Intel® 64 And IA 32 Architectures Software Developer’s Manual Volume 2B: Instruction Set Reference, M U Intel 2018 11 [Intel Vol.2B Instructi

Inte_instruction_manual_M-U

User Manual:

Open the PDF directly: View PDF .
Page Count: 700 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Chapter 4 Instruction Set Reference, M-U

Intel® 64 and IA-32 Architectures

Software Developer’s Manual

Volume 2B:

Instruction Set Reference, M-U

NOTE: The Intel® 64 and IA-32 Architectures Software Developer's Manual consists of ten volumes:

Basic Architecture, Order Number 253665; Instruction Set Reference A-L, Order Number 253666;

Instruction Set Reference M-U, Order Number 253667; Instruction Set Reference V-Z, Order Number

326018; Instruction Set Reference, Order Number 334569; System Programming Guide, Part 1, Order

Number 253668; System Programming Guide, Part 2, Order Number 253669; System Programming

Guide, Part 3, Order Number 326019; System Programming Guide, Part 4, Order Number 332831;

Model-Specific Registers, Order Number 335592. Refer to all ten volumes when evaluating your design

needs.

Order Number: 253667-068US

November 2018

Intel technologies features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Learn

more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting

from such losses.

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products

described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject

matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifica-

tions. Current characterized errata are available on request.

This document contains information on products, services and/or processes in development. All information provided here is subject to change

without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-

800-548-4725, or by visiting http://www.intel.com/design/literature.htm.

Intel, the Intel logo, Intel Atom, Intel Core, Intel SpeedStep, MMX, Pentium, VTune, and Xeon are trademarks of Intel Corporation in the U.S.

and/or other countries.

*Other names and brands may be claimed as the property of others.

Vol. 2B 4-1

CHAPTER 4

INSTRUCTION SET REFERENCE, M-U

4.1 IMM8 CONTROL BYTE OPERATION FOR PCMPESTRI / PCMPESTRM /

PCMPISTRI / PCMPISTRM

The notations introduced in this section are referenced in the reference pages of PCMPESTRI, PCMPESTRM, PCMP-

ISTRI, PCMPISTRM. The operation of the immediate control byte is common to these four string text processing

instructions of SSE4.2. This section describes the common operations.

4.1.1 General Description

The operation of PCMPESTRI, PCMPESTRM, PCMPISTRI, PCMPISTRM is defined by the combination of the respec-

tive opcode and the interpretation of an immediate control byte that is part of the instruction encoding.

The opcode controls the relationship of input bytes/words to each other (determines whether the inputs terminated

strings or whether lengths are expressed explicitly) as well as the desired output (index or mask).

The Imm8 Control Byte for PCMPESTRM/PCMPESTRI/PCMPISTRM/PCMPISTRI encodes a significant amount of

programmable control over the functionality of those instructions. Some functionality is unique to each instruction

while some is common across some or all of the four instructions. This section describes functionality which is

common across the four instructions.

The arithmetic flags (ZF, CF, SF, OF, AF, PF) are set as a result of these instructions. However, the meanings of the

flags have been overloaded from their typical meanings in order to provide additional information regarding the

relationships of the two inputs.

PCMPxSTRx instructions perform arithmetic comparisons between all possible pairs of bytes or words, one from

each packed input source operand. The boolean results of those comparisons are then aggregated in order to

produce meaningful results. The Imm8 Control Byte is used to affect the interpretation of individual input elements

as well as control the arithmetic comparisons used and the specific aggregation scheme.

Specifically, the Imm8 Control Byte consists of bit fields that control the following attributes:

•Source data format — Byte/word data element granularity, signed or unsigned elements

•Aggregation operation — Encodes the mode of per-element comparison operation and the aggregation of

per-element comparisons into an intermediate result

•Polarity — Specifies intermediate processing to be performed on the intermediate result

•Output selection — Specifies final operation to produce the output (depending on index or mask) from the

intermediate result

INSTRUCTION SET REFERENCE, M-U

4-2 Vol. 2B

4.1.2 Source Data Format

If the Imm8 Control Byte has bit[0] cleared, each source contains 16 packed bytes. If the bit is set each source

contains 8 packed words. If the Imm8 Control Byte has bit[1] cleared, each input contains unsigned data. If the bit

is set each source contains signed data.

4.1.3 Aggregation Operation

All 256 (64) possible comparisons are always performed. The individual Boolean results of those comparisons are

referred by “BoolRes[Reg/Mem element index, Reg element index].” Comparisons evaluating to “True” are repre-

sented with a 1, False with a 0 (positive logic). The initial results are then aggregated into a 16-bit (8-bit) interme-

diate result (IntRes1) using one of the modes described in the table below, as determined by Imm8 Control Byte

bit[3:2].

Table 4-1. Source Data Format

Imm8[1:0] Meaning Description

00b Unsigned bytes Both 128-bit sources are treated as packed, unsigned bytes.

01b Unsigned words Both 128-bit sources are treated as packed, unsigned words.

10b Signed bytes Both 128-bit sources are treated as packed, signed bytes.

11b Signed words Both 128-bit sources are treated as packed, signed words.

Table 4-2. Aggregation Operation

Imm8[3:2] ModeComparison

00b Equal any The arithmetic comparison is “equal.”

01b Ranges Arithmetic comparison is “greater than or equal” between even indexed bytes/words of reg and

each byte/word of reg/mem.

Arithmetic comparison is “less than or equal” between odd indexed bytes/words of reg and each

byte/word of reg/mem.

(reg/mem[m] >= reg[n] for n = even, reg/mem[m] <= reg[n] for n = odd)

10b Equal each The arithmetic comparison is “equal.”

11b Equal ordered The arithmetic comparison is “equal.”

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-3

See Section 4.1.6 for a description of the overrideIfDataInvalid() function used in Table 4-3.

4.1.4 Polarity

IntRes1 may then be further modified by performing a 1’s complement, according to the value of the Imm8 Control

Byte bit[4]. Optionally, a mask may be used such that only those IntRes1 bits which correspond to “valid” reg/mem

input elements are complemented (note that the definition of a valid input element is dependant on the specific

opcode and is defined in each opcode’s description). The result of the possible negation is referred to as IntRes2.

Table 4-3. Aggregation Operation

Mode Pseudocode

Equal any

(find characters from a set)

UpperBound = imm8[0] ? 7 : 15;

IntRes1 = 0;

For j = 0 to UpperBound, j++

For i = 0 to UpperBound, i++

IntRes1[j] OR= overrideIfDataInvalid(BoolRes[j,i])

Ranges

(find characters from ranges)

UpperBound = imm8[0] ? 7 : 15;

IntRes1 = 0;

For j = 0 to UpperBound, j++

For i = 0 to UpperBound, i+=2

IntRes1[j] OR= (overrideIfDataInvalid(BoolRes[j,i]) AND

overrideIfDataInvalid(BoolRes[j,i+1]))

Equal each

(string compare)

UpperBound = imm8[0] ? 7 : 15;

IntRes1 = 0;

For i = 0 to UpperBound, i++

IntRes1[i] = overrideIfDataInvalid(BoolRes[i,i])

Equal ordered

(substring search)

UpperBound = imm8[0] ? 7 :15;

IntRes1 = imm8[0] ? FFH : FFFFH

For j = 0 to UpperBound, j++

For i = 0 to UpperBound-j, k=j to UpperBound, k++, i++

IntRes1[j] AND= overrideIfDataInvalid(BoolRes[k,i])

Table 4-4. Polarity

Imm8[5:4] Operation Description

00b Positive Polarity (+) IntRes2 = IntRes1

01b Negative Polarity (-) IntRes2 = -1 XOR IntRes1

10b Masked (+) IntRes2 = IntRes1

11b Masked (-) IntRes2[i] = IntRes1[i] if reg/mem[i] invalid, else = ~IntRes1[i]

INSTRUCTION SET REFERENCE, M-U

4-4 Vol. 2B

4.1.5 Output Selection

For PCMPESTRI/PCMPISTRI, the Imm8 Control Byte bit[6] is used to determine if the index is of the least significant

or most significant bit of IntRes2.

Specifically for PCMPESTRM/PCMPISTRM, the Imm8 Control Byte bit[6] is used to determine if the mask is a 16 (8)

bit mask or a 128 bit byte/word mask.

4.1.6 Valid/Invalid Override of Comparisons

PCMPxSTRx instructions allow for the possibility that an end-of-string (EOS) situation may occur within the 128-bit

packed data value (see the instruction descriptions below for details). Any data elements on either source that are

determined to be past the EOS are considered to be invalid, and the treatment of invalid data within a comparison

pair varies depending on the aggregation function being performed.

In general, the individual comparison result for each element pair BoolRes[i.j] can be forced true or false if one or

more elements in the pair are invalid. See Table 4-7.

Table 4-5. Output Selection

Imm8[6] Operation Description

0b Least significant index The index returned to ECX is of the least significant set bit in IntRes2.

1b Most significant index The index returned to ECX is of the most significant set bit in IntRes2.

Table 4-6. Output Selection

Imm8[6] Operation Description

0b Bit mask IntRes2 is returned as the mask to the least significant bits of XMM0 with zero extension to 128

bits.

1b Byte/word mask IntRes2 is expanded into a byte/word mask (based on imm8[1]) and placed in XMM0. The expansion

is performed by replicating each bit into all of the bits of the byte/word of the same index.

Table 4-7. Comparison Result for Each Element Pair BoolRes[i.j]

xmm1

byte/ word

xmm2/ m128

byte/word

Imm8[3:2] = 00b

(equal any)

Imm8[3:2] = 01b

(ranges)

Imm8[3:2] = 10b

(equal each)

Imm8[3:2] = 11b

(equal ordered)

Invalid Invalid Force false Force false Force true Force true

Invalid Valid Force false Force false Force false Force true

Valid Invalid Force false Force false Force false Force false

Valid Valid Do not force Do not force Do not force Do not force

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-5

4.1.7 Summary of Im8 Control byte

Table 4-8. Summary of Imm8 Control Byte

Imm8 Description

-------0b 128-bit sources treated as 16 packed bytes.

-------1b 128-bit sources treated as 8 packed words.

------0-b Packed bytes/words are unsigned.

------1-b Packed bytes/words are signed.

----00--b Mode is equal any.

----01--b Mode is ranges.

----10--b Mode is equal each.

----11--b Mode is equal ordered.

---0----b IntRes1 is unmodified.

---1----b IntRes1 is negated (1’s complement).

--0-----b Negation of IntRes1 is for all 16 (8) bits.

--1-----b Negation of IntRes1 is masked by reg/mem validity.

-0------b Index of the least significant, set, bit is used (regardless of corresponding input element validity).

IntRes2 is returned in least significant bits of XMM0.

-1------b Index of the most significant, set, bit is used (regardless of corresponding input element validity).

Each bit of IntRes2 is expanded to byte/word.

0-------b This bit currently has no defined effect, should be 0.

1-------b This bit currently has no defined effect, should be 0.

INSTRUCTION SET REFERENCE, M-U

4-6 Vol. 2B

4.1.8 Diagram Comparison and Aggregation Process

4.2 COMMON TRANSFORMATION AND PRIMITIVE FUNCTIONS FOR SHA1XXX

AND SHA256XXX

The following primitive functions and transformations are used in the algorithmic descriptions of SHA1 and SHA256

instruction extensions SHA1NEXTE, SHA1RNDS4, SHA1MSG1, SHA1MSG2, SHA256RNDS4, SHA256MSG1 and

SHA256MSG2. The operands of these primitives and transformation are generally 32-bit DWORD integers.

•f0(): A bit oriented logical operation that derives a new dword from three SHA1 state variables (dword). This

function is used in SHA1 round 1 to 20 processing.

f0(B,C,D)  (B AND C) XOR ((NOT(B) AND D)

•f1(): A bit oriented logical operation that derives a new dword from three SHA1 state variables (dword). This

function is used in SHA1 round 21 to 40 processing.

f1(B,C,D)  B XOR C XOR D

•f2(): A bit oriented logical operation that derives a new dword from three SHA1 state variables (dword). This

function is used in SHA1 round 41 to 60 processing.

f2(B,C,D)  (B AND C) XOR (B AND D) XOR (C AND D)

Figure 4-1. Operation of PCMPSTRx and PCMPESTRx

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-7

•f3(): A bit oriented logical operation that derives a new dword from three SHA1 state variables (dword). This

function is used in SHA1 round 61 to 80 processing. It is the same as f1().

f3(B,C,D)  B XOR C XOR D

•Ch(): A bit oriented logical operation that derives a new dword from three SHA256 state variables (dword).

Ch(E,F,G)  (E AND F) XOR ((NOT E) AND G)

•Maj(): A bit oriented logical operation that derives a new dword from three SHA256 state variables (dword).

Maj(A,B,C)  (A AND B) XOR (A AND C) XOR (B AND C)

ROR is rotate right operation

(A ROR N)  A[N-1:0] || A[Width-1:N]

ROL is rotate left operation

(A ROL N)  A ROR (Width-N)

SHR is the right shift operation

(A SHR N)  ZEROES[N-1:0] || A[Width-1:N]

•Σ0( ): A bit oriented logical and rotational transformation performed on a dword SHA256 state variable.

Σ0(A)  (A ROR 2) XOR (A ROR 13) XOR (A ROR 22)

•Σ1( ): A bit oriented logical and rotational transformation performed on a dword SHA256 state variable.

Σ1(E)  (E ROR 6) XOR (E ROR 11) XOR (E ROR 25)

•σ0( ): A bit oriented logical and rotational transformation performed on a SHA256 message dword used in the

message scheduling.

σ0(W)  (W ROR 7) XOR (W ROR 18) XOR (W SHR 3)

•σ1( ): A bit oriented logical and rotational transformation performed on a SHA256 message dword used in the

message scheduling.

σ1(W)  (W ROR 17) XOR (W ROR 19) XOR (W SHR 10)

•Ki: SHA1 Constants dependent on immediate i.

K0 = 0x5A827999

K1 = 0x6ED9EBA1

K2 = 0X8F1BBCDC

K3 = 0xCA62C1D6

4.3 INSTRUCTIONS (M-U)

Chapter 4 continues an alphabetical discussion of Intel® 64 and IA-32 instructions (M-U). See also: Chapter 3,

“Instruction Set Reference, A-L,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume

2A, and Chapter 5, “Instruction Set Reference, V-Z‚” in the Intel® 64 and IA-32 Architectures Software Developer’s

Manual, Volume 2C.

MASKMOVDQU—Store Selected Bytes of Double Quadword

INSTRUCTION SET REFERENCE, M-U

4-8 Vol. 2B

MASKMOVDQU—Store Selected Bytes of Double Quadword

Instruction Operand Encoding1

Description

Stores selected bytes from the source operand (first operand) into an 128-bit memory location. The mask operand

(second operand) selects which bytes from the source operand are written to memory. The source and mask oper-

ands are XMM registers. The memory location specified by the effective address in the DI/EDI/RDI register (the

default segment register is DS, but this may be overridden with a segment-override prefix). The memory location

does not need to be aligned on a natural boundary. (The size of the store address depends on the address-size

attribute.)

The most significant bit in each byte of the mask operand determines whether the corresponding byte in the source

operand is written to the corresponding byte location in memory: 0 indicates no write and 1 indicates write.

The MASKMOVDQU instruction generates a non-temporal hint to the processor to minimize cache pollution. The

non-temporal hint is implemented by using a write combining (WC) memory type protocol (see “Caching of

Temporal vs. Non-Temporal Data” in Chapter 10, of the Intel® 64 and IA-32 Architectures Software Developer’s

Manual, Volume 1). Because the WC protocol uses a weakly-ordered memory consistency model, a fencing opera-

tion implemented with the SFENCE or MFENCE instruction should be used in conjunction with MASKMOVDQU

instructions if multiple processors might use different memory types to read/write the destination memory loca-

tions.

Behavior with a mask of all 0s is as follows:

•No data will be written to memory.

•Signaling of breakpoints (code or data) is not guaranteed; different processor implementations may signal or

not signal these breakpoints.

•Exceptions associated with addressing memory and page faults may still be signaled (implementation

dependent).

•If the destination memory region is mapped as UC or WP, enforcement of associated semantics for these

memory types is not guaranteed (that is, is reserved) and is implementation-specific.

The MASKMOVDQU instruction can be used to improve performance of algorithms that need to merge data on a

byte-by-byte basis. MASKMOVDQU should not cause a read for ownership; doing so generates unnecessary band-

width since data is to be written directly using the byte-mask without allocating old data prior to the store.

In 64-bit mode, use of the REX.R prefix permits this instruction to access additional registers (XMM8-XMM15).

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

If VMASKMOVDQU is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will

cause an #UD exception.

Opcode/

Instruction

Op/

64/32-bit

Mode

CPUID

Feature

Flag

Description

66 0F F7 /r

MASKMOVDQU xmm1, xmm2

RM V/V SSE2 Selectively write bytes from xmm1 to

memory location using the byte mask in

xmm2. The default memory location is

specified by DS:DI/EDI/RDI.

VEX.128.66.0F.WIG F7 /r

VMASKMOVDQU xmm1, xmm2

RM V/V AVX Selectively write bytes from xmm1 to

memory location using the byte mask in

xmm2. The default memory location is

specified by DS:DI/EDI/RDI.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RM ModRM:reg (r) ModRM:r/m (r) NA NA

1.ModRM.MOD = 011B required

MASKMOVDQU—Store Selected Bytes of Double Quadword

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-9

Operation

IF (MASK[7] = 1)

THEN DEST[DI/EDI] ← SRC[7:0] ELSE (* Memory location unchanged *); FI;

IF (MASK[15] = 1)

THEN DEST[DI/EDI +1] ← SRC[15:8] ELSE (* Memory location unchanged *); FI;

(* Repeat operation for 3rd through 14th bytes in source operand *)

IF (MASK[127] = 1)

THEN DEST[DI/EDI +15] ← SRC[127:120] ELSE (* Memory location unchanged *); FI;

Intel C/C++ Compiler Intrinsic Equivalent

void _mm_maskmoveu_si128(__m128i d, __m128i n, char * p)

Other Exceptions

See Exceptions Type 4; additionally

#UD If VEX.L= 1

If VEX.vvvv ≠ 1111B.

MASKMOVQ—Store Selected Bytes of Quadword

INSTRUCTION SET REFERENCE, M-U

4-10 Vol. 2B

MASKMOVQ—Store Selected Bytes of Quadword

Instruction Operand Encoding

Description

Stores selected bytes from the source operand (first operand) into a 64-bit memory location. The mask operand

(second operand) selects which bytes from the source operand are written to memory. The source and mask oper-

ands are MMX technology registers. The memory location specified by the effective address in the DI/EDI/RDI

memory location does not need to be aligned on a natural boundary. (The size of the store address depends on the

address-size attribute.)

The most significant bit in each byte of the mask operand determines whether the corresponding byte in the source

operand is written to the corresponding byte location in memory: 0 indicates no write and 1 indicates write.

The MASKMOVQ instruction generates a non-temporal hint to the processor to minimize cache pollution. The non-

temporal hint is implemented by using a write combining (WC) memory type protocol (see “Caching of Temporal

vs. Non-Temporal Data” in Chapter 10, of the Intel® 64 and IA-32 Architectures Software Developer’s Manual,

Volume 1). Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation imple-

mented with the SFENCE or MFENCE instruction should be used in conjunction with MASKMOVQ instructions if

multiple processors might use different memory types to read/write the destination memory locations.

This instruction causes a transition from x87 FPU to MMX technology state (that is, the x87 FPU top-of-stack pointer

is set to 0 and the x87 FPU tag word is set to all 0s [valid]).

The behavior of the MASKMOVQ instruction with a mask of all 0s is as follows:

•No data will be written to memory.

•Transition from x87 FPU to MMX technology state will occur.

•Exceptions associated with addressing memory and page faults may still be signaled (implementation

dependent).

•Signaling of breakpoints (code or data) is not guaranteed (implementation dependent).

•If the destination memory region is mapped as UC or WP, enforcement of associated semantics for these

memory types is not guaranteed (that is, is reserved) and is implementation-specific.

The MASKMOVQ instruction can be used to improve performance for algorithms that need to merge data on a byte-

by-byte basis. It should not cause a read for ownership; doing so generates unnecessary bandwidth since data is

to be written directly using the byte-mask without allocating old data prior to the store.

In 64-bit mode, the memory address is specified by DS:RDI.

Opcode/

Instruction

Op/

64-Bit

Mode

Compat/

Leg Mode

Description

NP 0F F7 /r

MASKMOVQ mm1, mm2

RM Valid Valid Selectively write bytes from mm1 to memory

location using the byte mask in mm2. The

default memory location is specified by

DS:DI/EDI/RDI.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RM ModRM:reg (r) ModRM:r/m (r) NA NA

MASKMOVQ—Store Selected Bytes of Quadword

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-11

Operation

IF (MASK[7] = 1)

THEN DEST[DI/EDI] ← SRC[7:0] ELSE (* Memory location unchanged *); FI;

IF (MASK[15] = 1)

THEN DEST[DI/EDI +1] ← SRC[15:8] ELSE (* Memory location unchanged *); FI;

(* Repeat operation for 3rd through 6th bytes in source operand *)

IF (MASK[63] = 1)

THEN DEST[DI/EDI +15] ← SRC[63:56] ELSE (* Memory location unchanged *); FI;

Intel C/C++ Compiler Intrinsic Equivalent

void _mm_maskmove_si64(__m64d, __m64n, char * p)

Other Exceptions

See Table 22-8, “Exception Conditions for Legacy SIMD/MMX Instructions without FP Exception,” in the Intel® 64

and IA-32 Architectures Software Developer’s Manual, Volume 3A.

MAXPD—Maximum of Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-12 Vol. 2B

MAXPD—Maximum of Packed Double-Precision Floating-Point Values

Instruction Operand Encoding

Description

Performs a SIMD compare of the packed double-precision floating-point values in the first source operand and the

second source operand and returns the maximum value for each pair of values to the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is

returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that

is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN

or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source

operand (from either the first or second operand) be returned, the action of MAXPD can be emulated using a

sequence of instructions, such as a comparison followed by AND, ANDN and OR.

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second

source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector

broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally

updated with writemask k1.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM

the corresponding ZMM register destination are zeroed.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM

the corresponding ZMM register destination are zeroed.

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

66 0F 5F /r

MAXPD xmm1, xmm2/m128

A V/V SSE2 Return the maximum double-precision floating-point

values between xmm1 and xmm2/m128.

VEX.128.66.0F.WIG 5F /r

VMAXPD xmm1, xmm2, xmm3/m128

B V/V AVX Return the maximum double-precision floating-point

values between xmm2 and xmm3/m128.

VEX.256.66.0F.WIG 5F /r

VMAXPD ymm1, ymm2, ymm3/m256

B V/V AVX Return the maximum packed double-precision

floating-point values between ymm2 and

ymm3/m256.

EVEX.128.66.0F.W1 5F /r

VMAXPD xmm1 {k1}{z}, xmm2,

xmm3/m128/m64bcst

C V/V AVX512VL

AVX512F

Return the maximum packed double-precision

floating-point values between xmm2 and

xmm3/m128/m64bcst and store result in xmm1

subject to writemask k1.

EVEX.256.66.0F.W1 5F /r

VMAXPD ymm1 {k1}{z}, ymm2,

ymm3/m256/m64bcst

C V/V AVX512VL

AVX512F

Return the maximum packed double-precision

floating-point values between ymm2 and

ymm3/m256/m64bcst and store result in ymm1

subject to writemask k1.

EVEX.512.66.0F.W1 5F /r

VMAXPD zmm1 {k1}{z}, zmm2,

zmm3/m512/m64bcst{sae}

C V/V AVX512F Return the maximum packed double-precision

floating-point values between zmm2 and

zmm3/m512/m64bcst and store result in zmm1

subject to writemask k1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA

C Full ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA

MAXPD—Maximum of Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-13

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-

nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding

ZMM register destination are unmodified.

Operation

MAX(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2;

ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC1 > SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}

VMAXPD (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN

DEST[i+63:i]  MAX(SRC1[i+63:i], SRC2[63:0])

ELSE

DEST[i+63:i]  MAX(SRC1[i+63:i], SRC2[i+63:i])

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMAXPD (VEX.256 encoded version)

DEST[63:0] MAX(SRC1[63:0], SRC2[63:0])

DEST[127:64] MAX(SRC1[127:64], SRC2[127:64])

DEST[191:128] MAX(SRC1[191:128], SRC2[191:128])

DEST[255:192] MAX(SRC1[255:192], SRC2[255:192])

DEST[MAXVL-1:256] 0

VMAXPD (VEX.128 encoded version)

DEST[63:0] MAX(SRC1[63:0], SRC2[63:0])

DEST[127:64] MAX(SRC1[127:64], SRC2[127:64])

DEST[MAXVL-1:128] 0

MAXPD—Maximum of Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-14 Vol. 2B

MAXPD (128-bit Legacy SSE version)

DEST[63:0] MAX(DEST[63:0], SRC[63:0])

DEST[127:64] MAX(DEST[127:64], SRC[127:64])

DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VMAXPD __m512d _mm512_max_pd( __m512d a, __m512d b);

VMAXPD __m512d _mm512_mask_max_pd(__m512d s, __mmask8 k, __m512d a, __m512d b,);

VMAXPD __m512d _mm512_maskz_max_pd( __mmask8 k, __m512d a, __m512d b);

VMAXPD __m512d _mm512_max_round_pd( __m512d a, __m512d b, int);

VMAXPD __m512d _mm512_mask_max_round_pd(__m512d s, __mmask8 k, __m512d a, __m512d b, int);

VMAXPD __m512d _mm512_maskz_max_round_pd( __mmask8 k, __m512d a, __m512d b, int);

VMAXPD __m256d _mm256_mask_max_pd(__m5256d s, __mmask8 k, __m256d a, __m256d b);

VMAXPD __m256d _mm256_maskz_max_pd( __mmask8 k, __m256d a, __m256d b);

VMAXPD __m128d _mm_mask_max_pd(__m128d s, __mmask8 k, __m128d a, __m128d b);

VMAXPD __m128d _mm_maskz_max_pd( __mmask8 k, __m128d a, __m128d b);

VMAXPD __m256d _mm256_max_pd (__m256d a, __m256d b);

(V)MAXPD __m128d _mm_max_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions

Invalid (including QNaN Source Operand), Denormal

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 2.

EVEX-encoded instruction, see Exceptions Type E2.

MAXPS—Maximum of Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-15

MAXPS—Maximum of Packed Single-Precision Floating-Point Values

Instruction Operand Encoding

Description

Performs a SIMD compare of the packed single-precision floating-point values in the first source operand and the

second source operand and returns the maximum value for each pair of values to the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is

returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that

is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN

or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source

operand (from either the first or second operand) be returned, the action of MAXPS can be emulated using a

sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second

source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector

broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally

updated with writemask k1.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM

the corresponding ZMM register destination are zeroed.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM

the corresponding ZMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-

nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding

ZMM register destination are unmodified.

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

NP 0F 5F /r

MAXPS xmm1, xmm2/m128

A V/V SSE Return the maximum single-precision floating-point values

between xmm1 and xmm2/mem.

VEX.128.0F.WIG 5F /r

VMAXPS xmm1, xmm2,

xmm3/m128

B V/V AVX Return the maximum single-precision floating-point values

between xmm2 and xmm3/mem.

VEX.256.0F.WIG 5F /r

VMAXPS ymm1, ymm2,

ymm3/m256

B V/V AVX Return the maximum single-precision floating-point values

between ymm2 and ymm3/mem.

EVEX.128.0F.W0 5F /r

VMAXPS xmm1 {k1}{z}, xmm2,

xmm3/m128/m32bcst

C V/V AVX512VL

AVX512F

Return the maximum packed single-precision floating-point

values between xmm2 and xmm3/m128/m32bcst and store

result in xmm1 subject to writemask k1.

EVEX.256.0F.W0 5F /r

VMAXPS ymm1 {k1}{z}, ymm2,

ymm3/m256/m32bcst

C V/V AVX512VL

AVX512F

Return the maximum packed single-precision floating-point

values between ymm2 and ymm3/m256/m32bcst and store

result in ymm1 subject to writemask k1.

EVEX.512.0F.W0 5F /r

VMAXPS zmm1 {k1}{z}, zmm2,

zmm3/m512/m32bcst{sae}

C V/V AVX512F Return the maximum packed single-precision floating-point

values between zmm2 and zmm3/m512/m32bcst and store

result in zmm1 subject to writemask k1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA

C Full ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA

MAXPS—Maximum of Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-16 Vol. 2B

Operation

MAX(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2;

ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC1 > SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}

VMAXPS (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN

DEST[i+31:i]  MAX(SRC1[i+31:i], SRC2[31:0])

ELSE

DEST[i+31:i]  MAX(SRC1[i+31:i], SRC2[i+31:i])

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMAXPS (VEX.256 encoded version)

DEST[31:0] MAX(SRC1[31:0], SRC2[31:0])

DEST[63:32] MAX(SRC1[63:32], SRC2[63:32])

DEST[95:64] MAX(SRC1[95:64], SRC2[95:64])

DEST[127:96] MAX(SRC1[127:96], SRC2[127:96])

DEST[159:128] MAX(SRC1[159:128], SRC2[159:128])

DEST[191:160] MAX(SRC1[191:160], SRC2[191:160])

DEST[223:192] MAX(SRC1[223:192], SRC2[223:192])

DEST[255:224] MAX(SRC1[255:224], SRC2[255:224])

DEST[MAXVL-1:256] 0

VMAXPS (VEX.128 encoded version)

DEST[31:0] MAX(SRC1[31:0], SRC2[31:0])

DEST[63:32] MAX(SRC1[63:32], SRC2[63:32])

DEST[95:64] MAX(SRC1[95:64], SRC2[95:64])

DEST[127:96] MAX(SRC1[127:96], SRC2[127:96])

DEST[MAXVL-1:128] 0

MAXPS—Maximum of Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-17

MAXPS (128-bit Legacy SSE version)

DEST[31:0] MAX(DEST[31:0], SRC[31:0])

DEST[63:32] MAX(DEST[63:32], SRC[63:32])

DEST[95:64] MAX(DEST[95:64], SRC[95:64])

DEST[127:96] MAX(DEST[127:96], SRC[127:96])

DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VMAXPS __m512 _mm512_max_ps( __m512 a, __m512 b);

VMAXPS __m512 _mm512_mask_max_ps(__m512 s, __mmask16 k, __m512 a, __m512 b);

VMAXPS __m512 _mm512_maskz_max_ps( __mmask16 k, __m512 a, __m512 b);

VMAXPS __m512 _mm512_max_round_ps( __m512 a, __m512 b, int);

VMAXPS __m512 _mm512_mask_max_round_ps(__m512 s, __mmask16 k, __m512 a, __m512 b, int);

VMAXPS __m512 _mm512_maskz_max_round_ps( __mmask16 k, __m512 a, __m512 b, int);

VMAXPS __m256 _mm256_mask_max_ps(__m256 s, __mmask8 k, __m256 a, __m256 b);

VMAXPS __m256 _mm256_maskz_max_ps( __mmask8 k, __m256 a, __m256 b);

VMAXPS __m128 _mm_mask_max_ps(__m128 s, __mmask8 k, __m128 a, __m128 b);

VMAXPS __m128 _mm_maskz_max_ps( __mmask8 k, __m128 a, __m128 b);

VMAXPS __m256 _mm256_max_ps (__m256 a, __m256 b);

MAXPS __m128 _mm_max_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions

Invalid (including QNaN Source Operand), Denormal

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 2.

EVEX-encoded instruction, see Exceptions Type E2.

MAXSD—Return Maximum Scalar Double-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

4-18 Vol. 2B

MAXSD—Return Maximum Scalar Double-Precision Floating-Point Value

Instruction Operand Encoding

Description

Compares the low double-precision floating-point values in the first source operand and the second source

operand, and returns the maximum value to the low quadword of the destination operand. The second source

operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM

registers. When the second source operand is a memory operand, only 64 bits are accessed.

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If

a value in the second source operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a

QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid

floating-point value, is written to the result. If instead of this behavior, it is required that the NaN of either source

operand be returned, the action of MAXSD can be emulated using a sequence of instructions, such as, a comparison

followed by AND, ANDN and OR.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL-1:64) of the

corresponding destination register remain unchanged.

VEX.128 and EVEX encoded version: Bits (127:64) of the XMM register destination are copied from corresponding

bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.

EVEX encoded version: The low quadword element of the destination operand is updated according to the

writemask.

Software should ensure VMAXSD is encoded with VEX.L=0. Encoding VMAXSD with VEX.L=1 may encounter unpre-

dictable behavior across different processor generations.

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

F2 0F 5F /r

MAXSD xmm1, xmm2/m64

A V/V SSE2 Return the maximum scalar double-precision floating-point

value between xmm2/m64 and xmm1.

VEX.LIG.F2.0F.WIG 5F /r

VMAXSD xmm1, xmm2,

xmm3/m64

B V/V AVX Return the maximum scalar double-precision floating-point

value between xmm3/m64 and xmm2.

EVEX.LIG.F2.0F.W1 5F /r

VMAXSD xmm1 {k1}{z}, xmm2,

xmm3/m64{sae}

C V/V AVX512F Return the maximum scalar double-precision floating-point

value between xmm3/m64 and xmm2.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA

C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA

MAXSD—Return Maximum Scalar Double-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-19

Operation

MAX(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2;

ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC1 > SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}

VMAXSD (EVEX encoded version)

IF k1[0] or *no writemask*

THEN DEST[63:0]  MAX(SRC1[63:0], SRC2[63:0])

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[63:0] remains unchanged*

ELSE ; zeroing-masking

DEST[63:0]  0

FI;

DEST[127:64]  SRC1[127:64]

DEST[MAXVL-1:128]  0

VMAXSD (VEX.128 encoded version)

DEST[63:0] MAX(SRC1[63:0], SRC2[63:0])

DEST[127:64] SRC1[127:64]

DEST[MAXVL-1:128] 0

MAXSD (128-bit Legacy SSE version)

DEST[63:0] MAX(DEST[63:0], SRC[63:0])

DEST[MAXVL-1:64] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VMAXSD __m128d _mm_max_round_sd( __m128d a, __m128d b, int);

VMAXSD __m128d _mm_mask_max_round_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, int);

VMAXSD __m128d _mm_maskz_max_round_sd( __mmask8 k, __m128d a, __m128d b, int);

MAXSD __m128d _mm_max_sd(__m128d a, __m128d b)

SIMD Floating-Point Exceptions

Invalid (Including QNaN Source Operand), Denormal

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 3.

EVEX-encoded instruction, see Exceptions Type E3.

MAXSS—Return Maximum Scalar Single-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

4-20 Vol. 2B

MAXSS—Return Maximum Scalar Single-Precision Floating-Point Value

Instruction Operand Encoding

Description

Compares the low single-precision floating-point values in the first source operand and the second source operand,

and returns the maximum value to the low doubleword of the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If

a value in the second source operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a

QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid

floating-point value, is written to the result. If instead of this behavior, it is required that the NaN from either source

operand be returned, the action of MAXSS can be emulated using a sequence of instructions, such as, a comparison

followed by AND, ANDN and OR.

The second source operand can be an XMM register or a 32-bit memory location. The first source and destination

operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL:32) of the corre-

sponding destination register remain unchanged.

VEX.128 and EVEX encoded version: The first source operand is an xmm register encoded by VEX.vvvv. Bits

(127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits

(MAXVL:128) of the destination register are zeroed.

EVEX encoded version: The low doubleword element of the destination operand is updated according to the

writemask.

Software should ensure VMAXSS is encoded with VEX.L=0. Encoding VMAXSS with VEX.L=1 may encounter unpre-

dictable behavior across different processor generations.

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

F3 0F 5F /r

MAXSS xmm1, xmm2/m32

A V/V SSE Return the maximum scalar single-precision floating-point

value between xmm2/m32 and xmm1.

VEX.LIG.F3.0F.WIG 5F /r

VMAXSS xmm1, xmm2,

xmm3/m32

B V/V AVX Return the maximum scalar single-precision floating-point

value between xmm3/m32 and xmm2.

EVEX.LIG.F3.0F.W0 5F /r

VMAXSS xmm1 {k1}{z}, xmm2,

xmm3/m32{sae}

C V/V AVX512F Return the maximum scalar single-precision floating-point

value between xmm3/m32 and xmm2.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA

C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA

MAXSS—Return Maximum Scalar Single-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-21

Operation

MAX(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2;

ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC1 > SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}

VMAXSS (EVEX encoded version)

IF k1[0] or *no writemask*

THEN DEST[31:0]  MAX(SRC1[31:0], SRC2[31:0])

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[31:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[31:0]  0

FI;

DEST[127:32]  SRC1[127:32]

DEST[MAXVL-1:128]  0

VMAXSS (VEX.128 encoded version)

DEST[31:0] MAX(SRC1[31:0], SRC2[31:0])

DEST[127:32] SRC1[127:32]

DEST[MAXVL-1:128] 0

MAXSS (128-bit Legacy SSE version)

DEST[31:0] MAX(DEST[31:0], SRC[31:0])

DEST[MAXVL-1:32] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VMAXSS __m128 _mm_max_round_ss( __m128 a, __m128 b, int);

VMAXSS __m128 _mm_mask_max_round_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, int);

VMAXSS __m128 _mm_maskz_max_round_ss( __mmask8 k, __m128 a, __m128 b, int);

MAXSS __m128 _mm_max_ss(__m128 a, __m128 b)

SIMD Floating-Point Exceptions

Invalid (Including QNaN Source Operand), Denormal

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 3.

EVEX-encoded instruction, see Exceptions Type E3.

MFENCE—Memory Fence

INSTRUCTION SET REFERENCE, M-U

4-22 Vol. 2B

MFENCE—Memory Fence

Instruction Operand Encoding

Description

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior

the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes

the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows

the MFENCE instruction.1 The MFENCE instruction is ordered with respect to all load and store instructions, other

MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID

instruction). MFENCE does not serialize the instruction stream.

Weakly ordered memory types can be used to achieve higher processor performance through such techniques as

out-of-order issue, speculative reads, write-combining, and write-collapsing. The degree to which a consumer of

data recognizes or knows that the data is weakly ordered varies among applications and may be unknown to the

producer of this data. The MFENCE instruction provides a performance-efficient way of ensuring load and store

ordering between routines that produce weakly-ordered results and routines that consume that data.

Processors are free to fetch and cache data speculatively from regions of system memory that use the WB, WC, and

WT memory types. This speculative fetching can occur at any time and is not tied to instruction execution. Thus, it

is not ordered with respect to executions of the MFENCE instruction; data can be brought into the caches specula-

tively just before, during, or after the execution of an MFENCE instruction.

This instruction’s operation is the same in non-64-bit modes and 64-bit mode.

Specification of the instruction's opcode above indicates a ModR/M byte of F0. For this instruction, the processor

ignores the r/m field of the ModR/M byte. Thus, MFENCE is encoded by any opcode of the form 0F AE Fx, where x

is in the range 0-7.

Operation

Wait_On_Following_Loads_And_Stores_Until(preceding_loads_and_stores_globally_visible);

Intel C/C++ Compiler Intrinsic Equivalent

void _mm_mfence(void)

Exceptions (All Modes of Operation)

#UD If CPUID.01H:EDX.SSE2[bit 26] = 0.

If the LOCK prefix is used.

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

NP 0F AE F0 MFENCE ZO Valid Valid Serializes load and store operations.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

ZO NA NA NA NA

1. A load instruction is considered to become globally visible when the value to be loaded into its destination register is determined.

MINPD—Minimum of Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-23

MINPD—Minimum of Packed Double-Precision Floating-Point Values

Instruction Operand Encoding

Description

Performs a SIMD compare of the packed double-precision floating-point values in the first source operand and the

second source operand and returns the minimum value for each pair of values to the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is

returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that

is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN

or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source

operand (from either the first or second operand) be returned, the action of MINPD can be emulated using a

sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second

source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector

broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally

updated with writemask k1.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM

the corresponding ZMM register destination are zeroed.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM

the corresponding ZMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-

nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding

ZMM register destination are unmodified.

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

66 0F 5D /r

MINPD xmm1, xmm2/m128

A V/V SSE2 Return the minimum double-precision floating-point values

between xmm1 and xmm2/mem

VEX.128.66.0F.WIG 5D /r

VMINPD xmm1, xmm2,

xmm3/m128

B V/V AVX Return the minimum double-precision floating-point values

between xmm2 and xmm3/mem.

VEX.256.66.0F.WIG 5D /r

VMINPD ymm1, ymm2,

ymm3/m256

B V/V AVX Return the minimum packed double-precision floating-point

values between ymm2 and ymm3/mem.

EVEX.128.66.0F.W1 5D /r

VMINPD xmm1 {k1}{z}, xmm2,

xmm3/m128/m64bcst

C V/V AVX512VL

AVX512F

Return the minimum packed double-precision floating-point

values between xmm2 and xmm3/m128/m64bcst and store

result in xmm1 subject to writemask k1.

EVEX.256.66.0F.W1 5D /r

VMINPD ymm1 {k1}{z}, ymm2,

ymm3/m256/m64bcst

C V/V AVX512VL

AVX512F

Return the minimum packed double-precision floating-point

values between ymm2 and ymm3/m256/m64bcst and store

result in ymm1 subject to writemask k1.

EVEX.512.66.0F.W1 5D /r

VMINPD zmm1 {k1}{z}, zmm2,

zmm3/m512/m64bcst{sae}

C V/V AVX512F Return the minimum packed double-precision floating-point

values between zmm2 and zmm3/m512/m64bcst and store

result in zmm1 subject to writemask k1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA

C Full ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA

MINPD—Minimum of Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-24 Vol. 2B

Operation

MIN(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2;

ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC1 < SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}

VMINPD (EVEX encoded version)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN

DEST[i+63:i]  MIN(SRC1[i+63:i], SRC2[63:0])

ELSE

DEST[i+63:i]  MIN(SRC1[i+63:i], SRC2[i+63:i])

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMINPD (VEX.256 encoded version)

DEST[63:0] MIN(SRC1[63:0], SRC2[63:0])

DEST[127:64] MIN(SRC1[127:64], SRC2[127:64])

DEST[191:128] MIN(SRC1[191:128], SRC2[191:128])

DEST[255:192] MIN(SRC1[255:192], SRC2[255:192])

VMINPD (VEX.128 encoded version)

DEST[63:0] MIN(SRC1[63:0], SRC2[63:0])

DEST[127:64] MIN(SRC1[127:64], SRC2[127:64])

DEST[MAXVL-1:128] 0

MINPD (128-bit Legacy SSE version)

DEST[63:0] MIN(SRC1[63:0], SRC2[63:0])

DEST[127:64] MIN(SRC1[127:64], SRC2[127:64])

DEST[MAXVL-1:128] (Unmodified)

MINPD—Minimum of Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-25

Intel C/C++ Compiler Intrinsic Equivalent

VMINPD __m512d _mm512_min_pd( __m512d a, __m512d b);

VMINPD __m512d _mm512_mask_min_pd(__m512d s, __mmask8 k, __m512d a, __m512d b);

VMINPD __m512d _mm512_maskz_min_pd( __mmask8 k, __m512d a, __m512d b);

VMINPD __m512d _mm512_min_round_pd( __m512d a, __m512d b, int);

VMINPD __m512d _mm512_mask_min_round_pd(__m512d s, __mmask8 k, __m512d a, __m512d b, int);

VMINPD __m512d _mm512_maskz_min_round_pd( __mmask8 k, __m512d a, __m512d b, int);

VMINPD __m256d _mm256_mask_min_pd(__m256d s, __mmask8 k, __m256d a, __m256d b);

VMINPD __m256d _mm256_maskz_min_pd( __mmask8 k, __m256d a, __m256d b);

VMINPD __m128d _mm_mask_min_pd(__m128d s, __mmask8 k, __m128d a, __m128d b);

VMINPD __m128d _mm_maskz_min_pd( __mmask8 k, __m128d a, __m128d b);

VMINPD __m256d _mm256_min_pd (__m256d a, __m256d b);

MINPD __m128d _mm_min_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions

Invalid (including QNaN Source Operand), Denormal

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 2.

EVEX-encoded instruction, see Exceptions Type E2.

MINPS—Minimum of Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-26 Vol. 2B

MINPS—Minimum of Packed Single-Precision Floating-Point Values

Instruction Operand Encoding

Description

Performs a SIMD compare of the packed single-precision floating-point values in the first source operand and the

second source operand and returns the minimum value for each pair of values to the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is

returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that

is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN

or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source

operand (from either the first or second operand) be returned, the action of MINPS can be emulated using a

sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second

source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector

broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally

updated with writemask k1.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM

the corresponding ZMM register destination are zeroed.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM

the corresponding ZMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-

nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding

ZMM register destination are unmodified.

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

NP 0F 5D /r

MINPS xmm1, xmm2/m128

A V/V SSE Return the minimum single-precision floating-point values

between xmm1 and xmm2/mem.

VEX.128.0F.WIG 5D /r

VMINPS xmm1, xmm2,

xmm3/m128

B V/V AVX Return the minimum single-precision floating-point values

between xmm2 and xmm3/mem.

VEX.256.0F.WIG 5D /r

VMINPS ymm1, ymm2,

ymm3/m256

B V/V AVX Return the minimum single double-precision floating-point

values between ymm2 and ymm3/mem.

EVEX.128.0F.W0 5D /r

VMINPS xmm1 {k1}{z}, xmm2,

xmm3/m128/m32bcst

C V/V AVX512VL

AVX512F

Return the minimum packed single-precision floating-point

values between xmm2 and xmm3/m128/m32bcst and store

result in xmm1 subject to writemask k1.

EVEX.256.0F.W0 5D /r

VMINPS ymm1 {k1}{z}, ymm2,

ymm3/m256/m32bcst

C V/V AVX512VL

AVX512F

Return the minimum packed single-precision floating-point

values between ymm2 and ymm3/m256/m32bcst and store

result in ymm1 subject to writemask k1.

EVEX.512.0F.W0 5D /r

VMINPS zmm1 {k1}{z}, zmm2,

zmm3/m512/m32bcst{sae}

C V/V AVX512F Return the minimum packed single-precision floating-point

values between zmm2 and zmm3/m512/m32bcst and store

result in zmm1 subject to writemask k1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA

C Full ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA

MINPS—Minimum of Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-27

Operation

MIN(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2;

ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC1 < SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}

VMINPS (EVEX encoded version)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN

DEST[i+31:i]  MIN(SRC1[i+31:i], SRC2[31:0])

ELSE

DEST[i+31:i]  MIN(SRC1[i+31:i], SRC2[i+31:i])

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMINPS (VEX.256 encoded version)

DEST[31:0] MIN(SRC1[31:0], SRC2[31:0])

DEST[63:32] MIN(SRC1[63:32], SRC2[63:32])

DEST[95:64] MIN(SRC1[95:64], SRC2[95:64])

DEST[127:96] MIN(SRC1[127:96], SRC2[127:96])

DEST[159:128] MIN(SRC1[159:128], SRC2[159:128])

DEST[191:160] MIN(SRC1[191:160], SRC2[191:160])

DEST[223:192] MIN(SRC1[223:192], SRC2[223:192])

DEST[255:224] MIN(SRC1[255:224], SRC2[255:224])

VMINPS (VEX.128 encoded version)

DEST[31:0] MIN(SRC1[31:0], SRC2[31:0])

DEST[63:32] MIN(SRC1[63:32], SRC2[63:32])

DEST[95:64] MIN(SRC1[95:64], SRC2[95:64])

DEST[127:96] MIN(SRC1[127:96], SRC2[127:96])

DEST[MAXVL-1:128] 0

MINPS—Minimum of Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-28 Vol. 2B

MINPS (128-bit Legacy SSE version)

DEST[31:0] MIN(SRC1[31:0], SRC2[31:0])

DEST[63:32] MIN(SRC1[63:32], SRC2[63:32])

DEST[95:64] MIN(SRC1[95:64], SRC2[95:64])

DEST[127:96] MIN(SRC1[127:96], SRC2[127:96])

DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VMINPS __m512 _mm512_min_ps( __m512 a, __m512 b);

VMINPS __m512 _mm512_mask_min_ps(__m512 s, __mmask16 k, __m512 a, __m512 b);

VMINPS __m512 _mm512_maskz_min_ps( __mmask16 k, __m512 a, __m512 b);

VMINPS __m512 _mm512_min_round_ps( __m512 a, __m512 b, int);

VMINPS __m512 _mm512_mask_min_round_ps(__m512 s, __mmask16 k, __m512 a, __m512 b, int);

VMINPS __m512 _mm512_maskz_min_round_ps( __mmask16 k, __m512 a, __m512 b, int);

VMINPS __m256 _mm256_mask_min_ps(__m256 s, __mmask8 k, __m256 a, __m256 b);

VMINPS __m256 _mm256_maskz_min_ps( __mmask8 k, __m256 a, __m25 b);

VMINPS __m128 _mm_mask_min_ps(__m128 s, __mmask8 k, __m128 a, __m128 b);

VMINPS __m128 _mm_maskz_min_ps( __mmask8 k, __m128 a, __m128 b);

VMINPS __m256 _mm256_min_ps (__m256 a, __m256 b);

MINPS __m128 _mm_min_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions

Invalid (including QNaN Source Operand), Denormal

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 2.

EVEX-encoded instruction, see Exceptions Type E2.

MINSD—Return Minimum Scalar Double-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-29

MINSD—Return Minimum Scalar Double-Precision Floating-Point Value

Instruction Operand Encoding

Description

Compares the low double-precision floating-point values in the first source operand and the second source

operand, and returns the minimum value to the low quadword of the destination operand. When the source

operand is a memory operand, only the 64 bits are accessed.

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If

a value in the second source operand is an SNaN, then SNaN is returned unchanged to the destination (that is, a

QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid

floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand

(from either the first or second source) be returned, the action of MINSD can be emulated using a sequence of

instructions, such as, a comparison followed by AND, ANDN and OR.

The second source operand can be an XMM register or a 64-bit memory location. The first source and destination

operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL-1:64) of the

corresponding destination register remain unchanged.

VEX.128 and EVEX encoded version: Bits (127:64) of the XMM register destination are copied from corresponding

bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.

EVEX encoded version: The low quadword element of the destination operand is updated according to the

writemask.

Software should ensure VMINSD is encoded with VEX.L=0. Encoding VMINSD with VEX.L=1 may encounter unpre-

dictable behavior across different processor generations.

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

F2 0F 5D /r

MINSD xmm1, xmm2/m64

A V/V SSE2 Return the minimum scalar double-precision floating-

point value between xmm2/m64 and xmm1.

VEX.LIG.F2.0F.WIG 5D /r

VMINSD xmm1, xmm2, xmm3/m64

B V/V AVX Return the minimum scalar double-precision floating-

point value between xmm3/m64 and xmm2.

EVEX.LIG.F2.0F.W1 5D /r

VMINSD xmm1 {k1}{z}, xmm2,

xmm3/m64{sae}

C V/V AVX512F Return the minimum scalar double-precision floating-

point value between xmm3/m64 and xmm2.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

ANAModRM:reg (r, w)ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA

C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA

MINSD—Return Minimum Scalar Double-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

4-30 Vol. 2B

Operation

MIN(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2;

ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC1 < SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}

MINSD (EVEX encoded version)

IF k1[0] or *no writemask*

THEN DEST[63:0]  MIN(SRC1[63:0], SRC2[63:0])

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[63:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[63:0]  0

FI;

DEST[127:64]  SRC1[127:64]

DEST[MAXVL-1:128]  0

MINSD (VEX.128 encoded version)

DEST[63:0] MIN(SRC1[63:0], SRC2[63:0])

DEST[127:64] SRC1[127:64]

DEST[MAXVL-1:128] 0

MINSD (128-bit Legacy SSE version)

DEST[63:0] MIN(SRC1[63:0], SRC2[63:0])

DEST[MAXVL-1:64] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VMINSD __m128d _mm_min_round_sd(__m128d a, __m128d b, int);

VMINSD __m128d _mm_mask_min_round_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, int);

VMINSD __m128d _mm_maskz_min_round_sd( __mmask8 k, __m128d a, __m128d b, int);

MINSD __m128d _mm_min_sd(__m128d a, __m128d b)

SIMD Floating-Point Exceptions

Invalid (including QNaN Source Operand), Denormal

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 3.

EVEX-encoded instruction, see Exceptions Type E3.

MINSS—Return Minimum Scalar Single-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-31

MINSS—Return Minimum Scalar Single-Precision Floating-Point Value

Instruction Operand Encoding

Description

Compares the low single-precision floating-point values in the first source operand and the second source operand

and returns the minimum value to the low doubleword of the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If

a value in the second operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a QNaN

version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid

floating-point value, is written to the result. If instead of this behavior, it is required that the NaN in either source

operand be returned, the action of MINSD can be emulated using a sequence of instructions, such as, a comparison

followed by AND, ANDN and OR.

The second source operand can be an XMM register or a 32-bit memory location. The first source and destination

operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL:32) of the corre-

sponding destination register remain unchanged.

VEX.128 and EVEX encoded version: The first source operand is an xmm register encoded by (E)VEX.vvvv. Bits

(127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits

(MAXVL-1:128) of the destination register are zeroed.

EVEX encoded version: The low doubleword element of the destination operand is updated according to the

writemask.

Software should ensure VMINSS is encoded with VEX.L=0. Encoding VMINSS with VEX.L=1 may encounter unpre-

dictable behavior across different processor generations.

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

F3 0F 5D /r

MINSS xmm1,xmm2/m32

A V/V SSE Return the minimum scalar single-precision floating-

point value between xmm2/m32 and xmm1.

VEX.LIG.F3.0F.WIG 5D /r

VMINSS xmm1,xmm2, xmm3/m32

B V/V AVX Return the minimum scalar single-precision floating-

point value between xmm3/m32 and xmm2.

EVEX.LIG.F3.0F.W0 5D /r

VMINSS xmm1 {k1}{z}, xmm2,

xmm3/m32{sae}

C V/V AVX512F Return the minimum scalar single-precision floating-

point value between xmm3/m32 and xmm2.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA

C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA

MINSS—Return Minimum Scalar Single-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

4-32 Vol. 2B

Operation

MIN(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2;

ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI;

ELSE IF (SRC1 < SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}

MINSS (EVEX encoded version)

IF k1[0] or *no writemask*

THEN DEST[31:0]  MIN(SRC1[31:0], SRC2[31:0])

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[31:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[31:0]  0

FI;

DEST[127:32]  SRC1[127:32]

DEST[MAXVL-1:128]  0

VMINSS (VEX.128 encoded version)

DEST[31:0] MIN(SRC1[31:0], SRC2[31:0])

DEST[127:32] SRC1[127:32]

DEST[MAXVL-1:128] 0

MINSS (128-bit Legacy SSE version)

DEST[31:0] MIN(SRC1[31:0], SRC2[31:0])

DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VMINSS __m128 _mm_min_round_ss( __m128 a, __m128 b, int);

VMINSS __m128 _mm_mask_min_round_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, int);

VMINSS __m128 _mm_maskz_min_round_ss( __mmask8 k, __m128 a, __m128 b, int);

MINSS __m128 _mm_min_ss(__m128 a, __m128 b)

SIMD Floating-Point Exceptions

Invalid (Including QNaN Source Operand), Denormal

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 2.

EVEX-encoded instruction, see Exceptions Type E2.

MONITOR—Set Up Monitor Address

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-33

MONITOR—Set Up Monitor Address

Instruction Operand Encoding

Description

The MONITOR instruction arms address monitoring hardware using an address specified in EAX (the address range

that the monitoring hardware checks for store operations can be determined by using CPUID). A store to an

address within the specified address range triggers the monitoring hardware. The state of monitor hardware is

used by MWAIT.

The address is specified in RAX/EAX/AX and the size is based on the effective address size of the encoded instruc-

tion. By default, the DS segment is used to create a linear address that is monitored. Segment overrides can be

used.

ECX and EDX are also used. They communicate other information to MONITOR. ECX specifies optional extensions.

EDX specifies optional hints; it does not change the architectural behavior of the instruction. For the Pentium 4

processor (family 15, model 3), no extensions or hints are defined. Undefined hints in EDX are ignored by the

processor; undefined extensions in ECX raises a general protection fault.

The address range must use memory of the write-back type. Only write-back memory will correctly trigger the

monitoring hardware. Additional information on determining what address range to use in order to prevent false

wake-ups is described in Chapter 8, “Multiple-Processor Management” of the Intel® 64 and IA-32 Architectures

Software Developer’s Manual, Volume 3A.

The MONITOR instruction is ordered as a load operation with respect to other memory transactions. The instruction

is subject to the permission checking and faults associated with a byte load. Like a load, MONITOR sets the A-bit

but not the D-bit in page tables.

CPUID.01H:ECX.MONITOR[bit 3] indicates the availability of MONITOR and MWAIT in the processor. When set,

MONITOR may be executed only at privilege level 0 (use at any other privilege level results in an invalid-opcode

exception). The operating system or system BIOS may disable this instruction by using the IA32_MISC_ENABLE

MSR; disabling MONITOR clears the CPUID feature flag and causes execution to generate an invalid-opcode excep-

tion.

The instruction’s operation is the same in non-64-bit modes and 64-bit mode.

Operation

MONITOR sets up an address range for the monitor hardware using the content of EAX (RAX in 64-bit mode) as an

effective address and puts the monitor hardware in armed state. Always use memory of the write-back caching

type. A store to the specified address range will trigger the monitor hardware. The content of ECX and EDX are

used to communicate other information to the monitor hardware.

Intel C/C++ Compiler Intrinsic Equivalent

MONITOR: void _mm_monitor(void const *p, unsigned extensions,unsigned hints)

Numeric Exceptions

None

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

0F 01 C8 MONITOR ZO Valid Valid Sets up a linear address range to be

monitored by hardware and activates the

monitor. The address range should be a write-

back memory caching type. The address is

DS:RAX/EAX/AX.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

ZO NA NA NA NA

MONITOR—Set Up Monitor Address

INSTRUCTION SET REFERENCE, M-U

4-34 Vol. 2B

Protected Mode Exceptions

#GP(0) If the value in EAX is outside the CS, DS, ES, FS, or GS segment limit.

If the DS, ES, FS, or GS register is used to access memory and it contains a NULL segment

selector.

If ECX ≠ 0.

#SS(0) If the value in EAX is outside the SS segment limit.

#PF(fault-code) For a page fault.

#UD If CPUID.01H:ECX.MONITOR[bit 3] = 0.

If current privilege level is not 0.

Real Address Mode Exceptions

#GP If the CS, DS, ES, FS, or GS register is used to access memory and the value in EAX is outside

of the effective address space from 0 to FFFFH.

If ECX ≠ 0.

#SS If the SS register is used to access memory and the value in EAX is outside of the effective

address space from 0 to FFFFH.

#UD If CPUID.01H:ECX.MONITOR[bit 3] = 0.

Virtual 8086 Mode Exceptions

#UD The MONITOR instruction is not recognized in virtual-8086 mode (even if

CPUID.01H:ECX.MONITOR[bit 3] = 1).

Compatibility Mode Exceptions

Same exceptions as in protected mode.

64-Bit Mode Exceptions

#GP(0) If the linear address of the operand in the CS, DS, ES, FS, or GS segment is in a non-canonical

form.

If RCX ≠ 0.

#SS(0) If the SS register is used to access memory and the value in EAX is in a non-canonical form.

#PF(fault-code) For a page fault.

#UD If the current privilege level is not 0.

If CPUID.01H:ECX.MONITOR[bit 3] = 0.

MOV—Move

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-35

MOV—Move

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

88 /rMOV r/m8,r8 MR Valid Valid Move r8 to r/m8.

REX + 88 /rMOV r/m8***,r8*** MR Valid N.E. Move r8 to r/m8.

89 /rMOV r/m16,r16 MR Valid Valid Move r16 to r/m16.

89 /rMOV r/m32,r32 MR Valid Valid Move r32 to r/m32.

REX.W + 89 /rMOV r/m64,r64 MR Valid N.E. Move r64 to r/m64.

8A /rMOV r8,r/m8 RM Valid Valid Move r/m8 to r8.

REX + 8A /rMOV r8***,r/m8*** RM Valid N.E. Move r/m8 to r8.

8B /rMOV r16,r/m16 RM Valid Valid Move r/m16 to r16.

8B /rMOV r32,r/m32 RM Valid Valid Move r/m32 to r32.

REX.W + 8B /rMOV r64,r/m64 RM Valid N.E. Move r/m64 to r64.

8C /rMOV r/m16,Sreg** MR Valid Valid Move segment register to r/m16.

REX.W + 8C /rMOV r16/r32/m16, Sreg** MR Valid Valid Move zero extended 16-bit segment register

to r16/r32/r64/m16.

REX.W + 8C /rMOV r64/m16, Sreg** MR Valid Valid Move zero extended 16-bit segment register

to r64/m16.

8E /rMOV Sreg,r/m16** RM Valid Valid Move r/m16 to segment register.

REX.W + 8E /rMOV Sreg,r/m64** RM Valid Valid Move lower 16 bits of r/m64 to segment

A0 MOV AL,moffs8* FD Valid Valid Move byte at (seg:offset) to AL.

REX.W + A0 MOV AL,moffs8*FD ValidN.E.Move byte at (offset) to AL.

A1 MOV AX,moffs16* FD Valid Valid Move word at (seg:offset) to AX.

A1 MOV EAX,moffs32* FD Valid Valid Move doubleword at (seg:offset) to EAX.

REX.W + A1 MOV RAX,moffs64* FD Valid N.E. Move quadword at (offset) to RAX.

A2 MOV moffs8,AL TD Valid Valid Move AL to (seg:offset).

REX.W + A2 MOV moffs8***,AL TD Valid N.E. Move AL to (offset).

A3 MOV moffs16*,AX TD Valid Valid Move AX to (seg:offset).

A3 MOV moffs32*,EAX TD Valid Valid Move EAX to (seg:offset).

REX.W + A3 MOV moffs64*,RAX TD Valid N.E. Move RAX to (offset).

B0+ rb ib MOV r8, imm8 OI Valid Valid Move imm8 to r8.

REX + B0+ rb ib MOV r8***, imm8 OI Valid N.E. Move imm8 to r8.

B8+ rw iw MOV r16, imm16 OI Valid Valid Move imm16 to r16.

B8+ rd id MOV r32, imm32 OI Valid Valid Move imm32 to r32.

REX.W + B8+ rd io MOV r64, imm64 OI Valid N.E. Move imm64 to r64.

C6 /0 ib MOV r/m8, imm8 MI Valid Valid Move imm8 to r/m8.

REX + C6 /0 ib MOV r/m8***, imm8 MI Valid N.E. Move imm8 to r/m8.

C7 /0 iw MOV r/m16, imm16 MI Valid Valid Move imm16 to r/m16.

C7 /0 id MOV r/m32, imm32 MI Valid Valid Move imm32 to r/m32.

REX.W + C7 /0 id MOV r/m64, imm32 MI Valid N.E. Move imm32 sign extended to 64-bits to

r/m64.

MOV—Move

INSTRUCTION SET REFERENCE, M-U

4-36 Vol. 2B

Instruction Operand Encoding

Description

Copies the second operand (source operand) to the first operand (destination operand). The source operand can be

an immediate value, general-purpose register, segment register, or memory location; the destination register can

be a general-purpose register, segment register, or memory location. Both operands must be the same size, which

can be a byte, a word, a doubleword, or a quadword.

The MOV instruction cannot be used to load the CS register. Attempting to do so results in an invalid opcode excep-

tion (#UD). To load the CS register, use the far JMP, CALL, or RET instruction.

If the destination operand is a segment register (DS, ES, FS, GS, or SS), the source operand must be a valid

segment selector. In protected mode, moving a segment selector into a segment register automatically causes the

segment descriptor information associated with that segment selector to be loaded into the hidden (shadow) part

of the segment register. While loading this information, the segment selector and segment descriptor information

is validated (see the “Operation” algorithm below). The segment descriptor data is obtained from the GDT or LDT

entry for the specified segment selector.

A NULL segment selector (values 0000-0003) can be loaded into the DS, ES, FS, and GS registers without causing

a protection exception. However, any subsequent attempt to reference a segment whose corresponding segment

Loading the SS register with a MOV instruction suppresses or inhibits some debug exceptions and inhibits inter-

rupts on the following instruction boundary. (The inhibition ends after delivery of an exception or the execution of

the next instruction.) This behavior allows a stack pointer to be loaded into the ESP register with the next instruc-

tion (MOV ESP, stack-pointer value) before an event can be delivered. See Section 6.8.3, “Masking Exceptions

and Interrupts When Switching Stacks,” in Intel® 64 and IA-32 Architectures Software Developer’s Manual,

Volume 3A. Intel recommends that software use the LSS instruction to load the SS register and ESP together.

When executing MOV Reg, Sreg, the processor copies the content of Sreg to the 16 least significant bits of the

general-purpose register. The upper bits of the destination register are zero for most IA-32 processors (Pentium

Pro processors and later) and all Intel 64 processors, with the exception that bits 31:16 are undefined for Intel

Quark X1000 processors, Pentium and earlier processors.

In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to addi-

tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the

beginning of this section for encoding data and limits.

NOTES:

*The moffs8, moffs16, moffs32 and moffs64 operands specify a simple offset relative to the segment base, where 8, 16, 32 and 64

refer to the size of the data. The address-size attribute of the instruction determines the size of the offset, either 16, 32 or 64

bits.

** In 32-bit mode, the assembler may insert the 16-bit operand-size prefix with this instruction (see the following “Description” sec-

tion for further information).

***In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

MR ModRM:r/m (w) ModRM:reg (r) NA NA

RM ModRM:reg (w) ModRM:r/m (r) NA NA

FD AL/AX/EAX/RAX Moffs NA NA

TD Moffs (w) AL/AX/EAX/RAX NA NA

OI opcode + rd (w) imm8/16/32/64 NA NA

MI ModRM:r/m (w) imm8/16/32/64 NA NA

MOV—Move

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-37

Operation

DEST ← SRC;

Loading a segment register while in protected mode results in special checks and actions, as described in the

following listing. These checks are performed on the segment selector and the segment descriptor to which it

points.

IF SS is loaded

THEN

IF segment selector is NULL

THEN #GP(0); FI;

IF segment selector index is outside descriptor table limits

OR segment selector's RPL ≠ CPL

OR segment is not a writable data segment

OR DPL ≠ CPL

THEN #GP(selector); FI;

IF segment not marked present

THEN #SS(selector);

ELSE

SS ← segment selector;

SS ← segment descriptor; FI;

FI;

IF DS, ES, FS, or GS is loaded with non-NULL selector

THEN

IF segment selector index is outside descriptor table limits

OR segment is not a data or readable code segment

OR ((segment is a data or nonconforming code segment) AND ((RPL > DPL) or (CPL > DPL)))

THEN #GP(selector); FI;

IF segment not marked present

THEN #NP(selector);

ELSE

SegmentRegister ← segment selector;

SegmentRegister ← segment descriptor; FI;

FI;

IF DS, ES, FS, or GS is loaded with NULL selector

THEN

SegmentRegister ← segment selector;

SegmentRegister ← segment descriptor;

FI;

Flags Affected

None

MOV—Move

INSTRUCTION SET REFERENCE, M-U

4-38 Vol. 2B

Protected Mode Exceptions

#GP(0) If attempt is made to load SS register with NULL segment selector.

If the destination operand is in a non-writable segment.

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

If the DS, ES, FS, or GS register contains a NULL segment selector.

#GP(selector) If segment selector index is outside descriptor table limits.

If the SS register is being loaded and the segment selector's RPL and the segment descriptor’s

DPL are not equal to the CPL.

If the SS register is being loaded and the segment pointed to is a

non-writable data segment.

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is not a data or

readable code segment.

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is a data or

nonconforming code segment, and either the RPL or the CPL is greater than the DPL.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#SS(selector) If the SS register is being loaded and the segment pointed to is marked not present.

#NP If the DS, ES, FS, or GS register is being loaded and the segment pointed to is marked not

present.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If attempt is made to load the CS register.

If the LOCK prefix is used.

Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If attempt is made to load the CS register.

If the LOCK prefix is used.

Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made.

#UD If attempt is made to load the CS register.

If the LOCK prefix is used.

Compatibility Mode Exceptions

Same exceptions as in protected mode.

MOV—Move

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-39

64-Bit Mode Exceptions

#GP(0) If the memory address is in a non-canonical form.

If an attempt is made to load SS register with NULL segment selector when CPL = 3.

If an attempt is made to load SS register with NULL segment selector when CPL < 3 and CPL

≠ RPL.

#GP(selector) If segment selector index is outside descriptor table limits.

If the memory access to the descriptor table is non-canonical.

If the SS register is being loaded and the segment selector's RPL and the segment descriptor’s

DPL are not equal to the CPL.

If the SS register is being loaded and the segment pointed to is a nonwritable data segment.

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is not a data or

readable code segment.

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is a data or

nonconforming code segment, but both the RPL and the CPL are greater than the DPL.

#SS(0) If the stack address is in a non-canonical form.

#SS(selector) If the SS register is being loaded and the segment pointed to is marked not present.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If attempt is made to load the CS register.

If the LOCK prefix is used.

MOV—Move to/from Control Registers

INSTRUCTION SET REFERENCE, M-U

4-40 Vol. 2B

MOV—Move to/from Control Registers

Instruction Operand Encoding

Description

Moves the contents of a control register (CR0, CR2, CR3, CR4, or CR8) to a general-purpose register or the

contents of a general purpose register to a control register. The operand size for these instructions is always 32 bits

in non-64-bit modes, regardless of the operand-size attribute. (See “Control Registers” in Chapter 2 of the Intel®

64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, for a detailed description of the flags and

fields in the control registers.) This instruction can be executed only when the current privilege level is 0.

At the opcode level, the reg field within the ModR/M byte specifies which of the control registers is loaded or read.

The 2 bits in the mod field are ignored. The r/m field specifies the general-purpose register loaded or read.

Attempts to reference CR1, CR5, CR6, CR7, and CR9–CR15 result in undefined opcode (#UD) exceptions.

When loading control registers, programs should not attempt to change the reserved bits; that is, always set

reserved bits to the value previously read. An attempt to change CR4's reserved bits will cause a general protection

fault. Reserved bits in CR0 and CR3 remain clear after any load of those registers; attempts to set them have no

impact. On Pentium 4, Intel Xeon and P6 family processors, CR0.ET remains set after any load of CR0; attempts to

clear this bit have no impact.

In certain cases, these instructions have the side effect of invalidating entries in the TLBs and the paging-structure

caches. See Section 4.10.4.1, “Operations that Invalidate TLBs and Paging-Structure Caches,” in the Intel® 64 and

IA-32 Architectures Software Developer’s Manual, Volume 3A for details.

The following side effects are implementation-specific for the Pentium 4, Intel Xeon, and P6 processor family: when

modifying PE or PG in register CR0, or PSE or PAE in register CR4, all TLB entries are flushed, including global

entries. Software should not depend on this functionality in all Intel 64 or IA-32 processors.

In 64-bit mode, the instruction’s default operation size is 64 bits. The REX.R prefix must be used to access CR8. Use

of REX.B permits access to additional registers (R8-R15). Use of the REX.W prefix or 66H prefix is ignored. Use of

Opcode/

Instruction

Op/

64-Bit

Mode

Compat/

Leg Mode

Description

0F 20/r

MOV r32, CR0–CR7

MR N.E. Valid Move control register to r32.

0F 20/r

MOV r64, CR0–CR7

MR Valid N.E. Move extended control register to r64.

REX.R + 0F 20 /0

MOV r64, CR8

MR Valid N.E. Move extended CR8 to r64.1

0F 22 /r

MOV CR0–CR7, r32

RM N.E. Valid Move r32 to control register.

0F 22 /r

MOV CR0–CR7, r64

RM Valid N.E. Move r64 to extended control register.

REX.R + 0F 22 /0

MOV CR8, r64

RM Valid N.E. Move r64 to extended CR8.1

NOTE:

1. MOV CR* instructions, except for MOV CR8, are serializing instructions. MOV CR8 is not

architecturally defined as a serializing instruction. For more information, see Chapter 8 in Intel® 64 and IA-32 Architectures Soft-

ware Developer’s Manual, Volume 3A.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

MR ModRM:r/m (w) ModRM:reg (r) NA NA

RM ModRM:reg (w) ModRM:r/m (r) NA NA

MOV—Move to/from Control Registers

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-41

the REX.R prefix to specify a register other than CR8 causes an invalid-opcode exception. See the summary chart

at the beginning of this section for encoding data and limits.

If CR4.PCIDE = 1, bit 63 of the source operand to MOV to CR3 determines whether the instruction invalidates

entries in the TLBs and the paging-structure caches (see Section 4.10.4.1, “Operations that Invalidate TLBs and

Paging-Structure Caches,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). The

instruction does not modify bit 63 of CR3, which is reserved and always 0.

See “Changes to Instruction Behavior in VMX Non-Root Operation” in Chapter 25 of the Intel® 64 and IA-32 Archi-

tectures Software Developer’s Manual, Volume 3C, for more information about the behavior of this instruction in

VMX non-root operation.

Operation

DEST ← SRC;

Flags Affected

The OF, SF, ZF, AF, PF, and CF flags are undefined.

Protected Mode Exceptions

#GP(0) If the current privilege level is not 0.

If an attempt is made to write invalid bit combinations in CR0 (such as setting the PG flag to 1

when the PE flag is set to 0, or setting the CD flag to 0 when the NW flag is set to 1).

If an attempt is made to write a 1 to any reserved bit in CR4.

If an attempt is made to write 1 to CR4.PCIDE.

If any of the reserved bits are set in the page-directory pointers table (PDPT) and the loading

of a control register causes the PDPT to be loaded into the processor.

#UD If the LOCK prefix is used.

If an attempt is made to access CR1, CR5, CR6, or CR7.

Real-Address Mode Exceptions

#GP If an attempt is made to write a 1 to any reserved bit in CR4.

If an attempt is made to write 1 to CR4.PCIDE.

If an attempt is made to write invalid bit combinations in CR0 (such as setting the PG flag to 1

when the PE flag is set to 0).

#UD If the LOCK prefix is used.

If an attempt is made to access CR1, CR5, CR6, or CR7.

Virtual-8086 Mode Exceptions

#GP(0) These instructions cannot be executed in virtual-8086 mode.

Compatibility Mode Exceptions

#GP(0) If the current privilege level is not 0.

If an attempt is made to write invalid bit combinations in CR0 (such as setting the PG flag to 1

when the PE flag is set to 0, or setting the CD flag to 0 when the NW flag is set to 1).

If an attempt is made to change CR4.PCIDE from 0 to 1 while CR3[11:0] ≠ 000H.

If an attempt is made to clear CR0.PG[bit 31] while CR4.PCIDE = 1.

If an attempt is made to write a 1 to any reserved bit in CR3.

If an attempt is made to leave IA-32e mode by clearing CR4.PAE[bit 5].

#UD If the LOCK prefix is used.

If an attempt is made to access CR1, CR5, CR6, or CR7.

MOV—Move to/from Control Registers

INSTRUCTION SET REFERENCE, M-U

4-42 Vol. 2B

64-Bit Mode Exceptions

#GP(0) If the current privilege level is not 0.

If an attempt is made to write invalid bit combinations in CR0 (such as setting the PG flag to 1

when the PE flag is set to 0, or setting the CD flag to 0 when the NW flag is set to 1).

If an attempt is made to change CR4.PCIDE from 0 to 1 while CR3[11:0] ≠ 000H.

If an attempt is made to clear CR0.PG[bit 31].

If an attempt is made to write a 1 to any reserved bit in CR4.

If an attempt is made to write a 1 to any reserved bit in CR8.

If an attempt is made to write a 1 to any reserved bit in CR3.

If an attempt is made to leave IA-32e mode by clearing CR4.PAE[bit 5].

#UD If the LOCK prefix is used.

If an attempt is made to access CR1, CR5, CR6, or CR7.

If the REX.R prefix is used to specify a register other than CR8.

MOV—Move to/from Debug Registers

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-43

MOV—Move to/from Debug Registers

Instruction Operand Encoding

Description

Moves the contents of a debug register (DR0, DR1, DR2, DR3, DR4, DR5, DR6, or DR7) to a general-purpose

the operand-size attribute. (See Section 17.2, “Debug Registers”, of the Intel® 64 and IA-32 Architectures Soft-

ware Developer’s Manual, Volume 3A, for a detailed description of the flags and fields in the debug registers.)

The instructions must be executed at privilege level 0 or in real-address mode.

When the debug extension (DE) flag in register CR4 is clear, these instructions operate on debug registers in a

manner that is compatible with Intel386 and Intel486 processors. In this mode, references to DR4 and DR5 refer

to DR6 and DR7, respectively. When the DE flag in CR4 is set, attempts to reference DR4 and DR5 result in an

undefined opcode (#UD) exception. (The CR4 register was added to the IA-32 Architecture beginning with the

Pentium processor.)

At the opcode level, the reg field within the ModR/M byte specifies which of the debug registers is loaded or read.

The two bits in the mod field are ignored. The r/m field specifies the general-purpose register loaded or read.

In 64-bit mode, the instruction’s default operation size is 64 bits. Use of the REX.B prefix permits access to addi-

tional registers (R8–R15). Use of the REX.W or 66H prefix is ignored. Use of the REX.R prefix causes an invalid-

opcode exception. See the summary chart at the beginning of this section for encoding data and limits.

Operation

IF ((DE = 1) and (SRC or DEST = DR4 or DR5))

THEN

#UD;

ELSE

DEST ← SRC;

FI;

Flags Affected

The OF, SF, ZF, AF, PF, and CF flags are undefined.

Opcode/

Instruction

Op/

64-Bit

Mode

Compat/

Leg Mode

Description

0F 21/r

MOV r32, DR0–DR7

MR N.E. Valid Move debug register to r32.

0F 21/r

MOV r64, DR0–DR7

MR Valid N.E. Move extended debug register to r64.

0F 23 /r

MOV DR0–DR7, r32

RM N.E. Valid Move r32 to debug register.

0F 23 /r

MOV DR0–DR7, r64

RM Valid N.E. Move r64 to extended debug register.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

MR ModRM:r/m (w) ModRM:reg (r) NA NA

RM ModRM:reg (w) ModRM:r/m (r) NA NA

MOV—Move to/from Debug Registers

INSTRUCTION SET REFERENCE, M-U

4-44 Vol. 2B

Protected Mode Exceptions

#GP(0) If the current privilege level is not 0.

#UD If CR4.DE[bit 3] = 1 (debug extensions) and a MOV instruction is executed involving DR4 or

DR5.

If the LOCK prefix is used.

#DB If any debug register is accessed while the DR7.GD[bit 13] = 1.

Real-Address Mode Exceptions

#UD If CR4.DE[bit 3] = 1 (debug extensions) and a MOV instruction is executed involving DR4 or

DR5.

If the LOCK prefix is used.

#DB If any debug register is accessed while the DR7.GD[bit 13] = 1.

Virtual-8086 Mode Exceptions

#GP(0) The debug registers cannot be loaded or read when in virtual-8086 mode.

Compatibility Mode Exceptions

Same exceptions as in protected mode.

64-Bit Mode Exceptions

#GP(0) If the current privilege level is not 0.

If an attempt is made to write a 1 to any of bits 63:32 in DR6.

If an attempt is made to write a 1 to any of bits 63:32 in DR7.

#UD If CR4.DE[bit 3] = 1 (debug extensions) and a MOV instruction is executed involving DR4 or

DR5.

If the LOCK prefix is used.

If the REX.R prefix is used.

#DB If any debug register is accessed while the DR7.GD[bit 13] = 1.

MOVAPD—Move Aligned Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-45

MOVAPD—Move Aligned Packed Double-Precision Floating-Point Values

Instruction Operand Encoding

Opcode/

Instruction

Op/En 64/32

bit Mode

Support

CPUID

Feature

Flag

Description

66 0F 28 /r

MOVAPD xmm1, xmm2/m128

A V/V SSE2 Move aligned packed double-precision floating-

point values from xmm2/mem to xmm1.

66 0F 29 /r

MOVAPD xmm2/m128, xmm1

B V/V SSE2 Move aligned packed double-precision floating-

point values from xmm1 to xmm2/mem.

VEX.128.66.0F.WIG 28 /r

VMOVAPD xmm1, xmm2/m128

A V/V AVX Move aligned packed double-precision floating-

point values from xmm2/mem to xmm1.

VEX.128.66.0F.WIG 29 /r

VMOVAPD xmm2/m128, xmm1

B V/V AVX Move aligned packed double-precision floating-

point values from xmm1 to xmm2/mem.

VEX.256.66.0F.WIG 28 /r

VMOVAPD ymm1, ymm2/m256

A V/V AVX Move aligned packed double-precision floating-

point values from ymm2/mem to ymm1.

VEX.256.66.0F.WIG 29 /r

VMOVAPD ymm2/m256, ymm1

B V/V AVX Move aligned packed double-precision floating-

point values from ymm1 to ymm2/mem.

EVEX.128.66.0F.W1 28 /r

VMOVAPD xmm1 {k1}{z}, xmm2/m128

C V/V AVX512VL

AVX512F

Move aligned packed double-precision floating-

point values from xmm2/m128 to xmm1 using

writemask k1.

EVEX.256.66.0F.W1 28 /r

VMOVAPD ymm1 {k1}{z}, ymm2/m256

C V/V AVX512VL

AVX512F

Move aligned packed double-precision floating-

point values from ymm2/m256 to ymm1 using

writemask k1.

EVEX.512.66.0F.W1 28 /r

VMOVAPD zmm1 {k1}{z}, zmm2/m512

C V/V AVX512F Move aligned packed double-precision floating-

point values from zmm2/m512 to zmm1 using

writemask k1.

EVEX.128.66.0F.W1 29 /r

VMOVAPD xmm2/m128 {k1}{z}, xmm1

D V/V AVX512VL

AVX512F

Move aligned packed double-precision floating-

point values from xmm1 to xmm2/m128 using

writemask k1.

EVEX.256.66.0F.W1 29 /r

VMOVAPD ymm2/m256 {k1}{z}, ymm1

D V/V AVX512VL

AVX512F

Move aligned packed double-precision floating-

point values from ymm1 to ymm2/m256 using

writemask k1.

EVEX.512.66.0F.W1 29 /r

VMOVAPD zmm2/m512 {k1}{z}, zmm1

D V/V AVX512F Move aligned packed double-precision floating-

point values from zmm1 to zmm2/m512 using

writemask k1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (w) ModRM:r/m (r) NA NA

B NA ModRM:r/m (w) ModRM:reg (r) NA NA

C Full Mem ModRM:reg (w) ModRM:r/m (r) NA NA

D Full Mem ModRM:r/m (w) ModRM:reg (r) NA NA

MOVAPD—Move Aligned Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-46 Vol. 2B

Description

Moves 2, 4 or 8 double-precision floating-point values from the source operand (second operand) to the destination

operand (first operand). This instruction can be used to load an XMM, YMM or ZMM register from an 128-bit, 256-

bit or 512-bit memory location, to store the contents of an XMM, YMM or ZMM register into a 128-bit, 256-bit or

512-bit memory location, or to move data between two XMM, two YMM or two ZMM registers.

When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte (128-bit

versions), 32-byte (256-bit version) or 64-byte (EVEX.512 encoded version) boundary or a general-protection

exception (#GP) will be generated. For EVEX encoded versions, the operand must be aligned to the size of the

memory operand. To move double-precision floating-point values to and from unaligned memory locations, use the

VMOVUPD instruction.

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

EVEX.512 encoded version:

Moves 512 bits of packed double-precision floating-point values from the source operand (second operand) to the

destination operand (first operand). This instruction can be used to load a ZMM register from a 512-bit float64

memory location, to store the contents of a ZMM register into a 512-bit float64 memory location, or to move data

between two ZMM registers. When the source or destination operand is a memory operand, the operand must be

aligned on a 64-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision

floating-point values to and from unaligned memory locations, use the VMOVUPD instruction.

VEX.256 and EVEX.256 encoded versions:

Moves 256 bits of packed double-precision floating-point values from the source operand (second operand) to the

destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory

location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM

registers. When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte

boundary or a general-protection exception (#GP) will be generated. To move double-precision floating-point

values to and from unaligned memory locations, use the VMOVUPD instruction.

128-bit versions:

Moves 128 bits of packed double-precision floating-point values from the source operand (second operand) to the

destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory

location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two

XMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a

16-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision floating-

point values to and from unaligned memory locations, use the VMOVUPD instruction.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding ZMM destination register remain

unchanged.

(E)VEX.128 encoded version: Bits (MAXVL-1:128) of the destination ZMM register destination are zeroed.

Operation

VMOVAPD (EVEX encoded versions, register-copy form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i]  SRC[i+63:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

MOVAPD—Move Aligned Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-47

VMOVAPD (EVEX encoded versions, store-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i]

ELSE

ELSE *DEST[i+63:i] remains unchanged* ; merging-masking

FI;

ENDFOR;

VMOVAPD (EVEX encoded versions, load-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i]  SRC[i+63:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVAPD (VEX.256 encoded version, load - and register copy)

DEST[255:0]  SRC[255:0]

DEST[MAXVL-1:256]  0

VMOVAPD (VEX.256 encoded version, store-form)

DEST[255:0]  SRC[255:0]

VMOVAPD (VEX.128 encoded version, load - and register copy)

DEST[127:0]  SRC[127:0]

DEST[MAXVL-1:128]  0

MOVAPD (128-bit load- and register-copy- form Legacy SSE version)

DEST[127:0]  SRC[127:0]

DEST[MAXVL-1:128] (Unmodified)

(V)MOVAPD (128-bit store-form version)

DEST[127:0]  SRC[127:0]

MOVAPD—Move Aligned Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-48 Vol. 2B

Intel C/C++ Compiler Intrinsic Equivalent

VMOVAPD __m512d _mm512_load_pd( void * m);

VMOVAPD __m512d _mm512_mask_load_pd(__m512d s, __mmask8 k, void * m);

VMOVAPD __m512d _mm512_maskz_load_pd( __mmask8 k, void * m);

VMOVAPD void _mm512_store_pd( void * d, __m512d a);

VMOVAPD void _mm512_mask_store_pd( void * d, __mmask8 k, __m512d a);

VMOVAPD __m256d _mm256_mask_load_pd(__m256d s, __mmask8 k, void * m);

VMOVAPD __m256d _mm256_maskz_load_pd( __mmask8 k, void * m);

VMOVAPD void _mm256_mask_store_pd( void * d, __mmask8 k, __m256d a);

VMOVAPD __m128d _mm_mask_load_pd(__m128d s, __mmask8 k, void * m);

VMOVAPD __m128d _mm_maskz_load_pd( __mmask8 k, void * m);

VMOVAPD void _mm_mask_store_pd( void * d, __mmask8 k, __m128d a);

MOVAPD __m256d _mm256_load_pd (double * p);

MOVAPD void _mm256_store_pd(double * p, __m256d a);

MOVAPD __m128d _mm_load_pd (double * p);

MOVAPD void _mm_store_pd(double * p, __m128d a);

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type1.SSE2;

EVEX-encoded instruction, see Exceptions Type E1.

#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.

MOVAPS—Move Aligned Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-49

MOVAPS—Move Aligned Packed Single-Precision Floating-Point Values

Instruction Operand Encoding

Description

Moves 4, 8 or 16 single-precision floating-point values from the source operand (second operand) to the destina-

tion operand (first operand). This instruction can be used to load an XMM, YMM or ZMM register from an 128-bit,

256-bit or 512-bit memory location, to store the contents of an XMM, YMM or ZMM register into a 128-bit, 256-bit

or 512-bit memory location, or to move data between two XMM, two YMM or two ZMM registers.

When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte (128-bit

version), 32-byte (VEX.256 encoded version) or 64-byte (EVEX.512 encoded version) boundary or a general-

protection exception (#GP) will be generated. For EVEX.512 encoded versions, the operand must be aligned to the

size of the memory operand. To move single-precision floating-point values to and from unaligned memory loca-

tions, use the VMOVUPS instruction.

Opcode/

Instruction

Op/En 64/32

bit Mode

Support

CPUID

Feature

Flag

Description

NP 0F 28 /r

MOVAPS xmm1, xmm2/m128

A V/V SSE Move aligned packed single-precision floating-point

values from xmm2/mem to xmm1.

NP 0F 29 /r

MOVAPS xmm2/m128, xmm1

B V/V SSE Move aligned packed single-precision floating-point

values from xmm1 to xmm2/mem.

VEX.128.0F.WIG 28 /r

VMOVAPS xmm1, xmm2/m128

A V/V AVX Move aligned packed single-precision floating-point

values from xmm2/mem to xmm1.

VEX.128.0F.WIG 29 /r

VMOVAPS xmm2/m128, xmm1

B V/V AVX Move aligned packed single-precision floating-point

values from xmm1 to xmm2/mem.

VEX.256.0F.WIG 28 /r

VMOVAPS ymm1, ymm2/m256

A V/V AVX Move aligned packed single-precision floating-point

values from ymm2/mem to ymm1.

VEX.256.0F.WIG 29 /r

VMOVAPS ymm2/m256, ymm1

B V/V AVX Move aligned packed single-precision floating-point

values from ymm1 to ymm2/mem.

EVEX.128.0F.W0 28 /r

VMOVAPS xmm1 {k1}{z}, xmm2/m128

C V/V AVX512VL

AVX512F

Move aligned packed single-precision floating-point

values from xmm2/m128 to xmm1 using

writemask k1.

EVEX.256.0F.W0 28 /r

VMOVAPS ymm1 {k1}{z}, ymm2/m256

C V/V AVX512VL

AVX512F

Move aligned packed single-precision floating-point

values from ymm2/m256 to ymm1 using

writemask k1.

EVEX.512.0F.W0 28 /r

VMOVAPS zmm1 {k1}{z}, zmm2/m512

C V/V AVX512F Move aligned packed single-precision floating-point

values from zmm2/m512 to zmm1 using

writemask k1.

EVEX.128.0F.W0 29 /r

VMOVAPS xmm2/m128 {k1}{z}, xmm1

D V/V AVX512VL

AVX512F

Move aligned packed single-precision floating-point

values from xmm1 to xmm2/m128 using

writemask k1.

EVEX.256.0F.W0 29 /r

VMOVAPS ymm2/m256 {k1}{z}, ymm1

D V/V AVX512VL

AVX512F

Move aligned packed single-precision floating-point

values from ymm1 to ymm2/m256 using

writemask k1.

EVEX.512.0F.W0 29 /r

VMOVAPS zmm2/m512 {k1}{z}, zmm1

D V/V AVX512F Move aligned packed single-precision floating-point

values from zmm1 to zmm2/m512 using

writemask k1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (w) ModRM:r/m (r) NA NA

B NA ModRM:r/m (w) ModRM:reg (r) NA NA

C Full Mem ModRM:reg (w) ModRM:r/m (r) NA NA

D Full Mem ModRM:r/m (w) ModRM:reg (r) NA NA

MOVAPS—Move Aligned Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-50 Vol. 2B

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

EVEX.512 encoded version:

Moves 512 bits of packed single-precision floating-point values from the source operand (second operand) to the

destination operand (first operand). This instruction can be used to load a ZMM register from a 512-bit float32

memory location, to store the contents of a ZMM register into a float32 memory location, or to move data between

two ZMM registers. When the source or destination operand is a memory operand, the operand must be aligned on

a 64-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision floating-

point values to and from unaligned memory locations, use the VMOVUPS instruction.

VEX.256 and EVEX.256 encoded version:

Moves 256 bits of packed single-precision floating-point values from the source operand (second operand) to the

destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory

location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM

registers. When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte

boundary or a general-protection exception (#GP) will be generated.

128-bit versions:

Moves 128 bits of packed single-precision floating-point values from the source operand (second operand) to the

destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory

location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two

XMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a

16-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision floating-

point values to and from unaligned memory locations, use the VMOVUPS instruction.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding ZMM destination register remain

unchanged.

(E)VEX.128 encoded version: Bits (MAXVL-1:128) of the destination ZMM register are zeroed.

Operation

VMOVAPS (EVEX encoded versions, register-copy form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i]  SRC[i+31:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVAPS (EVEX encoded versions, store-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i]

SRC[i+31:i]

ELSE *DEST[i+31:i] remains unchanged* ; merging-masking

FI;

ENDFOR;

MOVAPS—Move Aligned Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-51

VMOVAPS (EVEX encoded versions, load-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i]  SRC[i+31:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVAPS (VEX.256 encoded version, load - and register copy)

DEST[255:0]  SRC[255:0]

DEST[MAXVL-1:256]  0

VMOVAPS (VEX.256 encoded version, store-form)

DEST[255:0]  SRC[255:0]

VMOVAPS (VEX.128 encoded version, load - and register copy)

DEST[127:0]  SRC[127:0]

DEST[MAXVL-1:128]  0

MOVAPS (128-bit load- and register-copy- form Legacy SSE version)

DEST[127:0]  SRC[127:0]

DEST[MAXVL-1:128] (Unmodified)

(V)MOVAPS (128-bit store-form version)

DEST[127:0]  SRC[127:0]

Intel C/C++ Compiler Intrinsic Equivalent

VMOVAPS __m512 _mm512_load_ps( void * m);

VMOVAPS __m512 _mm512_mask_load_ps(__m512 s, __mmask16 k, void * m);

VMOVAPS __m512 _mm512_maskz_load_ps( __mmask16 k, void * m);

VMOVAPS void _mm512_store_ps( void * d, __m512 a);

VMOVAPS void _mm512_mask_store_ps( void * d, __mmask16 k, __m512 a);

VMOVAPS __m256 _mm256_mask_load_ps(__m256 a, __mmask8 k, void * s);

VMOVAPS __m256 _mm256_maskz_load_ps( __mmask8 k, void * s);

VMOVAPS void _mm256_mask_store_ps( void * d, __mmask8 k, __m256 a);

VMOVAPS __m128 _mm_mask_load_ps(__m128 a, __mmask8 k, void * s);

VMOVAPS __m128 _mm_maskz_load_ps( __mmask8 k, void * s);

VMOVAPS void _mm_mask_store_ps( void * d, __mmask8 k, __m128 a);

MOVAPS __m256 _mm256_load_ps (float * p);

MOVAPS void _mm256_store_ps(float * p, __m256 a);

MOVAPS __m128 _mm_load_ps (float * p);

MOVAPS void _mm_store_ps(float * p, __m128 a);

SIMD Floating-Point Exceptions

None

MOVAPS—Move Aligned Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-52 Vol. 2B

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type1.SSE; additionally

#UD If VEX.vvvv != 1111B.

EVEX-encoded instruction, see Exceptions Type E1.

MOVBE—Move Data After Swapping Bytes

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-53

MOVBE—Move Data After Swapping Bytes

Instruction Operand Encoding

Description

Performs a byte swap operation on the data copied from the second operand (source operand) and store the result

in the first operand (destination operand). The source operand can be a general-purpose register, or memory loca-

tion; the destination register can be a general-purpose register, or a memory location; however, both operands can

not be registers, and only one operand can be a memory location. Both operands must be the same size, which can

be a word, a doubleword or quadword.

The MOVBE instruction is provided for swapping the bytes on a read from memory or on a write to memory; thus

providing support for converting little-endian values to big-endian format and vice versa.

In 64-bit mode, the instruction's default operation size is 32 bits. Use of the REX.R prefix permits access to addi-

tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the

beginning of this section for encoding data and limits.

Operation

TEMP ← SRC

IF ( OperandSize = 16)

THEN

DEST[7:0] ← TEMP[15:8];

DEST[15:8] ← TEMP[7:0];

ELES IF ( OperandSize = 32)

DEST[7:0] ← TEMP[31:24];

DEST[15:8] ← TEMP[23:16];

DEST[23:16] ← TEMP[15:8];

DEST[31:23] ← TEMP[7:0];

ELSE IF ( OperandSize = 64)

DEST[7:0] ← TEMP[63:56];

DEST[15:8] ← TEMP[55:48];

DEST[23:16] ← TEMP[47:40];

DEST[31:24] ← TEMP[39:32];

DEST[39:32] ← TEMP[31:24];

DEST[47:40] ← TEMP[23:16];

DEST[55:48] ← TEMP[15:8];

DEST[63:56] ← TEMP[7:0];

FI;

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

0F 38 F0 /rMOVBE r16, m16 RM Valid Valid Reverse byte order in m16 and move to r16.

0F 38 F0 /rMOVBE r32, m32 RM Valid Valid Reverse byte order in m32 and move to r32.

REX.W + 0F 38 F0 /rMOVBE r64, m64 RM Valid N.E. Reverse byte order in m64 and move to r64.

0F 38 F1 /rMOVBE m16, r16 MR Valid Valid Reverse byte order in r16 and move to m16.

0F 38 F1 /rMOVBE m32, r32 MR Valid Valid Reverse byte order in r32 and move to m32.

REX.W + 0F 38 F1 /rMOVBE m64, r64 MR Valid N.E. Reverse byte order in r64 and move to m64.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RM ModRM:reg (w) ModRM:r/m (r) NA NA

MR ModRM:r/m (w) ModRM:reg (r) NA NA

MOVBE—Move Data After Swapping Bytes

INSTRUCTION SET REFERENCE, M-U

4-54 Vol. 2B

Flags Affected

None

Protected Mode Exceptions

#GP(0) If the destination operand is in a non-writable segment.

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

If the DS, ES, FS, or GS register contains a NULL segment selector.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If CPUID.01H:ECX.MOVBE[bit 22] = 0.

If the LOCK prefix is used.

If REP (F3H) prefix is used.

Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If CPUID.01H:ECX.MOVBE[bit 22] = 0.

If the LOCK prefix is used.

If REP (F3H) prefix is used.

Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If CPUID.01H:ECX.MOVBE[bit 22] = 0.

If the LOCK prefix is used.

If REP (F3H) prefix is used.

If REPNE (F2H) prefix is used and CPUID.01H:ECX.SSE4_2[bit 20] = 0.

Compatibility Mode Exceptions

Same exceptions as in protected mode.

64-Bit Mode Exceptions

#GP(0) If the memory address is in a non-canonical form.

#SS(0) If the stack address is in a non-canonical form.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If CPUID.01H:ECX.MOVBE[bit 22] = 0.

If the LOCK prefix is used.

If REP (F3H) prefix is used.

MOVD/MOVQ—Move Doubleword/Move Quadword

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-55

MOVD/MOVQ—Move Doubleword/Move Quadword

Opcode/

Instruction

Op/ En 64/32-bit

Mode

CPUID

Feature

Flag

Description

NP 0F 6E /r

MOVD mm, r/m32

AV/V MMXMove doubleword from r/m32 to mm.

NP REX.W + 0F 6E /r

MOVQ mm, r/m64

AV/N.E.MMXMove quadword from r/m64 to mm.

NP 0F 7E /r

MOVD r/m32, mm

BV/V MMXMove doubleword from mm to r/m32.

NP REX.W + 0F 7E /r

MOVQ r/m64, mm

BV/N.E.MMXMove quadword from mm to r/m64.

66 0F 6E /r

MOVD xmm, r/m32

A V/V SSE2 Move doubleword from r/m32 to xmm.

66 REX.W 0F 6E /r

MOVQ xmm, r/m64

A V/N.E. SSE2 Move quadword from r/m64 to xmm.

66 0F 7E /r

MOVD r/m32, xmm

B V/V SSE2 Move doubleword from xmm register to r/m32.

66 REX.W 0F 7E /r

MOVQ r/m64, xmm

B V/N.E. SSE2 Move quadword from xmm register to r/m64.

VEX.128.66.0F.W0 6E /

VMOVD xmm1, r32/m32

AV/V AVXMove doubleword from r/m32 to xmm1.

VEX.128.66.0F.W1 6E /r

VMOVQ xmm1, r64/m64

AV/N.E

1.AVX Move quadword from r/m64 to xmm1.

VEX.128.66.0F.W0 7E /r

VMOVD r32/m32, xmm1

BV/V AVXMove doubleword from xmm1 register to r/m32.

VEX.128.66.0F.W1 7E /r

VMOVQ r64/m64, xmm1

BV/N.E

1.AVX Move quadword from xmm1 register to r/m64.

EVEX.128.66.0F.W0 6E /r

VMOVD xmm1, r32/m32

C V/V AVX512F Move doubleword from r/m32 to xmm1.

EVEX.128.66.0F.W1 6E /r

VMOVQ xmm1, r64/m64

CV/N.E.

NOTES:

1. For this specific instruction, VEX.W/EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 ver-

sion is used.

AVX512F Move quadword from r/m64 to xmm1.

EVEX.128.66.0F.W0 7E /r

VMOVD r32/m32, xmm1

D V/V AVX512F Move doubleword from xmm1 register to r/m32.

EVEX.128.66.0F.W1 7E /r

VMOVQ r64/m64, xmm1

DV/N.E.

1AVX512F Move quadword from xmm1 register to r/m64.

MOVD/MOVQ—Move Doubleword/Move Quadword

INSTRUCTION SET REFERENCE, M-U

4-56 Vol. 2B

Instruction Operand Encoding

Description

Copies a doubleword from the source operand (second operand) to the destination operand (first operand). The

source and destination operands can be general-purpose registers, MMX technology registers, XMM registers, or

32-bit memory locations. This instruction can be used to move a doubleword to and from the low doubleword of an

MMX technology register and a general-purpose register or a 32-bit memory location, or to and from the low

doubleword of an XMM register and a general-purpose register or a 32-bit memory location. The instruction cannot

be used to transfer data between MMX technology registers, between XMM registers, between general-purpose

registers, or between memory locations.

When the destination operand is an MMX technology register, the source operand is written to the low doubleword

of the register, and the register is zero-extended to 64 bits. When the destination operand is an XMM register, the

source operand is written to the low doubleword of the register, and the register is zero-extended to 128 bits.

In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to addi-

tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the

beginning of this section for encoding data and limits.

MOVD/Q with XMM destination:

Moves a dword/qword integer from the source operand and stores it in the low 32/64-bits of the destination XMM

memory location.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding YMM destination register remain unchanged.

Qword operation requires the use of REX.W=1.

VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed. Qword operation requires the

use of VEX.W=1.

EVEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed. Qword operation requires

the use of EVEX.W=1.

MOVD/Q with 32/64 reg/mem destination:

Stores the low dword/qword of the source XMM register to 32/64-bit memory location or general-purpose register.

Qword operation requires the use of REX.W=1, VEX.W=1, or EVEX.W=1.

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

If VMOVD or VMOVQ is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will

cause an #UD exception.

Operation

MOVD (when destination operand is MMX technology register)

DEST[31:0] ← SRC;

DEST[63:32] ← 00000000H;

MOVD (when destination operand is XMM register)

DEST[31:0] ← SRC;

DEST[127:32] ← 000000000000000000000000H;

DEST[MAXVL-1:128] (Unmodified)

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (w) ModRM:r/m (r) NA NA

B NA ModRM:r/m (w) ModRM:reg (r) NA NA

C Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) NA NA

D Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) NA NA

MOVD/MOVQ—Move Doubleword/Move Quadword

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-57

MOVD (when source operand is MMX technology or XMM register)

DEST ← SRC[31:0];

VMOVD (VEX-encoded version when destination is an XMM register)

DEST[31:0]  SRC[31:0]

DEST[MAXVL-1:32]  0

MOVQ (when destination operand is XMM register)

DEST[63:0] ← SRC[63:0];

DEST[127:64] ← 0000000000000000H;

DEST[MAXVL-1:128] (Unmodified)

MOVQ (when destination operand is r/m64)

DEST[63:0] ← SRC[63:0];

MOVQ (when source operand is XMM register or r/m64)

DEST ← SRC[63:0];

VMOVQ (VEX-encoded version when destination is an XMM register)

DEST[63:0]  SRC[63:0]

DEST[MAXVL-1:64]  0

VMOVD (EVEX-encoded version when destination is an XMM register)

DEST[31:0]  SRC[31:0]

DEST[MAXVL-1:32]  0

VMOVQ (EVEX-encoded version when destination is an XMM register)

DEST[63:0]  SRC[63:0]

DEST[MAXVL-1:64]  0

Intel C/C++ Compiler Intrinsic Equivalent

MOVD: __m64 _mm_cvtsi32_si64 (int i )

MOVD: int _mm_cvtsi64_si32 ( __m64m )

MOVD: __m128i _mm_cvtsi32_si128 (int a)

MOVD: int _mm_cvtsi128_si32 ( __m128i a)

MOVQ: __int64 _mm_cvtsi128_si64(__m128i);

MOVQ: __m128i _mm_cvtsi64_si128(__int64);

VMOVD __m128i _mm_cvtsi32_si128( int);

VMOVD int _mm_cvtsi128_si32( __m128i );

VMOVQ __m128i _mm_cvtsi64_si128 (__int64);

VMOVQ __int64 _mm_cvtsi128_si64(__m128i );

VMOVQ __m128i _mm_loadl_epi64( __m128i * s);

VMOVQ void _mm_storel_epi64( __m128i * d, __m128i s);

Flags Affected

None

SIMD Floating-Point Exceptions

None

MOVD/MOVQ—Move Doubleword/Move Quadword

INSTRUCTION SET REFERENCE, M-U

4-58 Vol. 2B

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5.

EVEX-encoded instruction, see Exceptions Type E9NF.

#UD If VEX.L = 1.

If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

MOVDDUP—Replicate Double FP Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-59

MOVDDUP—Replicate Double FP Values

Instruction Operand Encoding

Description

For 256-bit or higher versions: Duplicates even-indexed double-precision floating-point values from the source

operand (the second operand) and into adjacent pair and store to the destination operand (the first operand).

For 128-bit versions: Duplicates the low double-precision floating-point value from the source operand (the second

operand) and store to the destination operand (the first operand).

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register are unchanged. The

source operand is XMM register or a 64-bit memory location.

VEX.128 and EVEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed. The source

operand is XMM register or a 64-bit memory location. The destination is updated conditionally under the writemask

for EVEX version.

VEX.256 and EVEX.256 encoded version: Bits (MAXVL-1:256) of the destination register are zeroed. The source

operand is YMM register or a 256-bit memory location. The destination is updated conditionally under the

writemask for EVEX version.

EVEX.512 encoded version: The destination is updated according to the writemask. The source operand is ZMM

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

Opcode/

Instruction

Op / En 64/32

bit Mode

Support

CPUID

Feature

Flag

Description

F2 0F 12 /r

MOVDDUP xmm1, xmm2/m64

A V/V SSE3 Move double-precision floating-point value from

xmm2/m64 and duplicate into xmm1.

VEX.128.F2.0F.WIG 12 /r

VMOVDDUP xmm1, xmm2/m64

A V/V AVX Move double-precision floating-point value from

xmm2/m64 and duplicate into xmm1.

VEX.256.F2.0F.WIG 12 /r

VMOVDDUP ymm1, ymm2/m256

A V/V AVX Move even index double-precision floating-point

values from ymm2/mem and duplicate each element

into ymm1.

EVEX.128.F2.0F.W1 12 /r

VMOVDDUP xmm1 {k1}{z},

xmm2/m64

B V/V AVX512VL

AVX512F

Move double-precision floating-point value from

xmm2/m64 and duplicate each element into xmm1

subject to writemask k1.

EVEX.256.F2.0F.W1 12 /r

VMOVDDUP ymm1 {k1}{z},

ymm2/m256

B V/V AVX512VL

AVX512F

Move even index double-precision floating-point

values from ymm2/m256 and duplicate each element

into ymm1 subject to writemask k1.

EVEX.512.F2.0F.W1 12 /r

VMOVDDUP zmm1 {k1}{z},

zmm2/m512

B V/V AVX512F Move even index double-precision floating-point

values from zmm2/m512 and duplicate each element

into zmm1 subject to writemask k1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (w) ModRM:r/m (r) NA NA

B MOVDDUP ModRM:reg (w) ModRM:r/m (r) NA NA

MOVDDUP—Replicate Double FP Values

INSTRUCTION SET REFERENCE, M-U

4-60 Vol. 2B

Operation

VMOVDDUP (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

TMP_SRC[63:0]  SRC[63:0]

TMP_SRC[127:64]  SRC[63:0]

IF VL >= 256

TMP_SRC[191:128]  SRC[191:128]

TMP_SRC[255:192]  SRC[191:128]

FI;

IF VL >= 512

TMP_SRC[319:256]  SRC[319:256]

TMP_SRC[383:320]  SRC[319:256]

TMP_SRC[477:384]  SRC[477:384]

TMP_SRC[511:484]  SRC[477:384]

FI;

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i]  TMP_SRC[i+63:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVDDUP (VEX.256 encoded version)

DEST[63:0] SRC[63:0]

DEST[127:64] SRC[63:0]

DEST[191:128] SRC[191:128]

DEST[255:192] SRC[191:128]

DEST[MAXVL-1:256] 0

VMOVDDUP (VEX.128 encoded version)

DEST[63:0] SRC[63:0]

DEST[127:64] SRC[63:0]

DEST[MAXVL-1:128] 0

Figure 4-2. VMOVDDUP Operation

X2 X2 X0 X0DEST

X3 X2

SRC X1 X0

MOVDDUP—Replicate Double FP Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-61

MOVDDUP (128-bit Legacy SSE version)

DEST[63:0] SRC[63:0]

DEST[127:64] SRC[63:0]

DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VMOVDDUP __m512d _mm512_movedup_pd( __m512d a);

VMOVDDUP __m512d _mm512_mask_movedup_pd(__m512d s, __mmask8 k, __m512d a);

VMOVDDUP __m512d _mm512_maskz_movedup_pd( __mmask8 k, __m512d a);

VMOVDDUP __m256d _mm256_mask_movedup_pd(__m256d s, __mmask8 k, __m256d a);

VMOVDDUP __m256d _mm256_maskz_movedup_pd( __mmask8 k, __m256d a);

VMOVDDUP __m128d _mm_mask_movedup_pd(__m128d s, __mmask8 k, __m128d a);

VMOVDDUP __m128d _mm_maskz_movedup_pd( __mmask8 k, __m128d a);

MOVDDUP __m256d _mm256_movedup_pd (__m256d a);

MOVDDUP __m128d _mm_movedup_pd (__m128d a);

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5;

EVEX-encoded instruction, see Exceptions Type E5NF.

#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.

MOVDQA,VMOVDQA32/64—Move Aligned Packed Integer Values

INSTRUCTION SET REFERENCE, M-U

4-62 Vol. 2B

MOVDQA,VMOVDQA32/64—Move Aligned Packed Integer Values

Opcode/

Instruction

Op/En 64/32

bit Mode

Support

CPUID

Feature

Flag

Description

66 0F 6F /r

MOVDQA xmm1, xmm2/m128

A V/V SSE2 Move aligned packed integer values from

xmm2/mem to xmm1.

66 0F 7F /r

MOVDQA xmm2/m128, xmm1

B V/V SSE2 Move aligned packed integer values from xmm1

to xmm2/mem.

VEX.128.66.0F.WIG 6F /r

VMOVDQA xmm1, xmm2/m128

A V/V AVX Move aligned packed integer values from

xmm2/mem to xmm1.

VEX.128.66.0F.WIG 7F /r

VMOVDQA xmm2/m128, xmm1

B V/V AVX Move aligned packed integer values from xmm1

to xmm2/mem.

VEX.256.66.0F.WIG 6F /r

VMOVDQA ymm1, ymm2/m256

A V/V AVX Move aligned packed integer values from

ymm2/mem to ymm1.

VEX.256.66.0F.WIG 7F /r

VMOVDQA ymm2/m256, ymm1

B V/V AVX Move aligned packed integer values from ymm1

to ymm2/mem.

EVEX.128.66.0F.W0 6F /r

VMOVDQA32 xmm1 {k1}{z},

xmm2/m128

C V/V AVX512VL

AVX512F

Move aligned packed doubleword integer values

from xmm2/m128 to xmm1 using writemask

k1.

EVEX.256.66.0F.W0 6F /r

VMOVDQA32 ymm1 {k1}{z},

ymm2/m256

C V/V AVX512VL

AVX512F

Move aligned packed doubleword integer values

from ymm2/m256 to ymm1 using writemask

k1.

EVEX.512.66.0F.W0 6F /r

VMOVDQA32 zmm1 {k1}{z},

zmm2/m512

C V/V AVX512F Move aligned packed doubleword integer values

from zmm2/m512 to zmm1 using writemask k1.

EVEX.128.66.0F.W0 7F /r

VMOVDQA32 xmm2/m128 {k1}{z},

xmm1

D V/V AVX512VL

AVX512F

Move aligned packed doubleword integer values

from xmm1 to xmm2/m128 using writemask

k1.

EVEX.256.66.0F.W0 7F /r

VMOVDQA32 ymm2/m256 {k1}{z},

ymm1

D V/V AVX512VL

AVX512F

Move aligned packed doubleword integer values

from ymm1 to ymm2/m256 using writemask

k1.

EVEX.512.66.0F.W0 7F /r

VMOVDQA32 zmm2/m512 {k1}{z},

zmm1

D V/V AVX512F Move aligned packed doubleword integer values

from zmm1 to zmm2/m512 using writemask k1.

EVEX.128.66.0F.W1 6F /r

VMOVDQA64 xmm1 {k1}{z},

xmm2/m128

C V/V AVX512VL

AVX512F

Move aligned quadword integer values from

xmm2/m128 to xmm1 using writemask k1.

EVEX.256.66.0F.W1 6F /r

VMOVDQA64 ymm1 {k1}{z},

ymm2/m256

C V/V AVX512VL

AVX512F

Move aligned quadword integer values from

ymm2/m256 to ymm1 using writemask k1.

EVEX.512.66.0F.W1 6F /r

VMOVDQA64 zmm1 {k1}{z},

zmm2/m512

C V/V AVX512F Move aligned packed quadword integer values

from zmm2/m512 to zmm1 using writemask k1.

EVEX.128.66.0F.W1 7F /r

VMOVDQA64 xmm2/m128 {k1}{z},

xmm1

D V/V AVX512VL

AVX512F

Move aligned packed quadword integer values

from xmm1 to xmm2/m128 using writemask

k1.

EVEX.256.66.0F.W1 7F /r

VMOVDQA64 ymm2/m256 {k1}{z},

ymm1

D V/V AVX512VL

AVX512F

Move aligned packed quadword integer values

from ymm1 to ymm2/m256 using writemask

k1.

EVEX.512.66.0F.W1 7F /r

VMOVDQA64 zmm2/m512 {k1}{z},

zmm1

D V/V AVX512F Move aligned packed quadword integer values

from zmm1 to zmm2/m512 using writemask k1.

MOVDQA,VMOVDQA32/64—Move Aligned Packed Integer Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-63

Instruction Operand Encoding

Description

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

EVEX encoded versions:

Moves 128, 256 or 512 bits of packed doubleword/quadword integer values from the source operand (the second

operand) to the destination operand (the first operand). This instruction can be used to load a vector register from

an int32/int64 memory location, to store the contents of a vector register into an int32/int64 memory location, or

to move data between two ZMM registers. When the source or destination operand is a memory operand, the

operand must be aligned on a 16 (EVEX.128)/32(EVEX.256)/64(EVEX.512)-byte boundary or a general-protection

exception (#GP) will be generated. To move integer data to and from unaligned memory locations, use the

VMOVDQU instruction.

The destination operand is updated at 32-bit (VMOVDQA32) or 64-bit (VMOVDQA64) granularity according to the

writemask.

VEX.256 encoded version:

Moves 256 bits of packed integer values from the source operand (second operand) to the destination operand

(first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the

contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers.

When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte boundary

or a general-protection exception (#GP) will be generated. To move integer data to and from unaligned memory

locations, use the VMOVDQU instruction. Bits (MAXVL-1:256) of the destination register are zeroed.

128-bit versions:

Moves 128 bits of packed integer values from the source operand (second operand) to the destination operand

(first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the

contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.

When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary

or a general-protection exception (#GP) will be generated. To move integer data to and from unaligned memory

locations, use the VMOVDQU instruction.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding ZMM destination register remain

unchanged.

VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (w) ModRM:r/m (r) NA NA

B NA ModRM:r/m (w) ModRM:reg (r) NA NA

C Full Mem ModRM:reg (w) ModRM:r/m (r) NA NA

D Full Mem ModRM:r/m (w) ModRM:reg (r) NA NA

MOVDQA,VMOVDQA32/64—Move Aligned Packed Integer Values

INSTRUCTION SET REFERENCE, M-U

4-64 Vol. 2B

Operation

VMOVDQA32 (EVEX encoded versions, register-copy form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i]  SRC[i+31:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVDQA32 (EVEX encoded versions, store-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[i+31:i]

ELSE *DEST[i+31:i] remains unchanged* ; merging-masking

FI;

ENDFOR;

VMOVDQA32 (EVEX encoded versions, load-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i]  SRC[i+31:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

MOVDQA,VMOVDQA32/64—Move Aligned Packed Integer Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-65

VMOVDQA64 (EVEX encoded versions, register-copy form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i]  SRC[i+63:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVDQA64 (EVEX encoded versions, store-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i]

ELSE *DEST[i+63:i] remains unchanged* ; merging-masking

FI;

ENDFOR;

VMOVDQA64 (EVEX encoded versions, load-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i]  SRC[i+63:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVDQA (VEX.256 encoded version, load - and register copy)

DEST[255:0]  SRC[255:0]

DEST[MAXVL-1:256]  0

VMOVDQA (VEX.256 encoded version, store-form)

DEST[255:0]  SRC[255:0]

VMOVDQA (VEX.128 encoded version)

DEST[127:0]  SRC[127:0]

DEST[MAXVL-1:128]  0

VMOVDQA (128-bit load- and register-copy- form Legacy SSE version)

DEST[127:0]  SRC[127:0]

DEST[MAXVL-1:128] (Unmodified)

MOVDQA,VMOVDQA32/64—Move Aligned Packed Integer Values

INSTRUCTION SET REFERENCE, M-U

4-66 Vol. 2B

(V)MOVDQA (128-bit store-form version)

DEST[127:0]  SRC[127:0]

Intel C/C++ Compiler Intrinsic Equivalent

VMOVDQA32 __m512i _mm512_load_epi32( void * sa);

VMOVDQA32 __m512i _mm512_mask_load_epi32(__m512i s, __mmask16 k, void * sa);

VMOVDQA32 __m512i _mm512_maskz_load_epi32( __mmask16 k, void * sa);

VMOVDQA32 void _mm512_store_epi32(void * d, __m512i a);

VMOVDQA32 void _mm512_mask_store_epi32(void * d, __mmask16 k, __m512i a);

VMOVDQA32 __m256i _mm256_mask_load_epi32(__m256i s, __mmask8 k, void * sa);

VMOVDQA32 __m256i _mm256_maskz_load_epi32( __mmask8 k, void * sa);

VMOVDQA32 void _mm256_store_epi32(void * d, __m256i a);

VMOVDQA32 void _mm256_mask_store_epi32(void * d, __mmask8 k, __m256i a);

VMOVDQA32 __m128i _mm_mask_load_epi32(__m128i s, __mmask8 k, void * sa);

VMOVDQA32 __m128i _mm_maskz_load_epi32( __mmask8 k, void * sa);

VMOVDQA32 void _mm_store_epi32(void * d, __m128i a);

VMOVDQA32 void _mm_mask_store_epi32(void * d, __mmask8 k, __m128i a);

VMOVDQA64 __m512i _mm512_load_epi64( void * sa);

VMOVDQA64 __m512i _mm512_mask_load_epi64(__m512i s, __mmask8 k, void * sa);

VMOVDQA64 __m512i _mm512_maskz_load_epi64( __mmask8 k, void * sa);

VMOVDQA64 void _mm512_store_epi64(void * d, __m512i a);

VMOVDQA64 void _mm512_mask_store_epi64(void * d, __mmask8 k, __m512i a);

VMOVDQA64 __m256i _mm256_mask_load_epi64(__m256i s, __mmask8 k, void * sa);

VMOVDQA64 __m256i _mm256_maskz_load_epi64( __mmask8 k, void * sa);

VMOVDQA64 void _mm256_store_epi64(void * d, __m256i a);

VMOVDQA64 void _mm256_mask_store_epi64(void * d, __mmask8 k, __m256i a);

VMOVDQA64 __m128i _mm_mask_load_epi64(__m128i s, __mmask8 k, void * sa);

VMOVDQA64 __m128i _mm_maskz_load_epi64( __mmask8 k, void * sa);

VMOVDQA64 void _mm_store_epi64(void * d, __m128i a);

VMOVDQA64 void _mm_mask_store_epi64(void * d, __mmask8 k, __m128i a);

MOVDQA void __m256i _mm256_load_si256 (__m256i * p);

MOVDQA _mm256_store_si256(_m256i *p, __m256i a);

MOVDQA __m128i _mm_load_si128 (__m128i * p);

MOVDQA void _mm_store_si128(__m128i *p, __m128i a);

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type1.SSE2;

EVEX-encoded instruction, see Exceptions Type E1.

#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-67

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values

Opcode/

Instruction

Op/En 64/32

bit Mode

Support

CPUID

Feature

Flag

Description

F3 0F 6F /r

MOVDQU xmm1, xmm2/m128

A V/V SSE2 Move unaligned packed integer values from

xmm2/m128 to xmm1.

F3 0F 7F /r

MOVDQU xmm2/m128, xmm1

B V/V SSE2 Move unaligned packed integer values from

xmm1 to xmm2/m128.

VEX.128.F3.0F.WIG 6F /r

VMOVDQU xmm1, xmm2/m128

A V/V AVX Move unaligned packed integer values from

xmm2/m128 to xmm1.

VEX.128.F3.0F.WIG 7F /r

VMOVDQU xmm2/m128, xmm1

B V/V AVX Move unaligned packed integer values from

xmm1 to xmm2/m128.

VEX.256.F3.0F.WIG 6F /r

VMOVDQU ymm1, ymm2/m256

A V/V AVX Move unaligned packed integer values from

ymm2/m256 to ymm1.

VEX.256.F3.0F.WIG 7F /r

VMOVDQU ymm2/m256, ymm1

B V/V AVX Move unaligned packed integer values from

ymm1 to ymm2/m256.

EVEX.128.F2.0F.W0 6F /r

VMOVDQU8 xmm1 {k1}{z}, xmm2/m128

CV/VAVX512VL

AVX512BW

Move unaligned packed byte integer values

from xmm2/m128 to xmm1 using writemask

k1.

EVEX.256.F2.0F.W0 6F /r

VMOVDQU8 ymm1 {k1}{z}, ymm2/m256

CV/VAVX512VL

AVX512BW

Move unaligned packed byte integer values

from ymm2/m256 to ymm1 using writemask

k1.

EVEX.512.F2.0F.W0 6F /r

VMOVDQU8 zmm1 {k1}{z}, zmm2/m512

C V/V AVX512BW Move unaligned packed byte integer values

from zmm2/m512 to zmm1 using writemask

k1.

EVEX.128.F2.0F.W0 7F /r

VMOVDQU8 xmm2/m128 {k1}{z}, xmm1

DV/VAVX512VL

AVX512BW

Move unaligned packed byte integer values

from xmm1 to xmm2/m128 using writemask

k1.

EVEX.256.F2.0F.W0 7F /r

VMOVDQU8 ymm2/m256 {k1}{z}, ymm1

DV/VAVX512VL

AVX512BW

Move unaligned packed byte integer values

from ymm1 to ymm2/m256 using writemask

k1.

EVEX.512.F2.0F.W0 7F /r

VMOVDQU8 zmm2/m512 {k1}{z}, zmm1

D V/V AVX512BW Move unaligned packed byte integer values

from zmm1 to zmm2/m512 using writemask

k1.

EVEX.128.F2.0F.W1 6F /r

VMOVDQU16 xmm1 {k1}{z}, xmm2/m128

CV/VAVX512VL

AVX512BW

Move unaligned packed word integer values

from xmm2/m128 to xmm1 using writemask

k1.

EVEX.256.F2.0F.W1 6F /r

VMOVDQU16 ymm1 {k1}{z}, ymm2/m256

CV/VAVX512VL

AVX512BW

Move unaligned packed word integer values

from ymm2/m256 to ymm1 using writemask

k1.

EVEX.512.F2.0F.W1 6F /r

VMOVDQU16 zmm1 {k1}{z}, zmm2/m512

C V/V AVX512BW Move unaligned packed word integer values

from zmm2/m512 to zmm1 using writemask

k1.

EVEX.128.F2.0F.W1 7F /r

VMOVDQU16 xmm2/m128 {k1}{z}, xmm1

DV/VAVX512VL

AVX512BW

Move unaligned packed word integer values

from xmm1 to xmm2/m128 using writemask

k1.

EVEX.256.F2.0F.W1 7F /r

VMOVDQU16 ymm2/m256 {k1}{z}, ymm1

DV/VAVX512VL

AVX512BW

Move unaligned packed word integer values

from ymm1 to ymm2/m256 using writemask

k1.

EVEX.512.F2.0F.W1 7F /r

VMOVDQU16 zmm2/m512 {k1}{z}, zmm1

D V/V AVX512BW Move unaligned packed word integer values

from zmm1 to zmm2/m512 using writemask

k1.

EVEX.128.F3.0F.W0 6F /r

VMOVDQU32 xmm1 {k1}{z},

xmm2/mm128

CV/VAVX512VL

AVX512F

Move unaligned packed doubleword integer

values from xmm2/m128 to xmm1 using

writemask k1.

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values

INSTRUCTION SET REFERENCE, M-U

4-68 Vol. 2B

Instruction Operand Encoding

Description

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

EVEX encoded versions:

Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand

(the second operand) to the destination operand (first operand). This instruction can be used to load a vector

between two vector registers.

EVEX.256.F3.0F.W0 6F /r

VMOVDQU32 ymm1 {k1}{z}, ymm2/m256

CV/VAVX512VL

AVX512F

Move unaligned packed doubleword integer

values from ymm2/m256 to ymm1 using

writemask k1.

EVEX.512.F3.0F.W0 6F /r

VMOVDQU32 zmm1 {k1}{z}, zmm2/m512

C V/V AVX512F Move unaligned packed doubleword integer

values from zmm2/m512 to zmm1 using

writemask k1.

EVEX.128.F3.0F.W0 7F /r

VMOVDQU32 xmm2/m128 {k1}{z}, xmm1

DV/VAVX512VL

AVX512F

Move unaligned packed doubleword integer

values from xmm1 to xmm2/m128 using

writemask k1.

EVEX.256.F3.0F.W0 7F /r

VMOVDQU32 ymm2/m256 {k1}{z}, ymm1

DV/VAVX512VL

AVX512F

Move unaligned packed doubleword integer

values from ymm1 to ymm2/m256 using

writemask k1.

EVEX.512.F3.0F.W0 7F /r

VMOVDQU32 zmm2/m512 {k1}{z}, zmm1

D V/V AVX512F Move unaligned packed doubleword integer

values from zmm1 to zmm2/m512 using

writemask k1.

EVEX.128.F3.0F.W1 6F /r

VMOVDQU64 xmm1 {k1}{z}, xmm2/m128

CV/VAVX512VL

AVX512F

Move unaligned packed quadword integer

values from xmm2/m128 to xmm1 using

writemask k1.

EVEX.256.F3.0F.W1 6F /r

VMOVDQU64 ymm1 {k1}{z}, ymm2/m256

CV/VAVX512VL

AVX512F

Move unaligned packed quadword integer

values from ymm2/m256 to ymm1 using

writemask k1.

EVEX.512.F3.0F.W1 6F /r

VMOVDQU64 zmm1 {k1}{z}, zmm2/m512

C V/V AVX512F Move unaligned packed quadword integer

values from zmm2/m512 to zmm1 using

writemask k1.

EVEX.128.F3.0F.W1 7F /r

VMOVDQU64 xmm2/m128 {k1}{z}, xmm1

DV/VAVX512VL

AVX512F

Move unaligned packed quadword integer

values from xmm1 to xmm2/m128 using

writemask k1.

EVEX.256.F3.0F.W1 7F /r

VMOVDQU64 ymm2/m256 {k1}{z}, ymm1

DV/VAVX512VL

AVX512F

Move unaligned packed quadword integer

values from ymm1 to ymm2/m256 using

writemask k1.

EVEX.512.F3.0F.W1 7F /r

VMOVDQU64 zmm2/m512 {k1}{z}, zmm1

D V/V AVX512F Move unaligned packed quadword integer

values from zmm1 to zmm2/m512 using

writemask k1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (w) ModRM:r/m (r) NA NA

B NA ModRM:r/m (w) ModRM:reg (r) NA NA

C Full Mem ModRM:reg (w) ModRM:r/m (r) NA NA

D Full Mem ModRM:r/m (w) ModRM:reg (r) NA NA

Opcode/

Instruction

Op/En 64/32

bit Mode

Support

CPUID

Feature

Flag

Description

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-69

The destination operand is updated at 8-bit (VMOVDQU8), 16-bit (VMOVDQU16), 32-bit (VMOVDQU32), or 64-bit

(VMOVDQU64) granularity according to the writemask.

VEX.256 encoded version:

Moves 256 bits of packed integer values from the source operand (second operand) to the destination operand

(first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the

contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers.

Bits (MAXVL-1:256) of the destination register are zeroed.

128-bit versions:

Moves 128 bits of packed integer values from the source operand (second operand) to the destination operand

(first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the

contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.

When the source or destination operand is a memory operand, the operand may be unaligned to any alignment

without causing a general-protection exception (#GP) to be generated

VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.

Operation

VMOVDQU8 (EVEX encoded versions, register-copy form)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j  0 TO KL-1

i  j * 8

IF k1[j] OR *no writemask*

THEN DEST[i+7:i]  SRC[i+7:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+7:i] remains unchanged*

ELSE DEST[i+7:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVDQU8 (EVEX encoded versions, store-form)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j  0 TO KL-1

i  j * 8

IF k1[j] OR *no writemask*

THEN DEST[i+7:i]

SRC[i+7:i]

ELSE *DEST[i+7:i] remains unchanged* ; merging-masking

FI;

ENDFOR;

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values

INSTRUCTION SET REFERENCE, M-U

4-70 Vol. 2B

VMOVDQU8 (EVEX encoded versions, load-form)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j  0 TO KL-1

i  j * 8

IF k1[j] OR *no writemask*

THEN DEST[i+7:i]  SRC[i+7:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+7:i] remains unchanged*

ELSE DEST[i+7:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVDQU16 (EVEX encoded versions, register-copy form)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j  0 TO KL-1

i  j * 16

IF k1[j] OR *no writemask*

THEN DEST[i+15:i]  SRC[i+15:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+15:i] remains unchanged*

ELSE DEST[i+15:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVDQU16 (EVEX encoded versions, store-form)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j  0 TO KL-1

i  j * 16

IF k1[j] OR *no writemask*

THEN DEST[i+15:i]

SRC[i+15:i]

ELSE *DEST[i+15:i] remains unchanged* ; merging-masking

FI;

ENDFOR;

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-71

VMOVDQU16 (EVEX encoded versions, load-form)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j  0 TO KL-1

i  j * 16

IF k1[j] OR *no writemask*

THEN DEST[i+15:i]  SRC[i+15:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+15:i] remains unchanged*

ELSE DEST[i+15:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVDQU32 (EVEX encoded versions, register-copy form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i]  SRC[i+31:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVDQU32 (EVEX encoded versions, store-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i]

SRC[i+31:i]

ELSE *DEST[i+31:i] remains unchanged* ; merging-masking

FI;

ENDFOR;

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values

INSTRUCTION SET REFERENCE, M-U

4-72 Vol. 2B

VMOVDQU32 (EVEX encoded versions, load-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i]  SRC[i+31:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVDQU64 (EVEX encoded versions, register-copy form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i]  SRC[i+63:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVDQU64 (EVEX encoded versions, store-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i]

ELSE *DEST[i+63:i] remains unchanged* ; merging-masking

FI;

ENDFOR;

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-73

VMOVDQU64 (EVEX encoded versions, load-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i]  SRC[i+63:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVDQU (VEX.256 encoded version, load - and register copy)

DEST[255:0]  SRC[255:0]

DEST[MAXVL-1:256]  0

VMOVDQU (VEX.256 encoded version, store-form)

DEST[255:0]  SRC[255:0]

VMOVDQU (VEX.128 encoded version)

DEST[127:0]  SRC[127:0]

DEST[MAXVL-1:128]  0

VMOVDQU (128-bit load- and register-copy- form Legacy SSE version)

DEST[127:0]  SRC[127:0]

DEST[MAXVL-1:128] (Unmodified)

(V)MOVDQU (128-bit store-form version)

DEST[127:0]  SRC[127:0]

Intel C/C++ Compiler Intrinsic Equivalent

VMOVDQU16 __m512i _mm512_mask_loadu_epi16(__m512i s, __mmask32 k, void * sa);

VMOVDQU16 __m512i _mm512_maskz_loadu_epi16( __mmask32 k, void * sa);

VMOVDQU16 void _mm512_mask_storeu_epi16(void * d, __mmask32 k, __m512i a);

VMOVDQU16 __m256i _mm256_mask_loadu_epi16(__m256i s, __mmask16 k, void * sa);

VMOVDQU16 __m256i _mm256_maskz_loadu_epi16( __mmask16 k, void * sa);

VMOVDQU16 void _mm256_mask_storeu_epi16(void * d, __mmask16 k, __m256i a);

VMOVDQU16 __m128i _mm_mask_loadu_epi16(__m128i s, __mmask8 k, void * sa);

VMOVDQU16 __m128i _mm_maskz_loadu_epi16( __mmask8 k, void * sa);

VMOVDQU16 void _mm_mask_storeu_epi16(void * d, __mmask8 k, __m128i a);

VMOVDQU32 __m512i _mm512_loadu_epi32( void * sa);

VMOVDQU32 __m512i _mm512_mask_loadu_epi32(__m512i s, __mmask16 k, void * sa);

VMOVDQU32 __m512i _mm512_maskz_loadu_epi32( __mmask16 k, void * sa);

VMOVDQU32 void _mm512_storeu_epi32(void * d, __m512i a);

VMOVDQU32 void _mm512_mask_storeu_epi32(void * d, __mmask16 k, __m512i a);

VMOVDQU32 __m256i _mm256_mask_loadu_epi32(__m256i s, __mmask8 k, void * sa);

VMOVDQU32 __m256i _mm256_maskz_loadu_epi32( __mmask8 k, void * sa);

VMOVDQU32 void _mm256_storeu_epi32(void * d, __m256i a);

VMOVDQU32 void _mm256_mask_storeu_epi32(void * d, __mmask8 k, __m256i a);

VMOVDQU32 __m128i _mm_mask_loadu_epi32(__m128i s, __mmask8 k, void * sa);

VMOVDQU32 __m128i _mm_maskz_loadu_epi32( __mmask8 k, void * sa);

MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values

INSTRUCTION SET REFERENCE, M-U

4-74 Vol. 2B

VMOVDQU32 void _mm_storeu_epi32(void * d, __m128i a);

VMOVDQU32 void _mm_mask_storeu_epi32(void * d, __mmask8 k, __m128i a);

VMOVDQU64 __m512i _mm512_loadu_epi64( void * sa);

VMOVDQU64 __m512i _mm512_mask_loadu_epi64(__m512i s, __mmask8 k, void * sa);

VMOVDQU64 __m512i _mm512_maskz_loadu_epi64( __mmask8 k, void * sa);

VMOVDQU64 void _mm512_storeu_epi64(void * d, __m512i a);

VMOVDQU64 void _mm512_mask_storeu_epi64(void * d, __mmask8 k, __m512i a);

VMOVDQU64 __m256i _mm256_mask_loadu_epi64(__m256i s, __mmask8 k, void * sa);

VMOVDQU64 __m256i _mm256_maskz_loadu_epi64( __mmask8 k, void * sa);

VMOVDQU64 void _mm256_storeu_epi64(void * d, __m256i a);

VMOVDQU64 void _mm256_mask_storeu_epi64(void * d, __mmask8 k, __m256i a);

VMOVDQU64 __m128i _mm_mask_loadu_epi64(__m128i s, __mmask8 k, void * sa);

VMOVDQU64 __m128i _mm_maskz_loadu_epi64( __mmask8 k, void * sa);

VMOVDQU64 void _mm_storeu_epi64(void * d, __m128i a);

VMOVDQU64 void _mm_mask_storeu_epi64(void * d, __mmask8 k, __m128i a);

VMOVDQU8 __m512i _mm512_mask_loadu_epi8(__m512i s, __mmask64 k, void * sa);

VMOVDQU8 __m512i _mm512_maskz_loadu_epi8( __mmask64 k, void * sa);

VMOVDQU8 void _mm512_mask_storeu_epi8(void * d, __mmask64 k, __m512i a);

VMOVDQU8 __m256i _mm256_mask_loadu_epi8(__m256i s, __mmask32 k, void * sa);

VMOVDQU8 __m256i _mm256_maskz_loadu_epi8( __mmask32 k, void * sa);

VMOVDQU8 void _mm256_mask_storeu_epi8(void * d, __mmask32 k, __m256i a);

VMOVDQU8 __m128i _mm_mask_loadu_epi8(__m128i s, __mmask16 k, void * sa);

VMOVDQU8 __m128i _mm_maskz_loadu_epi8( __mmask16 k, void * sa);

VMOVDQU8 void _mm_mask_storeu_epi8(void * d, __mmask16 k, __m128i a);

MOVDQU __m256i _mm256_loadu_si256 (__m256i * p);

MOVDQU _mm256_storeu_si256(_m256i *p, __m256i a);

MOVDQU __m128i _mm_loadu_si128 (__m128i * p);

MOVDQU _mm_storeu_si128(__m128i *p, __m128i a);

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4;

EVEX-encoded instruction, see Exceptions Type E4.nb.

#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.

MOVDQ2Q—Move Quadword from XMM to MMX Technology Register

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-75

MOVDQ2Q—Move Quadword from XMM to MMX Technology Register

Instruction Operand Encoding

Description

Moves the low quadword from the source operand (second operand) to the destination operand (first operand). The

source operand is an XMM register and the destination operand is an MMX technology register.

This instruction causes a transition from x87 FPU to MMX technology operation (that is, the x87 FPU top-of-stack

pointer is set to 0 and the x87 FPU tag word is set to all 0s [valid]). If this instruction is executed while an x87 FPU

floating-point exception is pending, the exception is handled before the MOVDQ2Q instruction is executed.

In 64-bit mode, use of the REX.R prefix permits this instruction to access additional registers (XMM8-XMM15).

Operation

DEST ← SRC[63:0];

Intel C/C++ Compiler Intrinsic Equivalent

MOVDQ2Q: __m64 _mm_movepi64_pi64 ( __m128i a)

SIMD Floating-Point Exceptions

None.

Protected Mode Exceptions

#NM If CR0.TS[bit 3] = 1.

#UD If CR0.EM[bit 2] = 1.

If CR4.OSFXSR[bit 9] = 0.

If CPUID.01H:EDX.SSE2[bit 26] = 0.

If the LOCK prefix is used.

#MF If there is a pending x87 FPU exception.

Real-Address Mode Exceptions

Same exceptions as in protected mode.

Virtual-8086 Mode Exceptions

Same exceptions as in protected mode.

Compatibility Mode Exceptions

Same exceptions as in protected mode.

64-Bit Mode Exceptions

Same exceptions as in protected mode.

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

F2 0F D6 /r MOVDQ2Q mm, xmm RM Valid Valid Move low quadword from xmm to mmx

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RM ModRM:reg (w) ModRM:r/m (r) NA NA

MOVHLPS—Move Packed Single-Precision Floating-Point Values High to Low

INSTRUCTION SET REFERENCE, M-U

4-76 Vol. 2B

MOVHLPS—Move Packed Single-Precision Floating-Point Values High to Low

Instruction Operand Encoding1

Description

This instruction cannot be used for memory to register moves.

128-bit two-argument form:

Moves two packed single-precision floating-point values from the high quadword of the second XMM argument

(second operand) to the low quadword of the first XMM register (first argument). The quadword at bits 127:64 of

the destination operand is left unchanged. Bits (MAXVL-1:128) of the corresponding destination register remain

unchanged.

128-bit and EVEX three-argument form

Moves two packed single-precision floating-point values from the high quadword of the third XMM argument (third

operand) to the low quadword of the destination (first operand). Copies the high quadword from the second XMM

argument (second operand) to the high quadword of the destination (first operand). Bits (MAXVL-1:128) of the

corresponding destination register are zeroed.

If VMOVHLPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or

EVEX.L’L= 1 will cause an #UD exception.

Operation

MOVHLPS (128-bit two-argument form)

DEST[63:0]  SRC[127:64]

DEST[MAXVL-1:64] (Unmodified)

VMOVHLPS (128-bit three-argument form - VEX & EVEX)

DEST[63:0]  SRC2[127:64]

DEST[127:64]  SRC1[127:64]

DEST[MAXVL-1:128]  0

Intel C/C++ Compiler Intrinsic Equivalent

MOVHLPS __m128 _mm_movehl_ps(__m128 a, __m128 b)

SIMD Floating-Point Exceptions

None

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

NP 0F 12 /r

MOVHLPS xmm1, xmm2

RM V/V SSE Move two packed single-precision floating-point values

from high quadword of xmm2 to low quadword of xmm1.

VEX.128.0F.WIG 12 /r

VMOVHLPS xmm1, xmm2, xmm3

RVM V/V AVX Merge two packed single-precision floating-point values

from high quadword of xmm3 and low quadword of xmm2.

EVEX.128.0F.W0 12 /r

VMOVHLPS xmm1, xmm2, xmm3

RVM V/V AVX512F Merge two packed single-precision floating-point values

from high quadword of xmm3 and low quadword of xmm2.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RM ModRM:reg (w) ModRM:r/m (r) NA NA

RVM ModRM:reg (w) vvvv (r) ModRM:r/m (r) NA

1. ModRM.MOD = 011B required

MOVHLPS—Move Packed Single-Precision Floating-Point Values High to Low

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-77

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 7; additionally

#UD If VEX.L = 1.

EVEX-encoded instruction, see Exceptions Type E7NM.128.

MOVHPD—Move High Packed Double-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

4-78 Vol. 2B

MOVHPD—Move High Packed Double-Precision Floating-Point Value

Instruction Operand Encoding

Description

This instruction cannot be used for register to register or memory to memory moves.

128-bit Legacy SSE load:

Moves a double-precision floating-point value from the source 64-bit memory operand and stores it in the high 64-

bits of the destination XMM register. The lower 64bits of the XMM register are preserved. Bits (MAXVL-1:128) of the

corresponding destination register are preserved.

VEX.128 & EVEX encoded load:

Loads a double-precision floating-point value from the source 64-bit memory operand (the third operand) and

stores it in the upper 64-bits of the destination XMM register (first operand). The low 64-bits from the first source

operand (second operand) are copied to the low 64-bits of the destination. Bits (MAXVL-1:128) of the corre-

sponding destination register are zeroed.

128-bit store:

Stores a double-precision floating-point value from the high 64-bits of the XMM register source (second operand)

to the 64-bit memory location (first operand).

Note: VMOVHPD (store) (VEX.128.66.0F 17 /r) is legal and has the same behavior as the existing 66 0F 17 store.

For VMOVHPD (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.

If VMOVHPD is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or

EVEX.L’L= 1 will cause an #UD exception.

Opcode/

Instruction

Op / En 64/32

bit Mode

Support

CPUID

Feature

Flag

Description

66 0F 16 /r

MOVHPD xmm1, m64

A V/V SSE2 Move double-precision floating-point value from m64

to high quadword of xmm1.

VEX.128.66.0F.WIG 16 /r

VMOVHPD xmm2, xmm1, m64

B V/V AVX Merge double-precision floating-point value from m64

and the low quadword of xmm1.

EVEX.128.66.0F.W1 16 /r

VMOVHPD xmm2, xmm1, m64

D V/V AVX512F Merge double-precision floating-point value from m64

and the low quadword of xmm1.

66 0F 17 /r

MOVHPD m64, xmm1

C V/V SSE2 Move double-precision floating-point value from high

quadword of xmm1 to m64.

VEX.128.66.0F.WIG 17 /r

VMOVHPD m64, xmm1

C V/V AVX Move double-precision floating-point value from high

quadword of xmm1 to m64.

EVEX.128.66.0F.W1 17 /r

VMOVHPD m64, xmm1

E V/V AVX512F Move double-precision floating-point value from high

quadword of xmm1 to m64.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA

C NA ModRM:r/m (w) ModRM:reg (r) NA NA

D Tuple1 Scalar ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA

E Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) NA NA

MOVHPD—Move High Packed Double-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-79

Operation

MOVHPD (128-bit Legacy SSE load)

DEST[63:0] (Unmodified)

DEST[127:64]  SRC[63:0]

DEST[MAXVL-1:128] (Unmodified)

VMOVHPD (VEX.128 & EVEX encoded load)

DEST[63:0]  SRC1[63:0]

DEST[127:64]  SRC2[63:0]

DEST[MAXVL-1:128]  0

VMOVHPD (store)

DEST[63:0]  SRC[127:64]

Intel C/C++ Compiler Intrinsic Equivalent

MOVHPD __m128d _mm_loadh_pd ( __m128d a, double *p)

MOVHPD void _mm_storeh_pd (double *p, __m128d a)

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally

#UD If VEX.L = 1.

EVEX-encoded instruction, see Exceptions Type E9NF.

MOVHPS—Move High Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-80 Vol. 2B

MOVHPS—Move High Packed Single-Precision Floating-Point Values

Instruction Operand Encoding

Description

This instruction cannot be used for register to register or memory to memory moves.

128-bit Legacy SSE load:

Moves two packed single-precision floating-point values from the source 64-bit memory operand and stores them

in the high 64-bits of the destination XMM register. The lower 64bits of the XMM register are preserved. Bits

(MAXVL-1:128) of the corresponding destination register are preserved.

VEX.128 & EVEX encoded load:

Loads two single-precision floating-point values from the source 64-bit memory operand (the third operand) and

stores it in the upper 64-bits of the destination XMM register (first operand). The low 64-bits from the first source

operand (the second operand) are copied to the lower 64-bits of the destination. Bits (MAXVL-1:128) of the corre-

sponding destination register are zeroed.

128-bit store:

Stores two packed single-precision floating-point values from the high 64-bits of the XMM register source (second

operand) to the 64-bit memory location (first operand).

Note: VMOVHPS (store) (VEX.128.0F 17 /r) is legal and has the same behavior as the existing 0F 17 store. For

VMOVHPS (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.

If VMOVHPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or

EVEX.L’L= 1 will cause an #UD exception.

Opcode/

Instruction

Op / En 64/32

bit Mode

Support

CPUID

Feature

Flag

Description

NP 0F 16 /r

MOVHPS xmm1, m64

A V/V SSE Move two packed single-precision floating-point values

from m64 to high quadword of xmm1.

VEX.128.0F.WIG 16 /r

VMOVHPS xmm2, xmm1, m64

BV/VAVXMerge two packed single-precision floating-point values

from m64 and the low quadword of xmm1.

EVEX.128.0F.W0 16 /r

VMOVHPS xmm2, xmm1, m64

DV/VAVX512FMerge two packed single-precision floating-point values

from m64 and the low quadword of xmm1.

NP 0F 17 /r

MOVHPS m64, xmm1

C V/V SSE Move two packed single-precision floating-point values

from high quadword of xmm1 to m64.

VEX.128.0F.WIG 17 /r

VMOVHPS m64, xmm1

C V/V AVX Move two packed single-precision floating-point values

from high quadword of xmm1 to m64.

EVEX.128.0F.W0 17 /r

VMOVHPS m64, xmm1

E V/V AVX512F Move two packed single-precision floating-point values

from high quadword of xmm1 to m64.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA

C NA ModRM:r/m (w) ModRM:reg (r) NA NA

D Tuple2 ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA

E Tuple2 ModRM:r/m (w) ModRM:reg (r) NA NA

MOVHPS—Move High Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-81

Operation

MOVHPS (128-bit Legacy SSE load)

DEST[63:0] (Unmodified)

DEST[127:64]  SRC[63:0]

DEST[MAXVL-1:128] (Unmodified)

VMOVHPS (VEX.128 and EVEX encoded load)

DEST[63:0]  SRC1[63:0]

DEST[127:64]  SRC2[63:0]

DEST[MAXVL-1:128]  0

VMOVHPS (store)

DEST[63:0]  SRC[127:64]

Intel C/C++ Compiler Intrinsic Equivalent

MOVHPS __m128 _mm_loadh_pi ( __m128 a, __m64 *p)

MOVHPS void _mm_storeh_pi (__m64 *p, __m128 a)

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally

#UD If VEX.L = 1.

EVEX-encoded instruction, see Exceptions Type E9NF.

MOVLHPS—Move Packed Single-Precision Floating-Point Values Low to High

INSTRUCTION SET REFERENCE, M-U

4-82 Vol. 2B

MOVLHPS—Move Packed Single-Precision Floating-Point Values Low to High

Instruction Operand Encoding1

Description

This instruction cannot be used for memory to register moves.

128-bit two-argument form:

Moves two packed single-precision floating-point values from the low quadword of the second XMM argument

(second operand) to the high quadword of the first XMM register (first argument). The low quadword of the desti-

nation operand is left unchanged. Bits (MAXVL-1:128) of the corresponding destination register are unmodified.

128-bit three-argument forms:

Moves two packed single-precision floating-point values from the low quadword of the third XMM argument (third

operand) to the high quadword of the destination (first operand). Copies the low quadword from the second XMM

argument (second operand) to the low quadword of the destination (first operand). Bits (MAXVL-1:128) of the

corresponding destination register are zeroed.

If VMOVLHPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or

EVEX.L’L= 1 will cause an #UD exception.

Operation

MOVLHPS (128-bit two-argument form)

DEST[63:0] (Unmodified)

DEST[127:64]  SRC[63:0]

DEST[MAXVL-1:128] (Unmodified)

VMOVLHPS (128-bit three-argument form - VEX & EVEX)

DEST[63:0]  SRC1[63:0]

DEST[127:64]  SRC2[63:0]

DEST[MAXVL-1:128]  0

Intel C/C++ Compiler Intrinsic Equivalent

MOVLHPS __m128 _mm_movelh_ps(__m128 a, __m128 b)

SIMD Floating-Point Exceptions

None

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

NP 0F 16 /r

MOVLHPS xmm1, xmm2

RM V/V SSE Move two packed single-precision floating-point values from

low quadword of xmm2 to high quadword of xmm1.

VEX.128.0F.WIG 16 /r

VMOVLHPS xmm1, xmm2, xmm3

RVM V/V AVX Merge two packed single-precision floating-point values

from low quadword of xmm3 and low quadword of xmm2.

EVEX.128.0F.W0 16 /r

VMOVLHPS xmm1, xmm2, xmm3

RVM V/V AVX512F Merge two packed single-precision floating-point values

from low quadword of xmm3 and low quadword of xmm2.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RM ModRM:reg (w) ModRM:r/m (r) NA NA

RVM ModRM:reg (w) vvvv (r) ModRM:r/m (r) NA

1. ModRM.MOD = 011B required

MOVLHPS—Move Packed Single-Precision Floating-Point Values Low to High

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-83

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 7; additionally

#UD If VEX.L = 1.

EVEX-encoded instruction, see Exceptions Type E7NM.128.

MOVLPD—Move Low Packed Double-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

4-84 Vol. 2B

MOVLPD—Move Low Packed Double-Precision Floating-Point Value

Instruction Operand Encoding

Description

This instruction cannot be used for register to register or memory to memory moves.

128-bit Legacy SSE load:

Moves a double-precision floating-point value from the source 64-bit memory operand and stores it in the low 64-

bits of the destination XMM register. The upper 64bits of the XMM register are preserved. Bits (MAXVL-1:128) of the

corresponding destination register are preserved.

VEX.128 & EVEX encoded load:

Loads a double-precision floating-point value from the source 64-bit memory operand (third operand), merges it

with the upper 64-bits of the first source XMM register (second operand), and stores it in the low 128-bits of the

destination XMM register (first operand). Bits (MAXVL-1:128) of the corresponding destination register are zeroed.

128-bit store:

Stores a double-precision floating-point value from the low 64-bits of the XMM register source (second operand) to

the 64-bit memory location (first operand).

Note: VMOVLPD (store) (VEX.128.66.0F 13 /r) is legal and has the same behavior as the existing 66 0F 13 store.

For VMOVLPD (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.

If VMOVLPD is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or

EVEX.L’L= 1 will cause an #UD exception.

Operation

MOVLPD (128-bit Legacy SSE load)

DEST[63:0]  SRC[63:0]

DEST[MAXVL-1:64] (Unmodified)

Opcode/

Instruction

Op / En 64/32

bit Mode

Support

CPUID

Feature

Flag

Description

66 0F 12 /r

MOVLPD xmm1, m64

A V/V SSE2 Move double-precision floating-point value from m64 to

low quadword of xmm1.

VEX.128.66.0F.WIG 12 /r

VMOVLPD xmm2, xmm1, m64

B V/V AVX Merge double-precision floating-point value from m64

and the high quadword of xmm1.

EVEX.128.66.0F.W1 12 /r

VMOVLPD xmm2, xmm1, m64

D V/V AVX512F Merge double-precision floating-point value from m64

and the high quadword of xmm1.

66 0F 13/r

MOVLPD m64, xmm1

C V/V SSE2 Move double-precision floating-point value from low

quadword of xmm1 to m64.

VEX.128.66.0F.WIG 13/r

VMOVLPD m64, xmm1

CV/VAVXMove double-precision floating-point value from low

quadword of xmm1 to m64.

EVEX.128.66.0F.W1 13/r

VMOVLPD m64, xmm1

EV/VAVX512FMove double-precision floating-point value from low

quadword of xmm1 to m64.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:r/m (r) VEX.vvvv ModRM:r/m (r) NA

C NA ModRM:r/m (w) ModRM:reg (r) NA NA

D Tuple1 Scalar ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA

E Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) NA NA

MOVLPD—Move Low Packed Double-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-85

VMOVLPD (VEX.128 & EVEX encoded load)

DEST[63:0]  SRC2[63:0]

DEST[127:64]  SRC1[127:64]

DEST[MAXVL-1:128]  0

VMOVLPD (store)

DEST[63:0]  SRC[63:0]

Intel C/C++ Compiler Intrinsic Equivalent

MOVLPD __m128d _mm_loadl_pd ( __m128d a, double *p)

MOVLPD void _mm_storel_pd (double *p, __m128d a)

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally

#UD If VEX.L = 1.

EVEX-encoded instruction, see Exceptions Type E9NF.

MOVLPS—Move Low Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-86 Vol. 2B

MOVLPS—Move Low Packed Single-Precision Floating-Point Values

Instruction Operand Encoding

Description

This instruction cannot be used for register to register or memory to memory moves.

128-bit Legacy SSE load:

Moves two packed single-precision floating-point values from the source 64-bit memory operand and stores them

in the low 64-bits of the destination XMM register. The upper 64bits of the XMM register are preserved. Bits

(MAXVL-1:128) of the corresponding destination register are preserved.

VEX.128 & EVEX encoded load:

Loads two packed single-precision floating-point values from the source 64-bit memory operand (the third

operand), merges them with the upper 64-bits of the first source operand (the second operand), and stores them

in the low 128-bits of the destination register (the first operand). Bits (MAXVL-1:128) of the corresponding desti-

nation register are zeroed.

128-bit store:

Loads two packed single-precision floating-point values from the low 64-bits of the XMM register source (second

operand) to the 64-bit memory location (first operand).

Note: VMOVLPS (store) (VEX.128.0F 13 /r) is legal and has the same behavior as the existing 0F 13 store. For

VMOVLPS (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.

If VMOVLPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or

EVEX.L’L= 1 will cause an #UD exception.

Opcode/

Instruction

Op / En 64/32

bit Mode

Support

CPUID

Feature

Flag

Description

NP 0F 12 /r

MOVLPS xmm1, m64

A V/V SSE Move two packed single-precision floating-point values

from m64 to low quadword of xmm1.

VEX.128.0F.WIG 12 /r

VMOVLPS xmm2, xmm1, m64

B V/V AVX Merge two packed single-precision floating-point values

from m64 and the high quadword of xmm1.

EVEX.128.0F.W0 12 /r

VMOVLPS xmm2, xmm1, m64

D V/V AVX512F Merge two packed single-precision floating-point values

from m64 and the high quadword of xmm1.

0F 13/r

MOVLPS m64, xmm1

C V/V SSE Move two packed single-precision floating-point values

from low quadword of xmm1 to m64.

VEX.128.0F.WIG 13/r

VMOVLPS m64, xmm1

C V/V AVX Move two packed single-precision floating-point values

from low quadword of xmm1 to m64.

EVEX.128.0F.W0 13/r

VMOVLPS m64, xmm1

E V/V AVX512F Move two packed single-precision floating-point values

from low quadword of xmm1 to m64.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA

C NA ModRM:r/m (w) ModRM:reg (r) NA NA

D Tuple2 ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA

E Tuple2 ModRM:r/m (w) ModRM:reg (r) NA NA

MOVLPS—Move Low Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-87

Operation

MOVLPS (128-bit Legacy SSE load)

DEST[63:0]  SRC[63:0]

DEST[MAXVL-1:64] (Unmodified)

VMOVLPS (VEX.128 & EVEX encoded load)

DEST[63:0]  SRC2[63:0]

DEST[127:64]  SRC1[127:64]

DEST[MAXVL-1:128]  0

VMOVLPS (store)

DEST[63:0]  SRC[63:0]

Intel C/C++ Compiler Intrinsic Equivalent

MOVLPS __m128 _mm_loadl_pi ( __m128 a, __m64 *p)

MOVLPS void _mm_storel_pi (__m64 *p, __m128 a)

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally

#UD If VEX.L = 1.

EVEX-encoded instruction, see Exceptions Type E9NF.

MOVMSKPD—Extract Packed Double-Precision Floating-Point Sign Mask

INSTRUCTION SET REFERENCE, M-U

4-88 Vol. 2B

MOVMSKPD—Extract Packed Double-Precision Floating-Point Sign Mask

Instruction Operand Encoding

Description

Extracts the sign bits from the packed double-precision floating-point values in the source operand (second

operand), formats them into a 2-bit mask, and stores the mask in the destination operand (first operand). The

source operand is an XMM register, and the destination operand is a general-purpose register. The mask is stored

in the 2 low-order bits of the destination operand. Zero-extend the upper bits of the destination.

In 64-bit mode, the instruction can access additional registers (XMM8-XMM15, R8-R15) when used with a REX.R

prefix. The default operand size is 64-bit in 64-bit mode.

128-bit versions: The source operand is a YMM register. The destination operand is a general purpose register.

VEX.256 encoded version: The source operand is a YMM register. The destination operand is a general purpose

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation

(V)MOVMSKPD (128-bit versions)

DEST[0]  SRC[63]

DEST[1]  SRC[127]

IF DEST = r32

THEN DEST[31:2]  0;

ELSE DEST[63:2]  0;

VMOVMSKPD (VEX.256 encoded version)

DEST[0]  SRC[63]

DEST[1]  SRC[127]

DEST[2]  SRC[191]

DEST[3]  SRC[255]

IF DEST = r32

THEN DEST[31:4]  0;

ELSE DEST[63:4]  0;

Opcode/

Instruction

Op/

64/32-bit

Mode

CPUID

Feature

Flag

Description

66 0F 50 /r

MOVMSKPD reg, xmm

RM V/V SSE2 Extract 2-bit sign mask from xmm and store in reg. The

upper bits of r32 or r64 are filled with zeros.

VEX.128.66.0F.WIG 50 /r

VMOVMSKPD reg, xmm2

RM V/V AVX Extract 2-bit sign mask from xmm2 and store in reg.

The upper bits of r32 or r64 are zeroed.

VEX.256.66.0F.WIG 50 /r

VMOVMSKPD reg, ymm2

RM V/V AVX Extract 4-bit sign mask from ymm2 and store in reg.

The upper bits of r32 or r64 are zeroed.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RM ModRM:reg (w) ModRM:r/m (r) NA NA

MOVMSKPD—Extract Packed Double-Precision Floating-Point Sign Mask

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-89

Intel C/C++ Compiler Intrinsic Equivalent

MOVMSKPD: int _mm_movemask_pd ( __m128d a)

VMOVMSKPD: _mm256_movemask_pd(__m256d a)

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 7; additionally

#UD If VEX.vvvv ≠ 1111B.

MOVMSKPS—Extract Packed Single-Precision Floating-Point Sign Mask

INSTRUCTION SET REFERENCE, M-U

4-90 Vol. 2B

MOVMSKPS—Extract Packed Single-Precision Floating-Point Sign Mask

Instruction Operand Encoding1

Description

Extracts the sign bits from the packed single-precision floating-point values in the source operand (second

operand), formats them into a 4- or 8-bit mask, and stores the mask in the destination operand (first operand). The

source operand is an XMM or YMM register, and the destination operand is a general-purpose register. The mask is

stored in the 4 or 8 low-order bits of the destination operand. The upper bits of the destination operand beyond the

mask are filled with zeros.

In 64-bit mode, the instruction can access additional registers (XMM8-XMM15, R8-R15) when used with a REX.R

prefix. The default operand size is 64-bit in 64-bit mode.

128-bit versions: The source operand is a YMM register. The destination operand is a general purpose register.

VEX.256 encoded version: The source operand is a YMM register. The destination operand is a general purpose

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation

DEST[0] ← SRC[31];

DEST[1] ← SRC[63];

DEST[2] ← SRC[95];

DEST[3] ← SRC[127];

IF DEST = r32

THEN DEST[31:4] ← ZeroExtend;

ELSE DEST[63:4] ← ZeroExtend;

FI;

Opcode/

Instruction

Op/

64/32-bit

Mode

CPUID

Feature

Flag

Description

NP 0F 50 /r

MOVMSKPS reg, xmm

RM V/V SSE Extract 4-bit sign mask from xmm and store in reg.

The upper bits of r32 or r64 are filled with zeros.

VEX.128.0F.WIG 50 /r

VMOVMSKPS reg, xmm2

RM V/V AVX Extract 4-bit sign mask from xmm2 and store in reg.

The upper bits of r32 or r64 are zeroed.

VEX.256.0F.WIG 50 /r

VMOVMSKPS reg, ymm2

RM V/V AVX Extract 8-bit sign mask from ymm2 and store in reg.

The upper bits of r32 or r64 are zeroed.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RM ModRM:reg (w) ModRM:r/m (r) NA NA

1. ModRM.MOD = 011B required

MOVMSKPS—Extract Packed Single-Precision Floating-Point Sign Mask

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-91

(V)MOVMSKPS (128-bit version)

DEST[0]  SRC[31]

DEST[1]  SRC[63]

DEST[2]  SRC[95]

DEST[3]  SRC[127]

IF DEST = r32

THEN DEST[31:4]  0;

ELSE DEST[63:4]  0;

VMOVMSKPS (VEX.256 encoded version)

DEST[0]  SRC[31]

DEST[1]  SRC[63]

DEST[2]  SRC[95]

DEST[3]  SRC[127]

DEST[4]  SRC[159]

DEST[5]  SRC[191]

DEST[6]  SRC[223]

DEST[7]  SRC[255]

IF DEST = r32

THEN DEST[31:8]  0;

ELSE DEST[63:8]  0;

Intel C/C++ Compiler Intrinsic Equivalent

int _mm_movemask_ps(__m128 a)

int _mm256_movemask_ps(__m256 a)

SIMD Floating-Point Exceptions

None.

Other Exceptions

See Exceptions Type 7; additionally

#UD If VEX.vvvv ≠ 1111B.

MOVNTDQA—Load Double Quadword Non-Temporal Aligned Hint

INSTRUCTION SET REFERENCE, M-U

4-92 Vol. 2B

MOVNTDQA—Load Double Quadword Non-Temporal Aligned Hint

Instruction Operand Encoding1

Description

MOVNTDQA loads a double quadword from the source operand (second operand) to the destination operand (first

operand) using a non-temporal hint if the memory source is WC (write combining) memory type. For WC memory

type, the nontemporal hint may be implemented by loading a temporary internal buffer with the equivalent of an

aligned cache line without filling this data to the cache. Any memory-type aliased lines in the cache will be snooped

and flushed. Subsequent MOVNTDQA reads to unread portions of the WC cache line will receive data from the

temporary internal buffer if data is available. The temporary internal buffer may be flushed by the processor at any

time for any reason, for example:

• A load operation other than a MOVNTDQA which references memory already resident in a temporary internal

buffer.

• A non-WC reference to memory already resident in a temporary internal buffer.

• Interleaving of reads and writes to a single temporary internal buffer.

• Repeated (V)MOVNTDQA loads of a particular 16-byte item in a streaming line.

• Certain micro-architectural conditions including resource shortages, detection of

a mis-speculation condition, and various fault conditions

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when reading the

data from memory. Using this protocol, the processor

does not read the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into

the cache hierarchy. The memory type of the region being read can override the non-temporal hint, if the memory

address specified for the non-temporal read is not a WC memory region. Information on non-temporal reads and

writes can be found in “Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the Intel® 64 and IA-32

Architecture Software Developer’s Manual, Volume 3A.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with

a MFENCE instruction should be used in conjunction with MOVNTDQA instructions if multiple processors might use

different memory types for the referenced memory locations or to synchronize reads of a processor with writes by

other agents in the system. A processor’s implementation of the streaming load hint does not override the effective

memory type, but the implementation of the hint is processor dependent. For example, a processor implementa-

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature Flag

Description

66 0F 38 2A /r

MOVNTDQA xmm1, m128

A V/V SSE4_1 Move double quadword from m128 to xmm1 using non-

temporal hint if WC memory type.

VEX.128.66.0F38.WIG 2A /r

VMOVNTDQA xmm1, m128

A V/V AVX Move double quadword from m128 to xmm using non-

temporal hint if WC memory type.

VEX.256.66.0F38.WIG 2A /r

VMOVNTDQA ymm1, m256

A V/V AVX2 Move 256-bit data from m256 to ymm using non-temporal

hint if WC memory type.

EVEX.128.66.0F38.W0 2A /r

VMOVNTDQA xmm1, m128

B V/V AVX512VL

AVX512F

Move 128-bit data from m128 to xmm using non-temporal

hint if WC memory type.

EVEX.256.66.0F38.W0 2A /r

VMOVNTDQA ymm1, m256

B V/V AVX512VL

AVX512F

Move 256-bit data from m256 to ymm using non-temporal

hint if WC memory type.

EVEX.512.66.0F38.W0 2A /r

VMOVNTDQA zmm1, m512

B V/V AVX512F Move 512-bit data from m512 to zmm using non-temporal

hint if WC memory type.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (w) ModRM:r/m (r) NA NA

B Full Mem ModRM:reg (w) ModRM:r/m (r) NA NA

1. ModRM.MOD != 011B

MOVNTDQA—Load Double Quadword Non-Temporal Aligned Hint

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-93

tion may choose to ignore the hint and process the instruction as a normal MOVDQA for any memory type. Alter-

natively, another implementation may optimize cache reads generated by MOVNTDQA on WB memory type to

reduce cache evictions.

The 128-bit (V)MOVNTDQA addresses must be 16-byte aligned or the instruction will cause a #GP.

The 256-bit VMOVNTDQA addresses must be 32-byte aligned or the instruction will cause a #GP.

The 512-bit VMOVNTDQA addresses must be 64-byte aligned or the instruction will cause a #GP.

Operation

MOVNTDQA (128bit- Legacy SSE form)

DEST SRC

DEST[MAXVL-1:128] (Unmodified)

VMOVNTDQA (VEX.128 and EVEX.128 encoded form)

DEST  SRC

DEST[MAXVL-1:128]  0

VMOVNTDQA (VEX.256 and EVEX.256 encoded forms)

DEST[255:0]  SRC[255:0]

DEST[MAXVL-1:256]  0

VMOVNTDQA (EVEX.512 encoded form)

DEST[511:0]  SRC[511:0]

DEST[MAXVL-1:512]  0

Intel C/C++ Compiler Intrinsic Equivalent

VMOVNTDQA __m512i _mm512_stream_load_si512(__m512i const* p);

MOVNTDQA __m128i _mm_stream_load_si128 (const __m128i *p);

VMOVNTDQA __m256i _mm256_stream_load_si256 (__m256i const* p);

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type1;

EVEX-encoded instruction, see Exceptions Type E1NF.

#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

MOVNTDQ—Store Packed Integers Using Non-Temporal Hint

INSTRUCTION SET REFERENCE, M-U

4-94 Vol. 2B

MOVNTDQ—Store Packed Integers Using Non-Temporal Hint

Instruction Operand Encoding1

Description

Moves the packed integers in the source operand (second operand) to the destination operand (first operand) using

a non-temporal hint to prevent caching of the data during the write to memory. The source operand is an XMM

words, or quadwords). The destination operand is a 128-bit, 256-bit or 512-bit memory location. The memory

operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte (512-bit

version) boundary otherwise a general-protection exception (#GP) will be generated.

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the

data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it

fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being

written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an

uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see

“Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s

Manual, Volume 1.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with

the SFENCE or MFENCE instruction should be used in conjunction with VMOVNTDQ instructions if multiple proces-

sors might use different memory types to read/write the destination memory locations.

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, VEX.L must be 0; otherwise instructions will

#UD.

Operation

VMOVNTDQ(EVEX encoded versions)

VL = 128, 256, 512

DEST[VL-1:0]  SRC[VL-1:0]

DEST[MAXVL-1:VL]  0

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature Flag

Description

66 0F E7 /r

MOVNTDQ m128, xmm1

A V/V SSE2 Move packed integer values in xmm1 to m128 using non-

temporal hint.

VEX.128.66.0F.WIG E7 /r

VMOVNTDQ m128, xmm1

A V/V AVX Move packed integer values in xmm1 to m128 using non-

temporal hint.

VEX.256.66.0F.WIG E7 /r

VMOVNTDQ m256, ymm1

A V/V AVX Move packed integer values in ymm1 to m256 using non-

temporal hint.

EVEX.128.66.0F.W0 E7 /r

VMOVNTDQ m128, xmm1

BV/V AVX512VL

AVX512F

Move packed integer values in xmm1 to m128 using non-

temporal hint.

EVEX.256.66.0F.W0 E7 /r

VMOVNTDQ m256, ymm1

BV/V AVX512VL

AVX512F

Move packed integer values in zmm1 to m256 using non-

temporal hint.

EVEX.512.66.0F.W0 E7 /r

VMOVNTDQ m512, zmm1

B V/V AVX512F Move packed integer values in zmm1 to m512 using non-

temporal hint.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:r/m (w) ModRM:reg (r) NA NA

B Full Mem ModRM:r/m (w)ModRM:reg (r) NA NA

1. ModRM.MOD != 011B

MOVNTDQ—Store Packed Integers Using Non-Temporal Hint

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-95

MOVNTDQ (Legacy and VEX versions)

DEST  SRC

Intel C/C++ Compiler Intrinsic Equivalent

VMOVNTDQ void _mm512_stream_si512(void * p, __m512i a);

VMOVNTDQ void _mm256_stream_si256 (__m256i * p, __m256i a);

MOVNTDQ void _mm_stream_si128 (__m128i * p, __m128i a);

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type1.SSE2;

EVEX-encoded instruction, see Exceptions Type E1NF.

#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

MOVNTI—Store Doubleword Using Non-Temporal Hint

INSTRUCTION SET REFERENCE, M-U

4-96 Vol. 2B

MOVNTI—Store Doubleword Using Non-Temporal Hint

Instruction Operand Encoding

Description

Moves the doubleword integer in the source operand (second operand) to the destination operand (first operand)

using a non-temporal hint to minimize cache pollution during the write to memory. The source operand is a

general-purpose register. The destination operand is a 32-bit memory location.

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the

data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it

fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being

written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an

uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see

“Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the Intel® 64 and IA-32 Architectures Software

Developer’s Manual, Volume 1.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with

the SFENCE or MFENCE instruction should be used in conjunction with MOVNTI instructions if multiple processors

might use different memory types to read/write the destination memory locations.

In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to addi-

tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the

beginning of this section for encoding data and limits.

Operation

DEST ← SRC;

Intel C/C++ Compiler Intrinsic Equivalent

MOVNTI: void _mm_stream_si32 (int *p, int a)

MOVNTI: void _mm_stream_si64(__int64 *p, __int64 a)

SIMD Floating-Point Exceptions

None.

Protected Mode Exceptions

#GP(0) For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.

#SS(0) For an illegal address in the SS segment.

#PF(fault-code) For a page fault.

#UD If CPUID.01H:EDX.SSE2[bit 26] = 0.

If the LOCK prefix is used.

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

NP 0F C3 /rMOVNTI m32, r32 MR Valid Valid Move doubleword from r32 to m32 using non-

temporal hint.

NP REX.W + 0F C3 /rMOVNTI m64, r64 MR Valid N.E. Move quadword from r64 to m64 using non-

temporal hint.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

MR ModRM:r/m (w) ModRM:reg (r) NA NA

MOVNTI—Store Doubleword Using Non-Temporal Hint

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-97

Real-Address Mode Exceptions

#GP If any part of the operand lies outside the effective address space from 0 to FFFFH.

#UD If CPUID.01H:EDX.SSE2[bit 26] = 0.

If the LOCK prefix is used.

Virtual-8086 Mode Exceptions

Same exceptions as in real address mode.

#PF(fault-code) For a page fault.

Compatibility Mode Exceptions

Same exceptions as in protected mode.

64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the memory address is in a non-canonical form.

#PF(fault-code) For a page fault.

#UD If CPUID.01H:EDX.SSE2[bit 26] = 0.

If the LOCK prefix is used.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

MOVNTPD—Store Packed Double-Precision Floating-Point Values Using Non-Temporal Hint

INSTRUCTION SET REFERENCE, M-U

4-98 Vol. 2B

MOVNTPD—Store Packed Double-Precision Floating-Point Values Using Non-Temporal Hint

Instruction Operand Encoding1

Description

Moves the packed double-precision floating-point values in the source operand (second operand) to the destination

operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory. The

source operand is an XMM register, YMM register or ZMM register, which is assumed to contain packed double-

precision, floating-pointing data. The destination operand is a 128-bit, 256-bit or 512-bit memory location. The

memory operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte

(EVEX.512 encoded version) boundary otherwise a general-protection exception (#GP) will be generated.

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the

data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it

fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being

written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an

uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see

“Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s

Manual, Volume 1.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with

the SFENCE or MFENCE instruction should be used in conjunction with MOVNTPD instructions if multiple processors

might use different memory types to read/write the destination memory locations.

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, VEX.L must be 0; otherwise instructions will

#UD.

Operation

VMOVNTPD (EVEX encoded versions)

VL = 128, 256, 512

DEST[VL-1:0]  SRC[VL-1:0]

DEST[MAXVL-1:VL]  0

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

66 0F 2B /r

MOVNTPD m128, xmm1

A V/V SSE2 Move packed double-precision values in xmm1 to m128 using

non-temporal hint.

VEX.128.66.0F.WIG 2B /r

VMOVNTPD m128, xmm1

A V/V AVX Move packed double-precision values in xmm1 to m128 using

non-temporal hint.

VEX.256.66.0F.WIG 2B /r

VMOVNTPD m256, ymm1

A V/V AVX Move packed double-precision values in ymm1 to m256 using

non-temporal hint.

EVEX.128.66.0F.W1 2B /r

VMOVNTPD m128, xmm1

BV/V AVX512VL

AVX512F

Move packed double-precision values in xmm1 to m128 using

non-temporal hint.

EVEX.256.66.0F.W1 2B /r

VMOVNTPD m256, ymm1

BV/V AVX512VL

AVX512F

Move packed double-precision values in ymm1 to m256 using

non-temporal hint.

EVEX.512.66.0F.W1 2B /r

VMOVNTPD m512, zmm1

B V/V AVX512F Move packed double-precision values in zmm1 to m512 using

non-temporal hint.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:r/m (w) ModRM:reg (r) NA NA

B Full Mem ModRM:r/m (w) ModRM:reg (r) NA NA

1. ModRM.MOD != 011B

MOVNTPD—Store Packed Double-Precision Floating-Point Values Using Non-Temporal Hint

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-99

MOVNTPD (Legacy and VEX versions)

DEST  SRC

Intel C/C++ Compiler Intrinsic Equivalent

VMOVNTPD void _mm512_stream_pd(double * p, __m512d a);

VMOVNTPD void _mm256_stream_pd (double * p, __m256d a);

MOVNTPD void _mm_stream_pd (double * p, __m128d a);

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type1.SSE2;

EVEX-encoded instruction, see Exceptions Type E1NF.

#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

MOVNTPS—Store Packed Single-Precision Floating-Point Values Using Non-Temporal Hint

INSTRUCTION SET REFERENCE, M-U

4-100 Vol. 2B

MOVNTPS—Store Packed Single-Precision Floating-Point Values Using Non-Temporal Hint

Instruction Operand Encoding1

Description

Moves the packed single-precision floating-point values in the source operand (second operand) to the destination

operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory. The

source operand is an XMM register, YMM register or ZMM register, which is assumed to contain packed single-preci-

sion, floating-pointing. The destination operand is a 128-bit, 256-bit or 512-bit memory location. The memory

operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte (EVEX.512

encoded version) boundary otherwise a general-protection exception (#GP) will be generated.

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the

data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it

fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being

written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an

uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see

“Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s

Manual, Volume 1.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with

the SFENCE or MFENCE instruction should be used in conjunction with MOVNTPS instructions if multiple processors

might use different memory types to read/write the destination memory locations.

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

Operation

VMOVNTPS (EVEX encoded versions)

VL = 128, 256, 512

DEST[VL-1:0]  SRC[VL-1:0]

DEST[MAXVL-1:VL]  0

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

NP 0F 2B /r

MOVNTPS m128, xmm1

A V/V SSE Move packed single-precision values xmm1 to mem using

non-temporal hint.

VEX.128.0F.WIG 2B /r

VMOVNTPS m128, xmm1

A V/V AVX Move packed single-precision values xmm1 to mem using

non-temporal hint.

VEX.256.0F.WIG 2B /r

VMOVNTPS m256, ymm1

A V/V AVX Move packed single-precision values ymm1 to mem using

non-temporal hint.

EVEX.128.0F.W0 2B /r

VMOVNTPS m128, xmm1

BV/V AVX512VL

AVX512F

Move packed single-precision values in xmm1 to m128

using non-temporal hint.

EVEX.256.0F.W0 2B /r

VMOVNTPS m256, ymm1

BV/V AVX512VL

AVX512F

Move packed single-precision values in ymm1 to m256

using non-temporal hint.

EVEX.512.0F.W0 2B /r

VMOVNTPS m512, zmm1

BV/V AVX512FMove packed single-precision values in zmm1 to m512

using non-temporal hint.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:r/m (w) ModRM:reg (r) NA NA

B Full Mem ModRM:r/m (w) ModRM:reg (r) NA NA

1. ModRM.MOD != 011B

MOVNTPS—Store Packed Single-Precision Floating-Point Values Using Non-Temporal Hint

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-101

MOVNTPS

DEST  SRC

Intel C/C++ Compiler Intrinsic Equivalent

VMOVNTPS void _mm512_stream_ps(float * p, __m512d a);

MOVNTPS void _mm_stream_ps (float * p, __m128d a);

VMOVNTPS void _mm256_stream_ps (float * p, __m256 a);

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type1.SSE; additionally

EVEX-encoded instruction, see Exceptions Type E1NF.

#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.

MOVNTQ—Store of Quadword Using Non-Temporal Hint

INSTRUCTION SET REFERENCE, M-U

4-102 Vol. 2B

MOVNTQ—Store of Quadword Using Non-Temporal Hint

Instruction Operand Encoding

Description

Moves the quadword in the source operand (second operand) to the destination operand (first operand) using a

non-temporal hint to minimize cache pollution during the write to memory. The source operand is an MMX tech-

nology register, which is assumed to contain packed integer data (packed bytes, words, or doublewords). The

destination operand is a 64-bit memory location.

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the

data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it

fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being

written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an

uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see

“Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the Intel® 64 and IA-32 Architectures Software

Developer’s Manual, Volume 1.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with

the SFENCE or MFENCE instruction should be used in conjunction with MOVNTQ instructions if multiple processors

might use different memory types to read/write the destination memory locations.

This instruction’s operation is the same in non-64-bit modes and 64-bit mode.

Operation

DEST ← SRC;

Intel C/C++ Compiler Intrinsic Equivalent

MOVNTQ: void _mm_stream_pi(__m64 * p, __m64 a)

SIMD Floating-Point Exceptions

None.

Other Exceptions

See Table 22-8, “Exception Conditions for Legacy SIMD/MMX Instructions without FP Exception,” in the Intel® 64

and IA-32 Architectures Software Developer’s Manual, Volume 3A.

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

NP 0F E7 /rMOVNTQ m64, mm MR Valid Valid Move quadword from mm to m64 using non-

temporal hint.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

MR ModRM:r/m (w) ModRM:reg (r) NA NA

MOVQ—Move Quadword

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-103

MOVQ—Move Quadword

Instruction Operand Encoding

Description

Copies a quadword from the source operand (second operand) to the destination operand (first operand). The

source and destination operands can be MMX technology registers, XMM registers, or 64-bit memory locations.

This instruction can be used to move a quadword between two MMX technology registers or between an MMX tech-

nology register and a 64-bit memory location, or to move data between two XMM registers or between an XMM

When the source operand is an XMM register, the low quadword is moved; when the destination operand is an XMM

In 64-bit mode and if not encoded using VEX/EVEX, use of the REX prefix in the form of REX.R permits this instruc-

tion to access additional registers (XMM8-XMM15).

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.

If VMOVQ is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an

#UD exception.

Opcode/

Instruction

Op/ En 64/32-bit

Mode

CPUID

Feature

Flag

Description

NP 0F 6F /r

MOVQ mm, mm/m64

A V/V MMX Move quadword from mm/m64 to mm.

NP 0F 7F /r

MOVQ mm/m64, mm

B V/V MMX Move quadword from mm to mm/m64.

F3 0F 7E /r

MOVQ xmm1, xmm2/m64

A V/V SSE2 Move quadword from xmm2/mem64 to xmm1.

VEX.128.F3.0F.WIG 7E /r

VMOVQ xmm1, xmm2/m64

A V/V AVX Move quadword from xmm2 to xmm1.

EVEX.128.F3.0F.W1 7E /r

VMOVQ xmm1, xmm2/m64

CV/VAVX512FMove quadword from xmm2/m64 to xmm1.

66 0F D6 /r

MOVQ xmm2/m64, xmm1

B V/V SSE2 Move quadword from xmm1 to xmm2/mem64.

VEX.128.66.0F.WIG D6 /r

VMOVQ xmm1/m64, xmm2

B V/V AVX Move quadword from xmm2 register to xmm1/m64.

EVEX.128.66.0F.W1 D6 /r

VMOVQ xmm1/m64, xmm2

DV/VAVX512FMove quadword from xmm2 register to xmm1/m64.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (w) ModRM:r/m (r) NA NA

B NA ModRM:r/m (w) ModRM:reg (r) NA NA

C Tuple1 Scalar ModRM:reg (w) ModRM:r/m (r) NA NA

D Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) NA NA

MOVQ—Move Quadword

INSTRUCTION SET REFERENCE, M-U

4-104 Vol. 2B

Operation

MOVQ instruction when operating on MMX technology registers and memory locations

DEST ← SRC;

MOVQ instruction when source and destination operands are XMM registers

DEST[63:0] ← SRC[63:0];

DEST[127:64] ← 0000000000000000H;

MOVQ instruction when source operand is XMM register and destination

operand is memory location:

DEST ← SRC[63:0];

MOVQ instruction when source operand is memory location and destination

operand is XMM register:

DEST[63:0] ← SRC;

DEST[127:64] ← 0000000000000000H;

VMOVQ (VEX.128.F3.0F 7E) with XMM register source and destination

DEST[63:0] ← SRC[63:0]

DEST[MAXVL-1:64] ← 0

VMOVQ (VEX.128.66.0F D6) with XMM register source and destination

DEST[63:0] ← SRC[63:0]

DEST[MAXVL-1:64] ← 0

VMOVQ (7E - EVEX encoded version) with XMM register source and destination

DEST[63:0]  SRC[63:0]

DEST[MAXVL-1:64]  0

VMOVQ (D6 - EVEX encoded version) with XMM register source and destination

DEST[63:0]  SRC[63:0]

DEST[MAXVL-1:64]  0

VMOVQ (7E) with memory source

DEST[63:0] ← SRC[63:0]

DEST[MAXVL-1:64] ← 0

VMOVQ (7E - EVEX encoded version) with memory source

DEST[63:0]  SRC[63:0]

DEST[:MAXVL-1:64]  0

VMOVQ (D6) with memory dest

DEST[63:0] ← SRC2[63:0]

Flags Affected

None.

Intel C/C++ Compiler Intrinsic Equivalent

VMOVQ __m128i _mm_loadu_si64( void * s);

VMOVQ void _mm_storeu_si64( void * d, __m128i s);

MOVQ m128i _mm_move_epi64(__m128i a)

MOVQ—Move Quadword

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-105

SIMD Floating-Point Exceptions

None

Other Exceptions

See Table 22-8, “Exception Conditions for Legacy SIMD/MMX Instructions without FP Exception,” in the Intel® 64

and IA-32 Architectures Software Developer’s Manual, Volume 3B.

MOVQ2DQ—Move Quadword from MMX Technology to XMM Register

INSTRUCTION SET REFERENCE, M-U

4-106 Vol. 2B

MOVQ2DQ—Move Quadword from MMX Technology to XMM Register

Instruction Operand Encoding

Description

Moves the quadword from the source operand (second operand) to the low quadword of the destination operand

(first operand). The source operand is an MMX technology register and the destination operand is an XMM register.

This instruction causes a transition from x87 FPU to MMX technology operation (that is, the x87 FPU top-of-stack

pointer is set to 0 and the x87 FPU tag word is set to all 0s [valid]). If this instruction is executed while an x87 FPU

floating-point exception is pending, the exception is handled before the MOVQ2DQ instruction is executed.

In 64-bit mode, use of the REX.R prefix permits this instruction to access additional registers (XMM8-XMM15).

Operation

DEST[63:0] ← SRC[63:0];

DEST[127:64] ← 00000000000000000H;

Intel C/C++ Compiler Intrinsic Equivalent

MOVQ2DQ: __128i _mm_movpi64_epi64 ( __m64 a)

SIMD Floating-Point Exceptions

None.

Protected Mode Exceptions

#NM If CR0.TS[bit 3] = 1.

#UD If CR0.EM[bit 2] = 1.

If CR4.OSFXSR[bit 9] = 0.

If CPUID.01H:EDX.SSE2[bit 26] = 0.

If the LOCK prefix is used.

#MF If there is a pending x87 FPU exception.

Real-Address Mode Exceptions

Same exceptions as in protected mode.

Virtual-8086 Mode Exceptions

Same exceptions as in protected mode.

Compatibility Mode Exceptions

Same exceptions as in protected mode.

64-Bit Mode Exceptions

Same exceptions as in protected mode.

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

F3 0F D6 /r MOVQ2DQ xmm, mm RM Valid Valid Move quadword from mmx to low quadword

of xmm.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RM ModRM:reg (w) ModRM:r/m (r) NA NA

MOVS/MOVSB/MOVSW/MOVSD/MOVSQ—Move Data from String to String

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-107

MOVS/MOVSB/MOVSW/MOVSD/MOVSQ—Move Data from String to String

Instruction Operand Encoding

Description

Moves the byte, word, or doubleword specified with the second operand (source operand) to the location specified

with the first operand (destination operand). Both the source and destination operands are located in memory. The

address of the source operand is read from the DS:ESI or the DS:SI registers (depending on the address-size attri-

bute of the instruction, 32 or 16, respectively). The address of the destination operand is read from the ES:EDI or

the ES:DI registers (again depending on the address-size attribute of the instruction). The DS segment may be

overridden with a segment override prefix, but the ES segment cannot be overridden.

At the assembly-code level, two forms of this instruction are allowed: the “explicit-operands” form and the “no-

operands” form. The explicit-operands form (specified with the MOVS mnemonic) allows the source and destination

operands to be specified explicitly. Here, the source and destination operands should be symbols that indicate the

size and location of the source value and the destination, respectively. This explicit-operands form is provided to

allow documentation; however, note that the documentation provided by this form can be misleading. That is, the

source and destination operand symbols must specify the correct type (size) of the operands (bytes, words, or

doublewords), but they do not have to specify the correct location. The locations of the source and destination

operands are always specified by the DS:(E)SI and ES:(E)DI registers, which must be loaded correctly before the

move string instruction is executed.

The no-operands form provides “short forms” of the byte, word, and doubleword versions of the MOVS instruc-

tions. Here also DS:(E)SI and ES:(E)DI are assumed to be the source and destination operands, respectively. The

size of the source and destination operands is selected with the mnemonic: MOVSB (byte move), MOVSW (word

move), or MOVSD (doubleword move).

After the move operation, the (E)SI and (E)DI registers are incremented or decremented automatically according

to the setting of the DF flag in the EFLAGS register. (If the DF flag is 0, the (E)SI and (E)DI register are incre-

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

A4 MOVS m8, m8 ZO Valid Valid For legacy mode, Move byte from address

DS:(E)SI to ES:(E)DI. For 64-bit mode move

byte from address (R|E)SI to (R|E)DI.

A5 MOVS m16, m16 ZO Valid Valid For legacy mode, move word from address

DS:(E)SI to ES:(E)DI. For 64-bit mode move

word at address (R|E)SI to (R|E)DI.

A5 MOVS m32, m32 ZO Valid Valid For legacy mode, move dword from address

DS:(E)SI to ES:(E)DI. For 64-bit mode move

dword from address (R|E)SI to (R|E)DI.

REX.W + A5 MOVS m64, m64 ZO Valid N.E. Move qword from address (R|E)SI to (R|E)DI.

A4 MOVSB ZO Valid Valid For legacy mode, Move byte from address

DS:(E)SI to ES:(E)DI. For 64-bit mode move

byte from address (R|E)SI to (R|E)DI.

A5 MOVSW ZO Valid Valid For legacy mode, move word from address

DS:(E)SI to ES:(E)DI. For 64-bit mode move

word at address (R|E)SI to (R|E)DI.

A5 MOVSD ZO Valid Valid For legacy mode, move dword from address

DS:(E)SI to ES:(E)DI. For 64-bit mode move

dword from address (R|E)SI to (R|E)DI.

REX.W + A5 MOVSQ ZO Valid N.E. Move qword from address (R|E)SI to (R|E)DI.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

ZO NA NA NA NA

MOVS/MOVSB/MOVSW/MOVSD/MOVSQ—Move Data from String to String

INSTRUCTION SET REFERENCE, M-U

4-108 Vol. 2B

mented; if the DF flag is 1, the (E)SI and (E)DI registers are decremented.) The registers are incremented or

decremented by 1 for byte operations, by 2 for word operations, or by 4 for doubleword operations.

NOTE

To improve performance, more recent processors support modifications to the processor’s

operation during the string store operations initiated with MOVS and MOVSB. See Section 7.3.9.3

in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 for additional

information on fast-string operation.

The MOVS, MOVSB, MOVSW, and MOVSD instructions can be preceded by the REP prefix (see “REP/REPE/REPZ

/REPNE/REPNZ—Repeat String Operation Prefix” for a description of the REP prefix) for block moves of ECX bytes,

words, or doublewords.

In 64-bit mode, the instruction’s default address size is 64 bits, 32-bit address size is supported using the prefix

67H. The 64-bit addresses are specified by RSI and RDI; 32-bit address are specified by ESI and EDI. Use of the

REX.W prefix promotes doubleword operation to 64 bits. See the summary chart at the beginning of this section for

encoding data and limits.

Operation

DEST ← SRC;

Non-64-bit Mode:

IF (Byte move)

THEN IF DF = 0

THEN

(E)SI ← (E)SI + 1;

(E)DI ← (E)DI + 1;

ELSE

(E)SI ← (E)SI – 1;

(E)DI ← (E)DI – 1;

FI;

ELSE IF (Word move)

THEN IF DF = 0

(E)SI ← (E)SI + 2;

(E)DI ← (E)DI + 2;

FI;

ELSE

(E)SI ← (E)SI – 2;

(E)DI ← (E)DI – 2;

FI;

ELSE IF (Doubleword move)

THEN IF DF = 0

(E)SI ← (E)SI + 4;

(E)DI ← (E)DI + 4;

FI;

ELSE

(E)SI ← (E)SI – 4;

(E)DI ← (E)DI – 4;

FI;

64-bit Mode:

IF (Byte move)

THEN IF DF = 0

THEN

MOVS/MOVSB/MOVSW/MOVSD/MOVSQ—Move Data from String to String

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-109

(R|E)SI ← (R|E)SI + 1;

(R|E)DI ← (R|E)DI + 1;

ELSE

(R|E)SI ← (R|E)SI – 1;

(R|E)DI ← (R|E)DI – 1;

FI;

ELSE IF (Word move)

THEN IF DF = 0

(R|E)SI ← (R|E)SI + 2;

(R|E)DI ← (R|E)DI + 2;

FI;

ELSE

(R|E)SI ← (R|E)SI – 2;

(R|E)DI ← (R|E)DI – 2;

FI;

ELSE IF (Doubleword move)

THEN IF DF = 0

(R|E)SI ← (R|E)SI + 4;

(R|E)DI ← (R|E)DI + 4;

FI;

ELSE

(R|E)SI ← (R|E)SI – 4;

(R|E)DI ← (R|E)DI – 4;

FI;

ELSE IF (Quadword move)

THEN IF DF = 0

(R|E)SI ← (R|E)SI + 8;

(R|E)DI ← (R|E)DI + 8;

FI;

ELSE

(R|E)SI ← (R|E)SI – 8;

(R|E)DI ← (R|E)DI – 8;

FI;

Flags Affected

None

Protected Mode Exceptions

#GP(0) If the destination is located in a non-writable segment.

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

If the DS, ES, FS, or GS register contains a NULL segment selector.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If the LOCK prefix is used.

Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If the LOCK prefix is used.

MOVS/MOVSB/MOVSW/MOVSD/MOVSQ—Move Data from String to String

INSTRUCTION SET REFERENCE, M-U

4-110 Vol. 2B

Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made.

#UD If the LOCK prefix is used.

Compatibility Mode Exceptions

Same exceptions as in protected mode.

64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the memory address is in a non-canonical form.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If the LOCK prefix is used.

MOVSD—Move or Merge Scalar Double-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-111

MOVSD—Move or Merge Scalar Double-Precision Floating-Point Value

Instruction Operand Encoding

Opcode/

Instruction

Op / En 64/32

bit Mode

Support

CPUID

Feature

Flag

Description

F2 0F 10 /r

MOVSD xmm1, xmm2

A V/V SSE2 Move scalar double-precision floating-point value

from xmm2 to xmm1 register.

F2 0F 10 /r

MOVSD xmm1, m64

A V/V SSE2 Load scalar double-precision floating-point value

from m64 to xmm1 register.

F2 0F 11 /r

MOVSD xmm1/m64, xmm2

C V/V SSE2 Move scalar double-precision floating-point value

from xmm2 register to xmm1/m64.

VEX.LIG.F2.0F.WIG 10 /r

VMOVSD xmm1, xmm2, xmm3

B V/V AVX Merge scalar double-precision floating-point value

from xmm2 and xmm3 to xmm1 register.

VEX.LIG.F2.0F.WIG 10 /r

VMOVSD xmm1, m64

D V/V AVX Load scalar double-precision floating-point value

from m64 to xmm1 register.

VEX.LIG.F2.0F.WIG 11 /r

VMOVSD xmm1, xmm2, xmm3

E V/V AVX Merge scalar double-precision floating-point value

from xmm2 and xmm3 registers to xmm1.

VEX.LIG.F2.0F.WIG 11 /r

VMOVSD m64, xmm1

C V/V AVX Store scalar double-precision floating-point value

from xmm1 register to m64.

EVEX.LIG.F2.0F.W1 10 /r

VMOVSD xmm1 {k1}{z}, xmm2, xmm3

B V/V AVX512F Merge scalar double-precision floating-point value

from xmm2 and xmm3 registers to xmm1 under

writemask k1.

EVEX.LIG.F2.0F.W1 10 /r

VMOVSD xmm1 {k1}{z}, m64

F V/V AVX512F Load scalar double-precision floating-point value

from m64 to xmm1 register under writemask k1.

EVEX.LIG.F2.0F.W1 11 /r

VMOVSD xmm1 {k1}{z}, xmm2, xmm3

E V/V AVX512F Merge scalar double-precision floating-point value

from xmm2 and xmm3 registers to xmm1 under

writemask k1.

EVEX.LIG.F2.0F.W1 11 /r

VMOVSD m64 {k1}, xmm1

G V/V AVX512F Store scalar double-precision floating-point value

from xmm1 register to m64 under writemask k1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

ANAModRM:reg (r, w)ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C NA ModRM:r/m (w) ModRM:reg (r) NA NA

D NA ModRM:reg (w) ModRM:r/m (r) NA NA

E NA ModRM:r/m (w) vvvv (r) ModRM:reg (r) NA

F Tuple1 Scalar ModRM:reg (r, w) ModRM:r/m (r) NA NA

G Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) NA NA

MOVSD—Move or Merge Scalar Double-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

4-112 Vol. 2B

Description

Moves a scalar double-precision floating-point value from the source operand (second operand) to the destination

operand (first operand). The source and destination operands can be XMM registers or 64-bit memory locations.

This instruction can be used to move a double-precision floating-point value to and from the low quadword of an

XMM register and a 64-bit memory location, or to move a double-precision floating-point value between the low

quadwords of two XMM registers. The instruction cannot be used to transfer data between memory locations.

Legacy version: When the source and destination operands are XMM registers, bits MAXVL:64 of the destination

operand remains unchanged. When the source operand is a memory location and destination operand is an XMM

registers, the quadword at bits 127:64 of the destination operand is cleared to all 0s, bits MAXVL:128 of the desti-

nation operand remains unchanged.

VEX and EVEX encoded register-register syntax: Moves a scalar double-precision floating-point value from the

second source operand (the third operand) to the low quadword element of the destination operand (the first

operand). Bits 127:64 of the destination operand are copied from the first source operand (the second operand).

Bits (MAXVL-1:128) of the corresponding destination register are zeroed.

VEX and EVEX encoded memory store syntax: When the source operand is a memory location and destination

operand is an XMM registers, bits MAXVL:64 of the destination operand is cleared to all 0s.

EVEX encoded versions: The low quadword of the destination is updated according to the writemask.

Note: For VMOVSD (memory store and load forms), VEX.vvvv and EVEX.vvvv are reserved and must be 1111b,

otherwise instruction will #UD.

Operation

VMOVSD (EVEX.LIG.F2.0F 10 /r: VMOVSD xmm1, m64 with support for 32 registers)

IF k1[0] or *no writemask*

THEN DEST[63:0]  SRC[63:0]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[63:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[63:0]  0

FI;

DEST[MAXVL-1:64]  0

VMOVSD (EVEX.LIG.F2.0F 11 /r: VMOVSD m64, xmm1 with support for 32 registers)

IF k1[0] or *no writemask*

THEN DEST[63:0]  SRC[63:0]

ELSE *DEST[63:0] remains unchanged* ; merging-masking

FI;

VMOVSD (EVEX.LIG.F2.0F 11 /r: VMOVSD xmm1, xmm2, xmm3)

IF k1[0] or *no writemask*

THEN DEST[63:0]  SRC2[63:0]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[63:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[63:0]  0

FI;

DEST[127:64]  SRC1[127:64]

DEST[MAXVL-1:128]  0

MOVSD—Move or Merge Scalar Double-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-113

MOVSD (128-bit Legacy SSE version: MOVSD XMM1, XMM2)

DEST[63:0] SRC[63:0]

DEST[MAXVL-1:64] (Unmodified)

VMOVSD (VEX.128.F2.0F 11 /r: VMOVSD xmm1, xmm2, xmm3)

DEST[63:0] SRC2[63:0]

DEST[127:64] SRC1[127:64]

DEST[MAXVL-1:128] 0

VMOVSD (VEX.128.F2.0F 10 /r: VMOVSD xmm1, xmm2, xmm3)

DEST[63:0] SRC2[63:0]

DEST[127:64] SRC1[127:64]

DEST[MAXVL-1:128] 0

VMOVSD (VEX.128.F2.0F 10 /r: VMOVSD xmm1, m64)

DEST[63:0] SRC[63:0]

DEST[MAXVL-1:64] 0

MOVSD/VMOVSD (128-bit versions: MOVSD m64, xmm1 or VMOVSD m64, xmm1)

DEST[63:0] SRC[63:0]

MOVSD (128-bit Legacy SSE version: MOVSD XMM1, m64)

DEST[63:0] SRC[63:0]

DEST[127:64] 0

DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VMOVSD __m128d _mm_mask_load_sd(__m128d s, __mmask8 k, double * p);

VMOVSD __m128d _mm_maskz_load_sd( __mmask8 k, double * p);

VMOVSD __m128d _mm_mask_move_sd(__m128d sh, __mmask8 k, __m128d sl, __m128d a);

VMOVSD __m128d _mm_maskz_move_sd( __mmask8 k, __m128d s, __m128d a);

VMOVSD void _mm_mask_store_sd(double * p, __mmask8 k, __m128d s);

MOVSD __m128d _mm_load_sd (double *p)

MOVSD void _mm_store_sd (double *p, __m128d a)

MOVSD __m128d _mm_move_sd ( __m128d a, __m128d b)

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally

#UD If VEX.vvvv != 1111B.

EVEX-encoded instruction, see Exceptions Type E10.

MOVSHDUP—Replicate Single FP Values

INSTRUCTION SET REFERENCE, M-U

4-114 Vol. 2B

MOVSHDUP—Replicate Single FP Values

Instruction Operand Encoding

Description

Duplicates odd-indexed single-precision floating-point values from the source operand (the second operand) to

adjacent element pair in the destination operand (the first operand). See Figure 4-3. The source operand is an

XMM, YMM or ZMM register or 128, 256 or 512-bit memory location and the destination operand is an XMM, YMM

or ZMM register.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.

VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.

VEX.256 encoded version: Bits (MAXVL-1:256) of the destination register are zeroed.

EVEX encoded version: The destination operand is updated at 32-bit granularity according to the writemask.

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

F3 0F 16 /r

MOVSHDUP xmm1, xmm2/m128

A V/V SSE3 Move odd index single-precision floating-point values from

xmm2/mem and duplicate each element into xmm1.

VEX.128.F3.0F.WIG 16 /r

VMOVSHDUP xmm1, xmm2/m128

A V/V AVX Move odd index single-precision floating-point values from

xmm2/mem and duplicate each element into xmm1.

VEX.256.F3.0F.WIG 16 /r

VMOVSHDUP ymm1, ymm2/m256

A V/V AVX Move odd index single-precision floating-point values from

ymm2/mem and duplicate each element into ymm1.

EVEX.128.F3.0F.W0 16 /r

VMOVSHDUP xmm1 {k1}{z},

xmm2/m128

B V/V AVX512VL

AVX512F

Move odd index single-precision floating-point values from

xmm2/m128 and duplicate each element into xmm1 under

writemask.

EVEX.256.F3.0F.W0 16 /r

VMOVSHDUP ymm1 {k1}{z},

ymm2/m256

B V/V AVX512VL

AVX512F

Move odd index single-precision floating-point values from

ymm2/m256 and duplicate each element into ymm1 under

writemask.

EVEX.512.F3.0F.W0 16 /r

VMOVSHDUP zmm1 {k1}{z},

zmm2/m512

B V/V AVX512F Move odd index single-precision floating-point values from

zmm2/m512 and duplicate each element into zmm1 under

writemask.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (w) ModRM:r/m (r) NA NA

B Full Mem ModRM:reg (w) ModRM:r/m (r) NA NA

Figure 4-3. MOVSHDUP Operation

DEST

SRC X4X5X6X7

X1X1X3X3X5X5X7X7

X0X1X2X3

MOVSHDUP—Replicate Single FP Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-115

Operation

VMOVSHDUP (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

TMP_SRC[31:0]  SRC[63:32]

TMP_SRC[63:32]  SRC[63:32]

TMP_SRC[95:64]  SRC[127:96]

TMP_SRC[127:96]  SRC[127:96]

IF VL >= 256

TMP_SRC[159:128]  SRC[191:160]

TMP_SRC[191:160]  SRC[191:160]

TMP_SRC[223:192]  SRC[255:224]

TMP_SRC[255:224]  SRC[255:224]

FI;

IF VL >= 512

TMP_SRC[287:256]  SRC[319:288]

TMP_SRC[319:288]  SRC[319:288]

TMP_SRC[351:320]  SRC[383:352]

TMP_SRC[383:352]  SRC[383:352]

TMP_SRC[415:384]  SRC[447:416]

TMP_SRC[447:416]  SRC[447:416]

TMP_SRC[479:448]  SRC[511:480]

TMP_SRC[511:480]  SRC[511:480]

FI;

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i]  TMP_SRC[i+31:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i]  0

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVSHDUP (VEX.256 encoded version)

DEST[31:0]  SRC[63:32]

DEST[63:32]  SRC[63:32]

DEST[95:64]  SRC[127:96]

DEST[127:96]  SRC[127:96]

DEST[159:128]  SRC[191:160]

DEST[191:160]  SRC[191:160]

DEST[223:192]  SRC[255:224]

DEST[255:224]  SRC[255:224]

DEST[MAXVL-1:256]  0

VMOVSHDUP (VEX.128 encoded version)

DEST[31:0]  SRC[63:32]

DEST[63:32]  SRC[63:32]

DEST[95:64]  SRC[127:96]

DEST[127:96]  SRC[127:96]

DEST[MAXVL-1:128]  0

MOVSHDUP—Replicate Single FP Values

INSTRUCTION SET REFERENCE, M-U

4-116 Vol. 2B

MOVSHDUP (128-bit Legacy SSE version)

DEST[31:0] SRC[63:32]

DEST[63:32] SRC[63:32]

DEST[95:64] SRC[127:96]

DEST[127:96] SRC[127:96]

DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VMOVSHDUP __m512 _mm512_movehdup_ps( __m512 a);

VMOVSHDUP __m512 _mm512_mask_movehdup_ps(__m512 s, __mmask16 k, __m512 a);

VMOVSHDUP __m512 _mm512_maskz_movehdup_ps( __mmask16 k, __m512 a);

VMOVSHDUP __m256 _mm256_mask_movehdup_ps(__m256 s, __mmask8 k, __m256 a);

VMOVSHDUP __m256 _mm256_maskz_movehdup_ps( __mmask8 k, __m256 a);

VMOVSHDUP __m128 _mm_mask_movehdup_ps(__m128 s, __mmask8 k, __m128 a);

VMOVSHDUP __m128 _mm_maskz_movehdup_ps( __mmask8 k, __m128 a);

VMOVSHDUP __m256 _mm256_movehdup_ps (__m256 a);

VMOVSHDUP __m128 _mm_movehdup_ps (__m128 a);

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4;

EVEX-encoded instruction, see Exceptions Type E4NF.nb.

#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.

MOVSLDUP—Replicate Single FP Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-117

MOVSLDUP—Replicate Single FP Values

Instruction Operand Encoding

Description

Duplicates even-indexed single-precision floating-point values from the source operand (the second operand). See

Figure 4-4. The source operand is an XMM, YMM or ZMM register or 128, 256 or 512-bit memory location and the

destination operand is an XMM, YMM or ZMM register.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.

VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.

VEX.256 encoded version: Bits (MAXVL-1:256) of the destination register are zeroed.

EVEX encoded version: The destination operand is updated at 32-bit granularity according to the writemask.

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

F3 0F 12 /r

MOVSLDUP xmm1, xmm2/m128

A V/V SSE3 Move even index single-precision floating-point values from

xmm2/mem and duplicate each element into xmm1.

VEX.128.F3.0F.WIG 12 /r

VMOVSLDUP xmm1, xmm2/m128

A V/V AVX Move even index single-precision floating-point values from

xmm2/mem and duplicate each element into xmm1.

VEX.256.F3.0F.WIG 12 /r

VMOVSLDUP ymm1, ymm2/m256

A V/V AVX Move even index single-precision floating-point values from

ymm2/mem and duplicate each element into ymm1.

EVEX.128.F3.0F.W0 12 /r

VMOVSLDUP xmm1 {k1}{z},

xmm2/m128

BV/V AVX512VL

AVX512F

Move even index single-precision floating-point values from

xmm2/m128 and duplicate each element into xmm1 under

writemask.

EVEX.256.F3.0F.W0 12 /r

VMOVSLDUP ymm1 {k1}{z},

ymm2/m256

BV/V AVX512VL

AVX512F

Move even index single-precision floating-point values from

ymm2/m256 and duplicate each element into ymm1 under

writemask.

EVEX.512.F3.0F.W0 12 /r

VMOVSLDUP zmm1 {k1}{z},

zmm2/m512

B V/V AVX512F Move even index single-precision floating-point values from

zmm2/m512 and duplicate each element into zmm1 under

writemask.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (w) ModRM:r/m (r) NA NA

B Full Mem ModRM:reg (w) ModRM:r/m (r) NA NA

Figure 4-4. MOVSLDUP Operation

DEST

SRC X4X5X6X7

X0X0X2X2X4X4X6X6

X0X1X2X3

MOVSLDUP—Replicate Single FP Values

INSTRUCTION SET REFERENCE, M-U

4-118 Vol. 2B

Operation

VMOVSLDUP (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

TMP_SRC[31:0]  SRC[31:0]

TMP_SRC[63:32]  SRC[31:0]

TMP_SRC[95:64]  SRC[95:64]

TMP_SRC[127:96]  SRC[95:64]

IF VL >= 256

TMP_SRC[159:128]  SRC[159:128]

TMP_SRC[191:160]  SRC[159:128]

TMP_SRC[223:192]  SRC[223:192]

TMP_SRC[255:224]  SRC[223:192]

FI;

IF VL >= 512

TMP_SRC[287:256]  SRC[287:256]

TMP_SRC[319:288]  SRC[287:256]

TMP_SRC[351:320]  SRC[351:320]

TMP_SRC[383:352]  SRC[351:320]

TMP_SRC[415:384]  SRC[415:384]

TMP_SRC[447:416]  SRC[415:384]

TMP_SRC[479:448]  SRC[479:448]

TMP_SRC[511:480]  SRC[479:448]

FI;

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i]  TMP_SRC[i+31:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i]  0

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVSLDUP (VEX.256 encoded version)

DEST[31:0]  SRC[31:0]

DEST[63:32]  SRC[31:0]

DEST[95:64]  SRC[95:64]

DEST[127:96]  SRC[95:64]

DEST[159:128]  SRC[159:128]

DEST[191:160]  SRC[159:128]

DEST[223:192]  SRC[223:192]

DEST[255:224]  SRC[223:192]

DEST[MAXVL-1:256]  0

VMOVSLDUP (VEX.128 encoded version)

DEST[31:0]  SRC[31:0]

DEST[63:32]  SRC[31:0]

DEST[95:64]  SRC[95:64]

DEST[127:96]  SRC[95:64]

DEST[MAXVL-1:128]  0

MOVSLDUP—Replicate Single FP Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-119

MOVSLDUP (128-bit Legacy SSE version)

DEST[31:0] SRC[31:0]

DEST[63:32] SRC[31:0]

DEST[95:64] SRC[95:64]

DEST[127:96] SRC[95:64]

DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VMOVSLDUP __m512 _mm512_moveldup_ps( __m512 a);

VMOVSLDUP __m512 _mm512_mask_moveldup_ps(__m512 s, __mmask16 k, __m512 a);

VMOVSLDUP __m512 _mm512_maskz_moveldup_ps( __mmask16 k, __m512 a);

VMOVSLDUP __m256 _mm256_mask_moveldup_ps(__m256 s, __mmask8 k, __m256 a);

VMOVSLDUP __m256 _mm256_maskz_moveldup_ps( __mmask8 k, __m256 a);

VMOVSLDUP __m128 _mm_mask_moveldup_ps(__m128 s, __mmask8 k, __m128 a);

VMOVSLDUP __m128 _mm_maskz_moveldup_ps( __mmask8 k, __m128 a);

VMOVSLDUP __m256 _mm256_moveldup_ps (__m256 a);

VMOVSLDUP __m128 _mm_moveldup_ps (__m128 a);

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4;

EVEX-encoded instruction, see Exceptions Type E4NF.nb.

#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.

MOVSS—Move or Merge Scalar Single-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

4-120 Vol. 2B

MOVSS—Move or Merge Scalar Single-Precision Floating-Point Value

Instruction Operand Encoding

Opcode/

Instruction

Op / En 64/32

bit Mode

Support

CPUID

Feature

Flag

Description

F3 0F 10 /r

MOVSS xmm1, xmm2

A V/V SSE Merge scalar single-precision floating-point value

from xmm2 to xmm1 register.

F3 0F 10 /r

MOVSS xmm1, m32

A V/V SSE Load scalar single-precision floating-point value from

m32 to xmm1 register.

VEX.LIG.F3.0F.WIG 10 /r

VMOVSS xmm1, xmm2, xmm3

B V/V AVX Merge scalar single-precision floating-point value

from xmm2 and xmm3 to xmm1 register

VEX.LIG.F3.0F.WIG 10 /r

VMOVSS xmm1, m32

D V/V AVX Load scalar single-precision floating-point value from

m32 to xmm1 register.

F3 0F 11 /r

MOVSS xmm2/m32, xmm1

C V/V SSE Move scalar single-precision floating-point value

from xmm1 register to xmm2/m32.

VEX.LIG.F3.0F.WIG 11 /r

VMOVSS xmm1, xmm2, xmm3

E V/V AVX Move scalar single-precision floating-point value

from xmm2 and xmm3 to xmm1 register.

VEX.LIG.F3.0F.WIG 11 /r

VMOVSS m32, xmm1

C V/V AVX Move scalar single-precision floating-point value

from xmm1 register to m32.

EVEX.LIG.F3.0F.W0 10 /r

VMOVSS xmm1 {k1}{z}, xmm2, xmm3

B V/V AVX512F Move scalar single-precision floating-point value

from xmm2 and xmm3 to xmm1 register under

writemask k1.

EVEX.LIG.F3.0F.W0 10 /r

VMOVSS xmm1 {k1}{z}, m32

F V/V AVX512F Move scalar single-precision floating-point values

from m32 to xmm1 under writemask k1.

EVEX.LIG.F3.0F.W0 11 /r

VMOVSS xmm1 {k1}{z}, xmm2, xmm3

E V/V AVX512F Move scalar single-precision floating-point value

from xmm2 and xmm3 to xmm1 register under

writemask k1.

EVEX.LIG.F3.0F.W0 11 /r

VMOVSS m32 {k1}, xmm1

G V/V AVX512F Move scalar single-precision floating-point values

from xmm1 to m32 under writemask k1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C NA ModRM:r/m (w) ModRM:reg (r) NA NA

D NA ModRM:reg (w) ModRM:r/m (r) NA NA

E NA ModRM:r/m (w) vvvv (r) ModRM:reg (r) NA

F Tuple1 Scalar ModRM:reg (r, w) ModRM:r/m (r) NA NA

G Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) NA NA

MOVSS—Move or Merge Scalar Single-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-121

Description

Moves a scalar single-precision floating-point value from the source operand (second operand) to the destination

operand (first operand). The source and destination operands can be XMM registers or 32-bit memory locations.

This instruction can be used to move a single-precision floating-point value to and from the low doubleword of an

XMM register and a 32-bit memory location, or to move a single-precision floating-point value between the low

doublewords of two XMM registers. The instruction cannot be used to transfer data between memory locations.

Legacy version: When the source and destination operands are XMM registers, bits (MAXVL-1:32) of the corre-

sponding destination register are unmodified. When the source operand is a memory location and destination

operand is an XMM registers, Bits (127:32) of the destination operand is cleared to all 0s, bits MAXVL:128 of the

destination operand remains unchanged.

VEX and EVEX encoded register-register syntax: Moves a scalar single-precision floating-point value from the

second source operand (the third operand) to the low doubleword element of the destination operand (the first

operand). Bits 127:32 of the destination operand are copied from the first source operand (the second operand).

Bits (MAXVL-1:128) of the corresponding destination register are zeroed.

VEX and EVEX encoded memory load syntax: When the source operand is a memory location and destination

operand is an XMM registers, bits MAXVL:32 of the destination operand is cleared to all 0s.

EVEX encoded versions: The low doubleword of the destination is updated according to the writemask.

Note: For memory store form instruction “VMOVSS m32, xmm1”, VEX.vvvv is reserved and must be 1111b other-

wise instruction will #UD. For memory store form instruction “VMOVSS mv {k1}, xmm1”, EVEX.vvvv is reserved

and must be 1111b otherwise instruction will #UD.

Software should ensure VMOVSS is encoded with VEX.L=0. Encoding VMOVSS with VEX.L=1 may encounter

unpredictable behavior across different processor generations.

Operation

VMOVSS (EVEX.LIG.F3.0F.W0 11 /r when the source operand is memory and the destination is an XMM register)

IF k1[0] or *no writemask*

THEN DEST[31:0]  SRC[31:0]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[31:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[31:0]  0

FI;

DEST[MAXVL-1:32]  0

VMOVSS (EVEX.LIG.F3.0F.W0 10 /r when the source operand is an XMM register and the destination is memory)

IF k1[0] or *no writemask*

THEN DEST[31:0]  SRC[31:0]

ELSE *DEST[31:0] remains unchanged* ; merging-masking

FI;

MOVSS—Move or Merge Scalar Single-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

4-122 Vol. 2B

VMOVSS (EVEX.LIG.F3.0F.W0 10/11 /r where the source and destination are XMM registers)

IF k1[0] or *no writemask*

THEN DEST[31:0]  SRC2[31:0]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[31:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[31:0]  0

FI;

DEST[127:32]  SRC1[127:32]

DEST[MAXVL-1:128]  0

MOVSS (Legacy SSE version when the source and destination operands are both XMM registers)

DEST[31:0] SRC[31:0]

DEST[MAXVL-1:32] (Unmodified)

VMOVSS (VEX.128.F3.0F 11 /r where the destination is an XMM register)

DEST[31:0] SRC2[31:0]

DEST[127:32] SRC1[127:32]

DEST[MAXVL-1:128] 0

VMOVSS (VEX.128.F3.0F 10 /r where the source and destination are XMM registers)

DEST[31:0] SRC2[31:0]

DEST[127:32] SRC1[127:32]

DEST[MAXVL-1:128] 0

VMOVSS (VEX.128.F3.0F 10 /r when the source operand is memory and the destination is an XMM register)

DEST[31:0] SRC[31:0]

DEST[MAXVL-1:32] 0

MOVSS/VMOVSS (when the source operand is an XMM register and the destination is memory)

DEST[31:0] SRC[31:0]

MOVSS (Legacy SSE version when the source operand is memory and the destination is an XMM register)

DEST[31:0] SRC[31:0]

DEST[127:32] 0

DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VMOVSS __m128 _mm_mask_load_ss(__m128 s, __mmask8 k, float * p);

VMOVSS __m128 _mm_maskz_load_ss( __mmask8 k, float * p);

VMOVSS __m128 _mm_mask_move_ss(__m128 sh, __mmask8 k, __m128 sl, __m128 a);

VMOVSS __m128 _mm_maskz_move_ss( __mmask8 k, __m128 s, __m128 a);

VMOVSS void _mm_mask_store_ss(float * p, __mmask8 k, __m128 a);

MOVSS __m128 _mm_load_ss(float * p)

MOVSS void_mm_store_ss(float * p, __m128 a)

MOVSS __m128 _mm_move_ss(__m128 a, __m128 b)

SIMD Floating-Point Exceptions

None

MOVSS—Move or Merge Scalar Single-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-123

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally

#UD If VEX.vvvv != 1111B.

EVEX-encoded instruction, see Exceptions Type E10.

MOVSX/MOVSXD—Move with Sign-Extension

INSTRUCTION SET REFERENCE, M-U

4-124 Vol. 2B

MOVSX/MOVSXD—Move with Sign-Extension

Instruction Operand Encoding

Description

Copies the contents of the source operand (register or memory location) to the destination operand (register) and

sign extends the value to 16 or 32 bits (see Figure 7-6 in the Intel® 64 and IA-32 Architectures Software Devel-

oper’s Manual, Volume 1). The size of the converted value depends on the operand-size attribute.

In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to addi-

tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the

beginning of this section for encoding data and limits.

Operation

DEST ← SignExtend(SRC);

Flags Affected

None.

Protected Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

If the DS, ES, FS, or GS register contains a NULL segment selector.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If the LOCK prefix is used.

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

0F BE /rMOVSX r16, r/m8 RM Valid Valid Move byte to word with sign-extension.

0F BE /rMOVSX r32, r/m8 RM Valid Valid Move byte to doubleword with sign-

extension.

REX.W + 0F BE /rMOVSX r64, r/m8 RM Valid N.E. Move byte to quadword with sign-extension.

0F BF /rMOVSX r32, r/m16 RM Valid Valid Move word to doubleword, with sign-

extension.

REX.W + 0F BF /rMOVSX r64, r/m16 RM Valid N.E. Move word to quadword with sign-extension.

63 /r* MOVSXD r16, r/m16 RM Valid Valid Move word to word with sign-extension.

63 /r* MOVSXD r32, r/m32 RM Valid Valid Move doubleword to doubleword with sign-

extension.

REX.W + 63 /rMOVSXD r64, r/m32 RM Valid N.E. Move doubleword to quadword with sign-

extension.

NOTES:

* The use of MOVSXD without REX.W in 64-bit mode is discouraged. Regular MOV should be used instead of using MOVSXD without

REX.W.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RM ModRM:reg (w) ModRM:r/m (r) NA NA

MOVSX/MOVSXD—Move with Sign-Extension

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-125

Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If the LOCK prefix is used.

Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#UD If the LOCK prefix is used.

Compatibility Mode Exceptions

Same exceptions as in protected mode.

64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the memory address is in a non-canonical form.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If the LOCK prefix is used.

MOVUPD—Move Unaligned Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-126 Vol. 2B

MOVUPD—Move Unaligned Packed Double-Precision Floating-Point Values

Instruction Operand Encoding

Description

Note: VEX.vvvv and EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

EVEX.512 encoded version:

Moves 512 bits of packed double-precision floating-point values from the source operand (second operand) to the

destination operand (first operand). This instruction can be used to load a ZMM register from a float64 memory

location, to store the contents of a ZMM register into a memory. The destination operand is updated according to

the writemask.

Opcode/

Instruction

Op / En 64/32

bit Mode

Support

CPUID

Feature

Flag

Description

66 0F 10 /r

MOVUPD xmm1, xmm2/m128

A V/V SSE2 Move unaligned packed double-precision floating-

point from xmm2/mem to xmm1.

66 0F 11 /r

MOVUPD xmm2/m128, xmm1

B V/V SSE2 Move unaligned packed double-precision floating-

point from xmm1 to xmm2/mem.

VEX.128.66.0F.WIG 10 /r

VMOVUPD xmm1, xmm2/m128

A V/V AVX Move unaligned packed double-precision floating-

point from xmm2/mem to xmm1.

VEX.128.66.0F.WIG 11 /r

VMOVUPD xmm2/m128, xmm1

B V/V AVX Move unaligned packed double-precision floating-

point from xmm1 to xmm2/mem.

VEX.256.66.0F.WIG 10 /r

VMOVUPD ymm1, ymm2/m256

A V/V AVX Move unaligned packed double-precision floating-

point from ymm2/mem to ymm1.

VEX.256.66.0F.WIG 11 /r

VMOVUPD ymm2/m256, ymm1

B V/V AVX Move unaligned packed double-precision floating-

point from ymm1 to ymm2/mem.

EVEX.128.66.0F.W1 10 /r

VMOVUPD xmm1 {k1}{z}, xmm2/m128

CV/VAVX512VL

AVX512F

Move unaligned packed double-precision floating-

point from xmm2/m128 to xmm1 using

writemask k1.

EVEX.128.66.0F.W1 11 /r

VMOVUPD xmm2/m128 {k1}{z}, xmm1

DV/VAVX512VL

AVX512F

Move unaligned packed double-precision floating-

point from xmm1 to xmm2/m128 using

writemask k1.

EVEX.256.66.0F.W1 10 /r

VMOVUPD ymm1 {k1}{z}, ymm2/m256

CV/VAVX512VL

AVX512F

Move unaligned packed double-precision floating-

point from ymm2/m256 to ymm1 using

writemask k1.

EVEX.256.66.0F.W1 11 /r

VMOVUPD ymm2/m256 {k1}{z}, ymm1

DV/VAVX512VL

AVX512F

Move unaligned packed double-precision floating-

point from ymm1 to ymm2/m256 using

writemask k1.

EVEX.512.66.0F.W1 10 /r

VMOVUPD zmm1 {k1}{z}, zmm2/m512

C V/V AVX512F Move unaligned packed double-precision floating-

point values from zmm2/m512 to zmm1 using

writemask k1.

EVEX.512.66.0F.W1 11 /r

VMOVUPD zmm2/m512 {k1}{z}, zmm1

D V/V AVX512F Move unaligned packed double-precision floating-

point values from zmm1 to zmm2/m512 using

writemask k1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (w) ModRM:r/m (r) NA NA

B NA ModRM:r/m (w) ModRM:reg (r) NA NA

C Full Mem ModRM:reg (w) ModRM:r/m (r) NA NA

D Full Mem ModRM:r/m (w) ModRM:reg (r) NA NA

MOVUPD—Move Unaligned Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-127

VEX.256 encoded version:

Moves 256 bits of packed double-precision floating-point values from the source operand (second operand) to the

destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory

location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM

registers. Bits (MAXVL-1:256) of the destination register are zeroed.

128-bit versions:

Moves 128 bits of packed double-precision floating-point values from the source operand (second operand) to the

destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory

location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two

XMM registers.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.

When the source or destination operand is a memory operand, the operand may be unaligned on a 16-byte

boundary without causing a general-protection exception (#GP) to be generated

VEX.128 and EVEX.128 encoded versions: Bits (MAXVL-1:128) of the destination register are zeroed.

Operation

VMOVUPD (EVEX encoded versions, register-copy form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i]  SRC[i+63:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVUPD (EVEX encoded versions, store-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i]

ELSE *DEST[i+63:i] remains unchanged* ; merging-masking

FI;

ENDFOR;

MOVUPD—Move Unaligned Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-128 Vol. 2B

VMOVUPD (EVEX encoded versions, load-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i]  SRC[i+63:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVUPD (VEX.256 encoded version, load - and register copy)

DEST[255:0]  SRC[255:0]

DEST[MAXVL-1:256]  0

VMOVUPD (VEX.256 encoded version, store-form)

DEST[255:0]  SRC[255:0]

VMOVUPD (VEX.128 encoded version)

DEST[127:0]  SRC[127:0]

DEST[MAXVL-1:128]  0

MOVUPD (128-bit load- and register-copy- form Legacy SSE version)

DEST[127:0]  SRC[127:0]

DEST[MAXVL-1:128] (Unmodified)

(V)MOVUPD (128-bit store-form version)

DEST[127:0]  SRC[127:0]

Intel C/C++ Compiler Intrinsic Equivalent

VMOVUPD __m512d _mm512_loadu_pd( void * s);

VMOVUPD __m512d _mm512_mask_loadu_pd(__m512d a, __mmask8 k, void * s);

VMOVUPD __m512d _mm512_maskz_loadu_pd( __mmask8 k, void * s);

VMOVUPD void _mm512_storeu_pd( void * d, __m512d a);

VMOVUPD void _mm512_mask_storeu_pd( void * d, __mmask8 k, __m512d a);

VMOVUPD __m256d _mm256_mask_loadu_pd(__m256d s, __mmask8 k, void * m);

VMOVUPD __m256d _mm256_maskz_loadu_pd( __mmask8 k, void * m);

VMOVUPD void _mm256_mask_storeu_pd( void * d, __mmask8 k, __m256d a);

VMOVUPD __m128d _mm_mask_loadu_pd(__m128d s, __mmask8 k, void * m);

VMOVUPD __m128d _mm_maskz_loadu_pd( __mmask8 k, void * m);

VMOVUPD void _mm_mask_storeu_pd( void * d, __mmask8 k, __m128d a);

MOVUPD __m256d _mm256_loadu_pd (double * p);

MOVUPD void _mm256_storeu_pd( double *p, __m256d a);

MOVUPD __m128d _mm_loadu_pd (double * p);

MOVUPD void _mm_storeu_pd( double *p, __m128d a);

SIMD Floating-Point Exceptions

None

MOVUPD—Move Unaligned Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-129

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

Note treatment of #AC varies; additionally

#UD If VEX.vvvv != 1111B.

EVEX-encoded instruction, see Exceptions Type E4.nb.

MOVUPS—Move Unaligned Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-130 Vol. 2B

MOVUPS—Move Unaligned Packed Single-Precision Floating-Point Values

Instruction Operand Encoding

Description

Note: VEX.vvvv and EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

EVEX.512 encoded version:

Moves 512 bits of packed single-precision floating-point values from the source operand (second operand) to the

destination operand (first operand). This instruction can be used to load a ZMM register from a 512-bit float32

memory location, to store the contents of a ZMM register into memory. The destination operand is updated

according to the writemask.

Opcode/

Instruction

Op / En 64/32

bit Mode

Support

CPUID

Feature

Flag

Description

NP 0F 10 /r

MOVUPS xmm1, xmm2/m128

A V/V SSE Move unaligned packed single-precision

floating-point from xmm2/mem to xmm1.

NP 0F 11 /r

MOVUPS xmm2/m128, xmm1

B V/V SSE Move unaligned packed single-precision

floating-point from xmm1 to xmm2/mem.

VEX.128.0F.WIG 10 /r

VMOVUPS xmm1, xmm2/m128

A V/V AVX Move unaligned packed single-precision

floating-point from xmm2/mem to xmm1.

VEX.128.0F.WIG 11 /r

VMOVUPS xmm2/m128, xmm1

B V/V AVX Move unaligned packed single-precision

floating-point from xmm1 to xmm2/mem.

VEX.256.0F.WIG 10 /r

VMOVUPS ymm1, ymm2/m256

A V/V AVX Move unaligned packed single-precision

floating-point from ymm2/mem to ymm1.

VEX.256.0F.WIG 11 /r

VMOVUPS ymm2/m256, ymm1

B V/V AVX Move unaligned packed single-precision

floating-point from ymm1 to ymm2/mem.

EVEX.128.0F.W0 10 /r

VMOVUPS xmm1 {k1}{z}, xmm2/m128

CV/VAVX512VL

AVX512F

Move unaligned packed single-precision

floating-point values from xmm2/m128 to

xmm1 using writemask k1.

EVEX.256.0F.W0 10 /r

VMOVUPS ymm1 {k1}{z}, ymm2/m256

CV/VAVX512VL

AVX512F

Move unaligned packed single-precision

floating-point values from ymm2/m256 to

ymm1 using writemask k1.

EVEX.512.0F.W0 10 /r

VMOVUPS zmm1 {k1}{z}, zmm2/m512

C V/V AVX512F Move unaligned packed single-precision

floating-point values from zmm2/m512 to

zmm1 using writemask k1.

EVEX.128.0F.W0 11 /r

VMOVUPS xmm2/m128 {k1}{z}, xmm1

DV/VAVX512VL

AVX512F

Move unaligned packed single-precision

floating-point values from xmm1 to

xmm2/m128 using writemask k1.

EVEX.256.0F.W0 11 /r

VMOVUPS ymm2/m256 {k1}{z}, ymm1

DV/VAVX512VL

AVX512F

Move unaligned packed single-precision

floating-point values from ymm1 to

ymm2/m256 using writemask k1.

EVEX.512.0F.W0 11 /r

VMOVUPS zmm2/m512 {k1}{z}, zmm1

D V/V AVX512F Move unaligned packed single-precision

floating-point values from zmm1 to

zmm2/m512 using writemask k1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (w) ModRM:r/m (r) NA NA

B NA ModRM:r/m (w) ModRM:reg (r) NA NA

C Full Mem ModRM:reg (w) ModRM:r/m (r) NA NA

D Full Mem ModRM:r/m (w) ModRM:reg (r) NA NA

MOVUPS—Move Unaligned Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-131

VEX.256 and EVEX.256 encoded versions:

Moves 256 bits of packed single-precision floating-point values from the source operand (second operand) to the

destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory

location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM

registers. Bits (MAXVL-1:256) of the destination register are zeroed.

128-bit versions:

Moves 128 bits of packed single-precision floating-point values from the source operand (second operand) to the

destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory

location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two

XMM registers.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.

When the source or destination operand is a memory operand, the operand may be unaligned without causing a

general-protection exception (#GP) to be generated.

VEX.128 and EVEX.128 encoded versions: Bits (MAXVL-1:128) of the destination register are zeroed.

Operation

VMOVUPS (EVEX encoded versions, register-copy form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i]  SRC[i+31:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVUPS (EVEX encoded versions, store-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[i+31:i]

ELSE *DEST[i+31:i] remains unchanged* ; merging-masking

FI;

ENDFOR;

MOVUPS—Move Unaligned Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-132 Vol. 2B

VMOVUPS (EVEX encoded versions, load-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i]  SRC[i+31:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i]  0 ; zeroing-masking

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMOVUPS (VEX.256 encoded version, load - and register copy)

DEST[255:0]  SRC[255:0]

DEST[MAXVL-1:256]  0

VMOVUPS (VEX.256 encoded version, store-form)

DEST[255:0]  SRC[255:0]

VMOVUPS (VEX.128 encoded version)

DEST[127:0]  SRC[127:0]

DEST[MAXVL-1:128]  0

MOVUPS (128-bit load- and register-copy- form Legacy SSE version)

DEST[127:0]  SRC[127:0]

DEST[MAXVL-1:128] (Unmodified)

(V)MOVUPS (128-bit store-form version)

DEST[127:0]  SRC[127:0]

Intel C/C++ Compiler Intrinsic Equivalent

VMOVUPS __m512 _mm512_loadu_ps( void * s);

VMOVUPS __m512 _mm512_mask_loadu_ps(__m512 a, __mmask16 k, void * s);

VMOVUPS __m512 _mm512_maskz_loadu_ps( __mmask16 k, void * s);

VMOVUPS void _mm512_storeu_ps( void * d, __m512 a);

VMOVUPS void _mm512_mask_storeu_ps( void * d, __mmask8 k, __m512 a);

VMOVUPS __m256 _mm256_mask_loadu_ps(__m256 a, __mmask8 k, void * s);

VMOVUPS __m256 _mm256_maskz_loadu_ps( __mmask8 k, void * s);

VMOVUPS void _mm256_mask_storeu_ps( void * d, __mmask8 k, __m256 a);

VMOVUPS __m128 _mm_mask_loadu_ps(__m128 a, __mmask8 k, void * s);

VMOVUPS __m128 _mm_maskz_loadu_ps( __mmask8 k, void * s);

VMOVUPS void _mm_mask_storeu_ps( void * d, __mmask8 k, __m128 a);

MOVUPS __m256 _mm256_loadu_ps ( float * p);

MOVUPS void _mm256 _storeu_ps( float *p, __m256 a);

MOVUPS __m128 _mm_loadu_ps ( float * p);

MOVUPS void _mm_storeu_ps( float *p, __m128 a);

SIMD Floating-Point Exceptions

None

MOVUPS—Move Unaligned Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-133

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

Note treatment of #AC varies;

EVEX-encoded instruction, see Exceptions Type E4.nb.

#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.

MOVZX—Move with Zero-Extend

INSTRUCTION SET REFERENCE, M-U

4-134 Vol. 2B

MOVZX—Move with Zero-Extend

Instruction Operand Encoding

Description

Copies the contents of the source operand (register or memory location) to the destination operand (register) and

zero extends the value. The size of the converted value depends on the operand-size attribute.

In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to addi-

tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bit operands. See the summary chart

at the beginning of this section for encoding data and limits.

Operation

DEST ← ZeroExtend(SRC);

Flags Affected

None.

Protected Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

If the DS, ES, FS, or GS register contains a NULL segment selector.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If the LOCK prefix is used.

Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If the LOCK prefix is used.

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

0F B6 /rMOVZX r16, r/m8 RM Valid Valid Move byte to word with zero-extension.

0F B6 /rMOVZX r32, r/m8 RM Valid Valid Move byte to doubleword, zero-extension.

REX.W + 0F B6 /rMOVZX r64, r/m8* RM Valid N.E. Move byte to quadword, zero-extension.

0F B7 /rMOVZX r32, r/m16 RM Valid Valid Move word to doubleword, zero-extension.

REX.W + 0F B7 /rMOVZX r64, r/m16 RM Valid N.E. Move word to quadword, zero-extension.

NOTES:

* In 64-bit mode, r/m8 can not be encoded to access the following byte registers if the REX prefix is used: AH, BH, CH, DH.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RM ModRM:reg (w) ModRM:r/m (r) NA NA

MOVZX—Move with Zero-Extend

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-135

Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made.

#UD If the LOCK prefix is used.

Compatibility Mode Exceptions

Same exceptions as in protected mode.

64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the memory address is in a non-canonical form.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If the LOCK prefix is used.

MPSADBW — Compute Multiple Packed Sums of Absolute Difference

INSTRUCTION SET REFERENCE, M-U

4-136 Vol. 2B

MPSADBW — Compute Multiple Packed Sums of Absolute Difference

Instruction Operand Encoding

Description

(V)MPSADBW calculates packed word results of sum-absolute-difference (SAD) of unsigned bytes from two blocks

of 32-bit dword elements, using two select fields in the immediate byte to select the offsets of the two blocks within

the first source operand and the second operand. Packed SAD word results are calculated within each 128-bit lane.

Each SAD word result is calculated between a stationary block_2 (whose offset within the second source operand

is selected by a two bit select control, multiplied by 32 bits) and a sliding block_1 at consecutive byte-granular posi-

tion within the first source operand. The offset of the first 32-bit block of block_1 is selectable using a one bit select

control, multiplied by 32 bits.

128-bit Legacy SSE version: Imm8[1:0]*32 specifies the bit offset of block_2 within the second source operand.

Imm[2]*32 specifies the initial bit offset of the block_1 within the first source operand. The first source operand

and destination operand are the same. The first source and destination operands are XMM registers. The second

source operand is either an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding

YMM destination register remain unchanged. Bits 7:3 of the immediate byte are ignored.

VEX.128 encoded version: Imm8[1:0]*32 specifies the bit offset of block_2 within the second source operand.

Imm[2]*32 specifies the initial bit offset of the block_1 within the first source operand. The first source and desti-

nation operands are XMM registers. The second source operand is either an XMM register or a 128-bit memory

location. Bits (127:128) of the corresponding YMM register are zeroed. Bits 7:3 of the immediate byte are ignored.

VEX.256 encoded version: The sum-absolute-difference (SAD) operation is repeated 8 times for MPSADW between

the same block_2 (fixed offset within the second source operand) and a variable block_1 (offset is shifted by 8 bits

for each SAD operation) in the first source operand. Each 16-bit result of eight SAD operations between block_2

and block_1 is written to the respective word in the lower 128 bits of the destination operand.

Additionally, VMPSADBW performs another eight SAD operations on block_4 of the second source operand and

block_3 of the first source operand. (Imm8[4:3]*32 + 128) specifies the bit offset of block_4 within the second

source operand. (Imm[5]*32+128) specifies the initial bit offset of the block_3 within the first source operand.

Each 16-bit result of eight SAD operations between block_4 and block_3 is written to the respective word in the

upper 128 bits of the destination operand.

Opcode/

Instruction

Op/

64/32-bit

Mode

CPUID

Feature

Flag

Description

66 0F 3A 42 /r ib

MPSADBW xmm1, xmm2/m128, imm8

RMI V/V SSE4_1 Sums absolute 8-bit integer difference of

adjacent groups of 4 byte integers in xmm1

and xmm2/m128 and writes the results in

xmm1. Starting offsets within xmm1 and

xmm2/m128 are determined by imm8.

VEX.128.66.0F3A.WIG 42 /r ib

VMPSADBW xmm1, xmm2, xmm3/m128, imm8

RVMI V/V AVX Sums absolute 8-bit integer difference of

adjacent groups of 4 byte integers in xmm2

and xmm3/m128 and writes the results in

xmm1. Starting offsets within xmm2 and

xmm3/m128 are determined by imm8.

VEX.256.66.0F3A.WIG 42 /r ib

VMPSADBW ymm1, ymm2, ymm3/m256, imm8

RVMI V/V AVX2 Sums absolute 8-bit integer difference of

adjacent groups of 4 byte integers in xmm2

and ymm3/m128 and writes the results in

ymm1. Starting offsets within ymm2 and

xmm3/m128 are determined by imm8.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RMI ModRM:reg (r, w) ModRM:r/m (r) imm8 NA

RVMI ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8

MPSADBW — Compute Multiple Packed Sums of Absolute Difference

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-137

The first source operand is a YMM register. The second source register can be a YMM register or a 256-bit memory

location. The destination operand is a YMM register. Bits 7:6 of the immediate byte are ignored.

Note: If VMPSADBW is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will

cause an #UD exception.

Figure 4-5. 256-bit VMPSADBW Operation

Abs. Diff.

Sum

Imm[4:3]*32+128

Imm[5]*32+128

Src2

Src1

128

255 144

128

255 224 192

Abs. Diff.

Sum

Imm[1:0]*32

Imm[2]*32

Src2

Destination

127 16

127 96 64

Destination

Src1

MPSADBW — Compute Multiple Packed Sums of Absolute Difference

INSTRUCTION SET REFERENCE, M-U

4-138 Vol. 2B

Operation

VMPSADBW (VEX.256 encoded version)

BLK2_OFFSET  imm8[1:0]*32

BLK1_OFFSET  imm8[2]*32

SRC1_BYTE0  SRC1[BLK1_OFFSET+7:BLK1_OFFSET]

SRC1_BYTE1  SRC1[BLK1_OFFSET+15:BLK1_OFFSET+8]

SRC1_BYTE2  SRC1[BLK1_OFFSET+23:BLK1_OFFSET+16]

SRC1_BYTE3  SRC1[BLK1_OFFSET+31:BLK1_OFFSET+24]

SRC1_BYTE4 SRC1[BLK1_OFFSET+39:BLK1_OFFSET+32]

SRC1_BYTE5  SRC1[BLK1_OFFSET+47:BLK1_OFFSET+40]

SRC1_BYTE6  SRC1[BLK1_OFFSET+55:BLK1_OFFSET+48]

SRC1_BYTE7  SRC1[BLK1_OFFSET+63:BLK1_OFFSET+56]

SRC1_BYTE8  SRC1[BLK1_OFFSET+71:BLK1_OFFSET+64]

SRC1_BYTE9  SRC1[BLK1_OFFSET+79:BLK1_OFFSET+72]

SRC1_BYTE10  SRC1[BLK1_OFFSET+87:BLK1_OFFSET+80]

SRC2_BYTE0 SRC2[BLK2_OFFSET+7:BLK2_OFFSET]

SRC2_BYTE1  SRC2[BLK2_OFFSET+15:BLK2_OFFSET+8]

SRC2_BYTE2  SRC2[BLK2_OFFSET+23:BLK2_OFFSET+16]

SRC2_BYTE3  SRC2[BLK2_OFFSET+31:BLK2_OFFSET+24]

TEMP0  ABS(SRC1_BYTE0 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE1 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE2 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE3 - SRC2_BYTE3)

DEST[15:0]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE1 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE2 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE3 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE4 - SRC2_BYTE3)

DEST[31:16]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE2 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE3 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE4 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE5 - SRC2_BYTE3)

DEST[47:32]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE3 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE4 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE5 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE6 - SRC2_BYTE3)

DEST[63:48]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE4 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE5 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE6 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE7 - SRC2_BYTE3)

DEST[79:64]  TEMP0 + TEMP1 + TEMP2 + TEMP3

MPSADBW — Compute Multiple Packed Sums of Absolute Difference

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-139

TEMP0  ABS(SRC1_BYTE5 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE6 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE7 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE8 - SRC2_BYTE3)

DEST[95:80]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE6 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE7 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE8 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE9 - SRC2_BYTE3)

DEST[111:96]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE7 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE8 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE9 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE10 - SRC2_BYTE3)

DEST[127:112]  TEMP0 + TEMP1 + TEMP2 + TEMP3

BLK2_OFFSET  imm8[4:3]*32 + 128

BLK1_OFFSET  imm8[5]*32 + 128

SRC1_BYTE0  SRC1[BLK1_OFFSET+7:BLK1_OFFSET]

SRC1_BYTE1  SRC1[BLK1_OFFSET+15:BLK1_OFFSET+8]

SRC1_BYTE2  SRC1[BLK1_OFFSET+23:BLK1_OFFSET+16]

SRC1_BYTE3  SRC1[BLK1_OFFSET+31:BLK1_OFFSET+24]

SRC1_BYTE4  SRC1[BLK1_OFFSET+39:BLK1_OFFSET+32]

SRC1_BYTE5  SRC1[BLK1_OFFSET+47:BLK1_OFFSET+40]

SRC1_BYTE6  SRC1[BLK1_OFFSET+55:BLK1_OFFSET+48]

SRC1_BYTE7  SRC1[BLK1_OFFSET+63:BLK1_OFFSET+56]

SRC1_BYTE8  SRC1[BLK1_OFFSET+71:BLK1_OFFSET+64]

SRC1_BYTE9  SRC1[BLK1_OFFSET+79:BLK1_OFFSET+72]

SRC1_BYTE10  SRC1[BLK1_OFFSET+87:BLK1_OFFSET+80]

SRC2_BYTE0 SRC2[BLK2_OFFSET+7:BLK2_OFFSET]

SRC2_BYTE1  SRC2[BLK2_OFFSET+15:BLK2_OFFSET+8]

SRC2_BYTE2  SRC2[BLK2_OFFSET+23:BLK2_OFFSET+16]

SRC2_BYTE3  SRC2[BLK2_OFFSET+31:BLK2_OFFSET+24]

TEMP0  ABS(SRC1_BYTE0 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE1 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE2 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE3 - SRC2_BYTE3)

DEST[143:128]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0 ABS(SRC1_BYTE1 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE2 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE3 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE4 - SRC2_BYTE3)

DEST[159:144]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE2 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE3 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE4 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE5 - SRC2_BYTE3)

DEST[175:160]  TEMP0 + TEMP1 + TEMP2 + TEMP3

MPSADBW — Compute Multiple Packed Sums of Absolute Difference

INSTRUCTION SET REFERENCE, M-U

4-140 Vol. 2B

TEMP0 ABS(SRC1_BYTE3 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE4 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE5 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE6 - SRC2_BYTE3)

DEST[191:176]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE4 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE5 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE6 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE7 - SRC2_BYTE3)

DEST[207:192]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE5 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE6 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE7 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE8 - SRC2_BYTE3)

DEST[223:208]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE6 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE7 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE8 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE9 - SRC2_BYTE3)

DEST[239:224]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE7 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE8 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE9 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE10 - SRC2_BYTE3)

DEST[255:240]  TEMP0 + TEMP1 + TEMP2 + TEMP3

VMPSADBW (VEX.128 encoded version)

BLK2_OFFSET  imm8[1:0]*32

BLK1_OFFSET  imm8[2]*32

SRC1_BYTE0  SRC1[BLK1_OFFSET+7:BLK1_OFFSET]

SRC1_BYTE1  SRC1[BLK1_OFFSET+15:BLK1_OFFSET+8]

SRC1_BYTE2  SRC1[BLK1_OFFSET+23:BLK1_OFFSET+16]

SRC1_BYTE3  SRC1[BLK1_OFFSET+31:BLK1_OFFSET+24]

SRC1_BYTE4  SRC1[BLK1_OFFSET+39:BLK1_OFFSET+32]

SRC1_BYTE5  SRC1[BLK1_OFFSET+47:BLK1_OFFSET+40]

SRC1_BYTE6  SRC1[BLK1_OFFSET+55:BLK1_OFFSET+48]

SRC1_BYTE7  SRC1[BLK1_OFFSET+63:BLK1_OFFSET+56]

SRC1_BYTE8  SRC1[BLK1_OFFSET+71:BLK1_OFFSET+64]

SRC1_BYTE9  SRC1[BLK1_OFFSET+79:BLK1_OFFSET+72]

SRC1_BYTE10  SRC1[BLK1_OFFSET+87:BLK1_OFFSET+80]

SRC2_BYTE0 SRC2[BLK2_OFFSET+7:BLK2_OFFSET]

SRC2_BYTE1  SRC2[BLK2_OFFSET+15:BLK2_OFFSET+8]

SRC2_BYTE2  SRC2[BLK2_OFFSET+23:BLK2_OFFSET+16]

SRC2_BYTE3  SRC2[BLK2_OFFSET+31:BLK2_OFFSET+24]

MPSADBW — Compute Multiple Packed Sums of Absolute Difference

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-141

TEMP0  ABS(SRC1_BYTE0 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE1 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE2 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE3 - SRC2_BYTE3)

DEST[15:0]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE1 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE2 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE3 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE4 - SRC2_BYTE3)

DEST[31:16]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE2 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE3 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE4 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE5 - SRC2_BYTE3)

DEST[47:32]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE3 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE4 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE5 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE6 - SRC2_BYTE3)

DEST[63:48]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE4 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE5 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE6 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE7 - SRC2_BYTE3)

DEST[79:64]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE5 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE6 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE7 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE8 - SRC2_BYTE3)

DEST[95:80]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE6 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE7 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE8 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE9 - SRC2_BYTE3)

DEST[111:96]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS(SRC1_BYTE7 - SRC2_BYTE0)

TEMP1  ABS(SRC1_BYTE8 - SRC2_BYTE1)

TEMP2  ABS(SRC1_BYTE9 - SRC2_BYTE2)

TEMP3  ABS(SRC1_BYTE10 - SRC2_BYTE3)

DEST[127:112]  TEMP0 + TEMP1 + TEMP2 + TEMP3

DEST[MAXVL-1:128]  0

MPSADBW — Compute Multiple Packed Sums of Absolute Difference

INSTRUCTION SET REFERENCE, M-U

4-142 Vol. 2B

MPSADBW (128-bit Legacy SSE version)

SRC_OFFSET  imm8[1:0]*32

DEST_OFFSET  imm8[2]*32

DEST_BYTE0  DEST[DEST_OFFSET+7:DEST_OFFSET]

DEST_BYTE1  DEST[DEST_OFFSET+15:DEST_OFFSET+8]

DEST_BYTE2  DEST[DEST_OFFSET+23:DEST_OFFSET+16]

DEST_BYTE3  DEST[DEST_OFFSET+31:DEST_OFFSET+24]

DEST_BYTE4  DEST[DEST_OFFSET+39:DEST_OFFSET+32]

DEST_BYTE5  DEST[DEST_OFFSET+47:DEST_OFFSET+40]

DEST_BYTE6  DEST[DEST_OFFSET+55:DEST_OFFSET+48]

DEST_BYTE7  DEST[DEST_OFFSET+63:DEST_OFFSET+56]

DEST_BYTE8  DEST[DEST_OFFSET+71:DEST_OFFSET+64]

DEST_BYTE9  DEST[DEST_OFFSET+79:DEST_OFFSET+72]

DEST_BYTE10  DEST[DEST_OFFSET+87:DEST_OFFSET+80]

SRC_BYTE0  SRC[SRC_OFFSET+7:SRC_OFFSET]

SRC_BYTE1  SRC[SRC_OFFSET+15:SRC_OFFSET+8]

SRC_BYTE2  SRC[SRC_OFFSET+23:SRC_OFFSET+16]

SRC_BYTE3  SRC[SRC_OFFSET+31:SRC_OFFSET+24]

TEMP0  ABS( DEST_BYTE0 - SRC_BYTE0)

TEMP1  ABS( DEST_BYTE1 - SRC_BYTE1)

TEMP2  ABS( DEST_BYTE2 - SRC_BYTE2)

TEMP3  ABS( DEST_BYTE3 - SRC_BYTE3)

DEST[15:0]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS( DEST_BYTE1 - SRC_BYTE0)

TEMP1  ABS( DEST_BYTE2 - SRC_BYTE1)

TEMP2  ABS( DEST_BYTE3 - SRC_BYTE2)

TEMP3  ABS( DEST_BYTE4 - SRC_BYTE3)

DEST[31:16]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS( DEST_BYTE2 - SRC_BYTE0)

TEMP1  ABS( DEST_BYTE3 - SRC_BYTE1)

TEMP2  ABS( DEST_BYTE4 - SRC_BYTE2)

TEMP3  ABS( DEST_BYTE5 - SRC_BYTE3)

DEST[47:32]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS( DEST_BYTE3 - SRC_BYTE0)

TEMP1  ABS( DEST_BYTE4 - SRC_BYTE1)

TEMP2  ABS( DEST_BYTE5 - SRC_BYTE2)

TEMP3  ABS( DEST_BYTE6 - SRC_BYTE3)

DEST[63:48]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS( DEST_BYTE4 - SRC_BYTE0)

TEMP1  ABS( DEST_BYTE5 - SRC_BYTE1)

TEMP2  ABS( DEST_BYTE6 - SRC_BYTE2)

TEMP3  ABS( DEST_BYTE7 - SRC_BYTE3)

DEST[79:64]  TEMP0 + TEMP1 + TEMP2 + TEMP3

MPSADBW — Compute Multiple Packed Sums of Absolute Difference

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-143

TEMP0  ABS( DEST_BYTE5 - SRC_BYTE0)

TEMP1  ABS( DEST_BYTE6 - SRC_BYTE1)

TEMP2  ABS( DEST_BYTE7 - SRC_BYTE2)

TEMP3  ABS( DEST_BYTE8 - SRC_BYTE3)

DEST[95:80]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS( DEST_BYTE6 - SRC_BYTE0)

TEMP1  ABS( DEST_BYTE7 - SRC_BYTE1)

TEMP2  ABS( DEST_BYTE8 - SRC_BYTE2)

TEMP3  ABS( DEST_BYTE9 - SRC_BYTE3)

DEST[111:96]  TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0  ABS( DEST_BYTE7 - SRC_BYTE0)

TEMP1  ABS( DEST_BYTE8 - SRC_BYTE1)

TEMP2  ABS( DEST_BYTE9 - SRC_BYTE2)

TEMP3  ABS( DEST_BYTE10 - SRC_BYTE3)

DEST[127:112]  TEMP0 + TEMP1 + TEMP2 + TEMP3

DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

(V)MPSADBW: __m128i _mm_mpsadbw_epu8 (__m128i s1, __m128i s2, const int mask);

VMPSADBW: __m256i _mm256_mpsadbw_epu8 (__m256i s1, __m256i s2, const int mask);

Flags Affected

None

Other Exceptions

See Exceptions Type 4; additionally

#UD If VEX.L = 1.

MUL—Unsigned Multiply

INSTRUCTION SET REFERENCE, M-U

4-144 Vol. 2B

MUL—Unsigned Multiply

Instruction Operand Encoding

Description

Performs an unsigned multiplication of the first operand (destination operand) and the second operand (source

operand) and stores the result in the destination operand. The destination operand is an implied operand located in

and the operand size as shown in Table 4-9.

The result is stored in register AX, register pair DX:AX, or register pair EDX:EAX (depending on the operand size),

with the high-order bits of the product contained in register AH, DX, or EDX, respectively. If the high-order bits of

the product are 0, the CF and OF flags are cleared; otherwise, the flags are set.

In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to addi-

tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits.

See the summary chart at the beginning of this section for encoding data and limits.

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

F6 /4 MUL r/m8 M Valid Valid Unsigned multiply (AX ← AL ∗ r/m8).

REX + F6 /4 MUL r/m8*M Valid N.E. Unsigned multiply (AX ← AL ∗ r/m8).

F7 /4 MUL r/m16 M Valid Valid Unsigned multiply (DX:AX ← AX ∗ r/m16).

F7 /4 MUL r/m32 MValid Valid Unsigned multiply (EDX:EAX ← EAX ∗ r/m32).

REX.W + F7 /4 MUL r/m64 MValid N.E. Unsigned multiply (RDX:RAX ← RAX ∗ r/m64).

NOTES:

* In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

M ModRM:r/m (r) NA NA NA

Table 4-9. MUL Results

Operand Size Source 1 Source 2 Destination

Byte AL r/m8 AX

Word AX r/m16 DX:AX

Doubleword EAX r/m32 EDX:EAX

Quadword RAX r/m64 RDX:RAX

MUL—Unsigned Multiply

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-145

Operation

IF (Byte operation)

THEN

AX ← AL ∗ SRC;

ELSE (* Word or doubleword operation *)

IF OperandSize = 16

THEN

DX:AX ← AX ∗ SRC;

ELSE IF OperandSize = 32

THEN EDX:EAX ← EAX ∗ SRC; FI;

ELSE (* OperandSize = 64 *)

RDX:RAX ← RAX ∗ SRC;

FI;

Flags Affected

The OF and CF flags are set to 0 if the upper half of the result is 0; otherwise, they are set to 1. The SF, ZF, AF, and

PF flags are undefined.

Protected Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

If the DS, ES, FS, or GS register contains a NULL segment selector.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If the LOCK prefix is used.

Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If the LOCK prefix is used.

Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made.

#UD If the LOCK prefix is used.

Compatibility Mode Exceptions

Same exceptions as in protected mode.

64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the memory address is in a non-canonical form.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

MULPD—Multiply Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-146 Vol. 2B

MULPD—Multiply Packed Double-Precision Floating-Point Values

Instruction Operand Encoding

Description

Multiply packed double-precision floating-point values from the first source operand with corresponding values in

the second source operand, and stores the packed double-precision floating-point results in the destination

operand.

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second

source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector

broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally

updated with writemask k1.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM

sponding destination ZMM register are zeroed.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM

the destination YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-

nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding

ZMM register destination are unmodified.

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

66 0F 59 /r

MULPD xmm1, xmm2/m128

A V/V SSE2 Multiply packed double-precision floating-point values

in xmm2/m128 with xmm1 and store result in xmm1.

VEX.128.66.0F.WIG 59 /r

VMULPD xmm1,xmm2, xmm3/m128

B V/V AVX Multiply packed double-precision floating-point values

in xmm3/m128 with xmm2 and store result in xmm1.

VEX.256.66.0F.WIG 59 /r

VMULPD ymm1, ymm2, ymm3/m256

B V/V AVX Multiply packed double-precision floating-point values

in ymm3/m256 with ymm2 and store result in ymm1.

EVEX.128.66.0F.W1 59 /r

VMULPD xmm1 {k1}{z}, xmm2,

xmm3/m128/m64bcst

CV/V AVX512VL

AVX512F

Multiply packed double-precision floating-point values

from xmm3/m128/m64bcst to xmm2 and store result

in xmm1.

EVEX.256.66.0F.W1 59 /r

VMULPD ymm1 {k1}{z}, ymm2,

ymm3/m256/m64bcst

CV/V AVX512VL

AVX512F

Multiply packed double-precision floating-point values

from ymm3/m256/m64bcst to ymm2 and store result

in ymm1.

EVEX.512.66.0F.W1 59 /r

VMULPD zmm1 {k1}{z}, zmm2,

zmm3/m512/m64bcst{er}

C V/V AVX512F Multiply packed double-precision floating-point values

in zmm3/m512/m64bcst with zmm2 and store result

in zmm1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

MULPD—Multiply Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-147

Operation

VMULPD (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

IF (VL = 512) AND (EVEX.b = 1) AND SRC2 *is a register*

THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN

DEST[i+63:i]  SRC1[i+63:i] * SRC2[63:0]

ELSE

DEST[i+63:i]  SRC1[i+63:i] * SRC2[i+63:i]

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i]  0

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMULPD (VEX.256 encoded version)

DEST[63:0] SRC1[63:0] * SRC2[63:0]

DEST[127:64] SRC1[127:64] * SRC2[127:64]

DEST[191:128] SRC1[191:128] * SRC2[191:128]

DEST[255:192] SRC1[255:192] * SRC2[255:192]

DEST[MAXVL-1:256] 0;

VMULPD (VEX.128 encoded version)

DEST[63:0] SRC1[63:0] * SRC2[63:0]

DEST[127:64] SRC1[127:64] * SRC2[127:64]

DEST[MAXVL-1:128] 0

MULPD (128-bit Legacy SSE version)

DEST[63:0] DEST[63:0] * SRC[63:0]

DEST[127:64] DEST[127:64] * SRC[127:64]

DEST[MAXVL-1:128] (Unmodified)

MULPD—Multiply Packed Double-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-148 Vol. 2B

Intel C/C++ Compiler Intrinsic Equivalent

VMULPD __m512d _mm512_mul_pd( __m512d a, __m512d b);

VMULPD __m512d _mm512_mask_mul_pd(__m512d s, __mmask8 k, __m512d a, __m512d b);

VMULPD __m512d _mm512_maskz_mul_pd( __mmask8 k, __m512d a, __m512d b);

VMULPD __m512d _mm512_mul_round_pd( __m512d a, __m512d b, int);

VMULPD __m512d _mm512_mask_mul_round_pd(__m512d s, __mmask8 k, __m512d a, __m512d b, int);

VMULPD __m512d _mm512_maskz_mul_round_pd( __mmask8 k, __m512d a, __m512d b, int);

VMULPD __m256d _mm256_mul_pd (__m256d a, __m256d b);

MULPD __m128d _mm_mul_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions

Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 2.

EVEX-encoded instruction, see Exceptions Type E2.

MULPS—Multiply Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-149

MULPS—Multiply Packed Single-Precision Floating-Point Values

Instruction Operand Encoding

Description

Multiply the packed single-precision floating-point values from the first source operand with the corresponding

values in the second source operand, and stores the packed double-precision floating-point results in the destina-

tion operand.

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second

source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector

broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally

updated with writemask k1.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM

sponding destination ZMM register are zeroed.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM

the destination YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-

nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding

ZMM register destination are unmodified.

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

NP 0F 59 /r

MULPS xmm1, xmm2/m128

A V/V SSE Multiply packed single-precision floating-point values in

xmm2/m128 with xmm1 and store result in xmm1.

VEX.128.0F.WIG 59 /r

VMULPS xmm1,xmm2, xmm3/m128

B V/V AVX Multiply packed single-precision floating-point values in

xmm3/m128 with xmm2 and store result in xmm1.

VEX.256.0F.WIG 59 /r

VMULPS ymm1, ymm2, ymm3/m256

B V/V AVX Multiply packed single-precision floating-point values in

ymm3/m256 with ymm2 and store result in ymm1.

EVEX.128.0F.W0 59 /r

VMULPS xmm1 {k1}{z}, xmm2,

xmm3/m128/m32bcst

CV/V AVX512VL

AVX512F

Multiply packed single-precision floating-point values

from xmm3/m128/m32bcst to xmm2 and store result in

xmm1.

EVEX.256.0F.W0 59 /r

VMULPS ymm1 {k1}{z}, ymm2,

ymm3/m256/m32bcst

CV/V AVX512VL

AVX512F

Multiply packed single-precision floating-point values

from ymm3/m256/m32bcst to ymm2 and store result in

ymm1.

EVEX.512.0F.W0 59 /r

VMULPS zmm1 {k1}{z}, zmm2,

zmm3/m512/m32bcst {er}

C V/V AVX512F Multiply packed single-precision floating-point values in

zmm3/m512/m32bcst with zmm2 and store result in

zmm1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

CFullModRM:reg (w)EVEX.vvvv (r) ModRM:r/m (r) NA

MULPS—Multiply Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-150 Vol. 2B

Operation

VMULPS (EVEX encoded version)

(KL, VL) = (4, 128), (8, 256), (16, 512)

IF (VL = 512) AND (EVEX.b = 1) AND SRC2 *is a register*

THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN

DEST[i+31:i]  SRC1[i+31:i] * SRC2[31:0]

ELSE

DEST[i+31:i]  SRC1[i+31:i] * SRC2[i+31:i]

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i]  0

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VMULPS (VEX.256 encoded version)

DEST[31:0] SRC1[31:0] * SRC2[31:0]

DEST[63:32] SRC1[63:32] * SRC2[63:32]

DEST[95:64] SRC1[95:64] * SRC2[95:64]

DEST[127:96] SRC1[127:96] * SRC2[127:96]

DEST[159:128] SRC1[159:128] * SRC2[159:128]

DEST[191:160]SRC1[191:160] * SRC2[191:160]

DEST[223:192] SRC1[223:192] * SRC2[223:192]

DEST[255:224] SRC1[255:224] * SRC2[255:224].

DEST[MAXVL-1:256] 0;

VMULPS (VEX.128 encoded version)

DEST[31:0] SRC1[31:0] * SRC2[31:0]

DEST[63:32] SRC1[63:32] * SRC2[63:32]

DEST[95:64] SRC1[95:64] * SRC2[95:64]

DEST[127:96] SRC1[127:96] * SRC2[127:96]

DEST[MAXVL-1:128] 0

MULPS (128-bit Legacy SSE version)

DEST[31:0] SRC1[31:0] * SRC2[31:0]

DEST[63:32] SRC1[63:32] * SRC2[63:32]

DEST[95:64] SRC1[95:64] * SRC2[95:64]

DEST[127:96] SRC1[127:96] * SRC2[127:96]

DEST[MAXVL-1:128] (Unmodified)

MULPS—Multiply Packed Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-151

Intel C/C++ Compiler Intrinsic Equivalent

VMULPS __m512 _mm512_mul_ps( __m512 a, __m512 b);

VMULPS __m512 _mm512_mask_mul_ps(__m512 s, __mmask16 k, __m512 a, __m512 b);

VMULPS __m512 _mm512_maskz_mul_ps(__mmask16 k, __m512 a, __m512 b);

VMULPS __m512 _mm512_mul_round_ps( __m512 a, __m512 b, int);

VMULPS __m512 _mm512_mask_mul_round_ps(__m512 s, __mmask16 k, __m512 a, __m512 b, int);

VMULPS __m512 _mm512_maskz_mul_round_ps(__mmask16 k, __m512 a, __m512 b, int);

VMULPS __m256 _mm256_mask_mul_ps(__m256 s, __mmask8 k, __m256 a, __m256 b);

VMULPS __m256 _mm256_maskz_mul_ps(__mmask8 k, __m256 a, __m256 b);

VMULPS __m128 _mm_mask_mul_ps(__m128 s, __mmask8 k, __m128 a, __m128 b);

VMULPS __m128 _mm_maskz_mul_ps(__mmask8 k, __m128 a, __m128 b);

VMULPS __m256 _mm256_mul_ps (__m256 a, __m256 b);

MULPS __m128 _mm_mul_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions

Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 2.

EVEX-encoded instruction, see Exceptions Type E2.

MULSD—Multiply Scalar Double-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

4-152 Vol. 2B

MULSD—Multiply Scalar Double-Precision Floating-Point Value

Instruction Operand Encoding

Description

Multiplies the low double-precision floating-point value in the second source operand by the low double-precision

floating-point value in the first source operand, and stores the double-precision floating-point result in the destina-

tion operand. The second source operand can be an XMM register or a 64-bit memory location. The first source

operand and the destination operands are XMM registers.

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL-

1:64) of the corresponding destination register remain unchanged.

VEX.128 and EVEX encoded version: The quadword at bits 127:64 of the destination operand is copied from the

same bits of the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.

EVEX encoded version: The low quadword element of the destination operand is updated according to the

writemask.

Software should ensure VMULSD is encoded with VEX.L=0. Encoding VMULSD with VEX.L=1 may encounter unpre-

dictable behavior across different processor generations.

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

F2 0F 59 /r

MULSD xmm1,xmm2/m64

A V/V SSE2 Multiply the low double-precision floating-point value in

xmm2/m64 by low double-precision floating-point

value in xmm1.

VEX.LIG.F2.0F.WIG 59 /r

VMULSD xmm1,xmm2, xmm3/m64

B V/V AVX Multiply the low double-precision floating-point value in

xmm3/m64 by low double-precision floating-point

value in xmm2.

EVEX.LIG.F2.0F.W1 59 /r

VMULSD xmm1 {k1}{z}, xmm2,

xmm3/m64 {er}

C V/V AVX512F Multiply the low double-precision floating-point value in

xmm3/m64 by low double-precision floating-point

value in xmm2.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

MULSD—Multiply Scalar Double-Precision Floating-Point Value

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-153

Operation

VMULSD (EVEX encoded version)

IF (EVEX.b = 1) AND SRC2 *is a register*

THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;

IF k1[0] or *no writemask*

THEN DEST[63:0]  SRC1[63:0] * SRC2[63:0]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[63:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[63:0]  0

FI;

ENDFOR

DEST[127:64]  SRC1[127:64]

DEST[MAXVL-1:128]  0

VMULSD (VEX.128 encoded version)

DEST[63:0] SRC1[63:0] * SRC2[63:0]

DEST[127:64] SRC1[127:64]

DEST[MAXVL-1:128] 0

MULSD (128-bit Legacy SSE version)

DEST[63:0] DEST[63:0] * SRC[63:0]

DEST[MAXVL-1:64] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VMULSD __m128d _mm_mask_mul_sd(__m128d s, __mmask8 k, __m128d a, __m128d b);

VMULSD __m128d _mm_maskz_mul_sd( __mmask8 k, __m128d a, __m128d b);

VMULSD __m128d _mm_mul_round_sd( __m128d a, __m128d b, int);

VMULSD __m128d _mm_mask_mul_round_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, int);

VMULSD __m128d _mm_maskz_mul_round_sd( __mmask8 k, __m128d a, __m128d b, int);

MULSD __m128d _mm_mul_sd (__m128d a, __m128d b)

SIMD Floating-Point Exceptions

Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 3.

EVEX-encoded instruction, see Exceptions Type E3.

MULSS—Multiply Scalar Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-154 Vol. 2B

MULSS—Multiply Scalar Single-Precision Floating-Point Values

Instruction Operand Encoding

Description

Multiplies the low single-precision floating-point value from the second source operand by the low single-precision

floating-point value in the first source operand, and stores the single-precision floating-point result in the destina-

tion operand. The second source operand can be an XMM register or a 32-bit memory location. The first source

operand and the destination operands are XMM registers.

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL-

1:32) of the corresponding YMM destination register remain unchanged.

VEX.128 and EVEX encoded version: The first source operand is an xmm register encoded by VEX.vvvv. The three

high-order doublewords of the destination operand are copied from the first source operand. Bits (MAXVL-1:128)

of the destination register are zeroed.

EVEX encoded version: The low doubleword element of the destination operand is updated according to the

writemask.

Software should ensure VMULSS is encoded with VEX.L=0. Encoding VMULSS with VEX.L=1 may encounter unpre-

dictable behavior across different processor generations.

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

F3 0F 59 /r

MULSS xmm1,xmm2/m32

A V/V SSE Multiply the low single-precision floating-point value in

xmm2/m32 by the low single-precision floating-point

value in xmm1.

VEX.LIG.F3.0F.WIG 59 /r

VMULSS xmm1,xmm2, xmm3/m32

B V/V AVX Multiply the low single-precision floating-point value in

xmm3/m32 by the low single-precision floating-point

value in xmm2.

EVEX.LIG.F3.0F.W0 59 /r

VMULSS xmm1 {k1}{z}, xmm2,

xmm3/m32 {er}

C V/V AVX512F Multiply the low single-precision floating-point value in

xmm3/m32 by the low single-precision floating-point

value in xmm2.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Tuple1 Scalar ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

MULSS—Multiply Scalar Single-Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-155

Operation

VMULSS (EVEX encoded version)

IF (EVEX.b = 1) AND SRC2 *is a register*

THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;

IF k1[0] or *no writemask*

THEN DEST[31:0]  SRC1[31:0] * SRC2[31:0]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[31:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[31:0]  0

FI;

ENDFOR

DEST[127:32]  SRC1[127:32]

DEST[MAXVL-1:128]  0

VMULSS (VEX.128 encoded version)

DEST[31:0] SRC1[31:0] * SRC2[31:0]

DEST[127:32] SRC1[127:32]

DEST[MAXVL-1:128] 0

MULSS (128-bit Legacy SSE version)

DEST[31:0] DEST[31:0] * SRC[31:0]

DEST[MAXVL-1:32] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VMULSS __m128 _mm_mask_mul_ss(__m128 s, __mmask8 k, __m128 a, __m128 b);

VMULSS __m128 _mm_maskz_mul_ss( __mmask8 k, __m128 a, __m128 b);

VMULSS __m128 _mm_mul_round_ss( __m128 a, __m128 b, int);

VMULSS __m128 _mm_mask_mul_round_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, int);

VMULSS __m128 _mm_maskz_mul_round_ss( __mmask8 k, __m128 a, __m128 b, int);

MULSS __m128 _mm_mul_ss(__m128 a, __m128 b)

SIMD Floating-Point Exceptions

Underflow, Overflow, Invalid, Precision, Denormal

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 3.

EVEX-encoded instruction, see Exceptions Type E3.

MULX — Unsigned Multiply Without Affecting Flags

INSTRUCTION SET REFERENCE, M-U

4-156 Vol. 2B

MULX — Unsigned Multiply Without Affecting Flags

Instruction Operand Encoding

Description

Performs an unsigned multiplication of the implicit source operand (EDX/RDX) and the specified source operand

(the third operand) and stores the low half of the result in the second destination (second operand), the high half

of the result in the first destination operand (first operand), without reading or writing the arithmetic flags. This

enables efficient programming where the software can interleave add with carry operations and multiplications.

If the first and second operand are identical, it will contain the high half of the multiplication result.

This instruction is not supported in real mode and virtual-8086 mode. The operand size is always 32 bits if not in

64-bit mode. In 64-bit mode operand size 64 requires VEX.W1. VEX.W1 is ignored in non-64-bit modes. An attempt

to execute this instruction with VEX.L not equal to 0 will cause #UD.

Operation

// DEST1: ModRM:reg

// DEST2: VEX.vvvv

IF (OperandSize = 32)

SRC1 ← EDX;

DEST2 ← (SRC1*SRC2)[31:0];

DEST1 ← (SRC1*SRC2)[63:32];

ELSE IF (OperandSize = 64)

SRC1 ← RDX;

DEST2 ← (SRC1*SRC2)[63:0];

DEST1 ← (SRC1*SRC2)[127:64];

Flags Affected

None

Intel C/C++ Compiler Intrinsic Equivalent

Auto-generated from high-level language when possible.

unsigned int mulx_u32(unsigned int a, unsigned int b, unsigned int * hi);

unsigned __int64 mulx_u64(unsigned __int64 a, unsigned __int64 b, unsigned __int64 * hi);

SIMD Floating-Point Exceptions

None

Opcode/

Instruction

Op/

64/32

-bit

Mode

CPUID

Feature

Flag

Description

VEX.LZ.F2.0F38.W0 F6 /r

MULX r32a, r32b, r/m32

RVM V/V BMI2 Unsigned multiply of r/m32 with EDX without affecting arithmetic

flags.

VEX.LZ.F2.0F38.W1 F6 /r

MULX r64a, r64b, r/m64

RVM V/N.E. BMI2 Unsigned multiply of r/m64 with RDX without affecting arithmetic

flags.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RVM ModRM:reg (w) VEX.vvvv (w) ModRM:r/m (r) RDX/EDX is implied 64/32 bits

source

MULX — Unsigned Multiply Without Affecting Flags

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-157

Other Exceptions

See Section 2.5.1, “Exception Conditions for VEX-Encoded GPR Instructions”, Table 2-29.

MWAIT—Monitor Wait

INSTRUCTION SET REFERENCE, M-U

4-158 Vol. 2B

MWAIT—Monitor Wait

Instruction Operand Encoding

Description

MWAIT instruction provides hints to allow the processor to enter an implementation-dependent optimized state.

There are two principal targeted usages: address-range monitor and advanced power management. Both usages

of MWAIT require the use of the MONITOR instruction.

CPUID.01H:ECX.MONITOR[bit 3] indicates the availability of MONITOR and MWAIT in the processor. When set,

MWAIT may be executed only at privilege level 0 (use at any other privilege level results in an invalid-opcode

exception). The operating system or system BIOS may disable this instruction by using the IA32_MISC_ENABLE

MSR; disabling MWAIT clears the CPUID feature flag and causes execution to generate an invalid-opcode excep-

tion.

This instruction’s operation is the same in non-64-bit modes and 64-bit mode.

ECX specifies optional extensions for the MWAIT instruction. EAX may contain hints such as the preferred optimized

state the processor should enter. The first processors to implement MWAIT supported only the zero value for EAX

and ECX. Later processors allowed setting ECX[0] to enable masked interrupts as break events for MWAIT (see

below). Software can use the CPUID instruction to determine the extensions and hints supported by the processor.

MWAIT for Address Range Monitoring

For address-range monitoring, the MWAIT instruction operates with the MONITOR instruction. The two instructions

allow the definition of an address at which to wait (MONITOR) and a implementation-dependent-optimized opera-

tion to commence at the wait address (MWAIT). The execution of MWAIT is a hint to the processor that it can enter

an implementation-dependent-optimized state while waiting for an event or a store operation to the address range

armed by MONITOR.

The following cause the processor to exit the implementation-dependent-optimized state: a store to the address

range armed by the MONITOR instruction, an NMI or SMI, a debug exception, a machine check exception, the

BINIT# signal, the INIT# signal, and the RESET# signal. Other implementation-dependent events may also cause

the processor to exit the implementation-dependent-optimized state.

In addition, an external interrupt causes the processor to exit the implementation-dependent-optimized state

either (1) if the interrupt would be delivered to software (e.g., as it would be if HLT had been executed instead of

MWAIT); or (2) if ECX[0] = 1. Software can execute MWAIT with ECX[0] = 1 only if CPUID.05H:ECX[bit 1] = 1.

(Implementation-specific conditions may result in an interrupt causing the processor to exit the implementation-

dependent-optimized state even if interrupts are masked and ECX[0] = 0.)

Following exit from the implementation-dependent-optimized state, control passes to the instruction following the

MWAIT instruction. A pending interrupt that is not masked (including an NMI or an SMI) may be delivered before

execution of that instruction. Unlike the HLT instruction, the MWAIT instruction does not support a restart at the

MWAIT instruction following the handling of an SMI.

If the preceding MONITOR instruction did not successfully arm an address range or if the MONITOR instruction has

not been executed prior to executing MWAIT, then the processor will not enter the implementation-dependent-opti-

mized state. Execution will resume at the instruction following the MWAIT.

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

0F 01 C9 MWAIT ZO Valid Valid A hint that allows the processor to stop

instruction execution and enter an

implementation-dependent optimized state

until occurrence of a class of events.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

ZO NA NA NA NA

MWAIT—Monitor Wait

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-159

MWAIT for Power Management

MWAIT accepts a hint and optional extension to the processor that it can enter a specified target C state while

waiting for an event or a store operation to the address range armed by MONITOR. Support for MWAIT extensions

for power management is indicated by CPUID.05H:ECX[bit 0] reporting 1.

EAX and ECX are used to communicate the additional information to the MWAIT instruction, such as the kind of

optimized state the processor should enter. ECX specifies optional extensions for the MWAIT instruction. EAX may

contain hints such as the preferred optimized state the processor should enter. Implementation-specific conditions

may cause a processor to ignore the hint and enter a different optimized state. Future processor implementations

may implement several optimized “waiting” states and will select among those states based on the hint argument.

Table 4-10 describes the meaning of ECX and EAX registers for MWAIT extensions.

Note that if MWAIT is used to enter any of the C-states that are numerically higher than C1, a store to the address

range armed by the MONITOR instruction will cause the processor to exit MWAIT only if the store was originated by

other processor agents. A store from non-processor agent might not cause the processor to exit MWAIT in such

cases.

For additional details of MWAIT extensions, see Chapter 14, “Power and Thermal Management,” of Intel® 64 and

IA-32 Architectures Software Developer’s Manual, Volume 3A.

Operation

(* MWAIT takes the argument in EAX as a hint extension and is architected to take the argument in ECX as an instruction extension

MWAIT EAX, ECX *)

{

WHILE ( (“Monitor Hardware is in armed state”)) {

implementation_dependent_optimized_state(EAX, ECX); }

Set the state of Monitor Hardware as triggered;

}

Intel C/C++ Compiler Intrinsic Equivalent

MWAIT: void _mm_mwait(unsigned extensions, unsigned hints)

Table 4-10. MWAIT Extension Register (ECX)

Bits Description

0Treat interrupts as break events even if masked (e.g., even if EFLAGS.IF=0). May be set only if

CPUID.05H:ECX[bit 1] = 1.

31: 1 Reserved

Table 4-11. MWAIT Hints Register (EAX)

Bits Description

3 : 0 Sub C-state within a C-state, indicated by bits [7:4]

7 : 4 Target C-state*

Value of 0 means C1; 1 means C2 and so on

Value of 01111B means C0

Note: Target C states for MWAIT extensions are processor-specific C-states, not ACPI C-states

31: 8 Reserved

MWAIT—Monitor Wait

INSTRUCTION SET REFERENCE, M-U

4-160 Vol. 2B

Example

MONITOR/MWAIT instruction pair must be coded in the same loop because execution of the MWAIT instruction will

trigger the monitor hardware. It is not a proper usage to execute MONITOR once and then execute MWAIT in a

loop. Setting up MONITOR without executing MWAIT has no adverse effects.

Typically the MONITOR/MWAIT pair is used in a sequence, such as:

EAX = Logical Address(Trigger)

ECX = 0 (*Hints *)

EDX = 0 (* Hints *)

IF ( !trigger_store_happened) {

MONITOR EAX, ECX, EDX

IF ( !trigger_store_happened ) {

MWAIT EAX, ECX

}

The above code sequence makes sure that a triggering store does not happen between the first check of the trigger

and the execution of the monitor instruction. Without the second check that triggering store would go un-noticed.

Typical usage of MONITOR and MWAIT would have the above code sequence within a loop.

Numeric Exceptions

None

Protected Mode Exceptions

#GP(0) If ECX[31:1] ≠ 0.

If ECX[0] = 1 and CPUID.05H:ECX[bit 1] = 0.

#UD If CPUID.01H:ECX.MONITOR[bit 3] = 0.

If current privilege level is not 0.

Real Address Mode Exceptions

#GP If ECX[31:1] ≠ 0.

If ECX[0] = 1 and CPUID.05H:ECX[bit 1] = 0.

#UD If CPUID.01H:ECX.MONITOR[bit 3] = 0.

Virtual 8086 Mode Exceptions

#UD The MWAIT instruction is not recognized in virtual-8086 mode (even if

CPUID.01H:ECX.MONITOR[bit 3] = 1).

Compatibility Mode Exceptions

Same exceptions as in protected mode.

64-Bit Mode Exceptions

#GP(0) If RCX[63:1] ≠ 0.

If RCX[0] = 1 and CPUID.05H:ECX[bit 1] = 0.

#UD If the current privilege level is not 0.

If CPUID.01H:ECX.MONITOR[bit 3] = 0.

NEG—Two's Complement Negation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-161

NEG—Two's Complement Negation

Instruction Operand Encoding

Description

Replaces the value of operand (the destination operand) with its two's complement. (This operation is equivalent

to subtracting the operand from 0.) The destination operand is located in a general-purpose register or a memory

location.

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

In 64-bit mode, the instruction’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits

access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See

the summary chart at the beginning of this section for encoding data and limits.

Operation

IF DEST = 0

THEN CF ← 0;

ELSE CF ← 1;

FI;

DEST ← [– (DEST)]

Flags Affected

The CF flag set to 0 if the source operand is 0; otherwise it is set to 1. The OF, SF, ZF, AF, and PF flags are set

according to the result.

Protected Mode Exceptions

#GP(0) If the destination is located in a non-writable segment.

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

If the DS, ES, FS, or GS register contains a NULL segment selector.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If the LOCK prefix is used but the destination is not a memory operand.

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

F6 /3 NEG r/m8 M Valid Valid Two's complement negate r/m8.

REX + F6 /3 NEG r/m8* M Valid N.E. Two's complement negate r/m8.

F7 /3 NEG r/m16 M Valid Valid Two's complement negate r/m16.

F7 /3 NEG r/m32 M Valid Valid Two's complement negate r/m32.

REX.W + F7 /3 NEG r/m64 M Valid N.E. Two's complement negate r/m64.

NOTES:

* In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

M ModRM:r/m (r, w) NA NA NA

NEG—Two's Complement Negation

INSTRUCTION SET REFERENCE, M-U

4-162 Vol. 2B

Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If the LOCK prefix is used but the destination is not a memory operand.

Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made.

#UD If the LOCK prefix is used but the destination is not a memory operand.

Compatibility Mode Exceptions

Same as for protected mode exceptions.

64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the memory address is in a non-canonical form.

#PF(fault-code) For a page fault.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If the LOCK prefix is used but the destination is not a memory operand.

NOP—No Operation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-163

NOP—No Operation

Instruction Operand Encoding

Description

This instruction performs no operation. It is a one-byte or multi-byte NOP that takes up space in the instruction

stream but does not impact machine context, except for the EIP register.

The multi-byte form of NOP is available on processors with model encoding:

•CPUID.01H.EAX[Bytes 11:8] = 0110B or 1111B

The multi-byte NOP instruction does not alter the content of a register and will not issue a memory operation. The

instruction’s operation is the same in non-64-bit modes and 64-bit mode.

Operation

The one-byte NOP instruction is an alias mnemonic for the XCHG (E)AX, (E)AX instruction.

The multi-byte NOP instruction performs no operation on supported processors and generates undefined opcode

exception on processors that do not support the multi-byte NOP instruction.

The memory operand form of the instruction allows software to create a byte sequence of “no operation” as one

instruction. For situations where multiple-byte NOPs are needed, the recommended operations (32-bit mode and

64-bit mode) are:

Flags Affected

None

Exceptions (All Operating Modes)

#UD If the LOCK prefix is used.

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

NP 90 NOP ZO Valid Valid One byte no-operation instruction.

NP 0F 1F /0 NOP r/m16 M Valid Valid Multi-byte no-operation instruction.

NP 0F 1F /0 NOP r/m32 M Valid Valid Multi-byte no-operation instruction.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

ZO NA NA NA NA

M ModRM:r/m (r) NA NA NA

Table 4-12. Recommended Multi-Byte Sequence of NOP Instruction

Length Assembly Byte Sequence

2 bytes 66 NOP 66 90H

3 bytes NOP DWORD ptr [EAX] 0F 1F 00H

4 bytes NOP DWORD ptr [EAX + 00H] 0F 1F 40 00H

5 bytes NOP DWORD ptr [EAX + EAX*1 + 00H] 0F 1F 44 00 00H

6 bytes 66 NOP DWORD ptr [EAX + EAX*1 + 00H] 66 0F 1F 44 00 00H

7 bytes NOP DWORD ptr [EAX + 00000000H] 0F 1F 80 00 00 00 00H

8 bytes NOP DWORD ptr [EAX + EAX*1 + 00000000H] 0F 1F 84 00 00 00 00 00H

9 bytes 66 NOP DWORD ptr [EAX + EAX*1 + 00000000H] 66 0F 1F 84 00 00 00 00 00H

NOT—One's Complement Negation

INSTRUCTION SET REFERENCE, M-U

4-164 Vol. 2B

NOT—One's Complement Negation

Instruction Operand Encoding

Description

Performs a bitwise NOT operation (each 1 is set to 0, and each 0 is set to 1) on the destination operand and stores

the result in the destination operand location. The destination operand can be a register or a memory location.

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

In 64-bit mode, the instruction’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits

access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See

the summary chart at the beginning of this section for encoding data and limits.

Operation

DEST ← NOT DEST;

Flags Affected

None

Protected Mode Exceptions

#GP(0) If the destination operand points to a non-writable segment.

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

If the DS, ES, FS, or GS register contains a NULL segment selector.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If the LOCK prefix is used but the destination is not a memory operand.

Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If the LOCK prefix is used but the destination is not a memory operand.

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

F6 /2 NOT r/m8 MValid Valid Reverse each bit of r/m8.

REX + F6 /2 NOT r/m8* MValid N.E. Reverse each bit of r/m8.

F7 /2 NOT r/m16 MValid Valid Reverse each bit of r/m16.

F7 /2 NOT r/m32 MValid Valid Reverse each bit of r/m32.

REX.W + F7 /2 NOT r/m64 MValid N.E. Reverse each bit of r/m64.

NOTES:

* In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

M ModRM:r/m (r, w) NA NA NA

NOT—One's Complement Negation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-165

Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made.

#UD If the LOCK prefix is used but the destination is not a memory operand.

Compatibility Mode Exceptions

Same as for protected mode exceptions.

64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the memory address is in a non-canonical form.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If the LOCK prefix is used but the destination is not a memory operand.

OR—Logical Inclusive OR

INSTRUCTION SET REFERENCE, M-U

4-166 Vol. 2B

OR—Logical Inclusive OR

Instruction Operand Encoding

Description

Performs a bitwise inclusive OR operation between the destination (first) and source (second) operands and stores

the result in the destination operand location. The source operand can be an immediate, a register, or a memory

location; the destination operand can be a register or a memory location. (However, two memory operands cannot

be used in one instruction.) Each bit of the result of the OR instruction is set to 0 if both corresponding bits of the

first and second operands are 0; otherwise, each bit is set to 1.

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

0C ib OR AL, imm8 IValid Valid AL OR imm8.

0D iw OR AX, imm16 IValid Valid AX OR imm16.

0D id OR EAX, imm32 IValid Valid EAX OR imm32.

REX.W + 0D id OR RAX, imm32 IValid N.E. RAX OR imm32 (sign-extended).

80 /1 ib OR r/m8, imm8 MI Valid Valid r/m8 OR imm8.

REX + 80 /1 ib OR r/m8*, imm8 MI Valid N.E. r/m8 OR imm8.

81 /1 iw OR r/m16, imm16 MI Valid Valid r/m16 OR imm16.

81 /1 id OR r/m32, imm32 MI Valid Valid r/m32 OR imm32.

REX.W + 81 /1 id OR r/m64, imm32 MI Valid N.E. r/m64 OR imm32 (sign-extended).

83 /1 ib OR r/m16, imm8 MI Valid Valid r/m16 OR imm8 (sign-extended).

83 /1 ib OR r/m32, imm8 MI Valid Valid r/m32 OR imm8 (sign-extended).

REX.W + 83 /1 ib OR r/m64, imm8 MI Valid N.E. r/m64 OR imm8 (sign-extended).

08 /rOR r/m8, r8 MR Valid Valid r/m8 OR r8.

REX + 08 /rOR r/m8*, r8* MR Valid N.E. r/m8 OR r8.

09 /rOR r/m16, r16 MR Valid Valid r/m16 OR r16.

09 /rOR r/m32, r32 MR Valid Valid r/m32 OR r32.

REX.W + 09 /rOR r/m64, r64 MR Valid N.E. r/m64 OR r64.

0A /rOR r8, r/m8 RM Valid Valid r8 OR r/m8.

REX + 0A /rOR r8*, r/m8* RM Valid N.E. r8 OR r/m8.

0B /rOR r16, r/m16 RM Valid Valid r16 OR r/m16.

0B /rOR r32, r/m32 RM Valid Valid r32 OR r/m32.

REX.W + 0B /rOR r64, r/m64 RM Valid N.E. r64 OR r/m64.

NOTES:

* In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

I AL/AX/EAX/RAX imm8/16/32 NA NA

MI ModRM:r/m (r, w) imm8/16/32 NA NA

MR ModRM:r/m (r, w) ModRM:reg (r) NA NA

RM ModRM:reg (r, w) ModRM:r/m (r) NA NA

OR—Logical Inclusive OR

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-167

In 64-bit mode, the instruction’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits

access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See

the summary chart at the beginning of this section for encoding data and limits.

Operation

DEST ← DEST OR SRC;

Flags Affected

The OF and CF flags are cleared; the SF, ZF, and PF flags are set according to the result. The state of the AF flag is

undefined.

Protected Mode Exceptions

#GP(0) If the destination operand points to a non-writable segment.

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

If the DS, ES, FS, or GS register contains a NULL segment selector.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If the LOCK prefix is used but the destination is not a memory operand.

Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If the LOCK prefix is used but the destination is not a memory operand.

Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made.

#UD If the LOCK prefix is used but the destination is not a memory operand.

Compatibility Mode Exceptions

Same as for protected mode exceptions.

64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the memory address is in a non-canonical form.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If the LOCK prefix is used but the destination is not a memory operand.

ORPD—Bitwise Logical OR of Packed Double Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-168 Vol. 2B

ORPD—Bitwise Logical OR of Packed Double Precision Floating-Point Values

Instruction Operand Encoding

Description

Performs a bitwise logical OR of the two, four or eight packed double-precision floating-point values from the first

source operand and the second source operand, and stores the result in the destination operand.

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be

a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a

32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with

writemask k1.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register

or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the

corresponding ZMM register destination are zeroed.

VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM

the corresponding ZMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-

nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

66 0F 56/r

ORPD xmm1, xmm2/m128

A V/V SSE2 Return the bitwise logical OR of packed double-precision

floating-point values in xmm1 and xmm2/mem.

VEX.128.66.0F 56 /r

VORPD xmm1,xmm2, xmm3/m128

B V/V AVX Return the bitwise logical OR of packed double-precision

floating-point values in xmm2 and xmm3/mem.

VEX.256.66.0F 56 /r

VORPD ymm1, ymm2, ymm3/m256

B V/V AVX Return the bitwise logical OR of packed double-precision

floating-point values in ymm2 and ymm3/mem.

EVEX.128.66.0F.W1 56 /r

VORPD xmm1 {k1}{z}, xmm2,

xmm3/m128/m64bcst

C V/V AVX512VL

AVX512DQ

Return the bitwise logical OR of packed double-precision

floating-point values in xmm2 and xmm3/m128/m64bcst

subject to writemask k1.

EVEX.256.66.0F.W1 56 /r

VORPD ymm1 {k1}{z}, ymm2,

ymm3/m256/m64bcst

C V/V AVX512VL

AVX512DQ

Return the bitwise logical OR of packed double-precision

floating-point values in ymm2 and ymm3/m256/m64bcst

subject to writemask k1.

EVEX.512.66.0F.W1 56 /r

VORPD zmm1 {k1}{z}, zmm2,

zmm3/m512/m64bcst

C V/V AVX512DQ Return the bitwise logical OR of packed double-precision

floating-point values in zmm2 and zmm3/m512/m64bcst

subject to writemask k1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

ORPD—Bitwise Logical OR of Packed Double Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-169

Operation

VORPD (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b == 1) AND (SRC2 *is memory*)

THEN

DEST[i+63:i]  SRC1[i+63:i] BITWISE OR SRC2[63:0]

ELSE

DEST[i+63:i]  SRC1[i+63:i] BITWISE OR SRC2[i+63:i]

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+63:i]  0

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VORPD (VEX.256 encoded version)

DEST[63:0]  SRC1[63:0] BITWISE OR SRC2[63:0]

DEST[127:64]  SRC1[127:64] BITWISE OR SRC2[127:64]

DEST[191:128]  SRC1[191:128] BITWISE OR SRC2[191:128]

DEST[255:192]  SRC1[255:192] BITWISE OR SRC2[255:192]

DEST[MAXVL-1:256]  0

VORPD (VEX.128 encoded version)

DEST[63:0]  SRC1[63:0] BITWISE OR SRC2[63:0]

DEST[127:64]  SRC1[127:64] BITWISE OR SRC2[127:64]

DEST[MAXVL-1:128]  0

ORPD (128-bit Legacy SSE version)

DEST[63:0]  DEST[63:0] BITWISE OR SRC[63:0]

DEST[127:64]  DEST[127:64] BITWISE OR SRC[127:64]

DEST[MAXVL-1:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VORPD __m512d _mm512_or_pd ( __m512d a, __m512d b);

VORPD __m512d _mm512_mask_or_pd ( __m512d s, __mmask8 k, __m512d a, __m512d b);

VORPD __m512d _mm512_maskz_or_pd (__mmask8 k, __m512d a, __m512d b);

VORPD __m256d _mm256_mask_or_pd (__m256d s, ___mmask8 k, __m256d a, __m256d b);

VORPD __m256d _mm256_maskz_or_pd (__mmask8 k, __m256d a, __m256d b);

VORPD __m128d _mm_mask_or_pd ( __m128d s, __mmask8 k, __m128d a, __m128d b);

VORPD __m128d _mm_maskz_or_pd (__mmask8 k, __m128d a, __m128d b);

VORPD __m256d _mm256_or_pd (__m256d a, __m256d b);

ORPD __m128d _mm_or_pd (__m128d a, __m128d b);

ORPD—Bitwise Logical OR of Packed Double Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-170 Vol. 2B

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

EVEX-encoded instruction, see Exceptions Type E4.

ORPS—Bitwise Logical OR of Packed Single Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-171

ORPS—Bitwise Logical OR of Packed Single Precision Floating-Point Values

Instruction Operand Encoding

Description

Performs a bitwise logical OR of the four, eight or sixteen packed single-precision floating-point values from the

first source operand and the second source operand, and stores the result in the destination operand

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be

a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a

32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with

writemask k1.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register

or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the

corresponding ZMM register destination are zeroed.

VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM

the corresponding ZMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-

nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

NP 0F 56 /r

ORPS xmm1, xmm2/m128

A V/V SSE Return the bitwise logical OR of packed single-precision

floating-point values in xmm1 and xmm2/mem.

VEX.128.0F 56 /r

VORPS xmm1,xmm2, xmm3/m128

B V/V AVX Return the bitwise logical OR of packed single-precision

floating-point values in xmm2 and xmm3/mem.

VEX.256.0F 56 /r

VORPS ymm1, ymm2, ymm3/m256

B V/V AVX Return the bitwise logical OR of packed single-precision

floating-point values in ymm2 and ymm3/mem.

EVEX.128.0F.W0 56 /r

VORPS xmm1 {k1}{z}, xmm2,

xmm3/m128/m32bcst

C V/V AVX512VL

AVX512DQ

Return the bitwise logical OR of packed single-precision

floating-point values in xmm2 and xmm3/m128/m32bcst

subject to writemask k1.

EVEX.256.0F.W0 56 /r

VORPS ymm1 {k1}{z}, ymm2,

ymm3/m256/m32bcst

C V/V AVX512VL

AVX512DQ

Return the bitwise logical OR of packed single-precision

floating-point values in ymm2 and ymm3/m256/m32bcst

subject to writemask k1.

EVEX.512.0F.W0 56 /r

VORPS zmm1 {k1}{z}, zmm2,

zmm3/m512/m32bcst

C V/V AVX512DQ Return the bitwise logical OR of packed single-precision

floating-point values in zmm2 and zmm3/m512/m32bcst

subject to writemask k1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

ORPS—Bitwise Logical OR of Packed Single Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

4-172 Vol. 2B

Operation

VORPS (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b == 1) AND (SRC2 *is memory*)

THEN

DEST[i+31:i]  SRC1[i+31:i] BITWISE OR SRC2[31:0]

ELSE

DEST[i+31:i]  SRC1[i+31:i] BITWISE OR SRC2[i+31:i]

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+31:i]  0

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

VORPS (VEX.256 encoded version)

DEST[31:0]  SRC1[31:0] BITWISE OR SRC2[31:0]

DEST[63:32]  SRC1[63:32] BITWISE OR SRC2[63:32]

DEST[95:64]  SRC1[95:64] BITWISE OR SRC2[95:64]

DEST[127:96]  SRC1[127:96] BITWISE OR SRC2[127:96]

DEST[159:128]  SRC1[159:128] BITWISE OR SRC2[159:128]

DEST[191:160]  SRC1[191:160] BITWISE OR SRC2[191:160]

DEST[223:192]  SRC1[223:192] BITWISE OR SRC2[223:192]

DEST[255:224]  SRC1[255:224] BITWISE OR SRC2[255:224].

DEST[MAXVL-1:256]  0

VORPS (VEX.128 encoded version)

DEST[31:0]  SRC1[31:0] BITWISE OR SRC2[31:0]

DEST[63:32]  SRC1[63:32] BITWISE OR SRC2[63:32]

DEST[95:64]  SRC1[95:64] BITWISE OR SRC2[95:64]

DEST[127:96]  SRC1[127:96] BITWISE OR SRC2[127:96]

DEST[MAXVL-1:128]  0

ORPS (128-bit Legacy SSE version)

DEST[31:0]  SRC1[31:0] BITWISE OR SRC2[31:0]

DEST[63:32]  SRC1[63:32] BITWISE OR SRC2[63:32]

DEST[95:64]  SRC1[95:64] BITWISE OR SRC2[95:64]

DEST[127:96]  SRC1[127:96] BITWISE OR SRC2[127:96]

DEST[MAXVL-1:128] (Unmodified)

ORPS—Bitwise Logical OR of Packed Single Precision Floating-Point Values

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-173

Intel C/C++ Compiler Intrinsic Equivalent

VORPS __m512 _mm512_or_ps ( __m512 a, __m512 b);

VORPS __m512 _mm512_mask_or_ps ( __m512 s, __mmask16 k, __m512 a, __m512 b);

VORPS __m512 _mm512_maskz_or_ps (__mmask16 k, __m512 a, __m512 b);

VORPS __m256 _mm256_mask_or_ps (__m256 s, ___mmask8 k, __m256 a, __m256 b);

VORPS __m256 _mm256_maskz_or_ps (__mmask8 k, __m256 a, __m256 b);

VORPS __m128 _mm_mask_or_ps ( __m128 s, __mmask8 k, __m128 a, __m128 b);

VORPS __m128 _mm_maskz_or_ps (__mmask8 k, __m128 a, __m128 b);

VORPS __m256 _mm256_or_ps (__m256 a, __m256 b);

ORPS __m128 _mm_or_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

EVEX-encoded instruction, see Exceptions Type E4.

OUT—Output to Port

INSTRUCTION SET REFERENCE, M-U

4-174 Vol. 2B

OUT—Output to Port

Instruction Operand Encoding

Description

Copies the value from the second operand (source operand) to the I/O port specified with the destination operand

(first operand). The source operand can be register AL, AX, or EAX, depending on the size of the port being

accessed (8, 16, or 32 bits, respectively); the destination operand can be a byte-immediate or the DX register.

Using a byte immediate allows I/O port addresses 0 to 255 to be accessed; using the DX register as a source

operand allows I/O ports from 0 to 65,535 to be accessed.

The size of the I/O port being accessed is determined by the opcode for an 8-bit I/O port or by the operand-size

attribute of the instruction for a 16- or 32-bit I/O port.

At the machine code level, I/O instructions are shorter when accessing 8-bit I/O ports. Here, the upper eight bits

of the port address will be 0.

This instruction is only useful for accessing I/O ports located in the processor’s I/O address space. See Chapter 18,

“Input/Output,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for more infor-

mation on accessing I/O ports in the I/O address space.

This instruction’s operation is the same in non-64-bit modes and 64-bit mode.

IA-32 Architecture Compatibility

After executing an OUT instruction, the Pentium® processor ensures that the EWBE# pin has been sampled active

before it begins to execute the next instruction. (Note that the instruction can be prefetched if EWBE# is not active,

but it will not be executed until the EWBE# pin is sampled active.) Only the Pentium processor family has the

EWBE# pin.

Opcode* Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

E6 ib OUT imm8, AL I Valid Valid Output byte in AL to I/O port address imm8.

E7 ib OUT imm8, AX I Valid Valid Output word in AX to I/O port address imm8.

E7 ib OUT imm8, EAX I Valid Valid Output doubleword in EAX to I/O port address

imm8.

EE OUT DX, AL ZO Valid Valid Output byte in AL to I/O port address in DX.

EF OUT DX, AX ZO Valid Valid Output word in AX to I/O port address in DX.

EF OUT DX, EAX ZO Valid Valid Output doubleword in EAX to I/O port address

in DX.

NOTES:

* See IA-32 Architecture Compatibility section below.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

Iimm8 NA NA NA

ZO NA NA NA NA

OUT—Output to Port

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-175

Operation

IF ((PE = 1) and ((CPL > IOPL) or (VM = 1)))

THEN (* Protected mode with CPL > IOPL or virtual-8086 mode *)

IF (Any I/O Permission Bit for I/O port being accessed = 1)

THEN (* I/O operation is not allowed *)

#GP(0);

ELSE ( * I/O operation is allowed *)

DEST ← SRC; (* Writes to selected I/O port *)

FI;

ELSE (Real Mode or Protected Mode with CPL ≤ IOPL *)

DEST ← SRC; (* Writes to selected I/O port *)

FI;

Flags Affected

None

Protected Mode Exceptions

#GP(0) If the CPL is greater than (has less privilege) the I/O privilege level (IOPL) and any of the

corresponding I/O permission bits in TSS for the I/O port being accessed is 1.

#UD If the LOCK prefix is used.

Real-Address Mode Exceptions

#UD If the LOCK prefix is used.

Virtual-8086 Mode Exceptions

#GP(0) If any of the I/O permission bits in the TSS for the I/O port being accessed is 1.

#PF(fault-code) If a page fault occurs.

#UD If the LOCK prefix is used.

Compatibility Mode Exceptions

Same as protected mode exceptions.

64-Bit Mode Exceptions

Same as protected mode exceptions.

OUTS/OUTSB/OUTSW/OUTSD—Output String to Port

INSTRUCTION SET REFERENCE, M-U

4-176 Vol. 2B

OUTS/OUTSB/OUTSW/OUTSD—Output String to Port

Instruction Operand Encoding

Description

Copies data from the source operand (second operand) to the I/O port specified with the destination operand (first

operand). The source operand is a memory location, the address of which is read from either the DS:SI, DS:ESI or

the RSI registers (depending on the address-size attribute of the instruction, 16, 32 or 64, respectively). (The DS

segment may be overridden with a segment override prefix.) The destination operand is an I/O port address (from

0 to 65,535) that is read from the DX register. The size of the I/O port being accessed (that is, the size of the source

and destination operands) is determined by the opcode for an 8-bit I/O port or by the operand-size attribute of the

instruction for a 16- or 32-bit I/O port.

At the assembly-code level, two forms of this instruction are allowed: the “explicit-operands” form and the “no-

operands” form. The explicit-operands form (specified with the OUTS mnemonic) allows the source and destination

operands to be specified explicitly. Here, the source operand should be a symbol that indicates the size of the I/O

port and the source address, and the destination operand must be DX. This explicit-operands form is provided to

allow documentation; however, note that the documentation provided by this form can be misleading. That is, the

source operand symbol must specify the correct type (size) of the operand (byte, word, or doubleword), but it does

not have to specify the correct location. The location is always specified by the DS:(E)SI or RSI registers, which

must be loaded correctly before the OUTS instruction is executed.

The no-operands form provides “short forms” of the byte, word, and doubleword versions of the OUTS instructions.

Here also DS:(E)SI is assumed to be the source operand and DX is assumed to be the destination operand. The size

of the I/O port is specified with the choice of mnemonic: OUTSB (byte), OUTSW (word), or OUTSD (doubleword).

After the byte, word, or doubleword is transferred from the memory location to the I/O port, the SI/ESI/RSI

(If the DF flag is 0, the (E)SI register is incremented; if the DF flag is 1, the SI/ESI/RSI register is decremented.)

The SI/ESI/RSI register is incremented or decremented by 1 for byte operations, by 2 for word operations, and by

4 for doubleword operations.

Opcode* Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

6E OUTS DX, m8 ZO Valid Valid Output byte from memory location specified

in DS:(E)SI or RSI to I/O port specified in DX**.

6F OUTS DX, m16 ZO Valid Valid Output word from memory location specified

in DS:(E)SI or RSI to I/O port specified in DX**.

6F OUTS DX, m32 ZO Valid Valid Output doubleword from memory location

specified in DS:(E)SI or RSI to I/O port specified

in DX**.

6E OUTSB ZO Valid Valid Output byte from memory location specified

in DS:(E)SI or RSI to I/O port specified in DX**.

6F OUTSW ZO Valid Valid Output word from memory location specified

in DS:(E)SI or RSI to I/O port specified in DX**.

6F OUTSD ZO Valid Valid Output doubleword from memory location

specified in DS:(E)SI or RSI to I/O port specified

in DX**.

NOTES:

* See IA-32 Architecture Compatibility section below.

** In 64-bit mode, only 64-bit (RSI) and 32-bit (ESI) address sizes are supported. In non-64-bit mode, only 32-bit (ESI) and 16-bit (SI)

address sizes are supported.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

ZO NA NA NA NA

OUTS/OUTSB/OUTSW/OUTSD—Output String to Port

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-177

The OUTS, OUTSB, OUTSW, and OUTSD instructions can be preceded by the REP prefix for block input of ECX

bytes, words, or doublewords. See “REP/REPE/REPZ /REPNE/REPNZ—Repeat String Operation Prefix” in this

chapter for a description of the REP prefix. This instruction is only useful for accessing I/O ports located in the

processor’s I/O address space. See Chapter 18, “Input/Output,” in the Intel® 64 and IA-32 Architectures Software

Developer’s Manual, Volume 1, for more information on accessing I/O ports in the I/O address space.

In 64-bit mode, the default operand size is 32 bits; operand size is not promoted by the use of REX.W. In 64-bit

mode, the default address size is 64 bits, and 64-bit address is specified using RSI by default. 32-bit address using

ESI is support using the prefix 67H, but 16-bit address is not supported in 64-bit mode.

IA-32 Architecture Compatibility

After executing an OUTS, OUTSB, OUTSW, or OUTSD instruction, the Pentium processor ensures that the EWBE#

pin has been sampled active before it begins to execute the next instruction. (Note that the instruction can be

prefetched if EWBE# is not active, but it will not be executed until the EWBE# pin is sampled active.) Only the

Pentium processor family has the EWBE# pin.

For the Pentium 4, Intel® Xeon®, and P6 processor family, upon execution of an OUTS, OUTSB, OUTSW, or OUTSD

instruction, the processor will not execute the next instruction until the data phase of the transaction is complete.

Operation

IF ((PE = 1) and ((CPL > IOPL) or (VM = 1)))

THEN (* Protected mode with CPL > IOPL or virtual-8086 mode *)

IF (Any I/O Permission Bit for I/O port being accessed = 1)

THEN (* I/O operation is not allowed *)

#GP(0);

ELSE (* I/O operation is allowed *)

DEST ← SRC; (* Writes to I/O port *)

FI;

ELSE (Real Mode or Protected Mode or 64-Bit Mode with CPL ≤ IOPL *)

DEST ← SRC; (* Writes to I/O port *)

FI;

Byte transfer:

IF 64-bit mode

Then

IF 64-Bit Address Size

THEN

IF DF = 0

THEN RSI ← RSI RSI + 1;

ELSE RSI ← RSI or – 1;

FI;

ELSE (* 32-Bit Address Size *)

IF DF = 0

THEN ESI ← ESI + 1;

ELSE ESI ← ESI – 1;

FI;

ELSE

IF DF = 0

THEN (E)SI ← (E)SI + 1;

ELSE (E)SI ← (E)SI – 1;

FI;

Word transfer:

IF 64-bit mode

OUTS/OUTSB/OUTSW/OUTSD—Output String to Port

INSTRUCTION SET REFERENCE, M-U

4-178 Vol. 2B

Then

IF 64-Bit Address Size

THEN

IF DF = 0

THEN RSI ← RSI RSI + 2;

ELSE RSI ← RSI or – 2;

FI;

ELSE (* 32-Bit Address Size *)

IF DF = 0

THEN ESI ← ESI + 2;

ELSE ESI ← ESI – 2;

FI;

ELSE

IF DF = 0

THEN (E)SI ← (E)SI + 2;

ELSE (E)SI ← (E)SI – 2;

FI;

Doubleword transfer:

IF 64-bit mode

Then

IF 64-Bit Address Size

THEN

IF DF = 0

THEN RSI ← RSI RSI + 4;

ELSE RSI ← RSI or – 4;

FI;

ELSE (* 32-Bit Address Size *)

IF DF = 0

THEN ESI ← ESI + 4;

ELSE ESI ← ESI – 4;

FI;

ELSE

IF DF = 0

THEN (E)SI ← (E)SI + 4;

ELSE (E)SI ← (E)SI – 4;

FI;

Flags Affected

None

OUTS/OUTSB/OUTSW/OUTSD—Output String to Port

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-179

Protected Mode Exceptions

#GP(0) If the CPL is greater than (has less privilege) the I/O privilege level (IOPL) and any of the

corresponding I/O permission bits in TSS for the I/O port being accessed is 1.

If a memory operand effective address is outside the limit of the CS, DS, ES, FS, or GS

segment.

If the segment register contains a NULL segment selector.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If the LOCK prefix is used.

Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If the LOCK prefix is used.

Virtual-8086 Mode Exceptions

#GP(0) If any of the I/O permission bits in the TSS for the I/O port being accessed is 1.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made.

#UD If the LOCK prefix is used.

Compatibility Mode Exceptions

Same as for protected mode exceptions.

64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the CPL is greater than (has less privilege) the I/O privilege level (IOPL) and any of the

corresponding I/O permission bits in TSS for the I/O port being accessed is 1.

If the memory address is in a non-canonical form.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the

current privilege level is 3.

#UD If the LOCK prefix is used.

PABSB/PABSW/PABSD/PABSQ — Packed Absolute Value

INSTRUCTION SET REFERENCE, M-U

4-180 Vol. 2B

PABSB/PABSW/PABSD/PABSQ — Packed Absolute Value

Opcode/

Instruction

Op/

64/32 bit

Mode

Support

CPUID

Feature

Flag

Description

NP 0F 38 1C /r1

PABSB mm1, mm2/m64

A V/V SSSE3 Compute the absolute value of bytes in

mm2/m64 and store UNSIGNED result in mm1.

66 0F 38 1C /r

PABSB xmm1, xmm2/m128

A V/V SSSE3 Compute the absolute value of bytes in

xmm2/m128 and store UNSIGNED result in

xmm1.

NP 0F 38 1D /r1

PABSW mm1, mm2/m64

A V/V SSSE3 Compute the absolute value of 16-bit integers

in mm2/m64 and store UNSIGNED result in

mm1.

66 0F 38 1D /r

PABSW xmm1, xmm2/m128

A V/V SSSE3 Compute the absolute value of 16-bit integers

in xmm2/m128 and store UNSIGNED result in

xmm1.

NP 0F 38 1E /r1

PABSD mm1, mm2/m64

A V/V SSSE3 Compute the absolute value of 32-bit integers

in mm2/m64 and store UNSIGNED result in

mm1.

66 0F 38 1E /r

PABSD xmm1, xmm2/m128

A V/V SSSE3 Compute the absolute value of 32-bit integers

in xmm2/m128 and store UNSIGNED result in

xmm1.

VEX.128.66.0F38.WIG 1C /r

VPABSB xmm1, xmm2/m128

A V/V AVX Compute the absolute value of bytes in

xmm2/m128 and store UNSIGNED result in

xmm1.

VEX.128.66.0F38.WIG 1D /r

VPABSW xmm1, xmm2/m128

A V/V AVX Compute the absolute value of 16- bit

integers in xmm2/m128 and store UNSIGNED

result in xmm1.

VEX.128.66.0F38.WIG 1E /r

VPABSD xmm1, xmm2/m128

A V/V AVX Compute the absolute value of 32- bit

integers in xmm2/m128 and store UNSIGNED

result in xmm1.

VEX.256.66.0F38.WIG 1C /r

VPABSB ymm1, ymm2/m256

A V/V AVX2 Compute the absolute value of bytes in

ymm2/m256 and store UNSIGNED result in

ymm1.

VEX.256.66.0F38.WIG 1D /r

VPABSW ymm1, ymm2/m256

A V/V AVX2 Compute the absolute value of 16-bit integers

in ymm2/m256 and store UNSIGNED result in

ymm1.

VEX.256.66.0F38.WIG 1E /r

VPABSD ymm1, ymm2/m256

A V/V AVX2 Compute the absolute value of 32-bit integers

in ymm2/m256 and store UNSIGNED result in

ymm1.

EVEX.128.66.0F38.WIG 1C /r

VPABSB xmm1 {k1}{z}, xmm2/m128

B V/V AVX512VL

AVX512BW

Compute the absolute value of bytes in

xmm2/m128 and store UNSIGNED result in

xmm1 using writemask k1.

EVEX.256.66.0F38.WIG 1C /r

VPABSB ymm1 {k1}{z}, ymm2/m256

B V/V AVX512VL

AVX512BW

Compute the absolute value of bytes in

ymm2/m256 and store UNSIGNED result in

ymm1 using writemask k1.

EVEX.512.66.0F38.WIG 1C /r

VPABSB zmm1 {k1}{z}, zmm2/m512

B V/V AVX512BW Compute the absolute value of bytes in

zmm2/m512 and store UNSIGNED result in

zmm1 using writemask k1.

EVEX.128.66.0F38.WIG 1D /r

VPABSW xmm1 {k1}{z}, xmm2/m128

B V/V AVX512VL

AVX512BW

Compute the absolute value of 16-bit integers

in xmm2/m128 and store UNSIGNED result in

xmm1 using writemask k1.

PABSB/PABSW/PABSD/PABSQ — Packed Absolute Value

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-181

Instruction Operand Encoding

Description

PABSB/W/D computes the absolute value of each data element of the source operand (the second operand) and

stores the UNSIGNED results in the destination operand (the first operand). PABSB operates on signed bytes,

PABSW operates on signed 16-bit words, and PABSD operates on signed 32-bit integers.

EVEX encoded VPABSD/Q: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location,

or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a

ZMM/YMM/XMM register updated according to the writemask.

EVEX encoded VPABSB/W: The source operand is a ZMM/YMM/XMM register, or a 512/256/128-bit memory loca-

tion. The destination operand is a ZMM/YMM/XMM register updated according to the writemask.

VEX.256 encoded versions: The source operand is a YMM register or a 256-bit memory location. The destination

operand is a YMM register. The upper bits (MAXVL-1:256) of the corresponding register destination are zeroed.

VEX.128 encoded versions: The source operand is an XMM register or 128-bit memory location. The destination

operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding register destination are zeroed.

EVEX.256.66.0F38.WIG 1D /r

VPABSW ymm1 {k1}{z}, ymm2/m256

B V/V AVX512VL

AVX512BW

Compute the absolute value of 16-bit integers

in ymm2/m256 and store UNSIGNED result in

ymm1 using writemask k1.

EVEX.512.66.0F38.WIG 1D /r

VPABSW zmm1 {k1}{z}, zmm2/m512

B V/V AVX512BW Compute the absolute value of 16-bit integers

in zmm2/m512 and store UNSIGNED result in

zmm1 using writemask k1.

EVEX.128.66.0F38.W0 1E /r

VPABSD xmm1 {k1}{z}, xmm2/m128/m32bcst

C V/V AVX512VL

AVX512F

Compute the absolute value of 32-bit integers

in xmm2/m128/m32bcst and store UNSIGNED

result in xmm1 using writemask k1.

EVEX.256.66.0F38.W0 1E /r

VPABSD ymm1 {k1}{z}, ymm2/m256/m32bcst

C V/V AVX512VL

AVX512F

Compute the absolute value of 32-bit integers

in ymm2/m256/m32bcst and store UNSIGNED

result in ymm1 using writemask k1.

EVEX.512.66.0F38.W0 1E /r

VPABSD zmm1 {k1}{z}, zmm2/m512/m32bcst

C V/V AVX512F Compute the absolute value of 32-bit integers

in zmm2/m512/m32bcst and store UNSIGNED

result in zmm1 using writemask k1.

EVEX.128.66.0F38.W1 1F /r

VPABSQ xmm1 {k1}{z}, xmm2/m128/m64bcst

C V/V AVX512VL

AVX512F

Compute the absolute value of 64-bit integers

in xmm2/m128/m64bcst and store UNSIGNED

result in xmm1 using writemask k1.

EVEX.256.66.0F38.W1 1F /r

VPABSQ ymm1 {k1}{z}, ymm2/m256/m64bcst

C V/V AVX512VL

AVX512F

Compute the absolute value of 64-bit integers

in ymm2/m256/m64bcst and store UNSIGNED

result in ymm1 using writemask k1.

EVEX.512.66.0F38.W1 1F /r

VPABSQ zmm1 {k1}{z}, zmm2/m512/m64bcst

C V/V AVX512F Compute the absolute value of 64-bit integers

in zmm2/m512/m64bcst and store UNSIGNED

result in zmm1 using writemask k1.

NOTES:

1. See note in Section 2.4, “AVX and SSE Instruction Exception Specification” in the Intel® 64 and IA-32 Architectures Software

Developer’s Manual, Volume 2A and Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”

in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (w) ModRM:r/m (r) NA NA

B Full Mem ModRM:reg (w) ModRM:r/m (r) NA NA

C Full ModRM:reg (w) ModRM:r/m (r) NA NA

PABSB/PABSW/PABSD/PABSQ — Packed Absolute Value

INSTRUCTION SET REFERENCE, M-U

4-182 Vol. 2B

128-bit Legacy SSE version: The source operand can be an XMM register or an 128-bit memory location. The desti-

nation is an XMM register. The upper bits (VL_MAX-1:128) of the corresponding register destination are unmodi-

fied.

VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

Operation

PABSB with 128 bit operands:

Unsigned DEST[7:0] ABS(SRC[7: 0])

Repeat operation for 2nd through 15th bytes

Unsigned DEST[127:120] ABS(SRC[127:120])

VPABSB with 128 bit operands:

Unsigned DEST[7:0] ABS(SRC[7: 0])

Repeat operation for 2nd through 15th bytes

Unsigned DEST[127:120]ABS(SRC[127:120])

VPABSB with 256 bit operands:

Unsigned DEST[7:0]ABS(SRC[7: 0])

Repeat operation for 2nd through 31st bytes

Unsigned DEST[255:248]ABS(SRC[255:248])

VPABSB (EVEX encoded versions)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j  0 TO KL-1

i  j * 8

IF k1[j] OR *no writemask*

THEN

Unsigned DEST[i+7:i]  ABS(SRC[i+7:i])

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+7:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+7:i]  0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

PABSW with 128 bit operands:

Unsigned DEST[15:0]ABS(SRC[15:0])

Repeat operation for 2nd through 7th 16-bit words

Unsigned DEST[127:112]ABS(SRC[127:112])

VPABSW with 128 bit operands:

Unsigned DEST[15:0] ABS(SRC[15:0])

Repeat operation for 2nd through 7th 16-bit words

Unsigned DEST[127:112]ABS(SRC[127:112])

VPABSW with 256 bit operands:

Unsigned DEST[15:0]ABS(SRC[15:0])

Repeat operation for 2nd through 15th 16-bit words

Unsigned DEST[255:240] ABS(SRC[255:240])

PABSB/PABSW/PABSD/PABSQ — Packed Absolute Value

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-183

VPABSW (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j  0 TO KL-1

i  j * 16

IF k1[j] OR *no writemask*

THEN

Unsigned DEST[i+15:i]  ABS(SRC[i+15:i])

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+15:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+15:i]  0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

PABSD with 128 bit operands:

Unsigned DEST[31:0]ABS(SRC[31:0])

Repeat operation for 2nd through 3rd 32-bit double words

Unsigned DEST[127:96]ABS(SRC[127:96])

VPABSD with 128 bit operands:

Unsigned DEST[31:0]ABS(SRC[31:0])

Repeat operation for 2nd through 3rd 32-bit double words

Unsigned DEST[127:96]ABS(SRC[127:96])

VPABSD with 256 bit operands:

Unsigned DEST[31:0] ABS(SRC[31:0])

Repeat operation for 2nd through 7th 32-bit double words

Unsigned DEST[255:224] ABS(SRC[255:224])

VPABSD (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC *is memory*)

THEN

Unsigned DEST[i+31:i]  ABS(SRC[31:0])

ELSE

Unsigned DEST[i+31:i]  ABS(SRC[i+31:i])

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+31:i]  0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

PABSB/PABSW/PABSD/PABSQ — Packed Absolute Value

INSTRUCTION SET REFERENCE, M-U

4-184 Vol. 2B

VPABSQ (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC *is memory*)

THEN

Unsigned DEST[i+63:i]  ABS(SRC[63:0])

ELSE

Unsigned DEST[i+63:i]  ABS(SRC[i+63:i])

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+63:i]  0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

Intel C/C++ Compiler Intrinsic Equivalents

VPABSB__m512i _mm512_abs_epi8 ( __m512i a)

VPABSW__m512i _mm512_abs_epi16 ( __m512i a)

VPABSB__m512i _mm512_mask_abs_epi8 ( __m512i s, __mmask64 m, __m512i a)

VPABSW__m512i _mm512_mask_abs_epi16 ( __m512i s, __mmask32 m, __m512i a)

VPABSB__m512i _mm512_maskz_abs_epi8 (__mmask64 m, __m512i a)

VPABSW__m512i _mm512_maskz_abs_epi16 (__mmask32 m, __m512i a)

VPABSB__m256i _mm256_mask_abs_epi8 (__m256i s, __mmask32 m, __m256i a)

VPABSW__m256i _mm256_mask_abs_epi16 (__m256i s, __mmask16 m, __m256i a)

VPABSB__m256i _mm256_maskz_abs_epi8 (__mmask32 m, __m256i a)

VPABSW__m256i _mm256_maskz_abs_epi16 (__mmask16 m, __m256i a)

VPABSB__m128i _mm_mask_abs_epi8 (__m128i s, __mmask16 m, __m128i a)

VPABSW__m128i _mm_mask_abs_epi16 (__m128i s, __mmask8 m, __m128i a)

VPABSB__m128i _mm_maskz_abs_epi8 (__mmask16 m, __m128i a)

VPABSW__m128i _mm_maskz_abs_epi16 (__mmask8 m, __m128i a)

VPABSD __m256i _mm256_mask_abs_epi32(__m256i s, __mmask8 k, __m256i a);

VPABSD __m256i _mm256_maskz_abs_epi32( __mmask8 k, __m256i a);

VPABSD __m128i _mm_mask_abs_epi32(__m128i s, __mmask8 k, __m128i a);

VPABSD __m128i _mm_maskz_abs_epi32( __mmask8 k, __m128i a);

VPABSD __m512i _mm512_abs_epi32( __m512i a);

VPABSD __m512i _mm512_mask_abs_epi32(__m512i s, __mmask16 k, __m512i a);

VPABSD __m512i _mm512_maskz_abs_epi32( __mmask16 k, __m512i a);

VPABSQ __m512i _mm512_abs_epi64( __m512i a);

VPABSQ __m512i _mm512_mask_abs_epi64(__m512i s, __mmask8 k, __m512i a);

VPABSQ __m512i _mm512_maskz_abs_epi64( __mmask8 k, __m512i a);

VPABSQ __m256i _mm256_mask_abs_epi64(__m256i s, __mmask8 k, __m256i a);

VPABSQ __m256i _mm256_maskz_abs_epi64( __mmask8 k, __m256i a);

VPABSQ __m128i _mm_mask_abs_epi64(__m128i s, __mmask8 k, __m128i a);

VPABSQ __m128i _mm_maskz_abs_epi64( __mmask8 k, __m128i a);

PABSB __m128i _mm_abs_epi8 (__m128i a)

VPABSB __m128i _mm_abs_epi8 (__m128i a)

PABSB/PABSW/PABSD/PABSQ — Packed Absolute Value

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-185

VPABSB __m256i _mm256_abs_epi8 (__m256i a)

PABSW __m128i _mm_abs_epi16 (__m128i a)

VPABSW __m128i _mm_abs_epi16 (__m128i a)

VPABSW __m256i _mm256_abs_epi16 (__m256i a)

PABSD __m128i _mm_abs_epi32 (__m128i a)

VPABSD __m128i _mm_abs_epi32 (__m128i a)

VPABSD __m256i _mm256_abs_epi32 (__m256i a)

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

EVEX-encoded VPABSD/Q, see Exceptions Type E4.

EVEX-encoded VPABSB/W, see Exceptions Type E4.nb.

PACKSSWB/PACKSSDW—Pack with Signed Saturation

INSTRUCTION SET REFERENCE, M-U

4-186 Vol. 2B

PACKSSWB/PACKSSDW—Pack with Signed Saturation

Opcode/

Instruction

Op/

64/32 bit

Mode

Support

CPUID

Feature Flag

Description

NP 0F 63 /r1

PACKSSWB mm1, mm2/m64

A V/V MMX Converts 4 packed signed word integers from

mm1 and from mm2/m64 into 8 packed

signed byte integers in mm1 using signed

saturation.

66 0F 63 /r

PACKSSWB xmm1, xmm2/m128

A V/V SSE2 Converts 8 packed signed word integers from

xmm1 and from xxm2/m128 into 16 packed

signed byte integers in xxm1 using signed

saturation.

NP 0F 6B /r1

PACKSSDW mm1, mm2/m64

A V/V MMX Converts 2 packed signed doubleword

integers from mm1 and from mm2/m64 into 4

packed signed word integers in mm1 using

signed saturation.

66 0F 6B /r

PACKSSDW xmm1, xmm2/m128

A V/V SSE2 Converts 4 packed signed doubleword

integers from xmm1 and from xxm2/m128

into 8 packed signed word integers in xxm1

using signed saturation.

VEX.128.66.0F.WIG 63 /r

VPACKSSWB xmm1,xmm2, xmm3/m128

B V/V AVX Converts 8 packed signed word integers from

xmm2 and from xmm3/m128 into 16 packed

signed byte integers in xmm1 using signed

saturation.

VEX.128.66.0F.WIG 6B /r

VPACKSSDW xmm1,xmm2, xmm3/m128

B V/V AVX Converts 4 packed signed doubleword

integers from xmm2 and from xmm3/m128

into 8 packed signed word integers in xmm1

using signed saturation.

VEX.256.66.0F.WIG 63 /r

VPACKSSWB ymm1, ymm2, ymm3/m256

B V/V AVX2 Converts 16 packed signed word integers

from ymm2 and from ymm3/m256 into 32

packed signed byte integers in ymm1 using

signed saturation.

VEX.256.66.0F.WIG 6B /r

VPACKSSDW ymm1, ymm2, ymm3/m256

B V/V AVX2 Converts 8 packed signed doubleword

integers from ymm2 and from ymm3/m256

into 16 packed signed word integers in

ymm1using signed saturation.

EVEX.128.66.0F.WIG 63 /r

VPACKSSWB xmm1 {k1}{z}, xmm2, xmm3/m128

CV/V AVX512VL

AVX512BW

Converts packed signed word integers from

xmm2 and from xmm3/m128 into packed

signed byte integers in xmm1 using signed

saturation under writemask k1.

EVEX.256.66.0F.WIG 63 /r

VPACKSSWB ymm1 {k1}{z}, ymm2, ymm3/m256

CV/V AVX512VL

AVX512BW

Converts packed signed word integers from

ymm2 and from ymm3/m256 into packed

signed byte integers in ymm1 using signed

saturation under writemask k1.

EVEX.512.66.0F.WIG 63 /r

VPACKSSWB zmm1 {k1}{z}, zmm2, zmm3/m512

C V/V AVX512BW Converts packed signed word integers from

zmm2 and from zmm3/m512 into packed

signed byte integers in zmm1 using signed

saturation under writemask k1.

EVEX.128.66.0F.W0 6B /r

VPACKSSDW xmm1 {k1}{z}, xmm2,

xmm3/m128/m32bcst

DV/V AVX512VL

AVX512BW

Converts packed signed doubleword integers

from xmm2 and from xmm3/m128/m32bcst

into packed signed word integers in xmm1

using signed saturation under writemask k1.

PACKSSWB/PACKSSDW—Pack with Signed Saturation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-187

Instruction Operand Encoding

Description

Converts packed signed word integers into packed signed byte integers (PACKSSWB) or converts packed signed

doubleword integers into packed signed word integers (PACKSSDW), using saturation to handle overflow condi-

tions. See Figure 4-6 for an example of the packing operation.

PACKSSWB converts packed signed word integers in the first and second source operands into packed signed byte

integers using signed saturation to handle overflow conditions beyond the range of signed byte integers. If the

signed word value is beyond the range of a signed byte value (i.e., greater than 7FH or less than 80H), the satu-

rated signed byte integer value of 7FH or 80H, respectively, is stored in the destination. PACKSSDW converts

packed signed doubleword integers in the first and second source operands into packed signed word integers using

signed saturation to handle overflow conditions beyond 7FFFH and 8000H.

EVEX encoded PACKSSWB: The first source operand is a ZMM/YMM/XMM register. The second source operand is a

ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM

EVEX encoded PACKSSDW: The first source operand is a ZMM/YMM/XMM register. The second source operand is a

ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32-

bit memory location. The destination operand is a ZMM/YMM/XMM register, updated conditional under the

writemask k1.

EVEX.256.66.0F.W0 6B /r

VPACKSSDW ymm1 {k1}{z}, ymm2,

ymm3/m256/m32bcst

D V/V AVX512VL

AVX512BW

Converts packed signed doubleword integers

from ymm2 and from ymm3/m256/m32bcst

into packed signed word integers in ymm1

using signed saturation under writemask k1.

EVEX.512.66.0F.W0 6B /r

VPACKSSDW zmm1 {k1}{z}, zmm2,

zmm3/m512/m32bcst

D V/V AVX512BW Converts packed signed doubleword integers

from zmm2 and from zmm3/m512/m32bcst

into packed signed word integers in zmm1

using signed saturation under writemask k1.

NOTES:

1. See note in Section 2.4, “AVX and SSE Instruction Exception Specification” in the Intel® 64 and IA-32 Architectures Software

Developer’s Manual, Volume 2A and Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”

in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

D Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

Figure 4-6. Operation of the PACKSSDW Instruction Using 64-bit Operands

64-Bit SRC

64-Bit DEST

D’ C’ B’ A’

64-Bit DEST

PACKSSWB/PACKSSDW—Pack with Signed Saturation

INSTRUCTION SET REFERENCE, M-U

4-188 Vol. 2B

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register

or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the

corresponding ZMM register destination are zeroed.

VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM

the corresponding ZMM register destination are zeroed.

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM

upper bits (MAXVL-1:128) of the corresponding ZMM destination register destination are unmodified.

Operation

PACKSSWB instruction (128-bit Legacy SSE version)

DEST[7:0]  SaturateSignedWordToSignedByte (DEST[15:0]);

DEST[15:8]  SaturateSignedWordToSignedByte (DEST[31:16]);

DEST[23:16]  SaturateSignedWordToSignedByte (DEST[47:32]);

DEST[31:24]  SaturateSignedWordToSignedByte (DEST[63:48]);

DEST[39:32]  SaturateSignedWordToSignedByte (DEST[79:64]);

DEST[47:40]  SaturateSignedWordToSignedByte (DEST[95:80]);

DEST[55:48]  SaturateSignedWordToSignedByte (DEST[111:96]);

DEST[63:56]  SaturateSignedWordToSignedByte (DEST[127:112]);

DEST[71:64]  SaturateSignedWordToSignedByte (SRC[15:0]);

DEST[79:72]  SaturateSignedWordToSignedByte (SRC[31:16]);

DEST[87:80]  SaturateSignedWordToSignedByte (SRC[47:32]);

DEST[95:88]  SaturateSignedWordToSignedByte (SRC[63:48]);

DEST[103:96]  SaturateSignedWordToSignedByte (SRC[79:64]);

DEST[111:104]  SaturateSignedWordToSignedByte (SRC[95:80]);

DEST[119:112]  SaturateSignedWordToSignedByte (SRC[111:96]);

DEST[127:120]  SaturateSignedWordToSignedByte (SRC[127:112]);

DEST[MAXVL-1:128] (Unmodified)

PACKSSDW instruction (128-bit Legacy SSE version)

DEST[15:0]  SaturateSignedDwordToSignedWord (DEST[31:0]);

DEST[31:16]  SaturateSignedDwordToSignedWord (DEST[63:32]);

DEST[47:32]  SaturateSignedDwordToSignedWord (DEST[95:64]);

DEST[63:48]  SaturateSignedDwordToSignedWord (DEST[127:96]);

DEST[79:64]  SaturateSignedDwordToSignedWord (SRC[31:0]);

DEST[95:80]  SaturateSignedDwordToSignedWord (SRC[63:32]);

DEST[111:96]  SaturateSignedDwordToSignedWord (SRC[95:64]);

DEST[127:112]  SaturateSignedDwordToSignedWord (SRC[127:96]);

DEST[MAXVL-1:128] (Unmodified)

PACKSSWB/PACKSSDW—Pack with Signed Saturation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-189

VPACKSSWB instruction (VEX.128 encoded version)

DEST[7:0]  SaturateSignedWordToSignedByte (SRC1[15:0]);

DEST[15:8]  SaturateSignedWordToSignedByte (SRC1[31:16]);

DEST[23:16]  SaturateSignedWordToSignedByte (SRC1[47:32]);

DEST[31:24]  SaturateSignedWordToSignedByte (SRC1[63:48]);

DEST[39:32]  SaturateSignedWordToSignedByte (SRC1[79:64]);

DEST[47:40]  SaturateSignedWordToSignedByte (SRC1[95:80]);

DEST[55:48]  SaturateSignedWordToSignedByte (SRC1[111:96]);

DEST[63:56]  SaturateSignedWordToSignedByte (SRC1[127:112]);

DEST[71:64]  SaturateSignedWordToSignedByte (SRC2[15:0]);

DEST[79:72]  SaturateSignedWordToSignedByte (SRC2[31:16]);

DEST[87:80]  SaturateSignedWordToSignedByte (SRC2[47:32]);

DEST[95:88]  SaturateSignedWordToSignedByte (SRC2[63:48]);

DEST[103:96]  SaturateSignedWordToSignedByte (SRC2[79:64]);

DEST[111:104]  SaturateSignedWordToSignedByte (SRC2[95:80]);

DEST[119:112]  SaturateSignedWordToSignedByte (SRC2[111:96]);

DEST[127:120]  SaturateSignedWordToSignedByte (SRC2[127:112]);

DEST[MAXVL-1:128]  0;

VPACKSSDW instruction (VEX.128 encoded version)

DEST[15:0]  SaturateSignedDwordToSignedWord (SRC1[31:0]);

DEST[31:16]  SaturateSignedDwordToSignedWord (SRC1[63:32]);

DEST[47:32]  SaturateSignedDwordToSignedWord (SRC1[95:64]);

DEST[63:48]  SaturateSignedDwordToSignedWord (SRC1[127:96]);

DEST[79:64]  SaturateSignedDwordToSignedWord (SRC2[31:0]);

DEST[95:80]  SaturateSignedDwordToSignedWord (SRC2[63:32]);

DEST[111:96]  SaturateSignedDwordToSignedWord (SRC2[95:64]);

DEST[127:112]  SaturateSignedDwordToSignedWord (SRC2[127:96]);

DEST[MAXVL-1:128]  0;

VPACKSSWB instruction (VEX.256 encoded version)

DEST[7:0]  SaturateSignedWordToSignedByte (SRC1[15:0]);

DEST[15:8]  SaturateSignedWordToSignedByte (SRC1[31:16]);

DEST[23:16]  SaturateSignedWordToSignedByte (SRC1[47:32]);

DEST[31:24]  SaturateSignedWordToSignedByte (SRC1[63:48]);

DEST[39:32]  SaturateSignedWordToSignedByte (SRC1[79:64]);

DEST[47:40]  SaturateSignedWordToSignedByte (SRC1[95:80]);

DEST[55:48]  SaturateSignedWordToSignedByte (SRC1[111:96]);

DEST[63:56]  SaturateSignedWordToSignedByte (SRC1[127:112]);

DEST[71:64]  SaturateSignedWordToSignedByte (SRC2[15:0]);

DEST[79:72]  SaturateSignedWordToSignedByte (SRC2[31:16]);

DEST[87:80]  SaturateSignedWordToSignedByte (SRC2[47:32]);

DEST[95:88]  SaturateSignedWordToSignedByte (SRC2[63:48]);

DEST[103:96]  SaturateSignedWordToSignedByte (SRC2[79:64]);

DEST[111:104]  SaturateSignedWordToSignedByte (SRC2[95:80]);

DEST[119:112]  SaturateSignedWordToSignedByte (SRC2[111:96]);

DEST[127:120]  SaturateSignedWordToSignedByte (SRC2[127:112]);

DEST[135:128]  SaturateSignedWordToSignedByte (SRC1[143:128]);

DEST[143:136]  SaturateSignedWordToSignedByte (SRC1[159:144]);

DEST[151:144]  SaturateSignedWordToSignedByte (SRC1[175:160]);

DEST[159:152]  SaturateSignedWordToSignedByte (SRC1[191:176]);

DEST[167:160]  SaturateSignedWordToSignedByte (SRC1[207:192]);

DEST[175:168]  SaturateSignedWordToSignedByte (SRC1[223:208]);

DEST[183:176]  SaturateSignedWordToSignedByte (SRC1[239:224]);

PACKSSWB/PACKSSDW—Pack with Signed Saturation

INSTRUCTION SET REFERENCE, M-U

4-190 Vol. 2B

DEST[191:184]  SaturateSignedWordToSignedByte (SRC1[255:240]);

DEST[199:192]  SaturateSignedWordToSignedByte (SRC2[143:128]);

DEST[207:200]  SaturateSignedWordToSignedByte (SRC2[159:144]);

DEST[215:208]  SaturateSignedWordToSignedByte (SRC2[175:160]);

DEST[223:216]  SaturateSignedWordToSignedByte (SRC2[191:176]);

DEST[231:224]  SaturateSignedWordToSignedByte (SRC2[207:192]);

DEST[239:232]  SaturateSignedWordToSignedByte (SRC2[223:208]);

DEST[247:240]  SaturateSignedWordToSignedByte (SRC2[239:224]);

DEST[255:248]  SaturateSignedWordToSignedByte (SRC2[255:240]);

DEST[MAXVL-1:256]  0;

VPACKSSDW instruction (VEX.256 encoded version)

DEST[15:0]  SaturateSignedDwordToSignedWord (SRC1[31:0]);

DEST[31:16]  SaturateSignedDwordToSignedWord (SRC1[63:32]);

DEST[47:32]  SaturateSignedDwordToSignedWord (SRC1[95:64]);

DEST[63:48]  SaturateSignedDwordToSignedWord (SRC1[127:96]);

DEST[79:64]  SaturateSignedDwordToSignedWord (SRC2[31:0]);

DEST[95:80]  SaturateSignedDwordToSignedWord (SRC2[63:32]);

DEST[111:96]  SaturateSignedDwordToSignedWord (SRC2[95:64]);

DEST[127:112]  SaturateSignedDwordToSignedWord (SRC2[127:96]);

DEST[143:128]  SaturateSignedDwordToSignedWord (SRC1[159:128]);

DEST[159:144]  SaturateSignedDwordToSignedWord (SRC1[191:160]);

DEST[175:160]  SaturateSignedDwordToSignedWord (SRC1[223:192]);

DEST[191:176]  SaturateSignedDwordToSignedWord (SRC1[255:224]);

DEST[207:192]  SaturateSignedDwordToSignedWord (SRC2[159:128]);

DEST[223:208]  SaturateSignedDwordToSignedWord (SRC2[191:160]);

DEST[239:224]  SaturateSignedDwordToSignedWord (SRC2[223:192]);

DEST[255:240]  SaturateSignedDwordToSignedWord (SRC2[255:224]);

DEST[MAXVL-1:256]  0;

VPACKSSWB (EVEX encoded versions)

(KL, VL) = (16, 128), (32, 256), (64, 512)

TMP_DEST[7:0]  SaturateSignedWordToSignedByte (SRC1[15:0]);

TMP_DEST[15:8]  SaturateSignedWordToSignedByte (SRC1[31:16]);

TMP_DEST[23:16]  SaturateSignedWordToSignedByte (SRC1[47:32]);

TMP_DEST[31:24]  SaturateSignedWordToSignedByte (SRC1[63:48]);

TMP_DEST[39:32]  SaturateSignedWordToSignedByte (SRC1[79:64]);

TMP_DEST[47:40]  SaturateSignedWordToSignedByte (SRC1[95:80]);

TMP_DEST[55:48]  SaturateSignedWordToSignedByte (SRC1[111:96]);

TMP_DEST[63:56]  SaturateSignedWordToSignedByte (SRC1[127:112]);

TMP_DEST[71:64]  SaturateSignedWordToSignedByte (SRC2[15:0]);

TMP_DEST[79:72]  SaturateSignedWordToSignedByte (SRC2[31:16]);

TMP_DEST[87:80]  SaturateSignedWordToSignedByte (SRC2[47:32]);

TMP_DEST[95:88]  SaturateSignedWordToSignedByte (SRC2[63:48]);

TMP_DEST[103:96]  SaturateSignedWordToSignedByte (SRC2[79:64]);

TMP_DEST[111:104]  SaturateSignedWordToSignedByte (SRC2[95:80]);

TMP_DEST[119:112]  SaturateSignedWordToSignedByte (SRC2[111:96]);

TMP_DEST[127:120]  SaturateSignedWordToSignedByte (SRC2[127:112]);

IF VL >= 256

TMP_DEST[135:128] SaturateSignedWordToSignedByte (SRC1[143:128]);

TMP_DEST[143:136]  SaturateSignedWordToSignedByte (SRC1[159:144]);

TMP_DEST[151:144]  SaturateSignedWordToSignedByte (SRC1[175:160]);

TMP_DEST[159:152]  SaturateSignedWordToSignedByte (SRC1[191:176]);

TMP_DEST[167:160]  SaturateSignedWordToSignedByte (SRC1[207:192]);

PACKSSWB/PACKSSDW—Pack with Signed Saturation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-191

TMP_DEST[175:168]  SaturateSignedWordToSignedByte (SRC1[223:208]);

TMP_DEST[183:176]  SaturateSignedWordToSignedByte (SRC1[239:224]);

TMP_DEST[191:184]  SaturateSignedWordToSignedByte (SRC1[255:240]);

TMP_DEST[199:192]  SaturateSignedWordToSignedByte (SRC2[143:128]);

TMP_DEST[207:200]  SaturateSignedWordToSignedByte (SRC2[159:144]);

TMP_DEST[215:208]  SaturateSignedWordToSignedByte (SRC2[175:160]);

TMP_DEST[223:216]  SaturateSignedWordToSignedByte (SRC2[191:176]);

TMP_DEST[231:224]  SaturateSignedWordToSignedByte (SRC2[207:192]);

TMP_DEST[239:232]  SaturateSignedWordToSignedByte (SRC2[223:208]);

TMP_DEST[247:240]  SaturateSignedWordToSignedByte (SRC2[239:224]);

TMP_DEST[255:248]  SaturateSignedWordToSignedByte (SRC2[255:240]);

FI;

IF VL >= 512

TMP_DEST[263:256]  SaturateSignedWordToSignedByte (SRC1[271:256]);

TMP_DEST[271:264]  SaturateSignedWordToSignedByte (SRC1[287:272]);

TMP_DEST[279:272]  SaturateSignedWordToSignedByte (SRC1[303:288]);

TMP_DEST[287:280]  SaturateSignedWordToSignedByte (SRC1[319:304]);

TMP_DEST[295:288]  SaturateSignedWordToSignedByte (SRC1[335:320]);

TMP_DEST[303:296]  SaturateSignedWordToSignedByte (SRC1[351:336]);

TMP_DEST[311:304]  SaturateSignedWordToSignedByte (SRC1[367:352]);

TMP_DEST[319:312]  SaturateSignedWordToSignedByte (SRC1[383:368]);

TMP_DEST[327:320]  SaturateSignedWordToSignedByte (SRC2[271:256]);

TMP_DEST[335:328]  SaturateSignedWordToSignedByte (SRC2[287:272]);

TMP_DEST[343:336]  SaturateSignedWordToSignedByte (SRC2[303:288]);

TMP_DEST[351:344]  SaturateSignedWordToSignedByte (SRC2[319:304]);

TMP_DEST[359:352]  SaturateSignedWordToSignedByte (SRC2[335:320]);

TMP_DEST[367:360]  SaturateSignedWordToSignedByte (SRC2[351:336]);

TMP_DEST[375:368]  SaturateSignedWordToSignedByte (SRC2[367:352]);

TMP_DEST[383:376]  SaturateSignedWordToSignedByte (SRC2[383:368]);

TMP_DEST[391:384]  SaturateSignedWordToSignedByte (SRC1[399:384]);

TMP_DEST[399:392]  SaturateSignedWordToSignedByte (SRC1[415:400]);

TMP_DEST[407:400]  SaturateSignedWordToSignedByte (SRC1[431:416]);

TMP_DEST[415:408]  SaturateSignedWordToSignedByte (SRC1[447:432]);

TMP_DEST[423:416]  SaturateSignedWordToSignedByte (SRC1[463:448]);

TMP_DEST[431:424]  SaturateSignedWordToSignedByte (SRC1[479:464]);

TMP_DEST[439:432]  SaturateSignedWordToSignedByte (SRC1[495:480]);

TMP_DEST[447:440]  SaturateSignedWordToSignedByte (SRC1[511:496]);

TMP_DEST[455:448]  SaturateSignedWordToSignedByte (SRC2[399:384]);

TMP_DEST[463:456]  SaturateSignedWordToSignedByte (SRC2[415:400]);

TMP_DEST[471:464]  SaturateSignedWordToSignedByte (SRC2[431:416]);

TMP_DEST[479:472]  SaturateSignedWordToSignedByte (SRC2[447:432]);

TMP_DEST[487:480]  SaturateSignedWordToSignedByte (SRC2[463:448]);

TMP_DEST[495:488]  SaturateSignedWordToSignedByte (SRC2[479:464]);

TMP_DEST[503:496]  SaturateSignedWordToSignedByte (SRC2[495:480]);

TMP_DEST[511:504]  SaturateSignedWordToSignedByte (SRC2[511:496]);

FI;

FOR j  0 TO KL-1

i  j * 8

IF k1[j] OR *no writemask*

THEN

DEST[i+7:i]  TMP_DEST[i+7:i]

PACKSSWB/PACKSSDW—Pack with Signed Saturation

INSTRUCTION SET REFERENCE, M-U

4-192 Vol. 2B

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+7:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+7:i]  0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

VPACKSSDW (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j  0 TO ((KL/2) - 1)

i  j * 32

IF (EVEX.b == 1) AND (SRC2 *is memory*)

THEN

TMP_SRC2[i+31:i]  SRC2[31:0]

ELSE

TMP_SRC2[i+31:i]  SRC2[i+31:i]

FI;

ENDFOR;

TMP_DEST[15:0]  SaturateSignedDwordToSignedWord (SRC1[31:0]);

TMP_DEST[31:16]  SaturateSignedDwordToSignedWord (SRC1[63:32]);

TMP_DEST[47:32]  SaturateSignedDwordToSignedWord (SRC1[95:64]);

TMP_DEST[63:48]  SaturateSignedDwordToSignedWord (SRC1[127:96]);

TMP_DEST[79:64]  SaturateSignedDwordToSignedWord (TMP_SRC2[31:0]);

TMP_DEST[95:80]  SaturateSignedDwordToSignedWord (TMP_SRC2[63:32]);

TMP_DEST[111:96]  SaturateSignedDwordToSignedWord (TMP_SRC2[95:64]);

TMP_DEST[127:112]  SaturateSignedDwordToSignedWord (TMP_SRC2[127:96]);

IF VL >= 256

TMP_DEST[143:128]  SaturateSignedDwordToSignedWord (SRC1[159:128]);

TMP_DEST[159:144]  SaturateSignedDwordToSignedWord (SRC1[191:160]);

TMP_DEST[175:160]  SaturateSignedDwordToSignedWord (SRC1[223:192]);

TMP_DEST[191:176]  SaturateSignedDwordToSignedWord (SRC1[255:224]);

TMP_DEST[207:192]  SaturateSignedDwordToSignedWord (TMP_SRC2[159:128]);

TMP_DEST[223:208]  SaturateSignedDwordToSignedWord (TMP_SRC2[191:160]);

TMP_DEST[239:224]  SaturateSignedDwordToSignedWord (TMP_SRC2[223:192]);

TMP_DEST[255:240]  SaturateSignedDwordToSignedWord (TMP_SRC2[255:224]);

FI;

IF VL >= 512

TMP_DEST[271:256]  SaturateSignedDwordToSignedWord (SRC1[287:256]);

TMP_DEST[287:272]  SaturateSignedDwordToSignedWord (SRC1[319:288]);

TMP_DEST[303:288]  SaturateSignedDwordToSignedWord (SRC1[351:320]);

TMP_DEST[319:304]  SaturateSignedDwordToSignedWord (SRC1[383:352]);

TMP_DEST[335:320]  SaturateSignedDwordToSignedWord (TMP_SRC2[287:256]);

TMP_DEST[351:336]  SaturateSignedDwordToSignedWord (TMP_SRC2[319:288]);

TMP_DEST[367:352]  SaturateSignedDwordToSignedWord (TMP_SRC2[351:320]);

TMP_DEST[383:368]  SaturateSignedDwordToSignedWord (TMP_SRC2[383:352]);

TMP_DEST[399:384]  SaturateSignedDwordToSignedWord (SRC1[415:384]);

TMP_DEST[415:400]  SaturateSignedDwordToSignedWord (SRC1[447:416]);

TMP_DEST[431:416]  SaturateSignedDwordToSignedWord (SRC1[479:448]);

PACKSSWB/PACKSSDW—Pack with Signed Saturation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-193

TMP_DEST[447:432]  SaturateSignedDwordToSignedWord (SRC1[511:480]);

TMP_DEST[463:448]  SaturateSignedDwordToSignedWord (TMP_SRC2[415:384]);

TMP_DEST[479:464]  SaturateSignedDwordToSignedWord (TMP_SRC2[447:416]);

TMP_DEST[495:480]  SaturateSignedDwordToSignedWord (TMP_SRC2[479:448]);

TMP_DEST[511:496]  SaturateSignedDwordToSignedWord (TMP_SRC2[511:480]);

FI;

FOR j  0 TO KL-1

i  j * 16

IF k1[j] OR *no writemask*

THEN DEST[i+15:i]  TMP_DEST[i+15:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+15:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+15:i]  0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

Intel C/C++ Compiler Intrinsic Equivalents

VPACKSSDW__m512i _mm512_packs_epi32(__m512i m1, __m512i m2);

VPACKSSDW__m512i _mm512_mask_packs_epi32(__m512i s, __mmask32 k, __m512i m1, __m512i m2);

VPACKSSDW__m512i _mm512_maskz_packs_epi32( __mmask32 k, __m512i m1, __m512i m2);

VPACKSSDW__m256i _mm256_mask_packs_epi32( __m256i s, __mmask16 k, __m256i m1, __m256i m2);

VPACKSSDW__m256i _mm256_maskz_packs_epi32( __mmask16 k, __m256i m1, __m256i m2);

VPACKSSDW__m128i _mm_mask_packs_epi32( __m128i s, __mmask8 k, __m128i m1, __m128i m2);

VPACKSSDW__m128i _mm_maskz_packs_epi32( __mmask8 k, __m128i m1, __m128i m2);

VPACKSSWB__m512i _mm512_packs_epi16(__m512i m1, __m512i m2);

VPACKSSWB__m512i _mm512_mask_packs_epi16(__m512i s, __mmask32 k, __m512i m1, __m512i m2);

VPACKSSWB__m512i _mm512_maskz_packs_epi16( __mmask32 k, __m512i m1, __m512i m2);

VPACKSSWB__m256i _mm256_mask_packs_epi16( __m256i s, __mmask16 k, __m256i m1, __m256i m2);

VPACKSSWB__m256i _mm256_maskz_packs_epi16( __mmask16 k, __m256i m1, __m256i m2);

VPACKSSWB__m128i _mm_mask_packs_epi16( __m128i s, __mmask8 k, __m128i m1, __m128i m2);

VPACKSSWB__m128i _mm_maskz_packs_epi16( __mmask8 k, __m128i m1, __m128i m2);

PACKSSWB __m128i _mm_packs_epi16(__m128i m1, __m128i m2)

PACKSSDW __m128i _mm_packs_epi32(__m128i m1, __m128i m2)

VPACKSSWB __m256i _mm256_packs_epi16(__m256i m1, __m256i m2)

VPACKSSDW __m256i _mm256_packs_epi32(__m256i m1, __m256i m2)

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

EVEX-encoded VPACKSSDW, see Exceptions Type E4NF.

EVEX-encoded VPACKSSWB, see Exceptions Type E4NF.nb.

PACKUSDW—Pack with Unsigned Saturation

INSTRUCTION SET REFERENCE, M-U

4-194 Vol. 2B

PACKUSDW—Pack with Unsigned Saturation

Instruction Operand Encoding

Description

Converts packed signed doubleword integers in the first and second source operands into packed unsigned word

integers using unsigned saturation to handle overflow conditions. If the signed doubleword value is beyond the

range of an unsigned word (that is, greater than FFFFH or less than 0000H), the saturated unsigned word integer

value of FFFFH or 0000H, respectively, is stored in the destination.

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand is a

ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32-

bit memory location. The destination operand is a ZMM register, updated conditionally under the writemask k1.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register

or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the

corresponding ZMM register destination are zeroed.

VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM

the corresponding ZMM register destination are zeroed.

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM

upper bits (MAXVL-1:128) of the corresponding destination register destination are unmodified.

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

66 0F 38 2B /r

PACKUSDW xmm1, xmm2/m128

A V/V SSE4_1 Convert 4 packed signed doubleword integers from xmm1

and 4 packed signed doubleword integers from

xmm2/m128 into 8 packed unsigned word integers in

xmm1 using unsigned saturation.

VEX.128.66.0F38 2B /r

VPACKUSDW xmm1,xmm2,

xmm3/m128

B V/V AVX Convert 4 packed signed doubleword integers from xmm2

and 4 packed signed doubleword integers from

xmm3/m128 into 8 packed unsigned word integers in

xmm1 using unsigned saturation.

VEX.256.66.0F38 2B /r

VPACKUSDW ymm1, ymm2,

ymm3/m256

B V/V AVX2 Convert 8 packed signed doubleword integers from ymm2

and 8 packed signed doubleword integers from

ymm3/m256 into 16 packed unsigned word integers in

ymm1 using unsigned saturation.

EVEX.128.66.0F38.W0 2B /r

VPACKUSDW xmm1{k1}{z},

xmm2, xmm3/m128/m32bcst

CV/V AVX512VL

AVX512BW

Convert packed signed doubleword integers from xmm2

and packed signed doubleword integers from

xmm3/m128/m32bcst into packed unsigned word integers

in xmm1 using unsigned saturation under writemask k1.

EVEX.256.66.0F38.W0 2B /r

VPACKUSDW ymm1{k1}{z},

ymm2, ymm3/m256/m32bcst

CV/V AVX512VL

AVX512BW

Convert packed signed doubleword integers from ymm2

and packed signed doubleword integers from

ymm3/m256/m32bcst into packed unsigned word integers

in ymm1 using unsigned saturation under writemask k1.

EVEX.512.66.0F38.W0 2B /r

VPACKUSDW zmm1{k1}{z},

zmm2, zmm3/m512/m32bcst

C V/V AVX512BW Convert packed signed doubleword integers from zmm2

and packed signed doubleword integers from

zmm3/m512/m32bcst into packed unsigned word integers

in zmm1 using unsigned saturation under writemask k1.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

PACKUSDW—Pack with Unsigned Saturation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-195

Operation

PACKUSDW (Legacy SSE instruction)

TMP[15:0]  (DEST[31:0] < 0) ? 0 : DEST[15:0];

DEST[15:0]  (DEST[31:0] > FFFFH) ? FFFFH : TMP[15:0] ;

TMP[31:16]  (DEST[63:32] < 0) ? 0 : DEST[47:32];

DEST[31:16]  (DEST[63:32] > FFFFH) ? FFFFH : TMP[31:16] ;

TMP[47:32]  (DEST[95:64] < 0) ? 0 : DEST[79:64];

DEST[47:32]  (DEST[95:64] > FFFFH) ? FFFFH : TMP[47:32] ;

TMP[63:48]  (DEST[127:96] < 0) ? 0 : DEST[111:96];

DEST[63:48]  (DEST[127:96] > FFFFH) ? FFFFH : TMP[63:48] ;

TMP[79:64]  (SRC[31:0] < 0) ? 0 : SRC[15:0];

DEST[79:64]  (SRC[31:0] > FFFFH) ? FFFFH : TMP[79:64] ;

TMP[95:80]  (SRC[63:32] < 0) ? 0 : SRC[47:32];

DEST[95:80]  (SRC[63:32] > FFFFH) ? FFFFH : TMP[95:80] ;

TMP[111:96]  (SRC[95:64] < 0) ? 0 : SRC[79:64];

DEST[111:96]  (SRC[95:64] > FFFFH) ? FFFFH : TMP[111:96] ;

TMP[127:112]  (SRC[127:96] < 0) ? 0 : SRC[111:96];

DEST[127:112]  (SRC[127:96] > FFFFH) ? FFFFH : TMP[127:112] ;

DEST[MAXVL-1:128] (Unmodified)

PACKUSDW (VEX.128 encoded version)

TMP[15:0]  (SRC1[31:0] < 0) ? 0 : SRC1[15:0];

DEST[15:0]  (SRC1[31:0] > FFFFH) ? FFFFH : TMP[15:0] ;

TMP[31:16]  (SRC1[63:32] < 0) ? 0 : SRC1[47:32];

DEST[31:16]  (SRC1[63:32] > FFFFH) ? FFFFH : TMP[31:16] ;

TMP[47:32]  (SRC1[95:64] < 0) ? 0 : SRC1[79:64];

DEST[47:32]  (SRC1[95:64] > FFFFH) ? FFFFH : TMP[47:32] ;

TMP[63:48]  (SRC1[127:96] < 0) ? 0 : SRC1[111:96];

DEST[63:48]  (SRC1[127:96] > FFFFH) ? FFFFH : TMP[63:48] ;

TMP[79:64]  (SRC2[31:0] < 0) ? 0 : SRC2[15:0];

DEST[79:64]  (SRC2[31:0] > FFFFH) ? FFFFH : TMP[79:64] ;

TMP[95:80]  (SRC2[63:32] < 0) ? 0 : SRC2[47:32];

DEST[95:80]  (SRC2[63:32] > FFFFH) ? FFFFH : TMP[95:80] ;

TMP[111:96]  (SRC2[95:64] < 0) ? 0 : SRC2[79:64];

DEST[111:96]  (SRC2[95:64] > FFFFH) ? FFFFH : TMP[111:96] ;

TMP[127:112]  (SRC2[127:96] < 0) ? 0 : SRC2[111:96];

DEST[127:112]  (SRC2[127:96] > FFFFH) ? FFFFH : TMP[127:112];

DEST[MAXVL-1:128]  0;

VPACKUSDW (VEX.256 encoded version)

TMP[15:0]  (SRC1[31:0] < 0) ? 0 : SRC1[15:0];

DEST[15:0]  (SRC1[31:0] > FFFFH) ? FFFFH : TMP[15:0] ;

TMP[31:16]  (SRC1[63:32] < 0) ? 0 : SRC1[47:32];

DEST[31:16]  (SRC1[63:32] > FFFFH) ? FFFFH : TMP[31:16] ;

TMP[47:32]  (SRC1[95:64] < 0) ? 0 : SRC1[79:64];

DEST[47:32]  (SRC1[95:64] > FFFFH) ? FFFFH : TMP[47:32] ;

TMP[63:48]  (SRC1[127:96] < 0) ? 0 : SRC1[111:96];

DEST[63:48]  (SRC1[127:96] > FFFFH) ? FFFFH : TMP[63:48] ;

TMP[79:64]  (SRC2[31:0] < 0) ? 0 : SRC2[15:0];

DEST[79:64]  (SRC2[31:0] > FFFFH) ? FFFFH : TMP[79:64] ;

TMP[95:80]  (SRC2[63:32] < 0) ? 0 : SRC2[47:32];

DEST[95:80]  (SRC2[63:32] > FFFFH) ? FFFFH : TMP[95:80] ;

TMP[111:96]  (SRC2[95:64] < 0) ? 0 : SRC2[79:64];

DEST[111:96]  (SRC2[95:64] > FFFFH) ? FFFFH : TMP[111:96] ;

PACKUSDW—Pack with Unsigned Saturation

INSTRUCTION SET REFERENCE, M-U

4-196 Vol. 2B

TMP[127:112]  (SRC2[127:96] < 0) ? 0 : SRC2[111:96];

DEST[127:112]  (SRC2[127:96] > FFFFH) ? FFFFH : TMP[127:112] ;

TMP[143:128]  (SRC1[159:128] < 0) ? 0 : SRC1[143:128];

DEST[143:128]  (SRC1[159:128] > FFFFH) ? FFFFH : TMP[143:128] ;

TMP[159:144]  (SRC1[191:160] < 0) ? 0 : SRC1[175:160];

DEST[159:144]  (SRC1[191:160] > FFFFH) ? FFFFH : TMP[159:144] ;

TMP[175:160]  (SRC1[223:192] < 0) ? 0 : SRC1[207:192];

DEST[175:160]  (SRC1[223:192] > FFFFH) ? FFFFH : TMP[175:160] ;

TMP[191:176]  (SRC1[255:224] < 0) ? 0 : SRC1[239:224];

DEST[191:176]  (SRC1[255:224] > FFFFH) ? FFFFH : TMP[191:176] ;

TMP[207:192]  (SRC2[159:128] < 0) ? 0 : SRC2[143:128];

DEST[207:192]  (SRC2[159:128] > FFFFH) ? FFFFH : TMP[207:192] ;

TMP[223:208]  (SRC2[191:160] < 0) ? 0 : SRC2[175:160];

DEST[223:208]  (SRC2[191:160] > FFFFH) ? FFFFH : TMP[223:208] ;

TMP[239:224]  (SRC2[223:192] < 0) ? 0 : SRC2[207:192];

DEST[239:224]  (SRC2[223:192] > FFFFH) ? FFFFH : TMP[239:224] ;

TMP[255:240]  (SRC2[255:224] < 0) ? 0 : SRC2[239:224];

DEST[255:240]  (SRC2[255:224] > FFFFH) ? FFFFH : TMP[255:240] ;

DEST[MAXVL-1:256]  0;

VPACKUSDW (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j  0 TO ((KL/2) - 1)

i  j * 32

IF (EVEX.b == 1) AND (SRC2 *is memory*)

THEN

TMP_SRC2[i+31:i]  SRC2[31:0]

ELSE

TMP_SRC2[i+31:i]  SRC2[i+31:i]

FI;

ENDFOR;

TMP[15:0]  (SRC1[31:0] < 0) ? 0 : SRC1[15:0];

DEST[15:0]  (SRC1[31:0] > FFFFH) ? FFFFH : TMP[15:0] ;

TMP[31:16]  (SRC1[63:32] < 0) ? 0 : SRC1[47:32];

DEST[31:16]  (SRC1[63:32] > FFFFH) ? FFFFH : TMP[31:16] ;

TMP[47:32]  (SRC1[95:64] < 0) ? 0 : SRC1[79:64];

DEST[47:32]  (SRC1[95:64] > FFFFH) ? FFFFH : TMP[47:32] ;

TMP[63:48]  (SRC1[127:96] < 0) ? 0 : SRC1[111:96];

DEST[63:48]  (SRC1[127:96] > FFFFH) ? FFFFH : TMP[63:48] ;

TMP[79:64]  (TMP_SRC2[31:0] < 0) ? 0 : TMP_SRC2[15:0];

DEST[79:64]  (TMP_SRC2[31:0] > FFFFH) ? FFFFH : TMP[79:64] ;

TMP[95:80]  (TMP_SRC2[63:32] < 0) ? 0 : TMP_SRC2[47:32];

DEST[95:80]  (TMP_SRC2[63:32] > FFFFH) ? FFFFH : TMP[95:80] ;

TMP[111:96]  (TMP_SRC2[95:64] < 0) ? 0 : TMP_SRC2[79:64];

DEST[111:96]  (TMP_SRC2[95:64] > FFFFH) ? FFFFH : TMP[111:96] ;

TMP[127:112]  (TMP_SRC2[127:96] < 0) ? 0 : TMP_SRC2[111:96];

DEST[127:112]  (TMP_SRC2[127:96] > FFFFH) ? FFFFH : TMP[127:112] ;

IF VL >= 256

TMP[143:128]  (SRC1[159:128] < 0) ? 0 : SRC1[143:128];

DEST[143:128]  (SRC1[159:128] > FFFFH) ? FFFFH : TMP[143:128] ;

TMP[159:144]  (SRC1[191:160] < 0) ? 0 : SRC1[175:160];

DEST[159:144]  (SRC1[191:160] > FFFFH) ? FFFFH : TMP[159:144] ;

PACKUSDW—Pack with Unsigned Saturation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-197

TMP[175:160]  (SRC1[223:192] < 0) ? 0 : SRC1[207:192];

DEST[175:160]  (SRC1[223:192] > FFFFH) ? FFFFH : TMP[175:160] ;

TMP[191:176]  (SRC1[255:224] < 0) ? 0 : SRC1[239:224];

DEST[191:176]  (SRC1[255:224] > FFFFH) ? FFFFH : TMP[191:176] ;

TMP[207:192]  (TMP_SRC2[159:128] < 0) ? 0 : TMP_SRC2[143:128];

DEST[207:192]  (TMP_SRC2[159:128] > FFFFH) ? FFFFH : TMP[207:192] ;

TMP[223:208]  (TMP_SRC2[191:160] < 0) ? 0 : TMP_SRC2[175:160];

DEST[223:208]  (TMP_SRC2[191:160] > FFFFH) ? FFFFH : TMP[223:208] ;

TMP[239:224]  (TMP_SRC2[223:192] < 0) ? 0 : TMP_SRC2[207:192];

DEST[239:224]  (TMP_SRC2[223:192] > FFFFH) ? FFFFH : TMP[239:224] ;

TMP[255:240]  (TMP_SRC2[255:224] < 0) ? 0 : TMP_SRC2[239:224];

DEST[255:240]  (TMP_SRC2[255:224] > FFFFH) ? FFFFH : TMP[255:240] ;

FI;

IF VL >= 512

TMP[271:256]  (SRC1[287:256] < 0) ? 0 : SRC1[271:256];

DEST[271:256]  (SRC1[287:256] > FFFFH) ? FFFFH : TMP[271:256] ;

TMP[287:272]  (SRC1[319:288] < 0) ? 0 : SRC1[303:288];

DEST[287:272]  (SRC1[319:288] > FFFFH) ? FFFFH : TMP[287:272] ;

TMP[303:288]  (SRC1[351:320] < 0) ? 0 : SRC1[335:320];

DEST[303:288]  (SRC1[351:320] > FFFFH) ? FFFFH : TMP[303:288] ;

TMP[319:304]  (SRC1[383:352] < 0) ? 0 : SRC1[367:352];

DEST[319:304]  (SRC1[383:352] > FFFFH) ? FFFFH : TMP[319:304] ;

TMP[335:320]  (TMP_SRC2[287:256] < 0) ? 0 : TMP_SRC2[271:256];

DEST[335:304]  (TMP_SRC2[287:256] > FFFFH) ? FFFFH : TMP[79:64] ;

TMP[351:336]  (TMP_SRC2[319:288] < 0) ? 0 : TMP_SRC2[303:288];

DEST[351:336]  (TMP_SRC2[319:288] > FFFFH) ? FFFFH : TMP[351:336] ;

TMP[367:352]  (TMP_SRC2[351:320] < 0) ? 0 : TMP_SRC2[315:320];

DEST[367:352]  (TMP_SRC2[351:320] > FFFFH) ? FFFFH : TMP[367:352] ;

TMP[383:368]  (TMP_SRC2[383:352] < 0) ? 0 : TMP_SRC2[367:352];

DEST[383:368]  (TMP_SRC2[383:352] > FFFFH) ? FFFFH : TMP[383:368] ;

TMP[399:384]  (SRC1[415:384] < 0) ? 0 : SRC1[399:384];

DEST[399:384]  (SRC1[415:384] > FFFFH) ? FFFFH : TMP[399:384] ;

TMP[415:400]  (SRC1[447:416] < 0) ? 0 : SRC1[431:416];

DEST[415:400]  (SRC1[447:416] > FFFFH) ? FFFFH : TMP[415:400] ;

TMP[431:416]  (SRC1[479:448] < 0) ? 0 : SRC1[463:448];

DEST[431:416]  (SRC1[479:448] > FFFFH) ? FFFFH : TMP[431:416] ;

TMP[447:432]  (SRC1[511:480] < 0) ? 0 : SRC1[495:480];

DEST[447:432]  (SRC1[511:480] > FFFFH) ? FFFFH : TMP[447:432] ;

TMP[463:448]  (TMP_SRC2[415:384] < 0) ? 0 : TMP_SRC2[399:384];

DEST[463:448]  (TMP_SRC2[415:384] > FFFFH) ? FFFFH : TMP[463:448] ;

TMP[475:464]  (TMP_SRC2[447:416] < 0) ? 0 : TMP_SRC2[431:416];

DEST[475:464]  (TMP_SRC2[447:416] > FFFFH) ? FFFFH : TMP[475:464] ;

TMP[491:476]  (TMP_SRC2[479:448] < 0) ? 0 : TMP_SRC2[463:448];

DEST[491:476]  (TMP_SRC2[479:448] > FFFFH) ? FFFFH : TMP[491:476] ;

TMP[511:492]  (TMP_SRC2[511:480] < 0) ? 0 : TMP_SRC2[495:480];

DEST[511:492]  (TMP_SRC2[511:480] > FFFFH) ? FFFFH : TMP[511:492] ;

FI;

FOR j  0 TO KL-1

i  j * 16

IF k1[j] OR *no writemask*

THEN

DEST[i+15:i]  TMP_DEST[i+15:i]

ELSE

IF *merging-masking* ; merging-masking

PACKUSDW—Pack with Unsigned Saturation

INSTRUCTION SET REFERENCE, M-U

4-198 Vol. 2B

THEN *DEST[i+15:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+15:i]  0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

Intel C/C++ Compiler Intrinsic Equivalents

VPACKUSDW__m512i _mm512_packus_epi32(__m512i m1, __m512i m2);

VPACKUSDW__m512i _mm512_mask_packus_epi32(__m512i s, __mmask32 k, __m512i m1, __m512i m2);

VPACKUSDW__m512i _mm512_maskz_packus_epi32( __mmask32 k, __m512i m1, __m512i m2);

VPACKUSDW__m256i _mm256_mask_packus_epi32( __m256i s, __mmask16 k, __m256i m1, __m256i m2);

VPACKUSDW__m256i _mm256_maskz_packus_epi32( __mmask16 k, __m256i m1, __m256i m2);

VPACKUSDW__m128i _mm_mask_packus_epi32( __m128i s, __mmask8 k, __m128i m1, __m128i m2);

VPACKUSDW__m128i _mm_maskz_packus_epi32( __mmask8 k, __m128i m1, __m128i m2);

PACKUSDW__m128i _mm_packus_epi32(__m128i m1, __m128i m2);

VPACKUSDW__m256i _mm256_packus_epi32(__m256i m1, __m256i m2);

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

EVEX-encoded instruction, see Exceptions Type E4NF.

PACKUSWB—Pack with Unsigned Saturation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-199

PACKUSWB—Pack with Unsigned Saturation

Instruction Operand Encoding

Description

Converts 4, 8, 16 or 32 signed word integers from the destination operand (first operand) and 4, 8, 16 or 32 signed

word integers from the source operand (second operand) into 8, 16, 32 or 64 unsigned byte integers and stores the

result in the destination operand. (See Figure 4-6 for an example of the packing operation.) If a signed word

integer value is beyond the range of an unsigned byte integer (that is, greater than FFH or less than 00H), the satu-

rated unsigned byte integer value of FFH or 00H, respectively, is stored in the destination.

EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM

Opcode/

Instruction

Op/

64/32 bit

Mode

Support

CPUID

Feature Flag

Description

NP 0F 67 /r1

PACKUSWB mm, mm/m64

A V/V MMX Converts 4 signed word integers from mm and

4 signed word integers from mm/m64 into 8

unsigned byte integers in mm using unsigned

saturation.

66 0F 67 /r

PACKUSWB xmm1, xmm2/m128

A V/V SSE2 Converts 8 signed word integers from xmm1

and 8 signed word integers from xmm2/m128

into 16 unsigned byte integers in xmm1 using

unsigned saturation.

VEX.128.66.0F.WIG 67 /r

VPACKUSWB xmm1, xmm2, xmm3/m128

B V/V AVX Converts 8 signed word integers from xmm2

and 8 signed word integers from xmm3/m128

into 16 unsigned byte integers in xmm1 using

unsigned saturation.

VEX.256.66.0F.WIG 67 /r

VPACKUSWB ymm1, ymm2, ymm3/m256

B V/V AVX2 Converts 16 signed word integers from ymm2

and 16signed word integers from

ymm3/m256 into 32 unsigned byte integers

in ymm1 using unsigned saturation.

EVEX.128.66.0F.WIG 67 /r

VPACKUSWB xmm1{k1}{z}, xmm2, xmm3/m128

CV/V AVX512VL

AVX512BW

Converts signed word integers from xmm2

and signed word integers from xmm3/m128

into unsigned byte integers in xmm1 using

unsigned saturation under writemask k1.

EVEX.256.66.0F.WIG 67 /r

VPACKUSWB ymm1{k1}{z}, ymm2, ymm3/m256

CV/V AVX512VL

AVX512BW

Converts signed word integers from ymm2

and signed word integers from ymm3/m256

into unsigned byte integers in ymm1 using

unsigned saturation under writemask k1.

EVEX.512.66.0F.WIG 67 /r

VPACKUSWB zmm1{k1}{z}, zmm2, zmm3/m512

C V/V AVX512BW Converts signed word integers from zmm2

and signed word integers from zmm3/m512

into unsigned byte integers in zmm1 using

unsigned saturation under writemask k1.

NOTES:

1. See note in Section 2.4, “AVX and SSE Instruction Exception Specification” in the Intel® 64 and IA-32 Architectures Software

Developer’s Manual, Volume 2A and Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers” in

the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

PACKUSWB—Pack with Unsigned Saturation

INSTRUCTION SET REFERENCE, M-U

4-200 Vol. 2B

VEX.256 and EVEX.256 encoded versions: The first source operand is a YMM register. The second source operand

is a YMM register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-

1:256) of the corresponding ZMM register destination are zeroed.

VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand

is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-

1:128) of the corresponding register destination are zeroed.

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM

upper bits (MAXVL-1:128) of the corresponding register destination are unmodified.

Operation

PACKUSWB (with 64-bit operands)

DEST[7:0] ← SaturateSignedWordToUnsignedByte DEST[15:0];

DEST[15:8] ← SaturateSignedWordToUnsignedByte DEST[31:16];

DEST[23:16] ← SaturateSignedWordToUnsignedByte DEST[47:32];

DEST[31:24] ← SaturateSignedWordToUnsignedByte DEST[63:48];

DEST[39:32] ← SaturateSignedWordToUnsignedByte SRC[15:0];

DEST[47:40] ← SaturateSignedWordToUnsignedByte SRC[31:16];

DEST[55:48] ← SaturateSignedWordToUnsignedByte SRC[47:32];

DEST[63:56] ← SaturateSignedWordToUnsignedByte SRC[63:48];

PACKUSWB (Legacy SSE instruction)

DEST[7:0]SaturateSignedWordToUnsignedByte (DEST[15:0]);

DEST[15:8] SaturateSignedWordToUnsignedByte (DEST[31:16]);

DEST[23:16] SaturateSignedWordToUnsignedByte (DEST[47:32]);

DEST[31:24]  SaturateSignedWordToUnsignedByte (DEST[63:48]);

DEST[39:32]  SaturateSignedWordToUnsignedByte (DEST[79:64]);

DEST[47:40]  SaturateSignedWordToUnsignedByte (DEST[95:80]);

DEST[55:48]  SaturateSignedWordToUnsignedByte (DEST[111:96]);

DEST[63:56]  SaturateSignedWordToUnsignedByte (DEST[127:112]);

DEST[71:64]  SaturateSignedWordToUnsignedByte (SRC[15:0]);

DEST[79:72]  SaturateSignedWordToUnsignedByte (SRC[31:16]);

DEST[87:80]  SaturateSignedWordToUnsignedByte (SRC[47:32]);

DEST[95:88]  SaturateSignedWordToUnsignedByte (SRC[63:48]);

DEST[103:96]  SaturateSignedWordToUnsignedByte (SRC[79:64]);

DEST[111:104]  SaturateSignedWordToUnsignedByte (SRC[95:80]);

DEST[119:112]  SaturateSignedWordToUnsignedByte (SRC[111:96]);

DEST[127:120]  SaturateSignedWordToUnsignedByte (SRC[127:112]);

PACKUSWB (VEX.128 encoded version)

DEST[7:0] SaturateSignedWordToUnsignedByte (SRC1[15:0]);

DEST[15:8] SaturateSignedWordToUnsignedByte (SRC1[31:16]);

DEST[23:16] SaturateSignedWordToUnsignedByte (SRC1[47:32]);

DEST[31:24]  SaturateSignedWordToUnsignedByte (SRC1[63:48]);

DEST[39:32]  SaturateSignedWordToUnsignedByte (SRC1[79:64]);

DEST[47:40]  SaturateSignedWordToUnsignedByte (SRC1[95:80]);

DEST[55:48]  SaturateSignedWordToUnsignedByte (SRC1[111:96]);

DEST[63:56]  SaturateSignedWordToUnsignedByte (SRC1[127:112]);

DEST[71:64]  SaturateSignedWordToUnsignedByte (SRC2[15:0]);

DEST[79:72]  SaturateSignedWordToUnsignedByte (SRC2[31:16]);

DEST[87:80]  SaturateSignedWordToUnsignedByte (SRC2[47:32]);

DEST[95:88]  SaturateSignedWordToUnsignedByte (SRC2[63:48]);

DEST[103:96]  SaturateSignedWordToUnsignedByte (SRC2[79:64]);

DEST[111:104]  SaturateSignedWordToUnsignedByte (SRC2[95:80]);

PACKUSWB—Pack with Unsigned Saturation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-201

DEST[119:112]  SaturateSignedWordToUnsignedByte (SRC2[111:96]);

DEST[127:120]  SaturateSignedWordToUnsignedByte (SRC2[127:112]);

DEST[MAXVL-1:128]  0;

VPACKUSWB (VEX.256 encoded version)

DEST[7:0] SaturateSignedWordToUnsignedByte (SRC1[15:0]);

DEST[15:8] SaturateSignedWordToUnsignedByte (SRC1[31:16]);

DEST[23:16] SaturateSignedWordToUnsignedByte (SRC1[47:32]);

DEST[31:24]  SaturateSignedWordToUnsignedByte (SRC1[63:48]);

DEST[39:32] SaturateSignedWordToUnsignedByte (SRC1[79:64]);

DEST[47:40]  SaturateSignedWordToUnsignedByte (SRC1[95:80]);

DEST[55:48]  SaturateSignedWordToUnsignedByte (SRC1[111:96]);

DEST[63:56]  SaturateSignedWordToUnsignedByte (SRC1[127:112]);

DEST[71:64] SaturateSignedWordToUnsignedByte (SRC2[15:0]);

DEST[79:72]  SaturateSignedWordToUnsignedByte (SRC2[31:16]);

DEST[87:80]  SaturateSignedWordToUnsignedByte (SRC2[47:32]);

DEST[95:88]  SaturateSignedWordToUnsignedByte (SRC2[63:48]);

DEST[103:96]  SaturateSignedWordToUnsignedByte (SRC2[79:64]);

DEST[111:104]  SaturateSignedWordToUnsignedByte (SRC2[95:80]);

DEST[119:112]  SaturateSignedWordToUnsignedByte (SRC2[111:96]);

DEST[127:120]  SaturateSignedWordToUnsignedByte (SRC2[127:112]);

DEST[135:128] SaturateSignedWordToUnsignedByte (SRC1[143:128]);

DEST[143:136] SaturateSignedWordToUnsignedByte (SRC1[159:144]);

DEST[151:144] SaturateSignedWordToUnsignedByte (SRC1[175:160]);

DEST[159:152] SaturateSignedWordToUnsignedByte (SRC1[191:176]);

DEST[167:160]  SaturateSignedWordToUnsignedByte (SRC1[207:192]);

DEST[175:168]  SaturateSignedWordToUnsignedByte (SRC1[223:208]);

DEST[183:176]  SaturateSignedWordToUnsignedByte (SRC1[239:224]);

DEST[191:184]  SaturateSignedWordToUnsignedByte (SRC1[255:240]);

DEST[199:192]  SaturateSignedWordToUnsignedByte (SRC2[143:128]);

DEST[207:200]  SaturateSignedWordToUnsignedByte (SRC2[159:144]);

DEST[215:208]  SaturateSignedWordToUnsignedByte (SRC2[175:160]);

DEST[223:216]  SaturateSignedWordToUnsignedByte (SRC2[191:176]);

DEST[231:224]  SaturateSignedWordToUnsignedByte (SRC2[207:192]);

DEST[239:232]  SaturateSignedWordToUnsignedByte (SRC2[223:208]);

DEST[247:240]  SaturateSignedWordToUnsignedByte (SRC2[239:224]);

DEST[255:248]  SaturateSignedWordToUnsignedByte (SRC2[255:240]);

VPACKUSWB (EVEX encoded versions)

(KL, VL) = (16, 128), (32, 256), (64, 512)

TMP_DEST[7:0]  SaturateSignedWordToUnsignedByte (SRC1[15:0]);

TMP_DEST[15:8]  SaturateSignedWordToUnsignedByte (SRC1[31:16]);

TMP_DEST[23:16]  SaturateSignedWordToUnsignedByte (SRC1[47:32]);

TMP_DEST[31:24]  SaturateSignedWordToUnsignedByte (SRC1[63:48]);

TMP_DEST[39:32]  SaturateSignedWordToUnsignedByte (SRC1[79:64]);

TMP_DEST[47:40]  SaturateSignedWordToUnsignedByte (SRC1[95:80]);

TMP_DEST[55:48]  SaturateSignedWordToUnsignedByte (SRC1[111:96]);

TMP_DEST[63:56]  SaturateSignedWordToUnsignedByte (SRC1[127:112]);

TMP_DEST[71:64]  SaturateSignedWordToUnsignedByte (SRC2[15:0]);

TMP_DEST[79:72]  SaturateSignedWordToUnsignedByte (SRC2[31:16]);

TMP_DEST[87:80]  SaturateSignedWordToUnsignedByte (SRC2[47:32]);

TMP_DEST[95:88]  SaturateSignedWordToUnsignedByte (SRC2[63:48]);

TMP_DEST[103:96]  SaturateSignedWordToUnsignedByte (SRC2[79:64]);

TMP_DEST[111:104]  SaturateSignedWordToUnsignedByte (SRC2[95:80]);

PACKUSWB—Pack with Unsigned Saturation

INSTRUCTION SET REFERENCE, M-U

4-202 Vol. 2B

TMP_DEST[119:112]  SaturateSignedWordToUnsignedByte (SRC2[111:96]);

TMP_DEST[127:120]  SaturateSignedWordToUnsignedByte (SRC2[127:112]);

IF VL >= 256

TMP_DEST[135:128] SaturateSignedWordToUnsignedByte (SRC1[143:128]);

TMP_DEST[143:136]  SaturateSignedWordToUnsignedByte (SRC1[159:144]);

TMP_DEST[151:144]  SaturateSignedWordToUnsignedByte (SRC1[175:160]);

TMP_DEST[159:152]  SaturateSignedWordToUnsignedByte (SRC1[191:176]);

TMP_DEST[167:160]  SaturateSignedWordToUnsignedByte (SRC1[207:192]);

TMP_DEST[175:168]  SaturateSignedWordToUnsignedByte (SRC1[223:208]);

TMP_DEST[183:176]  SaturateSignedWordToUnsignedByte (SRC1[239:224]);

TMP_DEST[191:184]  SaturateSignedWordToUnsignedByte (SRC1[255:240]);

TMP_DEST[199:192]  SaturateSignedWordToUnsignedByte (SRC2[143:128]);

TMP_DEST[207:200]  SaturateSignedWordToUnsignedByte (SRC2[159:144]);

TMP_DEST[215:208]  SaturateSignedWordToUnsignedByte (SRC2[175:160]);

TMP_DEST[223:216]  SaturateSignedWordToUnsignedByte (SRC2[191:176]);

TMP_DEST[231:224]  SaturateSignedWordToUnsignedByte (SRC2[207:192]);

TMP_DEST[239:232]  SaturateSignedWordToUnsignedByte (SRC2[223:208]);

TMP_DEST[247:240]  SaturateSignedWordToUnsignedByte (SRC2[239:224]);

TMP_DEST[255:248]  SaturateSignedWordToUnsignedByte (SRC2[255:240]);

FI;

IF VL >= 512

TMP_DEST[263:256]  SaturateSignedWordToUnsignedByte (SRC1[271:256]);

TMP_DEST[271:264]  SaturateSignedWordToUnsignedByte (SRC1[287:272]);

TMP_DEST[279:272]  SaturateSignedWordToUnsignedByte (SRC1[303:288]);

TMP_DEST[287:280]  SaturateSignedWordToUnsignedByte (SRC1[319:304]);

TMP_DEST[295:288]  SaturateSignedWordToUnsignedByte (SRC1[335:320]);

TMP_DEST[303:296]  SaturateSignedWordToUnsignedByte (SRC1[351:336]);

TMP_DEST[311:304]  SaturateSignedWordToUnsignedByte (SRC1[367:352]);

TMP_DEST[319:312]  SaturateSignedWordToUnsignedByte (SRC1[383:368]);

TMP_DEST[327:320]  SaturateSignedWordToUnsignedByte (SRC2[271:256]);

TMP_DEST[335:328]  SaturateSignedWordToUnsignedByte (SRC2[287:272]);

TMP_DEST[343:336]  SaturateSignedWordToUnsignedByte (SRC2[303:288]);

TMP_DEST[351:344]  SaturateSignedWordToUnsignedByte (SRC2[319:304]);

TMP_DEST[359:352]  SaturateSignedWordToUnsignedByte (SRC2[335:320]);

TMP_DEST[367:360]  SaturateSignedWordToUnsignedByte (SRC2[351:336]);

TMP_DEST[375:368]  SaturateSignedWordToUnsignedByte (SRC2[367:352]);

TMP_DEST[383:376]  SaturateSignedWordToUnsignedByte (SRC2[383:368]);

TMP_DEST[391:384]  SaturateSignedWordToUnsignedByte (SRC1[399:384]);

TMP_DEST[399:392]  SaturateSignedWordToUnsignedByte (SRC1[415:400]);

TMP_DEST[407:400]  SaturateSignedWordToUnsignedByte (SRC1[431:416]);

TMP_DEST[415:408]  SaturateSignedWordToUnsignedByte (SRC1[447:432]);

TMP_DEST[423:416]  SaturateSignedWordToUnsignedByte (SRC1[463:448]);

TMP_DEST[431:424]  SaturateSignedWordToUnsignedByte (SRC1[479:464]);

TMP_DEST[439:432]  SaturateSignedWordToUnsignedByte (SRC1[495:480]);

TMP_DEST[447:440]  SaturateSignedWordToUnsignedByte (SRC1[511:496]);

TMP_DEST[455:448]  SaturateSignedWordToUnsignedByte (SRC2[399:384]);

TMP_DEST[463:456]  SaturateSignedWordToUnsignedByte (SRC2[415:400]);

TMP_DEST[471:464]  SaturateSignedWordToUnsignedByte (SRC2[431:416]);

TMP_DEST[479:472]  SaturateSignedWordToUnsignedByte (SRC2[447:432]);

TMP_DEST[487:480]  SaturateSignedWordToUnsignedByte (SRC2[463:448]);

TMP_DEST[495:488]  SaturateSignedWordToUnsignedByte (SRC2[479:464]);

PACKUSWB—Pack with Unsigned Saturation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-203

TMP_DEST[503:496]  SaturateSignedWordToUnsignedByte (SRC2[495:480]);

TMP_DEST[511:504]  SaturateSignedWordToUnsignedByte (SRC2[511:496]);

FI;

FOR j  0 TO KL-1

i  j * 8

IF k1[j] OR *no writemask*

THEN

DEST[i+7:i]  TMP_DEST[i+7:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+7:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+7:i]  0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

Intel C/C++ Compiler Intrinsic Equivalents

VPACKUSWB__m512i _mm512_packus_epi16(__m512i m1, __m512i m2);

VPACKUSWB__m512i _mm512_mask_packus_epi16(__m512i s, __mmask64 k, __m512i m1, __m512i m2);

VPACKUSWB__m512i _mm512_maskz_packus_epi16(__mmask64 k, __m512i m1, __m512i m2);

VPACKUSWB__m256i _mm256_mask_packus_epi16(__m256i s, __mmask32 k, __m256i m1, __m256i m2);

VPACKUSWB__m256i _mm256_maskz_packus_epi16(__mmask32 k, __m256i m1, __m256i m2);

VPACKUSWB__m128i _mm_mask_packus_epi16(__m128i s, __mmask16 k, __m128i m1, __m128i m2);

VPACKUSWB__m128i _mm_maskz_packus_epi16(__mmask16 k, __m128i m1, __m128i m2);

PACKUSWB: __m64 _mm_packs_pu16(__m64 m1, __m64 m2)

(V)PACKUSWB: __m128i _mm_packus_epi16(__m128i m1, __m128i m2)

VPACKUSWB: __m256i _mm256_packus_epi16(__m256i m1, __m256i m2);

Flags Affected

None

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

EVEX-encoded instruction, see Exceptions Type E4NF.nb.

PADDB/PADDW/PADDD/PADDQ—Add Packed Integers

INSTRUCTION SET REFERENCE, M-U

4-204 Vol. 2B

PADDB/PADDW/PADDD/PADDQ—Add Packed Integers

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

NP 0F FC /r1

PADDB mm, mm/m64

A V/V MMX Add packed byte integers from mm/m64 and mm.

NP 0F FD /r1

PADDW mm, mm/m64

A V/V MMX Add packed word integers from mm/m64 and mm.

NP 0F FE /r1

PADDD mm, mm/m64

A V/V MMX Add packed doubleword integers from mm/m64 and

mm.

NP 0F D4 /r1

PADDQ mm, mm/m64

A V/V MMX Add packed quadword integers from mm/m64 and

mm.

66 0F FC /r

PADDB xmm1, xmm2/m128

A V/V SSE2 Add packed byte integers from xmm2/m128 and

xmm1.

66 0F FD /r

PADDW xmm1, xmm2/m128

A V/V SSE2 Add packed word integers from xmm2/m128 and

xmm1.

66 0F FE /r

PADDD xmm1, xmm2/m128

A V/V SSE2 Add packed doubleword integers from xmm2/m128

and xmm1.

66 0F D4 /r

PADDQ xmm1, xmm2/m128

A V/V SSE2 Add packed quadword integers from xmm2/m128

and xmm1.

VEX.128.66.0F.WIG FC /r

VPADDB xmm1, xmm2, xmm3/m128

B V/V AVX Add packed byte integers from xmm2, and

xmm3/m128 and store in xmm1.

VEX.128.66.0F.WIG FD /r

VPADDW xmm1, xmm2, xmm3/m128

B V/V AVX Add packed word integers from xmm2, xmm3/m128

and store in xmm1.

VEX.128.66.0F.WIG FE /r

VPADDD xmm1, xmm2, xmm3/m128

B V/V AVX Add packed doubleword integers from xmm2,

xmm3/m128 and store in xmm1.

VEX.128.66.0F.WIG D4 /r

VPADDQ xmm1, xmm2, xmm3/m128

B V/V AVX Add packed quadword integers from xmm2,

xmm3/m128 and store in xmm1.

VEX.256.66.0F.WIG FC /r

VPADDB ymm1, ymm2, ymm3/m256

B V/V AVX2 Add packed byte integers from ymm2, and

ymm3/m256 and store in ymm1.

VEX.256.66.0F.WIG FD /r

VPADDW ymm1, ymm2, ymm3/m256

B V/V AVX2 Add packed word integers from ymm2, ymm3/m256

and store in ymm1.

VEX.256.66.0F.WIG FE /r

VPADDD ymm1, ymm2, ymm3/m256

B V/V AVX2 Add packed doubleword integers from ymm2,

ymm3/m256 and store in ymm1.

VEX.256.66.0F.WIG D4 /r

VPADDQ ymm1, ymm2, ymm3/m256

B V/V AVX2 Add packed quadword integers from ymm2,

ymm3/m256 and store in ymm1.

EVEX.128.66.0F.WIG FC /r

VPADDB xmm1 {k1}{z}, xmm2,

xmm3/m128

CV/V AVX512VL

AVX512BW

Add packed byte integers from xmm2, and

xmm3/m128 and store in xmm1 using writemask k1.

EVEX.128.66.0F.WIG FD /r

VPADDW xmm1 {k1}{z}, xmm2,

xmm3/m128

CV/V AVX512VL

AVX512BW

Add packed word integers from xmm2, and

xmm3/m128 and store in xmm1 using writemask k1.

EVEX.128.66.0F.W0 FE /r

VPADDD xmm1 {k1}{z}, xmm2,

xmm3/m128/m32bcst

DV/V AVX512VL

AVX512F

Add packed doubleword integers from xmm2, and

xmm3/m128/m32bcst and store in xmm1 using

writemask k1.

EVEX.128.66.0F.W1 D4 /r

VPADDQ xmm1 {k1}{z}, xmm2,

xmm3/m128/m64bcst

DV/V AVX512VL

AVX512F

Add packed quadword integers from xmm2, and

xmm3/m128/m64bcst and store in xmm1 using

writemask k1.

EVEX.256.66.0F.WIG FC /r

VPADDB ymm1 {k1}{z}, ymm2,

ymm3/m256

CV/V AVX512VL

AVX512BW

Add packed byte integers from ymm2, and

ymm3/m256 and store in ymm1 using writemask k1.

EVEX.256.66.0F.WIG FD /r

VPADDW ymm1 {k1}{z}, ymm2,

ymm3/m256

CV/V AVX512VL

AVX512BW

Add packed word integers from ymm2, and

ymm3/m256 and store in ymm1 using writemask k1.

PADDB/PADDW/PADDD/PADDQ—Add Packed Integers

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-205

Instruction Operand Encoding

Description

Performs a SIMD add of the packed integers from the source operand (second operand) and the destination

operand (first operand), and stores the packed integer results in the destination operand. See Figure 9-4 in the

Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a SIMD operation.

Overflow is handled with wraparound, as described in the following paragraphs.

The PADDB and VPADDB instructions add packed byte integers from the first source operand and second source

operand and store the packed integer results in the destination operand. When an individual result is too large to

be represented in 8 bits (overflow), the result is wrapped around and the low 8 bits are written to the destination

operand (that is, the carry is ignored).

The PADDW and VPADDW instructions add packed word integers from the first source operand and second source

operand and store the packed integer results in the destination operand. When an individual result is too large to

be represented in 16 bits (overflow), the result is wrapped around and the low 16 bits are written to the destination

operand (that is, the carry is ignored).

The PADDD and VPADDD instructions add packed doubleword integers from the first source operand and second

source operand and store the packed integer results in the destination operand. When an individual result is too

large to be represented in 32 bits (overflow), the result is wrapped around and the low 32 bits are written to the

destination operand (that is, the carry is ignored).

The PADDQ and VPADDQ instructions add packed quadword integers from the first source operand and second

source operand and store the packed integer results in the destination operand. When a quadword result is too

EVEX.256.66.0F.W0 FE /r

VPADDD ymm1 {k1}{z}, ymm2,

ymm3/m256/m32bcst

D V/V AVX512VL

AVX512F

Add packed doubleword integers from ymm2,

ymm3/m256/m32bcst and store in ymm1 using

writemask k1.

EVEX.256.66.0F.W1 D4 /r

VPADDQ ymm1 {k1}{z}, ymm2,

ymm3/m256/m64bcst

D V/V AVX512VL

AVX512F

Add packed quadword integers from ymm2,

ymm3/m256/m64bcst and store in ymm1 using

writemask k1.

EVEX.512.66.0F.WIG FC /r

VPADDB zmm1 {k1}{z}, zmm2,

zmm3/m512

C V/V AVX512BW Add packed byte integers from zmm2, and

zmm3/m512 and store in zmm1 using writemask k1.

EVEX.512.66.0F.WIG FD /r

VPADDW zmm1 {k1}{z}, zmm2,

zmm3/m512

C V/V AVX512BW Add packed word integers from zmm2, and

zmm3/m512 and store in zmm1 using writemask k1.

EVEX.512.66.0F.W0 FE /r

VPADDD zmm1 {k1}{z}, zmm2,

zmm3/m512/m32bcst

D V/V AVX512F Add packed doubleword integers from zmm2,

zmm3/m512/m32bcst and store in zmm1 using

writemask k1.

EVEX.512.66.0F.W1 D4 /r

VPADDQ zmm1 {k1}{z}, zmm2,

zmm3/m512/m64bcst

D V/V AVX512F Add packed quadword integers from zmm2,

zmm3/m512/m64bcst and store in zmm1 using

writemask k1.

NOTES:

1. See note in Section 2.4, “AVX and SSE Instruction Exception Specification” in the Intel® 64 and IA-32 Architectures Software

Developer’s Manual, Volume 2A and Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX

Registers” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

D Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

Opcode/

Instruction

Op /

64/32

bit Mode

Support

CPUID

Feature

Flag

Description

PADDB/PADDW/PADDD/PADDQ—Add Packed Integers

INSTRUCTION SET REFERENCE, M-U

4-206 Vol. 2B

large to be represented in 64 bits (overflow), the result is wrapped around and the low 64 bits are written to the

destination operand (that is, the carry is ignored).

Note that the (V)PADDB, (V)PADDW, (V)PADDD and (V)PADDQ instructions can operate on either unsigned or

signed (two's complement notation) packed integers; however, it does not set bits in the EFLAGS register to indi-

cate overflow and/or a carry. To prevent undetected overflow conditions, software must control the ranges of

values operated on.

EVEX encoded VPADDD/Q: The first source operand is a ZMM/YMM/XMM register. The second source operand is a

ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a

32/64-bit memory location. The destination operand is a ZMM/YMM/XMM register updated according to the

writemask.

EVEX encoded VPADDB/W: The first source operand is a ZMM/YMM/XMM register. The second source operand is a

ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register

or a 256-bit memory location. The destination operand is a YMM register. the upper bits (MAXVL-1:256) of the

destination are cleared.

VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM

the corresponding ZMM register destination are zeroed.

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM

upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified.

Operation

PADDB (with 64-bit operands)

DEST[7:0] ← DEST[7:0] + SRC[7:0];

(* Repeat add operation for 2nd through 7th byte *)

DEST[63:56] ← DEST[63:56] + SRC[63:56];

PADDW (with 64-bit operands)

DEST[15:0] ← DEST[15:0] + SRC[15:0];

(* Repeat add operation for 2nd and 3th word *)

DEST[63:48] ← DEST[63:48] + SRC[63:48];

PADDD (with 64-bit operands)

DEST[31:0] ← DEST[31:0] + SRC[31:0];

DEST[63:32] ← DEST[63:32] + SRC[63:32];

PADDQ (with 64-Bit operands)

DEST[63:0] ← DEST[63:0] + SRC[63:0];

PADDB (Legacy SSE instruction)

DEST[7:0]← DEST[7:0] + SRC[7:0];

(* Repeat add operation for 2nd through 15th byte *)

DEST[127:120]← DEST[127:120] + SRC[127:120];

DEST[MAXVL-1:128] (Unmodified)

PADDW (Legacy SSE instruction)

DEST[15:0] ← DEST[15:0] + SRC[15:0];

(* Repeat add operation for 2nd through 7th word *)

DEST[127:112]← DEST[127:112] + SRC[127:112];

DEST[MAXVL-1:128] (Unmodified)

PADDB/PADDW/PADDD/PADDQ—Add Packed Integers

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-207

PADDD (Legacy SSE instruction)

DEST[31:0]← DEST[31:0] + SRC[31:0];

(* Repeat add operation for 2nd and 3th doubleword *)

DEST[127:96]← DEST[127:96] + SRC[127:96];

DEST[MAXVL-1:128] (Unmodified)

PADDQ (Legacy SSE instruction)

DEST[63:0]← DEST[63:0] + SRC[63:0];

DEST[127:64]← DEST[127:64] + SRC[127:64];

DEST[MAXVL-1:128] (Unmodified)

VPADDB (VEX.128 encoded instruction)

DEST[7:0]← SRC1[7:0] + SRC2[7:0];

(* Repeat add operation for 2nd through 15th byte *)

DEST[127:120]← SRC1[127:120] + SRC2[127:120];

DEST[MAXVL-1:128] ← 0;

VPADDW (VEX.128 encoded instruction)

DEST[15:0] ← SRC1[15:0] + SRC2[15:0];

(* Repeat add operation for 2nd through 7th word *)

DEST[127:112]← SRC1[127:112] + SRC2[127:112];

DEST[MAXVL-1:128] ← 0;

VPADDD (VEX.128 encoded instruction)

DEST[31:0]← SRC1[31:0] + SRC2[31:0];

(* Repeat add operation for 2nd and 3th doubleword *)

DEST[127:96] ← SRC1[127:96] + SRC2[127:96];

DEST[MAXVL-1:128] ← 0;

VPADDQ (VEX.128 encoded instruction)

DEST[63:0]← SRC1[63:0] + SRC2[63:0];

DEST[127:64] ← SRC1[127:64] + SRC2[127:64];

DEST[MAXVL-1:128] ← 0;

VPADDB (VEX.256 encoded instruction)

DEST[7:0]← SRC1[7:0] + SRC2[7:0];

(* Repeat add operation for 2nd through 31th byte *)

DEST[255:248]← SRC1[255:248] + SRC2[255:248];

VPADDW (VEX.256 encoded instruction)

DEST[15:0] ← SRC1[15:0] + SRC2[15:0];

(* Repeat add operation for 2nd through 15th word *)

DEST[255:240]← SRC1[255:240] + SRC2[255:240];

VPADDD (VEX.256 encoded instruction)

DEST[31:0]← SRC1[31:0] + SRC2[31:0];

(* Repeat add operation for 2nd and 7th doubleword *)

DEST[255:224] ← SRC1[255:224] + SRC2[255:224];

VPADDQ (VEX.256 encoded instruction)

DEST[63:0]← SRC1[63:0] + SRC2[63:0];

DEST[127:64] ← SRC1[127:64] + SRC2[127:64];

DEST[191:128]← SRC1[191:128] + SRC2[191:128];

DEST[255:192] ← SRC1[255:192] + SRC2[255:192];

PADDB/PADDW/PADDD/PADDQ—Add Packed Integers

INSTRUCTION SET REFERENCE, M-U

4-208 Vol. 2B

VPADDB (EVEX encoded versions)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j  0 TO KL-1

i  j * 8

IF k1[j] OR *no writemask*

THEN DEST[i+7:i]  SRC1[i+7:i] + SRC2[i+7:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+7:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+7:i] = 0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

VPADDW (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j  0 TO KL-1

i  j * 16

IF k1[j] OR *no writemask*

THEN DEST[i+15:i]  SRC1[i+15:i] + SRC2[i+15:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+15:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+15:i] = 0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

VPADDD (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN DEST[i+31:i]  SRC1[i+31:i] + SRC2[31:0]

ELSE DEST[i+31:i]  SRC1[i+31:i] + SRC2[i+31:i]

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+31:i]  0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

PADDB/PADDW/PADDD/PADDQ—Add Packed Integers

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-209

VPADDQ (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN DEST[i+63:i]  SRC1[i+63:i] + SRC2[63:0]

ELSE DEST[i+63:i]  SRC1[i+63:i] + SRC2[i+63:i]

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+63:i]  0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

Intel C/C++ Compiler Intrinsic Equivalents

VPADDB__m512i _mm512_add_epi8 ( __m512i a, __m512i b)

VPADDW__m512i _mm512_add_epi16 ( __m512i a, __m512i b)

VPADDB__m512i _mm512_mask_add_epi8 ( __m512i s, __mmask64 m, __m512i a, __m512i b)

VPADDW__m512i _mm512_mask_add_epi16 ( __m512i s, __mmask32 m, __m512i a, __m512i b)

VPADDB__m512i _mm512_maskz_add_epi8 (__mmask64 m, __m512i a, __m512i b)

VPADDW__m512i _mm512_maskz_add_epi16 (__mmask32 m, __m512i a, __m512i b)

VPADDB__m256i _mm256_mask_add_epi8 (__m256i s, __mmask32 m, __m256i a, __m256i b)

VPADDW__m256i _mm256_mask_add_epi16 (__m256i s, __mmask16 m, __m256i a, __m256i b)

VPADDB__m256i _mm256_maskz_add_epi8 (__mmask32 m, __m256i a, __m256i b)

VPADDW__m256i _mm256_maskz_add_epi16 (__mmask16 m, __m256i a, __m256i b)

VPADDB__m128i _mm_mask_add_epi8 (__m128i s, __mmask16 m, __m128i a, __m128i b)

VPADDW__m128i _mm_mask_add_epi16 (__m128i s, __mmask8 m, __m128i a, __m128i b)

VPADDB__m128i _mm_maskz_add_epi8 (__mmask16 m, __m128i a, __m128i b)

VPADDW__m128i _mm_maskz_add_epi16 (__mmask8 m, __m128i a, __m128i b)

VPADDD __m512i _mm512_add_epi32( __m512i a, __m512i b);

VPADDD __m512i _mm512_mask_add_epi32(__m512i s, __mmask16 k, __m512i a, __m512i b);

VPADDD __m512i _mm512_maskz_add_epi32( __mmask16 k, __m512i a, __m512i b);

VPADDD __m256i _mm256_mask_add_epi32(__m256i s, __mmask8 k, __m256i a, __m256i b);

VPADDD __m256i _mm256_maskz_add_epi32( __mmask8 k, __m256i a, __m256i b);

VPADDD __m128i _mm_mask_add_epi32(__m128i s, __mmask8 k, __m128i a, __m128i b);

VPADDD __m128i _mm_maskz_add_epi32( __mmask8 k, __m128i a, __m128i b);

VPADDQ __m512i _mm512_add_epi64( __m512i a, __m512i b);

VPADDQ __m512i _mm512_mask_add_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);

VPADDQ __m512i _mm512_maskz_add_epi64( __mmask8 k, __m512i a, __m512i b);

VPADDQ __m256i _mm256_mask_add_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b);

VPADDQ __m256i _mm256_maskz_add_epi64( __mmask8 k, __m256i a, __m256i b);

VPADDQ __m128i _mm_mask_add_epi64(__m128i s, __mmask8 k, __m128i a, __m128i b);

VPADDQ __m128i _mm_maskz_add_epi64( __mmask8 k, __m128i a, __m128i b);

PADDB __m128i _mm_add_epi8 (__m128i a,__m128i b );

PADDW __m128i _mm_add_epi16 ( __m128i a, __m128i b);

PADDD __m128i _mm_add_epi32 ( __m128i a, __m128i b);

PADDQ __m128i _mm_add_epi64 ( __m128i a, __m128i b);

PADDB/PADDW/PADDD/PADDQ—Add Packed Integers

INSTRUCTION SET REFERENCE, M-U

4-210 Vol. 2B

VPADDB __m256i _mm256_add_epi8 (__m256ia,__m256i b );

VPADDW __m256i _mm256_add_epi16 ( __m256i a, __m256i b);

VPADDD __m256i _mm256_add_epi32 ( __m256i a, __m256i b);

VPADDQ __m256i _mm256_add_epi64 ( __m256i a, __m256i b);

PADDB __m64 _mm_add_pi8(__m64 m1, __m64 m2)

PADDW __m64 _mm_add_pi16(__m64 m1, __m64 m2)

PADDD __m64 _mm_add_pi32(__m64 m1, __m64 m2)

PADDQ __m64 _mm_add_si64(__m64 m1, __m64 m2)

SIMD Floating-Point Exceptions

None

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

EVEX-encoded VPADDD/Q, see Exceptions Type E4.

EVEX-encoded VPADDB/W, see Exceptions Type E4.nb.

PADDSB/PADDSW—Add Packed Signed Integers with Signed Saturation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-211

PADDSB/PADDSW—Add Packed Signed Integers with Signed Saturation

Opcode/

Instruction

Op/

64/32 bit

Mode

Support

CPUID

Feature

Flag

Description

NP 0F EC /r1

PADDSB mm, mm/m64

A V/V MMX Add packed signed byte integers from

mm/m64 and mm and saturate the results.

66 0F EC /r

PADDSB xmm1, xmm2/m128

A V/V SSE2 Add packed signed byte integers from

xmm2/m128 and xmm1 saturate the results.

NP 0F ED /r1

PADDSW mm, mm/m64

A V/V MMX Add packed signed word integers from

mm/m64 and mm and saturate the results.

66 0F ED /r

PADDSW xmm1, xmm2/m128

A V/V SSE2 Add packed signed word integers from

xmm2/m128 and xmm1 and saturate the

results.

VEX.128.66.0F.WIG EC /r

VPADDSB xmm1, xmm2, xmm3/m128

B V/V AVX Add packed signed byte integers from

xmm3/m128 and xmm2 saturate the results.

VEX.128.66.0F.WIG ED /r

VPADDSW xmm1, xmm2, xmm3/m128

B V/V AVX Add packed signed word integers from

xmm3/m128 and xmm2 and saturate the

results.

VEX.256.66.0F.WIG EC /r

VPADDSB ymm1, ymm2, ymm3/m256

B V/V AVX2 Add packed signed byte integers from ymm2,

and ymm3/m256 and store the saturated

results in ymm1.

VEX.256.66.0F.WIG ED /r

VPADDSW ymm1, ymm2, ymm3/m256

B V/V AVX2 Add packed signed word integers from ymm2,

and ymm3/m256 and store the saturated

results in ymm1.

EVEX.128.66.0F.WIG EC /r

VPADDSB xmm1 {k1}{z}, xmm2, xmm3/m128

CV/V AVX512VL

AVX512BW

Add packed signed byte integers from xmm2,

and xmm3/m128 and store the saturated

results in xmm1 under writemask k1.

EVEX.256.66.0F.WIG EC /r

VPADDSB ymm1 {k1}{z}, ymm2, ymm3/m256

CV/V AVX512VL

AVX512BW

Add packed signed byte integers from ymm2,

and ymm3/m256 and store the saturated

results in ymm1 under writemask k1.

EVEX.512.66.0F.WIG EC /r

VPADDSB zmm1 {k1}{z}, zmm2, zmm3/m512

C V/V AVX512BW Add packed signed byte integers from zmm2,

and zmm3/m512 and store the saturated

results in zmm1 under writemask k1.

EVEX.128.66.0F.WIG ED /r

VPADDSW xmm1 {k1}{z}, xmm2, xmm3/m128

CV/V AVX512VL

AVX512BW

Add packed signed word integers from xmm2,

and xmm3/m128 and store the saturated

results in xmm1 under writemask k1.

EVEX.256.66.0F.WIG ED /r

VPADDSW ymm1 {k1}{z}, ymm2, ymm3/m256

CV/V AVX512VL

AVX512BW

Add packed signed word integers from ymm2,

and ymm3/m256 and store the saturated

results in ymm1 under writemask k1.

EVEX.512.66.0F.WIG ED /r

VPADDSW zmm1 {k1}{z}, zmm2, zmm3/m512

C V/V AVX512BW Add packed signed word integers from zmm2,

and zmm3/m512 and store the saturated

results in zmm1 under writemask k1.

NOTES:

1. See note in Section 2.4, “AVX and SSE Instruction Exception Specification” in the Intel® 64 and IA-32 Architectures Software

Developer’s Manual, Volume 2A and Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”

in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

PADDSB/PADDSW—Add Packed Signed Integers with Signed Saturation

INSTRUCTION SET REFERENCE, M-U

4-212 Vol. 2B

Instruction Operand Encoding

Description

Performs a SIMD add of the packed signed integers from the source operand (second operand) and the destination

operand (first operand), and stores the packed integer results in the destination operand. See Figure 9-4 in the

Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a SIMD operation.

Overflow is handled with signed saturation, as described in the following paragraphs.

(V)PADDSB performs a SIMD add of the packed signed integers with saturation from the first source operand and

second source operand and stores the packed integer results in the destination operand. When an individual byte

result is beyond the range of a signed byte integer (that is, greater than 7FH or less than 80H), the saturated value

of 7FH or 80H, respectively, is written to the destination operand.

(V)PADDSW performs a SIMD add of the packed signed word integers with saturation from the first source operand

and second source operand and stores the packed integer results in the destination operand. When an individual

word result is beyond the range of a signed word integer (that is, greater than 7FFFH or less than 8000H), the satu-

rated value of 7FFFH or 8000H, respectively, is written to the destination operand.

EVEX encoded versions: The first source operand is an ZMM/YMM/XMM register. The second source operand is an

ZMM/YMM/XMM register or a memory location. The destination operand is an ZMM/YMM/XMM register.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register

or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM

the corresponding register destination are zeroed.

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM

upper bits (MAXVL-1:128) of the corresponding register destination are unmodified.

Operation

PADDSB (with 64-bit operands)

DEST[7:0] ← SaturateToSignedByte(DEST[7:0] + SRC (7:0]);

(* Repeat add operation for 2nd through 7th bytes *)

DEST[63:56] ← SaturateToSignedByte(DEST[63:56] + SRC[63:56] );

PADDSB (with 128-bit operands)

DEST[7:0] ←SaturateToSignedByte (DEST[7:0] + SRC[7:0]);

(* Repeat add operation for 2nd through 14th bytes *)

DEST[127:120] ← SaturateToSignedByte (DEST[111:120] + SRC[127:120]);

VPADDSB (VEX.128 encoded version)

DEST[7:0]  SaturateToSignedByte (SRC1[7:0] + SRC2[7:0]);

(* Repeat subtract operation for 2nd through 14th bytes *)

DEST[127:120]  SaturateToSignedByte (SRC1[111:120] + SRC2[127:120]);

DEST[MAXVL-1:128]  0

VPADDSB (VEX.256 encoded version)

DEST[7:0]  SaturateToSignedByte (SRC1[7:0] + SRC2[7:0]);

(* Repeat add operation for 2nd through 31st bytes *)

DEST[255:248] SaturateToSignedByte (SRC1[255:248] + SRC2[255:248]);

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

PADDSB/PADDSW—Add Packed Signed Integers with Signed Saturation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-213

VPADDSB (EVEX encoded versions)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j  0 TO KL-1

i  j * 8

IF k1[j] OR *no writemask*

THEN DEST[i+7:i]  SaturateToSignedByte (SRC1[i+7:i] + SRC2[i+7:i])

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+7:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+7:i] = 0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

PADDSW (with 64-bit operands)

DEST[15:0] ← SaturateToSignedWord(DEST[15:0] + SRC[15:0] );

(* Repeat add operation for 2nd and 7th words *)

DEST[63:48] ← SaturateToSignedWord(DEST[63:48] + SRC[63:48] );

PADDSW (with 128-bit operands)

DEST[15:0] ← SaturateToSignedWord (DEST[15:0] + SRC[15:0]);

(* Repeat add operation for 2nd through 7th words *)

DEST[127:112] ← SaturateToSignedWord (DEST[127:112] + SRC[127:112]);

VPADDSW (VEX.128 encoded version)

DEST[15:0]  SaturateToSignedWord (SRC1[15:0] + SRC2[15:0]);

(* Repeat subtract operation for 2nd through 7th words *)

DEST[127:112]  SaturateToSignedWord (SRC1[127:112] + SRC2[127:112]);

DEST[MAXVL-1:128]  0

VPADDSW (VEX.256 encoded version)

DEST[15:0]  SaturateToSignedWord (SRC1[15:0] + SRC2[15:0]);

(* Repeat add operation for 2nd through 15th words *)

DEST[255:240]  SaturateToSignedWord (SRC1[255:240] + SRC2[255:240])

VPADDSW (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j  0 TO KL-1

i  j * 16

IF k1[j] OR *no writemask*

THEN DEST[i+15:i]  SaturateToSignedWord (SRC1[i+15:i] + SRC2[i+15:i])

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+15:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+15:i] = 0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

PADDSB/PADDSW—Add Packed Signed Integers with Signed Saturation

INSTRUCTION SET REFERENCE, M-U

4-214 Vol. 2B

Intel C/C++ Compiler Intrinsic Equivalents

PADDSB: __m64 _mm_adds_pi8(__m64 m1, __m64 m2)

(V)PADDSB: __m128i _mm_adds_epi8 ( __m128i a, __m128i b)

VPADDSB: __m256i _mm256_adds_epi8 ( __m256i a, __m256i b)

PADDSW: __m64 _mm_adds_pi16(__m64 m1, __m64 m2)

(V)PADDSW: __m128i _mm_adds_epi16 ( __m128i a, __m128i b)

VPADDSW: __m256i _mm256_adds_epi16 ( __m256i a, __m256i b)

VPADDSB__m512i _mm512_adds_epi8 ( __m512i a, __m512i b)

VPADDSW__m512i _mm512_adds_epi16 ( __m512i a, __m512i b)

VPADDSB__m512i _mm512_mask_adds_epi8 ( __m512i s, __mmask64 m, __m512i a, __m512i b)

VPADDSW__m512i _mm512_mask_adds_epi16 ( __m512i s, __mmask32 m, __m512i a, __m512i b)

VPADDSB__m512i _mm512_maskz_adds_epi8 (__mmask64 m, __m512i a, __m512i b)

VPADDSW__m512i _mm512_maskz_adds_epi16 (__mmask32 m, __m512i a, __m512i b)

VPADDSB__m256i _mm256_mask_adds_epi8 (__m256i s, __mmask32 m, __m256i a, __m256i b)

VPADDSW__m256i _mm256_mask_adds_epi16 (__m256i s, __mmask16 m, __m256i a, __m256i b)

VPADDSB__m256i _mm256_maskz_adds_epi8 (__mmask32 m, __m256i a, __m256i b)

VPADDSW__m256i _mm256_maskz_adds_epi16 (__mmask16 m, __m256i a, __m256i b)

VPADDSB__m128i _mm_mask_adds_epi8 (__m128i s, __mmask16 m, __m128i a, __m128i b)

VPADDSW__m128i _mm_mask_adds_epi16 (__m128i s, __mmask8 m, __m128i a, __m128i b)

VPADDSB__m128i _mm_maskz_adds_epi8 (__mmask16 m, __m128i a, __m128i b)

VPADDSW__m128i _mm_maskz_adds_epi16 (__mmask8 m, __m128i a, __m128i b)

Flags Affected

None.

SIMD Floating-Point Exceptions

None.

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

EVEX-encoded instruction, see Exceptions Type E4.nb.

PADDUSB/PADDUSW—Add Packed Unsigned Integers with Unsigned Saturation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-215

PADDUSB/PADDUSW—Add Packed Unsigned Integers with Unsigned Saturation

Opcode/

Instruction

Op/

64/32 bit

Mode

Support

CPUID

Feature Flag

Description

NP 0F DC /r1

PADDUSB mm, mm/m64

A V/V MMX Add packed unsigned byte integers from

mm/m64 and mm and saturate the results.

66 0F DC /r

PADDUSB xmm1, xmm2/m128

A V/V SSE2 Add packed unsigned byte integers from

xmm2/m128 and xmm1 saturate the results.

NP 0F DD /r1

PADDUSW mm, mm/m64

A V/V MMX Add packed unsigned word integers from

mm/m64 and mm and saturate the results.

66 0F DD /r

PADDUSW xmm1, xmm2/m128

A V/V SSE2 Add packed unsigned word integers from

xmm2/m128 to xmm1 and saturate the

results.

VEX.128.660F.WIG DC /r

VPADDUSB xmm1, xmm2, xmm3/m128

B V/V AVX Add packed unsigned byte integers from

xmm3/m128 to xmm2 and saturate the

results.

VEX.128.66.0F.WIG DD /r

VPADDUSW xmm1, xmm2, xmm3/m128

B V/V AVX Add packed unsigned word integers from

xmm3/m128 to xmm2 and saturate the

results.

VEX.256.66.0F.WIG DC /r

VPADDUSB ymm1, ymm2, ymm3/m256

B V/V AVX2 Add packed unsigned byte integers from

ymm2, and ymm3/m256 and store the

saturated results in ymm1.

VEX.256.66.0F.WIG DD /r

VPADDUSW ymm1, ymm2, ymm3/m256

B V/V AVX2 Add packed unsigned word integers from

ymm2, and ymm3/m256 and store the

saturated results in ymm1.

EVEX.128.66.0F.WIG DC /r

VPADDUSB xmm1 {k1}{z}, xmm2, xmm3/m128

CV/V AVX512VL

AVX512BW

Add packed unsigned byte integers from

xmm2, and xmm3/m128 and store the

saturated results in xmm1 under writemask

k1.

EVEX.256.66.0F.WIG DC /r

VPADDUSB ymm1 {k1}{z}, ymm2, ymm3/m256

CV/V AVX512VL

AVX512BW

Add packed unsigned byte integers from

ymm2, and ymm3/m256 and store the

saturated results in ymm1 under writemask

k1.

EVEX.512.66.0F.WIG DC /r

VPADDUSB zmm1 {k1}{z}, zmm2, zmm3/m512

C V/V AVX512BW Add packed unsigned byte integers from

zmm2, and zmm3/m512 and store the

saturated results in zmm1 under writemask

k1.

EVEX.128.66.0F.WIG DD /r

VPADDUSW xmm1 {k1}{z}, xmm2, xmm3/m128

CV/V AVX512VL

AVX512BW

Add packed unsigned word integers from

xmm2, and xmm3/m128 and store the

saturated results in xmm1 under writemask

k1.

EVEX.256.66.0F.WIG DD /r

VPADDUSW ymm1 {k1}{z}, ymm2, ymm3/m256

CV/V AVX512VL

AVX512BW

Add packed unsigned word integers from

ymm2, and ymm3/m256 and store the

saturated results in ymm1 under writemask

k1.

PADDUSB/PADDUSW—Add Packed Unsigned Integers with Unsigned Saturation

INSTRUCTION SET REFERENCE, M-U

4-216 Vol. 2B

Instruction Operand Encoding

Description

Performs a SIMD add of the packed unsigned integers from the source operand (second operand) and the destina-

tion operand (first operand), and stores the packed integer results in the destination operand. See Figure 9-4 in the

Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a SIMD operation.

Overflow is handled with unsigned saturation, as described in the following paragraphs.

(V)PADDUSB performs a SIMD add of the packed unsigned integers with saturation from the first source operand

and second source operand and stores the packed integer results in the destination operand. When an individual

byte result is beyond the range of an unsigned byte integer (that is, greater than FFH), the saturated value of FFH

is written to the destination operand.

(V)PADDUSW performs a SIMD add of the packed unsigned word integers with saturation from the first source

operand and second source operand and stores the packed integer results in the destination operand. When an

individual word result is beyond the range of an unsigned word integer (that is, greater than FFFFH), the saturated

value of FFFFH is written to the destination operand.

EVEX encoded versions: The first source operand is an ZMM/YMM/XMM register. The second source operand is an

ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination is an ZMM/YMM/XMM register.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register

or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM

the corresponding destination register destination are zeroed.

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM

upper bits (MAXVL-1:128) of the corresponding register destination are unmodified.

Operation

PADDUSB (with 64-bit operands)

DEST[7:0] ← SaturateToUnsignedByte(DEST[7:0] + SRC (7:0] );

(* Repeat add operation for 2nd through 7th bytes *)

DEST[63:56] ← SaturateToUnsignedByte(DEST[63:56] + SRC[63:56]

PADDUSB (with 128-bit operands)

DEST[7:0] ← SaturateToUnsignedByte (DEST[7:0] + SRC[7:0]);

(* Repeat add operation for 2nd through 14th bytes *)

DEST[127:120] ← SaturateToUnSignedByte (DEST[127:120] + SRC[127:120]);

EVEX.512.66.0F.WIG DD /r

VPADDUSW zmm1 {k1}{z}, zmm2, zmm3/m512

C V/V AVX512BW Add packed unsigned word integers from

zmm2, and zmm3/m512 and store the

saturated results in zmm1 under writemask

k1.

NOTES:

1. See note in Section 2.4, “AVX and SSE Instruction Exception Specification” in the Intel® 64 and IA-32 Architectures Software

Developer’s Manual, Volume 2A and Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”

in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

PADDUSB/PADDUSW—Add Packed Unsigned Integers with Unsigned Saturation

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-217

VPADDUSB (VEX.128 encoded version)

DEST[7:0]  SaturateToUnsignedByte (SRC1[7:0] + SRC2[7:0]);

(* Repeat subtract operation for 2nd through 14th bytes *)

DEST[127:120]  SaturateToUnsignedByte (SRC1[111:120] + SRC2[127:120]);

DEST[MAXVL-1:128]  0

VPADDUSB (VEX.256 encoded version)

DEST[7:0]  SaturateToUnsignedByte (SRC1[7:0] + SRC2[7:0]);

(* Repeat add operation for 2nd through 31st bytes *)

DEST[255:248] SaturateToUnsignedByte (SRC1[255:248] + SRC2[255:248]);

PADDUSW (with 64-bit operands)

DEST[15:0] ← SaturateToUnsignedWord(DEST[15:0] + SRC[15:0] );

(* Repeat add operation for 2nd and 3rd words *)

DEST[63:48] ← SaturateToUnsignedWord(DEST[63:48] + SRC[63:48] );

PADDUSW (with 128-bit operands)

DEST[15:0] ← SaturateToUnsignedWord (DEST[15:0] + SRC[15:0]);

(* Repeat add operation for 2nd through 7th words *)

DEST[127:112] ← SaturateToUnSignedWord (DEST[127:112] + SRC[127:112]);

VPADDUSW (VEX.128 encoded version)

DEST[15:0]  SaturateToUnsignedWord (SRC1[15:0] + SRC2[15:0]);

(* Repeat subtract operation for 2nd through 7th words *)

DEST[127:112]  SaturateToUnsignedWord (SRC1[127:112] + SRC2[127:112]);

DEST[MAXVL-1:128]  0

VPADDUSW (VEX.256 encoded version)

DEST[15:0]  SaturateToUnsignedWord (SRC1[15:0] + SRC2[15:0]);

(* Repeat add operation for 2nd through 15th words *)

DEST[255:240]  SaturateToUnsignedWord (SRC1[255:240] + SRC2[255:240])

VPADDUSB (EVEX encoded versions)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j  0 TO KL-1

i  j * 8

IF k1[j] OR *no writemask*

THEN DEST[i+7:i]  SaturateToUnsignedByte (SRC1[i+7:i] + SRC2[i+7:i])

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+7:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+7:i] = 0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

PADDUSB/PADDUSW—Add Packed Unsigned Integers with Unsigned Saturation

INSTRUCTION SET REFERENCE, M-U

4-218 Vol. 2B

VPADDUSW (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j  0 TO KL-1

i  j * 16

IF k1[j] OR *no writemask*

THEN DEST[i+15:i]  SaturateToUnsignedWord (SRC1[i+15:i] + SRC2[i+15:i])

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+15:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+15:i] = 0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

Intel C/C++ Compiler Intrinsic Equivalents

PADDUSB: __m64 _mm_adds_pu8(__m64 m1, __m64 m2)

PADDUSW: __m64 _mm_adds_pu16(__m64 m1, __m64 m2)

(V)PADDUSB: __m128i _mm_adds_epu8 ( __m128i a, __m128i b)

(V)PADDUSW: __m128i _mm_adds_epu16 ( __m128i a, __m128i b)

VPADDUSB: __m256i _mm256_adds_epu8 ( __m256i a, __m256i b)

VPADDUSW: __m256i _mm256_adds_epu16 ( __m256i a, __m256i b)

VPADDUSB__m512i _mm512_adds_epu8 ( __m512i a, __m512i b)

VPADDUSW__m512i _mm512_adds_epu16 ( __m512i a, __m512i b)

VPADDUSB__m512i _mm512_mask_adds_epu8 ( __m512i s, __mmask64 m, __m512i a, __m512i b)

VPADDUSW__m512i _mm512_mask_adds_epu16 ( __m512i s, __mmask32 m, __m512i a, __m512i b)

VPADDUSB__m512i _mm512_maskz_adds_epu8 (__mmask64 m, __m512i a, __m512i b)

VPADDUSW__m512i _mm512_maskz_adds_epu16 (__mmask32 m, __m512i a, __m512i b)

VPADDUSB__m256i _mm256_mask_adds_epu8 (__m256i s, __mmask32 m, __m256i a, __m256i b)

VPADDUSW__m256i _mm256_mask_adds_epu16 (__m256i s, __mmask16 m, __m256i a, __m256i b)

VPADDUSB__m256i _mm256_maskz_adds_epu8 (__mmask32 m, __m256i a, __m256i b)

VPADDUSW__m256i _mm256_maskz_adds_epu16 (__mmask16 m, __m256i a, __m256i b)

VPADDUSB__m128i _mm_mask_adds_epu8 (__m128i s, __mmask16 m, __m128i a, __m128i b)

VPADDUSW__m128i _mm_mask_adds_epu16 (__m128i s, __mmask8 m, __m128i a, __m128i b)

VPADDUSB__m128i _mm_maskz_adds_epu8 (__mmask16 m, __m128i a, __m128i b)

VPADDUSW__m128i _mm_maskz_adds_epu16 (__mmask8 m, __m128i a, __m128i b)

Flags Affected

None.

Numeric Exceptions

None.

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

EVEX-encoded instruction, see Exceptions Type E4.nb.

PALIGNR — Packed Align Right

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-219

PALIGNR — Packed Align Right

Instruction Operand Encoding

Description

(V)PALIGNR concatenates the destination operand (the first operand) and the source operand (the second

operand) into an intermediate composite, shifts the composite at byte granularity to the right by a constant imme-

diate, and extracts the right-aligned result into the destination. The first and the second operands can be an MMX,

Opcode/

Instruction

Op/

64/32 bit

Mode

Support

CPUID

Feature

Flag

Description

NP 0F 3A 0F /r ib1

PALIGNR mm1, mm2/m64, imm8

A V/V SSSE3 Concatenate destination and source operands,

extract byte-aligned result shifted to the right by

constant value in imm8 into mm1.

66 0F 3A 0F /r ib

PALIGNR xmm1, xmm2/m128, imm8

A V/V SSSE3 Concatenate destination and source operands,

extract byte-aligned result shifted to the right by

constant value in imm8 into xmm1.

VEX.128.66.0F3A.WIG 0F /r ib

VPALIGNR xmm1, xmm2, xmm3/m128, imm8

B V/V AVX Concatenate xmm2 and xmm3/m128, extract

byte aligned result shifted to the right by

constant value in imm8 and result is stored in

xmm1.

VEX.256.66.0F3A.WIG 0F /r ib

VPALIGNR ymm1, ymm2, ymm3/m256, imm8

B V/V AVX2 Concatenate pairs of 16 bytes in ymm2 and

ymm3/m256 into 32-byte intermediate result,

extract byte-aligned, 16-byte result shifted to

the right by constant values in imm8 from each

intermediate result, and two 16-byte results are

stored in ymm1.

EVEX.128.66.0F3A.WIG 0F /r ib

VPALIGNR xmm1 {k1}{z}, xmm2, xmm3/m128,

imm8

CV/V AVX512VL

AVX512BW

Concatenate xmm2 and xmm3/m128 into a 32-

byte intermediate result, extract byte aligned

result shifted to the right by constant value in

imm8 and result is stored in xmm1.

EVEX.256.66.0F3A.WIG 0F /r ib

VPALIGNR ymm1 {k1}{z}, ymm2, ymm3/m256,

imm8

CV/V AVX512VL

AVX512BW

Concatenate pairs of 16 bytes in ymm2 and

ymm3/m256 into 32-byte intermediate result,

extract byte-aligned, 16-byte result shifted to

the right by constant values in imm8 from each

intermediate result, and two 16-byte results are

stored in ymm1.

EVEX.512.66.0F3A.WIG 0F /r ib

VPALIGNR zmm1 {k1}{z}, zmm2, zmm3/m512,

imm8

C V/V AVX512BW Concatenate pairs of 16 bytes in zmm2 and

zmm3/m512 into 32-byte intermediate result,

extract byte-aligned, 16-byte result shifted to

the right by constant values in imm8 from each

intermediate result, and four 16-byte results are

stored in zmm1.

NOTES:

1. See note in Section 2.4, “AVX and SSE Instruction Exception Specification” in the Intel® 64 and IA-32 Architectures Software

Developer’s Manual, Volume 2A and Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”

in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) imm8 NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8

C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) imm8

PALIGNR — Packed Align Right

INSTRUCTION SET REFERENCE, M-U

4-220 Vol. 2B

XMM or a YMM register. The immediate value is considered unsigned. Immediate shift counts larger than the 2L

(i.e. 32 for 128-bit operands, or 16 for 64-bit operands) produce a zero result. Both operands can be MMX regis-

ters, XMM registers or YMM registers. When the source operand is a 128-bit memory operand, the operand must

be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated.

In 64-bit mode and not encoded by VEX/EVEX prefix, use the REX prefix to access additional registers.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding YMM destination register remain unchanged.

EVEX.512 encoded version: The first source operand is a ZMM register and contains four 16-byte blocks. The

second source operand is a ZMM register or a 512-bit memory location containing four 16-byte block. The destina-

tion operand is a ZMM register and contain four 16-byte results. The imm8[7:0] is the common shift count

used for each of the four successive 16-byte block sources. The low 16-byte block of the two source operands

produce the low 16-byte result of the destination operand, the high 16-byte block of the two source operands

produce the high 16-byte result of the destination operand and so on for the blocks in the middle.

VEX.256 and EVEX.256 encoded versions: The first source operand is a YMM register and contains two 16-byte

blocks. The second source operand is a YMM register or a 256-bit memory location containing two 16-byte block.

The destination operand is a YMM register and contain two 16-byte results. The imm8[7:0] is the common shift

count used for the two lower 16-byte block sources and the two upper 16-byte block sources. The low 16-byte

block of the two source operands produce the low 16-byte result of the destination operand, the high 16-byte block

of the two source operands produce the high 16-byte result of the destination operand. The upper bits (MAXVL-

1:256) of the corresponding ZMM register destination are zeroed.

VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand

is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-

1:128) of the corresponding ZMM register destination are zeroed.

Concatenation is done with 128-bit data in the first and second source operand for both 128-bit and 256-bit

instructions. The high 128-bits of the intermediate composite 256-bit result came from the 128-bit data from the

first source operand; the low 128-bits of the intermediate result came from the 128-bit data of the second source

operand.

Note: VEX.L must be 0, otherwise the instruction will #UD.

Figure 4-7. 256-bit VPALIGN Instruction Operation

Operation

PALIGNR (with 64-bit operands)

temp1[127:0] = CONCATENATE(DEST,SRC)>>(imm8*8)

DEST[63:0] = temp1[63:0]

127 0

SRC1

Imm8[7:0]*8

127 0

SRC2

255 128

SRC1

255 128

SRC2

255 128

DEST

127 0

DEST

PALIGNR — Packed Align Right

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-221

PALIGNR (with 128-bit operands)

temp1[255:0]  ((DEST[127:0] << 128) OR SRC[127:0])>>(imm8*8);

DEST[127:0]  temp1[127:0]

DEST[MAXVL-1:128] (Unmodified)

VPALIGNR (VEX.128 encoded version)

temp1[255:0]  ((SRC1[127:0] << 128) OR SRC2[127:0])>>(imm8*8);

DEST[127:0]  temp1[127:0]

DEST[MAXVL-1:128]  0

VPALIGNR (VEX.256 encoded version)

temp1[255:0]  ((SRC1[127:0] << 128) OR SRC2[127:0])>>(imm8[7:0]*8);

DEST[127:0]  temp1[127:0]

temp1[255:0]  ((SRC1[255:128] << 128) OR SRC2[255:128])>>(imm8[7:0]*8);

DEST[MAXVL-1:128]  temp1[127:0]

VPALIGNR (EVEX encoded versions)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR l  0 TO VL-1 with increments of 128

temp1[255:0] ← ((SRC1[l+127:l] << 128) OR SRC2[l+127:l])>>(imm8[7:0]*8);

TMP_DEST[l+127:l] ← temp1[127:0]

ENDFOR;

FOR j  0 TO KL-1

i  j * 8

IF k1[j] OR *no writemask*

THEN DEST[i+7:i]  TMP_DEST[i+7:i]

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+7:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+7:i] = 0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

Intel C/C++ Compiler Intrinsic Equivalents

PALIGNR: __m64 _mm_alignr_pi8 (__m64 a, __m64 b, int n)

(V)PALIGNR: __m128i _mm_alignr_epi8 (__m128i a, __m128i b, int n)

VPALIGNR: __m256i _mm256_alignr_epi8 (__m256i a, __m256i b, const int n)

VPALIGNR __m512i _mm512_alignr_epi8 (__m512i a, __m512i b, const int n)

VPALIGNR __m512i _mm512_mask_alignr_epi8 (__m512i s, __mmask64 m, __m512i a, __m512i b, const int n)

VPALIGNR __m512i _mm512_maskz_alignr_epi8 ( __mmask64 m, __m512i a, __m512i b, const int n)

VPALIGNR __m256i _mm256_mask_alignr_epi8 (__m256i s, __mmask32 m, __m256i a, __m256i b, const int n)

VPALIGNR __m256i _mm256_maskz_alignr_epi8 (__mmask32 m, __m256i a, __m256i b, const int n)

VPALIGNR __m128i _mm_mask_alignr_epi8 (__m128i s, __mmask16 m, __m128i a, __m128i b, const int n)

VPALIGNR __m128i _mm_maskz_alignr_epi8 (__mmask16 m, __m128i a, __m128i b, const int n)

SIMD Floating-Point Exceptions

None.

PALIGNR — Packed Align Right

INSTRUCTION SET REFERENCE, M-U

4-222 Vol. 2B

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

EVEX-encoded instruction, see Exceptions Type E4NF.nb.

PAND—Logical AND

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-223

PAND—Logical AND

Instruction Operand Encoding

Description

Performs a bitwise logical AND operation on the first source operand and second source operand and stores the

result in the destination operand. Each bit of the result is set to 1 if the corresponding bits of the first and second

operands are 1, otherwise it is set to 0.

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to

access additional registers (XMM8-XMM15).

Opcode/

Instruction

Op/

64/32 bit

Mode

Support

CPUID

Feature Flag

Description

NP 0F DB /r1

PAND mm, mm/m64

AV/V MMX Bitwise AND mm/m64 and mm.

66 0F DB /r

PAND xmm1, xmm2/m128

AV/V SSE2 Bitwise AND of xmm2/m128 and xmm1.

VEX.128.66.0F.WIG DB /r

VPAND xmm1, xmm2, xmm3/m128

BV/V AVX Bitwise AND of xmm3/m128 and xmm.

VEX.256.66.0F.WIG DB /r

VPAND ymm1, ymm2, ymm3/.m256

BV/V AVX2 Bitwise AND of ymm2, and ymm3/m256 and

store result in ymm1.

EVEX.128.66.0F.W0 DB /r

VPANDD xmm1 {k1}{z}, xmm2,

xmm3/m128/m32bcst

C V/V AVX512VL

AVX512F

Bitwise AND of packed doubleword integers in

xmm2 and xmm3/m128/m32bcst and store

result in xmm1 using writemask k1.

EVEX.256.66.0F.W0 DB /r

VPANDD ymm1 {k1}{z}, ymm2,

ymm3/m256/m32bcst

C V/V AVX512VL

AVX512F

Bitwise AND of packed doubleword integers in

ymm2 and ymm3/m256/m32bcst and store

result in ymm1 using writemask k1.

EVEX.512.66.0F.W0 DB /r

VPANDD zmm1 {k1}{z}, zmm2,

zmm3/m512/m32bcst

C V/V AVX512F Bitwise AND of packed doubleword integers in

zmm2 and zmm3/m512/m32bcst and store

result in zmm1 using writemask k1.

EVEX.128.66.0F.W1 DB /r

VPANDQ xmm1 {k1}{z}, xmm2,

xmm3/m128/m64bcst

C V/V AVX512VL

AVX512F

Bitwise AND of packed quadword integers in

xmm2 and xmm3/m128/m64bcst and store

result in xmm1 using writemask k1.

EVEX.256.66.0F.W1 DB /r

VPANDQ ymm1 {k1}{z}, ymm2,

ymm3/m256/m64bcst

C V/V AVX512VL

AVX512F

Bitwise AND of packed quadword integers in

ymm2 and ymm3/m256/m64bcst and store

result in ymm1 using writemask k1.

EVEX.512.66.0F.W1 DB /r

VPANDQ zmm1 {k1}{z}, zmm2,

zmm3/m512/m64bcst

C V/V AVX512F Bitwise AND of packed quadword integers in

zmm2 and zmm3/m512/m64bcst and store

result in zmm1 using writemask k1.

NOTES:

1. See note in Section 2.4, “AVX and SSE Instruction Exception Specification” in the Intel® 64 and IA-32 Architectures Software

Developer’s Manual, Volume 2A and Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”

in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

PAND—Logical AND

INSTRUCTION SET REFERENCE, M-U

4-224 Vol. 2B

Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The

destination operand can be an MMX technology register.

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM

upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified.

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be

a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a

32/64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with

writemask k1 at 32/64-bit granularity.

VEX.256 encoded versions: The first source operand is a YMM register. The second source operand is a YMM

the corresponding ZMM register destination are zeroed.

VEX.128 encoded versions: The first source operand is an XMM register. The second source operand is an XMM

the corresponding ZMM register destination are zeroed.

Operation

PAND (64-bit operand)

DEST  DEST AND SRC

PAND (128-bit Legacy SSE version)

DEST  DEST AND SRC

DEST[MAXVL-1:128] (Unmodified)

VPAND (VEX.128 encoded version)

DEST  SRC1 AND SRC2

DEST[MAXVL-1:128]  0

VPAND (VEX.256 encoded instruction)

DEST[255:0]  (SRC1[255:0] AND SRC2[255:0])

DEST[MAXVL-1:256]  0

VPANDD (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN DEST[i+31:i]  SRC1[i+31:i] BITWISE AND SRC2[31:0]

ELSE DEST[i+31:i]  SRC1[i+31:i] BITWISE AND SRC2[i+31:i]

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i]  0

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

PAND—Logical AND

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-225

VPANDQ (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN DEST[i+63:i]  SRC1[i+63:i] BITWISE AND SRC2[63:0]

ELSE DEST[i+63:i]  SRC1[i+63:i] BITWISE AND SRC2[i+63:i]

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i]  0

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

Intel C/C++ Compiler Intrinsic Equivalents

VPANDD __m512i _mm512_and_epi32( __m512i a, __m512i b);

VPANDD __m512i _mm512_mask_and_epi32(__m512i s, __mmask16 k, __m512i a, __m512i b);

VPANDD __m512i _mm512_maskz_and_epi32( __mmask16 k, __m512i a, __m512i b);

VPANDQ __m512i _mm512_and_epi64( __m512i a, __m512i b);

VPANDQ __m512i _mm512_mask_and_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);

VPANDQ __m512i _mm512_maskz_and_epi64( __mmask8 k, __m512i a, __m512i b);

VPANDND __m256i _mm256_mask_and_epi32(__m256i s, __mmask8 k, __m256i a, __m256i b);

VPANDND __m256i _mm256_maskz_and_epi32( __mmask8 k, __m256i a, __m256i b);

VPANDND __m128i _mm_mask_and_epi32(__m128i s, __mmask8 k, __m128i a, __m128i b);

VPANDND __m128i _mm_maskz_and_epi32( __mmask8 k, __m128i a, __m128i b);

VPANDNQ __m256i _mm256_mask_and_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b);

VPANDNQ __m256i _mm256_maskz_and_epi64( __mmask8 k, __m256i a, __m256i b);

VPANDNQ __m128i _mm_mask_and_epi64(__m128i s, __mmask8 k, __m128i a, __m128i b);

VPANDNQ __m128i _mm_maskz_and_epi64( __mmask8 k, __m128i a, __m128i b);

PAND: __m64 _mm_and_si64 (__m64 m1, __m64 m2)

(V)PAND:__m128i _mm_and_si128 ( __m128i a, __m128i b)

VPAND: __m256i _mm256_and_si256 ( __m256i a, __m256i b)

Flags Affected

None.

Numeric Exceptions

None.

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

EVEX-encoded instruction, see Exceptions Type E4.

PANDN—Logical AND NOT

INSTRUCTION SET REFERENCE, M-U

4-226 Vol. 2B

PANDN—Logical AND NOT

Instruction Operand Encoding

Description

Performs a bitwise logical NOT operation on the first source operand, then performs bitwise AND with second

source operand and stores the result in the destination operand. Each bit of the result is set to 1 if the corre-

sponding bit in the first operand is 0 and the corresponding bit in the second operand is 1, otherwise it is set to 0.

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to

access additional registers (XMM8-XMM15).

Opcode/

Instruction

Op/

64/32 bit

Mode

Support

CPUID

Feature

Flag

Description

NP 0F DF /r1

PANDN mm, mm/m64

AV/V MMX Bitwise AND NOT of mm/m64 and mm.

66 0F DF /r

PANDN xmm1, xmm2/m128

AV/V SSE2 Bitwise AND NOT of xmm2/m128 and xmm1.

VEX.128.66.0F.WIG DF /r

VPANDN xmm1, xmm2, xmm3/m128

BV/V AVX Bitwise AND NOT of xmm3/m128 and xmm2.

VEX.256.66.0F.WIG DF /r

VPANDN ymm1, ymm2, ymm3/m256

BV/V AVX2 Bitwise AND NOT of ymm2, and ymm3/m256

and store result in ymm1.

EVEX.128.66.0F.W0 DF /r

VPANDND xmm1 {k1}{z}, xmm2,

xmm3/m128/m32bcst

C V/V AVX512VL

AVX512F

Bitwise AND NOT of packed doubleword

integers in xmm2 and xmm3/m128/m32bcst

and store result in xmm1 using writemask k1.

EVEX.256.66.0F.W0 DF /r

VPANDND ymm1 {k1}{z}, ymm2,

ymm3/m256/m32bcst

C V/V AVX512VL

AVX512F

Bitwise AND NOT of packed doubleword

integers in ymm2 and ymm3/m256/m32bcst

and store result in ymm1 using writemask k1.

EVEX.512.66.0F.W0 DF /r

VPANDND zmm1 {k1}{z}, zmm2,

zmm3/m512/m32bcst

C V/V AVX512F Bitwise AND NOT of packed doubleword

integers in zmm2 and zmm3/m512/m32bcst

and store result in zmm1 using writemask k1.

EVEX.128.66.0F.W1 DF /r

VPANDNQ xmm1 {k1}{z}, xmm2,

xmm3/m128/m64bcst

C V/V AVX512VL

AVX512F

Bitwise AND NOT of packed quadword

integers in xmm2 and xmm3/m128/m64bcst

and store result in xmm1 using writemask k1.

EVEX.256.66.0F.W1 DF /r

VPANDNQ ymm1 {k1}{z}, ymm2,

ymm3/m256/m64bcst

C V/V AVX512VL

AVX512F

Bitwise AND NOT of packed quadword

integers in ymm2 and ymm3/m256/m64bcst

and store result in ymm1 using writemask k1.

EVEX.512.66.0F.W1 DF /r

VPANDNQ zmm1 {k1}{z}, zmm2,

zmm3/m512/m64bcst

C V/V AVX512F Bitwise AND NOT of packed quadword

integers in zmm2 and zmm3/m512/m64bcst

and store result in zmm1 using writemask k1.

NOTES:

1. See note in Section 2.4, “AVX and SSE Instruction Exception Specification” in the Intel® 64 and IA-32 Architectures Software

Developer’s Manual, Volume 2A and Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”

in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

ANAModRM:reg (r, w)ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

PANDN—Logical AND NOT

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-227

Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The

destination operand can be an MMX technology register.

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM

upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified.

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be

a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a

32/64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with

writemask k1 at 32/64-bit granularity.

VEX.256 encoded versions: The first source operand is a YMM register. The second source operand is a YMM

the corresponding ZMM register destination are zeroed.

VEX.128 encoded versions: The first source operand is an XMM register. The second source operand is an XMM

the corresponding ZMM register destination are zeroed.

Operation

PANDN (64-bit operand)

DEST  NOT(DEST) AND SRC

PANDN (128-bit Legacy SSE version)

DEST  NOT(DEST) AND SRC

DEST[MAXVL-1:128] (Unmodified)

VPANDN (VEX.128 encoded version)

DEST  NOT(SRC1) AND SRC2

DEST[MAXVL-1:128]  0

VPANDN (VEX.256 encoded instruction)

DEST[255:0]  ((NOT SRC1[255:0]) AND SRC2[255:0])

DEST[MAXVL-1:256]  0

VPANDND (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN DEST[i+31:i]  ((NOT SRC1[i+31:i]) AND SRC2[31:0])

ELSE DEST[i+31:i]  ((NOT SRC1[i+31:i]) AND SRC2[i+31:i])

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i]  0

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

PANDN—Logical AND NOT

INSTRUCTION SET REFERENCE, M-U

4-228 Vol. 2B

VPANDNQ (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k1[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN DEST[i+63:i]  ((NOT SRC1[i+63:i]) AND SRC2[63:0])

ELSE DEST[i+63:i]  ((NOT SRC1[i+63:i]) AND SRC2[i+63:i])

FI;

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i]  0

FI;

ENDFOR

DEST[MAXVL-1:VL]  0

Intel C/C++ Compiler Intrinsic Equivalents

VPANDND __m512i _mm512_andnot_epi32( __m512i a, __m512i b);

VPANDND __m512i _mm512_mask_andnot_epi32(__m512i s, __mmask16 k, __m512i a, __m512i b);

VPANDND __m512i _mm512_maskz_andnot_epi32( __mmask16 k, __m512i a, __m512i b);

VPANDND __m256i _mm256_mask_andnot_epi32(__m256i s, __mmask8 k, __m256i a, __m256i b);

VPANDND __m256i _mm256_maskz_andnot_epi32( __mmask8 k, __m256i a, __m256i b);

VPANDND __m128i _mm_mask_andnot_epi32(__m128i s, __mmask8 k, __m128i a, __m128i b);

VPANDND __m128i _mm_maskz_andnot_epi32( __mmask8 k, __m128i a, __m128i b);

VPANDNQ __m512i _mm512_andnot_epi64( __m512i a, __m512i b);

VPANDNQ __m512i _mm512_mask_andnot_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b);

VPANDNQ __m512i _mm512_maskz_andnot_epi64( __mmask8 k, __m512i a, __m512i b);

VPANDNQ __m256i _mm256_mask_andnot_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b);

VPANDNQ __m256i _mm256_maskz_andnot_epi64( __mmask8 k, __m256i a, __m256i b);

VPANDNQ __m128i _mm_mask_andnot_epi64(__m128i s, __mmask8 k, __m128i a, __m128i b);

VPANDNQ __m128i _mm_maskz_andnot_epi64( __mmask8 k, __m128i a, __m128i b);

PANDN: __m64 _mm_andnot_si64 (__m64 m1, __m64 m2)

(V)PANDN:__m128i _mm_andnot_si128 ( __m128i a, __m128i b)

VPANDN: __m256i _mm256_andnot_si256 ( __m256i a, __m256i b)

Flags Affected

None.

Numeric Exceptions

None.

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

EVEX-encoded instruction, see Exceptions Type E4.

PAUSE—Spin Loop Hint

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-229

PAUSE—Spin Loop Hint

Instruction Operand Encoding

Description

Improves the performance of spin-wait loops. When executing a “spin-wait loop,” processors will suffer a severe

performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE

instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint

to avoid the memory order violation in most situations, which greatly improves processor performance. For this

reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops.

An additional function of the PAUSE instruction is to reduce the power consumed by a processor while executing a

spin loop. A processor can execute a spin-wait loop extremely quickly, causing the processor to consume a lot of

power while it waits for the resource it is spinning on to become available. Inserting a pause instruction in a spin-

wait loop greatly reduces the processor’s power consumption.

This instruction was introduced in the Pentium 4 processors, but is backward compatible with all IA-32 processors.

In earlier IA-32 processors, the PAUSE instruction operates like a NOP instruction. The Pentium 4 and Intel Xeon

processors implement the PAUSE instruction as a delay. The delay is finite and can be zero for some processors.

This instruction does not change the architectural state of the processor (that is, it performs essentially a delaying

no-op operation).

This instruction’s operation is the same in non-64-bit modes and 64-bit mode.

Operation

Execute_Next_Instruction(DELAY);

Numeric Exceptions

None.

Exceptions (All Operating Modes)

#UD If the LOCK prefix is used.

Opcode Instruction Op/

64-Bit

Mode

Compat/

Leg Mode

Description

F3 90 PAUSE ZO Valid Valid Gives hint to processor that improves

performance of spin-wait loops.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

ZO NA NA NA NA

PAVGB/PAVGW—Average Packed Integers

INSTRUCTION SET REFERENCE, M-U

4-230 Vol. 2B

PAVGB/PAVGW—Average Packed Integers

Opcode/

Instruction

Op/

64/32 bit

Mode

Support

CPUID

Feature

Flag

Description

NP 0F E0 /r1

PAVGB mm1, mm2/m64

A V/V SSE Average packed unsigned byte integers from

mm2/m64 and mm1 with rounding.

66 0F E0, /r

PAVGB xmm1, xmm2/m128

A V/V SSE2 Average packed unsigned byte integers from

xmm2/m128 and xmm1 with rounding.

NP 0F E3 /r1

PAVGW mm1, mm2/m64

A V/V SSE Average packed unsigned word integers from

mm2/m64 and mm1 with rounding.

66 0F E3 /r

PAVGW xmm1, xmm2/m128

A V/V SSE2 Average packed unsigned word integers from

xmm2/m128 and xmm1 with rounding.

VEX.128.66.0F.WIG E0 /r

VPAVGB xmm1, xmm2, xmm3/m128

B V/V AVX Average packed unsigned byte integers from

xmm3/m128 and xmm2 with rounding.

VEX.128.66.0F.WIG E3 /r

VPAVGW xmm1, xmm2, xmm3/m128

B V/V AVX Average packed unsigned word integers from

xmm3/m128 and xmm2 with rounding.

VEX.256.66.0F.WIG E0 /r

VPAVGB ymm1, ymm2, ymm3/m256

B V/V AVX2 Average packed unsigned byte integers from

ymm2, and ymm3/m256 with rounding and

store to ymm1.

VEX.256.66.0F.WIG E3 /r

VPAVGW ymm1, ymm2, ymm3/m256

B V/V AVX2 Average packed unsigned word integers from

ymm2, ymm3/m256 with rounding to ymm1.

EVEX.128.66.0F.WIG E0 /r

VPAVGB xmm1 {k1}{z}, xmm2, xmm3/m128

C V/V AVX512VL

AVX512BW

Average packed unsigned byte integers from

xmm2, and xmm3/m128 with rounding and

store to xmm1 under writemask k1.

EVEX.256.66.0F.WIG E0 /r

VPAVGB ymm1 {k1}{z}, ymm2, ymm3/m256

C V/V AVX512VL

AVX512BW

Average packed unsigned byte integers from

ymm2, and ymm3/m256 with rounding and

store to ymm1 under writemask k1.

EVEX.512.66.0F.WIG E0 /r

VPAVGB zmm1 {k1}{z}, zmm2, zmm3/m512

C V/V AVX512BW Average packed unsigned byte integers from

zmm2, and zmm3/m512 with rounding and

store to zmm1 under writemask k1.

EVEX.128.66.0F.WIG E3 /r

VPAVGW xmm1 {k1}{z}, xmm2, xmm3/m128

C V/V AVX512VL

AVX512BW

Average packed unsigned word integers from

xmm2, xmm3/m128 with rounding to xmm1

under writemask k1.

EVEX.256.66.0F.WIG E3 /r

VPAVGW ymm1 {k1}{z}, ymm2, ymm3/m256

C V/V AVX512VL

AVX512BW

Average packed unsigned word integers from

ymm2, ymm3/m256 with rounding to ymm1

under writemask k1.

EVEX.512.66.0F.WIG E3 /r

VPAVGW zmm1 {k1}{z}, zmm2, zmm3/m512

C V/V AVX512BW Average packed unsigned word integers from

zmm2, zmm3/m512 with rounding to zmm1

under writemask k1.

NOTES:

1. See note in Section 2.4, “AVX and SSE Instruction Exception Specification” in the Intel® 64 and IA-32 Architectures Software

Developer’s Manual, Volume 2A and Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”

in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

PAVGB/PAVGW—Average Packed Integers

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-231

Instruction Operand Encoding

Description

Performs a SIMD average of the packed unsigned integers from the source operand (second operand) and the

destination operand (first operand), and stores the results in the destination operand. For each corresponding pair

of data elements in the first and second operands, the elements are added together, a 1 is added to the temporary

sum, and that result is shifted right one bit position.

The (V)PAVGB instruction operates on packed unsigned bytes and the (V)PAVGW instruction operates on packed

unsigned words.

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to

access additional registers (XMM8-XMM15).

Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The

destination operand can be an MMX technology register.

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM

upper bits (MAXVL-1:128) of the corresponding register destination are unmodified.

EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM

VEX.256 and EVEX.256 encoded versions: The first source operand is a YMM register. The second source operand

is a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand

is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-

1:128) of the corresponding register destination are zeroed.

Operation

PAVGB (with 64-bit operands)

DEST[7:0] ← (SRC[7:0] + DEST[7:0] + 1) >> 1; (* Temp sum before shifting is 9 bits *)

(* Repeat operation performed for bytes 2 through 6 *)

DEST[63:56] ← (SRC[63:56] + DEST[63:56] + 1) >> 1;

PAVGW (with 64-bit operands)

DEST[15:0] ← (SRC[15:0] + DEST[15:0] + 1) >> 1; (* Temp sum before shifting is 17 bits *)

(* Repeat operation performed for words 2 and 3 *)

DEST[63:48] ← (SRC[63:48] + DEST[63:48] + 1) >> 1;

PAVGB (with 128-bit operands)

DEST[7:0] ← (SRC[7:0] + DEST[7:0] + 1) >> 1; (* Temp sum before shifting is 9 bits *)

(* Repeat operation performed for bytes 2 through 14 *)

DEST[127:120] ← (SRC[127:120] + DEST[127:120] + 1) >> 1;

PAVGW (with 128-bit operands)

DEST[15:0] ← (SRC[15:0] + DEST[15:0] + 1) >> 1; (* Temp sum before shifting is 17 bits *)

(* Repeat operation performed for words 2 through 6 *)

DEST[127:112] ← (SRC[127:112] + DEST[127:112] + 1) >> 1;

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

PAVGB/PAVGW—Average Packed Integers

INSTRUCTION SET REFERENCE, M-U

4-232 Vol. 2B

VPAVGB (VEX.128 encoded version)

DEST[7:0]  (SRC1[7:0] + SRC2[7:0] + 1) >> 1;

(* Repeat operation performed for bytes 2 through 15 *)

DEST[127:120]  (SRC1[127:120] + SRC2[127:120] + 1) >> 1

DEST[MAXVL-1:128]  0

VPAVGW (VEX.128 encoded version)

DEST[15:0]  (SRC1[15:0] + SRC2[15:0] + 1) >> 1;

(* Repeat operation performed for 16-bit words 2 through 7 *)

DEST[127:112]  (SRC1[127:112] + SRC2[127:112] + 1) >> 1

DEST[MAXVL-1:128]  0

VPAVGB (VEX.256 encoded instruction)

DEST[7:0]  (SRC1[7:0] + SRC2[7:0] + 1) >> 1; (* Temp sum before shifting is 9 bits *)

(* Repeat operation performed for bytes 2 through 31)

DEST[255:248]  (SRC1[255:248] + SRC2[255:248] + 1) >> 1;

VPAVGW (VEX.256 encoded instruction)

DEST[15:0]  (SRC1[15:0] + SRC2[15:0] + 1) >> 1; (* Temp sum before shifting is 17 bits *)

(* Repeat operation performed for words 2 through 15)

DEST[255:14])  (SRC1[255:240] + SRC2[255:240] + 1) >> 1;

VPAVGB (EVEX encoded versions)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j  0 TO KL-1

i  j * 8

IF k1[j] OR *no writemask*

THEN DEST[i+7:i]  (SRC1[i+7:i] + SRC2[i+7:i] + 1) >> 1; (* Temp sum before shifting is 9 bits *)

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+7:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+7:i] = 0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

VPAVGW (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j  0 TO KL-1

i  j * 16

IF k1[j] OR *no writemask*

THEN DEST[i+15:i]  (SRC1[i+15:i] + SRC2[i+15:i] + 1) >> 1

; (* Temp sum before shifting is 17 bits *)

ELSE

IF *merging-masking* ; merging-masking

THEN *DEST[i+15:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking

DEST[i+15:i] = 0

FI;

ENDFOR;

DEST[MAXVL-1:VL]  0

PAVGB/PAVGW—Average Packed Integers

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-233

Intel C/C++ Compiler Intrinsic Equivalents

VPAVGB __m512i _mm512_avg_epu8( __m512i a, __m512i b);

VPAVGW __m512i _mm512_avg_epu16( __m512i a, __m512i b);

VPAVGB __m512i _mm512_mask_avg_epu8(__m512i s, __mmask64 m, __m512i a, __m512i b);

VPAVGW __m512i _mm512_mask_avg_epu16(__m512i s, __mmask32 m, __m512i a, __m512i b);

VPAVGB __m512i _mm512_maskz_avg_epu8( __mmask64 m, __m512i a, __m512i b);

VPAVGW __m512i _mm512_maskz_avg_epu16( __mmask32 m, __m512i a, __m512i b);

VPAVGB __m256i _mm256_mask_avg_epu8(__m256i s, __mmask32 m, __m256i a, __m256i b);

VPAVGW __m256i _mm256_mask_avg_epu16(__m256i s, __mmask16 m, __m256i a, __m256i b);

VPAVGB __m256i _mm256_maskz_avg_epu8( __mmask32 m, __m256i a, __m256i b);

VPAVGW __m256i _mm256_maskz_avg_epu16( __mmask16 m, __m256i a, __m256i b);

VPAVGB __m128i _mm_mask_avg_epu8(__m128i s, __mmask16 m, __m128i a, __m128i b);

VPAVGW __m128i _mm_mask_avg_epu16(__m128i s, __mmask8 m, __m128i a, __m128i b);

VPAVGB __m128i _mm_maskz_avg_epu8( __mmask16 m, __m128i a, __m128i b);

VPAVGW __m128i _mm_maskz_avg_epu16( __mmask8 m, __m128i a, __m128i b);

PAVGB: __m64 _mm_avg_pu8 (__m64 a, __m64 b)

PAVGW: __m64 _mm_avg_pu16 (__m64 a, __m64 b)

(V)PAVGB: __m128i _mm_avg_epu8 ( __m128i a, __m128i b)

(V)PAVGW: __m128i _mm_avg_epu16 ( __m128i a, __m128i b)

VPAVGB: __m256i _mm256_avg_epu8 ( __m256i a, __m256i b)

VPAVGW: __m256i _mm256_avg_epu16 ( __m256i a, __m256i b)

Flags Affected

None.

Numeric Exceptions

None.

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

EVEX-encoded instruction, see Exceptions Type E4.nb.

PBLENDVB — Variable Blend Packed Bytes

INSTRUCTION SET REFERENCE, M-U

4-234 Vol. 2B

PBLENDVB — Variable Blend Packed Bytes

Instruction Operand Encoding

Description

Conditionally copies byte elements from the source operand (second operand) to the destination operand (first

operand) depending on mask bits defined in the implicit third register argument, XMM0. The mask bits are the most

significant bit in each byte element of the XMM0 register.

If a mask bit is “1", then the corresponding byte element in the source operand is copied to the destination, else

the byte element in the destination operand is left unchanged.

The register assignment of the implicit third operand is defined to be the architectural register XMM0.

128-bit Legacy SSE version: The first source operand and the destination operand is the same. Bits (MAXVL-1:128)

of the corresponding YMM destination register remain unchanged. The mask register operand is implicitly defined

to be the architectural register XMM0. An attempt to execute PBLENDVB with a VEX prefix will cause #UD.

VEX.128 encoded version: The first source operand and the destination operand are XMM registers. The second

source operand is an XMM register or 128-bit memory location. The mask operand is the third source register, and

encoded in bits[7:4] of the immediate byte(imm8). The bits[3:0] of imm8 are ignored. In 32-bit mode, imm8[7] is

ignored. The upper bits (MAXVL-1:128) of the corresponding YMM register (destination register) are zeroed. VEX.L

must be 0, otherwise the instruction will #UD. VEX.W must be 0, otherwise, the instruction will #UD.

VEX.256 encoded version: The first source operand and the destination operand are YMM registers. The second

source operand is an YMM register or 256-bit memory location. The third source register is an YMM register and

encoded in bits[7:4] of the immediate byte(imm8). The bits[3:0] of imm8 are ignored. In 32-bit mode, imm8[7] is

ignored.

VPBLENDVB permits the mask to be any XMM or YMM register. In contrast, PBLENDVB treats XMM0 implicitly as the

mask and do not support non-destructive destination operation. An attempt to execute PBLENDVB encoded with a

VEX prefix will cause a #UD exception.

Operation

PBLENDVB (128-bit Legacy SSE version)

MASK  XMM0

IF (MASK[7] = 1) THEN DEST[7:0]  SRC[7:0];

ELSE DEST[7:0]  DEST[7:0];

IF (MASK[15] = 1) THEN DEST[15:8]  SRC[15:8];

Opcode/

Instruction

Op/

64/32 bit

Mode

Support

CPUID

Feature

Flag

Description

66 0F 38 10 /r

PBLENDVB xmm1, xmm2/m128, <XMM0>

RM V/V SSE4_1 Select byte values from xmm1 and

xmm2/m128 from mask specified in the high

bit of each byte in XMM0 and store the

values into xmm1.

VEX.128.66.0F3A.W0 4C /r /is4

VPBLENDVB xmm1, xmm2, xmm3/m128, xmm4

RVMR V/V AVX Select byte values from xmm2 and

xmm3/m128 using mask bits in the specified

mask register, xmm4, and store the values

into xmm1.

VEX.256.66.0F3A.W0 4C /r /is4

VPBLENDVB ymm1, ymm2, ymm3/m256, ymm4

RVMR V/V AVX2 Select byte values from ymm2 and

ymm3/m256 from mask specified in the high

bit of each byte in ymm4 and store the

values into ymm1.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RM ModRM:reg (r, w) ModRM:r/m (r) <XMM0> NA

RVMR ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8[7:4]

PBLENDVB — Variable Blend Packed Bytes

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-235

ELSE DEST[15:8]  DEST[15:8];

IF (MASK[23] = 1) THEN DEST[23:16]  SRC[23:16]

ELSE DEST[23:16]  DEST[23:16];

IF (MASK[31] = 1) THEN DEST[31:24]  SRC[31:24]

ELSE DEST[31:24]  DEST[31:24];

IF (MASK[39] = 1) THEN DEST[39:32]  SRC[39:32]

ELSE DEST[39:32]  DEST[39:32];

IF (MASK[47] = 1) THEN DEST[47:40]  SRC[47:40]

ELSE DEST[47:40]  DEST[47:40];

IF (MASK[55] = 1) THEN DEST[55:48]  SRC[55:48]

ELSE DEST[55:48]  DEST[55:48];

IF (MASK[63] = 1) THEN DEST[63:56]  SRC[63:56]

ELSE DEST[63:56]  DEST[63:56];

IF (MASK[71] = 1) THEN DEST[71:64]  SRC[71:64]

ELSE DEST[71:64]  DEST[71:64];

IF (MASK[79] = 1) THEN DEST[79:72]  SRC[79:72]

ELSE DEST[79:72]  DEST[79:72];

IF (MASK[87] = 1) THEN DEST[87:80]  SRC[87:80]

ELSE DEST[87:80]  DEST[87:80];

IF (MASK[95] = 1) THEN DEST[95:88]  SRC[95:88]

ELSE DEST[95:88] DEST[95:88];

IF (MASK[103] = 1) THEN DEST[103:96]  SRC[103:96]

ELSE DEST[103:96] DEST[103:96];

IF (MASK[111] = 1) THEN DEST[111:104]  SRC[111:104]

ELSE DEST[111:104]  DEST[111:104];

IF (MASK[119] = 1) THEN DEST[119:112]  SRC[119:112]

ELSE DEST[119:112]  DEST[119:112];

IF (MASK[127] = 1) THEN DEST[127:120]  SRC[127:120]

ELSE DEST[127:120]  DEST[127:120])

DEST[MAXVL-1:128] (Unmodified)

VPBLENDVB (VEX.128 encoded version)

MASK  SRC3

IF (MASK[7] = 1) THEN DEST[7:0]  SRC2[7:0];

ELSE DEST[7:0]  SRC1[7:0];

IF (MASK[15] = 1) THEN DEST[15:8]  SRC2[15:8];

ELSE DEST[15:8]  SRC1[15:8];

IF (MASK[23] = 1) THEN DEST[23:16]  SRC2[23:16]

ELSE DEST[23:16]  SRC1[23:16];

IF (MASK[31] = 1) THEN DEST[31:24]  SRC2[31:24]

ELSE DEST[31:24]  SRC1[31:24];

IF (MASK[39] = 1) THEN DEST[39:32]  SRC2[39:32]

ELSE DEST[39:32]  SRC1[39:32];

IF (MASK[47] = 1) THEN DEST[47:40]  SRC2[47:40]

ELSE DEST[47:40]  SRC1[47:40];

IF (MASK[55] = 1) THEN DEST[55:48]  SRC2[55:48]

ELSE DEST[55:48]  SRC1[55:48];

IF (MASK[63] = 1) THEN DEST[63:56]  SRC2[63:56]

ELSE DEST[63:56]  SRC1[63:56];

IF (MASK[71] = 1) THEN DEST[71:64]  SRC2[71:64]

ELSE DEST[71:64]  SRC1[71:64];

IF (MASK[79] = 1) THEN DEST[79:72]  SRC2[79:72]

ELSE DEST[79:72]  SRC1[79:72];

IF (MASK[87] = 1) THEN DEST[87:80]  SRC2[87:80]

PBLENDVB — Variable Blend Packed Bytes

INSTRUCTION SET REFERENCE, M-U

4-236 Vol. 2B

ELSE DEST[87:80]  SRC1[87:80];

IF (MASK[95] = 1) THEN DEST[95:88]  SRC2[95:88]

ELSE DEST[95:88] SRC1[95:88];

IF (MASK[103] = 1) THEN DEST[103:96]  SRC2[103:96]

ELSE DEST[103:96] SRC1[103:96];

IF (MASK[111] = 1) THEN DEST[111:104]  SRC2[111:104]

ELSE DEST[111:104]  SRC1[111:104];

IF (MASK[119] = 1) THEN DEST[119:112]  SRC2[119:112]

ELSE DEST[119:112]  SRC1[119:112];

IF (MASK[127] = 1) THEN DEST[127:120]  SRC2[127:120]

ELSE DEST[127:120]  SRC1[127:120])

DEST[MAXVL-1:128]  0

VPBLENDVB (VEX.256 encoded version)

MASK  SRC3

IF (MASK[7] == 1) THEN DEST[7:0]  SRC2[7:0];

ELSE DEST[7:0]  SRC1[7:0];

IF (MASK[15] == 1) THEN DEST[15:8] SRC2[15:8];

ELSE DEST[15:8]  SRC1[15:8];

IF (MASK[23] == 1) THEN DEST[23:16] SRC2[23:16]

ELSE DEST[23:16]  SRC1[23:16];

IF (MASK[31] == 1) THEN DEST[31:24]  SRC2[31:24]

ELSE DEST[31:24]  SRC1[31:24];

IF (MASK[39] == 1) THEN DEST[39:32]  SRC2[39:32]

ELSE DEST[39:32]  SRC1[39:32];

IF (MASK[47] == 1) THEN DEST[47:40]  SRC2[47:40]

ELSE DEST[47:40]  SRC1[47:40];

IF (MASK[55] == 1) THEN DEST[55:48]  SRC2[55:48]

ELSE DEST[55:48]  SRC1[55:48];

IF (MASK[63] == 1) THEN DEST[63:56] SRC2[63:56]

ELSE DEST[63:56]  SRC1[63:56];

IF (MASK[71] == 1) THEN DEST[71:64] SRC2[71:64]

ELSE DEST[71:64]  SRC1[71:64];

IF (MASK[79] == 1) THEN DEST[79:72]  SRC2[79:72]

ELSE DEST[79:72]  SRC1[79:72];

IF (MASK[87] == 1) THEN DEST[87:80]  SRC2[87:80]

ELSE DEST[87:80]  SRC1[87:80];

IF (MASK[95] == 1) THEN DEST[95:88]  SRC2[95:88]

ELSE DEST[95:88]  SRC1[95:88];

IF (MASK[103] == 1) THEN DEST[103:96]  SRC2[103:96]

ELSE DEST[103:96]  SRC1[103:96];

IF (MASK[111] == 1) THEN DEST[111:104]  SRC2[111:104]

ELSE DEST[111:104]  SRC1[111:104];

IF (MASK[119] == 1) THEN DEST[119:112]  SRC2[119:112]

ELSE DEST[119:112]  SRC1[119:112];

IF (MASK[127] == 1) THEN DEST[127:120]  SRC2[127:120]

ELSE DEST[127:120]  SRC1[127:120])

IF (MASK[135] == 1) THEN DEST[135:128]  SRC2[135:128];

ELSE DEST[135:128]  SRC1[135:128];

IF (MASK[143] == 1) THEN DEST[143:136]  SRC2[143:136];

ELSE DEST[[143:136]  SRC1[143:136];

IF (MASK[151] == 1) THEN DEST[151:144]  SRC2[151:144]

ELSE DEST[151:144]  SRC1[151:144];

IF (MASK[159] == 1) THEN DEST[159:152]  SRC2[159:152]

PBLENDVB — Variable Blend Packed Bytes

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-237

ELSE DEST[159:152]  SRC1[159:152];

IF (MASK[167] == 1) THEN DEST[167:160]  SRC2[167:160]

ELSE DEST[167:160]  SRC1[167:160];

IF (MASK[175] == 1) THEN DEST[175:168]  SRC2[175:168]

ELSE DEST[175:168]  SRC1[175:168];

IF (MASK[183] == 1) THEN DEST[183:176]  SRC2[183:176]

ELSE DEST[183:176]  SRC1[183:176];

IF (MASK[191] == 1) THEN DEST[191:184]  SRC2[191:184]

ELSE DEST[191:184]  SRC1[191:184];

IF (MASK[199] == 1) THEN DEST[199:192]  SRC2[199:192]

ELSE DEST[199:192]  SRC1[199:192];

IF (MASK[207] == 1) THEN DEST[207:200]  SRC2[207:200]

ELSE DEST[207:200]  SRC1[207:200]

IF (MASK[215] == 1) THEN DEST[215:208]  SRC2[215:208]

ELSE DEST[215:208]  SRC1[215:208];

IF (MASK[223] == 1) THEN DEST[223:216]  SRC2[223:216]

ELSE DEST[223:216]  SRC1[223:216];

IF (MASK[231] == 1) THEN DEST[231:224]  SRC2[231:224]

ELSE DEST[231:224]  SRC1[231:224];

IF (MASK[239] == 1) THEN DEST[239:232]  SRC2[239:232]

ELSE DEST[239:232]  SRC1[239:232];

IF (MASK[247] == 1) THEN DEST[247:240]  SRC2[247:240]

ELSE DEST[247:240]  SRC1[247:240];

IF (MASK[255] == 1) THEN DEST[255:248]  SRC2[255:248]

ELSE DEST[255:248]  SRC1[255:248]

Intel C/C++ Compiler Intrinsic Equivalent

(V)PBLENDVB: __m128i _mm_blendv_epi8 (__m128i v1, __m128i v2, __m128i mask);

VPBLENDVB: __m256i _mm256_blendv_epi8 (__m256i v1, __m256i v2, __m256i mask);

Flags Affected

None.

SIMD Floating-Point Exceptions

None.

Other Exceptions

See Exceptions Type 4; additionally

#UD If VEX.W = 1.

PBLENDW — Blend Packed Words

INSTRUCTION SET REFERENCE, M-U

4-238 Vol. 2B

PBLENDW — Blend Packed Words

Instruction Operand Encoding

Description

Words from the source operand (second operand) are conditionally written to the destination operand (first

operand) depending on bits in the immediate operand (third operand). The immediate bits (bits 7:0) form a mask

that determines whether the corresponding word in the destination is copied from the source. If a bit in the mask,

corresponding to a word, is “1", then the word is copied, else the word element in the destination operand is

unchanged.

128-bit Legacy SSE version: The second source operand can be an XMM register or a 128-bit memory location. The

first source and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM destination

VEX.128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The

first source and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM register

are zeroed.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register

or a 256-bit memory location. The destination operand is a YMM register.

Operation

PBLENDW (128-bit Legacy SSE version)

IF (imm8[0] = 1) THEN DEST[15:0]  SRC[15:0]

ELSE DEST[15:0]  DEST[15:0]

IF (imm8[1] = 1) THEN DEST[31:16]  SRC[31:16]

ELSE DEST[31:16]  DEST[31:16]

IF (imm8[2] = 1) THEN DEST[47:32]  SRC[47:32]

ELSE DEST[47:32]  DEST[47:32]

IF (imm8[3] = 1) THEN DEST[63:48]  SRC[63:48]

ELSE DEST[63:48]  DEST[63:48]

IF (imm8[4] = 1) THEN DEST[79:64]  SRC[79:64]

ELSE DEST[79:64]  DEST[79:64]

IF (imm8[5] = 1) THEN DEST[95:80]  SRC[95:80]

ELSE DEST[95:80]  DEST[95:80]

IF (imm8[6] = 1) THEN DEST[111:96]  SRC[111:96]

ELSE DEST[111:96]  DEST[111:96]

IF (imm8[7] = 1) THEN DEST[127:112]  SRC[127:112]

Opcode/

Instruction

Op/

64/32 bit

Mode

Support

CPUID

Feature

Flag

Description

66 0F 3A 0E /r ib

PBLENDW xmm1, xmm2/m128, imm8

RMI V/V SSE4_1 Select words from xmm1 and xmm2/m128

from mask specified in imm8 and store the

values into xmm1.

VEX.128.66.0F3A.WIG 0E /r ib

VPBLENDW xmm1, xmm2, xmm3/m128, imm8

RVMI V/V AVX Select words from xmm2 and xmm3/m128

from mask specified in imm8 and store the

values into xmm1.

VEX.256.66.0F3A.WIG 0E /r ib

VPBLENDW ymm1, ymm2, ymm3/m256, imm8

RVMI V/V AVX2 Select words from ymm2 and ymm3/m256

from mask specified in imm8 and store the

values into ymm1.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RMI ModRM:reg (r, w) ModRM:r/m (r) imm8 NA

RVMI ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8

PBLENDW — Blend Packed Words

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-239

ELSE DEST[127:112]  DEST[127:112]

VPBLENDW (VEX.128 encoded version)

IF (imm8[0] = 1) THEN DEST[15:0]  SRC2[15:0]

ELSE DEST[15:0]  SRC1[15:0]

IF (imm8[1] = 1) THEN DEST[31:16]  SRC2[31:16]

ELSE DEST[31:16]  SRC1[31:16]

IF (imm8[2] = 1) THEN DEST[47:32]  SRC2[47:32]

ELSE DEST[47:32]  SRC1[47:32]

IF (imm8[3] = 1) THEN DEST[63:48]  SRC2[63:48]

ELSE DEST[63:48]  SRC1[63:48]

IF (imm8[4] = 1) THEN DEST[79:64]  SRC2[79:64]

ELSE DEST[79:64]  SRC1[79:64]

IF (imm8[5] = 1) THEN DEST[95:80]  SRC2[95:80]

ELSE DEST[95:80]  SRC1[95:80]

IF (imm8[6] = 1) THEN DEST[111:96]  SRC2[111:96]

ELSE DEST[111:96]  SRC1[111:96]

IF (imm8[7] = 1) THEN DEST[127:112]  SRC2[127:112]

ELSE DEST[127:112]  SRC1[127:112]

DEST[MAXVL-1:128]  0

VPBLENDW (VEX.256 encoded version)

IF (imm8[0] == 1) THEN DEST[15:0]  SRC2[15:0]

ELSE DEST[15:0]  SRC1[15:0]

IF (imm8[1] == 1) THEN DEST[31:16]  SRC2[31:16]

ELSE DEST[31:16]  SRC1[31:16]

IF (imm8[2] == 1) THEN DEST[47:32]  SRC2[47:32]

ELSE DEST[47:32]  SRC1[47:32]

IF (imm8[3] == 1) THEN DEST[63:48]  SRC2[63:48]

ELSE DEST[63:48]  SRC1[63:48]

IF (imm8[4] == 1) THEN DEST[79:64]  SRC2[79:64]

ELSE DEST[79:64]  SRC1[79:64]

IF (imm8[5] == 1) THEN DEST[95:80]  SRC2[95:80]

ELSE DEST[95:80]  SRC1[95:80]

IF (imm8[6] == 1) THEN DEST[111:96]  SRC2[111:96]

ELSE DEST[111:96]  SRC1[111:96]

IF (imm8[7] == 1) THEN DEST[127:112]  SRC2[127:112]

ELSE DEST[127:112]  SRC1[127:112]

IF (imm8[0] == 1) THEN DEST[143:128]  SRC2[143:128]

ELSE DEST[143:128]  SRC1[143:128]

IF (imm8[1] == 1) THEN DEST[159:144]  SRC2[159:144]

ELSE DEST[159:144]  SRC1[159:144]

IF (imm8[2] == 1) THEN DEST[175:160]  SRC2[175:160]

ELSE DEST[175:160]  SRC1[175:160]

IF (imm8[3] == 1) THEN DEST[191:176]  SRC2[191:176]

ELSE DEST[191:176]  SRC1[191:176]

IF (imm8[4] == 1) THEN DEST[207:192]  SRC2[207:192]

ELSE DEST[207:192]  SRC1[207:192]

IF (imm8[5] == 1) THEN DEST[223:208]  SRC2[223:208]

ELSE DEST[223:208]  SRC1[223:208]

IF (imm8[6] == 1) THEN DEST[239:224]  SRC2[239:224]

ELSE DEST[239:224]  SRC1[239:224]

IF (imm8[7] == 1) THEN DEST[255:240]  SRC2[255:240]

ELSE DEST[255:240]  SRC1[255:240]

PBLENDW — Blend Packed Words

INSTRUCTION SET REFERENCE, M-U

4-240 Vol. 2B

Intel C/C++ Compiler Intrinsic Equivalent

(V)PBLENDW: __m128i _mm_blend_epi16 (__m128i v1, __m128i v2, const int mask);

VPBLENDW: __m256i _mm256_blend_epi16 (__m256i v1, __m256i v2, const int mask)

Flags Affected

None.

SIMD Floating-Point Exceptions

None.

Other Exceptions

See Exceptions Type 4; additionally

#UD If VEX.L = 1 and AVX2 = 0.

PCLMULQDQ — Carry-Less Multiplication Quadword

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-241

PCLMULQDQ — Carry-Less Multiplication Quadword

Instruction Operand Encoding

Description

Performs a carry-less multiplication of two quadwords, selected from the first source and second source operand

according to the value of the immediate byte. Bits 4 and 0 are used to select which 64-bit half of each operand to

use according to Table 4-13, other bits of the immediate byte are ignored.

The first source operand and the destination operand are the same and must be an XMM register. The second

source operand can be an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding

YMM destination register remain unchanged.

Compilers and assemblers may implement the following pseudo-op syntax to simply programming and emit the

required encoding for Imm8.

Opcode/

Instruction

Op/

64/32 bit

Mode

Support

CPUID

Feature

Flag

Description

66 0F 3A 44 /r ib

PCLMULQDQ xmm1, xmm2/m128, imm8

RMI V/V PCLMUL-

QDQ

Carry-less multiplication of one quadword of

xmm1 by one quadword of xmm2/m128,

stores the 128-bit result in xmm1. The imme-

diate is used to determine which quadwords

of xmm1 and xmm2/m128 should be used.

VEX.128.66.0F3A.WIG 44 /r ib

VPCLMULQDQ xmm1, xmm2, xmm3/m128, imm8

RVMI V/V Both PCL-

MULQDQ

and AVX

flags

Carry-less multiplication of one quadword of

xmm2 by one quadword of xmm3/m128,

stores the 128-bit result in xmm1. The imme-

diate is used to determine which quadwords

of xmm2 and xmm3/m128 should be used.

Op/En Operand 1 Operand2 Operand3 Operand4

RMI ModRM:reg (r, w) ModRM:r/m (r) imm8 NA

RVMI ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8

Table 4-13. PCLMULQDQ Quadword Selection of Immediate Byte

Imm[4] Imm[0] PCLMULQDQ Operation

00CL_MUL( SRC2

1[63:0], SRC1[63:0] )

NOTES:

1. SRC2 denotes the second source operand, which can be a register or memory; SRC1 denotes the first source and destination oper-

and.

0 1 CL_MUL( SRC2[63:0], SRC1[127:64] )

1 0 CL_MUL( SRC2[127:64], SRC1[63:0] )

1 1 CL_MUL( SRC2[127:64], SRC1[127:64] )

Table 4-14. Pseudo-Op and PCLMULQDQ Implementation

Pseudo-Op Imm8 Encoding

PCLMULLQLQDQ xmm1, xmm2 0000_0000B

PCLMULHQLQDQ xmm1, xmm2 0000_0001B

PCLMULLQHQDQ xmm1, xmm2 0001_0000B

PCLMULHQHQDQ xmm1, xmm2 0001_0001B

PCLMULQDQ — Carry-Less Multiplication Quadword

INSTRUCTION SET REFERENCE, M-U

4-242 Vol. 2B

Operation

PCLMULQDQ

IF (Imm8[0] = 0 )

THEN

TEMP1  SRC1 [63:0];

ELSE

TEMP1  SRC1 [127:64];

IF (Imm8[4] = 0 )

THEN

TEMP2  SRC2 [63:0];

ELSE

TEMP2  SRC2 [127:64];

For i = 0 to 63 {

TmpB [ i ]  (TEMP1[ 0 ] and TEMP2[ i ]);

For j = 1 to i {

TmpB [ i ]  TmpB [ i ] xor (TEMP1[ j ] and TEMP2[ i - j ])

}

DEST[ i ]  TmpB[ i ];

}

For i = 64 to 126 {

TmpB [ i ]  0;

For j = i - 63 to 63 {

TmpB [ i ]  TmpB [ i ] xor (TEMP1[ j ] and TEMP2[ i - j ])

}

DEST[ i ]  TmpB[ i ];

}

DEST[127]  0;

DEST[MAXVL-1:128] (Unmodified)

VPCLMULQDQ

IF (Imm8[0] = 0 )

THEN

TEMP1  SRC1 [63:0];

ELSE

TEMP1  SRC1 [127:64];

IF (Imm8[4] = 0 )

THEN

TEMP2  SRC2 [63:0];

ELSE

TEMP2  SRC2 [127:64];

For i = 0 to 63 {

TmpB [ i ]  (TEMP1[ 0 ] and TEMP2[ i ]);

For j = 1 to i {

TmpB [i]  TmpB [i] xor (TEMP1[ j ] and TEMP2[ i - j ])

}

DEST[i]  TmpB[i];

}

For i = 64 to 126 {

TmpB [ i ]  0;

For j = i - 63 to 63 {

PCLMULQDQ — Carry-Less Multiplication Quadword

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-243

TmpB [i]  TmpB [i] xor (TEMP1[ j ] and TEMP2[ i - j ])

}

DEST[i]  TmpB[i];

}

DEST[MAXVL-1:127]  0;

Intel C/C++ Compiler Intrinsic Equivalent

(V)PCLMULQDQ: __m128i _mm_clmulepi64_si128 (__m128i, __m128i, const int)

SIMD Floating-Point Exceptions

None.

Other Exceptions

See Exceptions Type 4, additionally

#UD If VEX.L = 1.

PCMPEQB/PCMPEQW/PCMPEQD— Compare Packed Data for Equal

INSTRUCTION SET REFERENCE, M-U

4-244 Vol. 2B

PCMPEQB/PCMPEQW/PCMPEQD— Compare Packed Data for Equal

Opcode/

Instruction

Op/ En 64/32 bit

Mode

Support

CPUID

Feature

Flag

Description

NP 0F 74 /r1

PCMPEQB mm, mm/m64

A V/V MMX Compare packed bytes in mm/m64 and mm

for equality.

66 0F 74 /r

PCMPEQB xmm1, xmm2/m128

A V/V SSE2 Compare packed bytes in xmm2/m128 and

xmm1 for equality.

NP 0F 75 /r1

PCMPEQW mm, mm/m64

A V/V MMX Compare packed words in mm/m64 and mm

for equality.

66 0F 75 /r

PCMPEQW xmm1, xmm2/m128

AV/V SSE2 Compare packed words in xmm2/m128 and

xmm1 for equality.

NP 0F 76 /r1

PCMPEQD mm, mm/m64

A V/V MMX Compare packed doublewords in mm/m64 and

mm for equality.

66 0F 76 /r

PCMPEQD xmm1, xmm2/m128

A V/V SSE2 Compare packed doublewords in xmm2/m128

and xmm1 for equality.

VEX.128.66.0F.WIG 74 /r

VPCMPEQB xmm1, xmm2, xmm3/m128

B V/V AVX Compare packed bytes in xmm3/m128 and

xmm2 for equality.

VEX.128.66.0F.WIG 75 /r

VPCMPEQW xmm1, xmm2, xmm3/m128

B V/V AVX Compare packed words in xmm3/m128 and

xmm2 for equality.

VEX.128.66.0F.WIG 76 /r

VPCMPEQD xmm1, xmm2, xmm3/m128

B V/V AVX Compare packed doublewords in xmm3/m128

and xmm2 for equality.

VEX.256.66.0F.WIG 74 /r

VPCMPEQB ymm1, ymm2, ymm3 /m256

B V/V AVX2 Compare packed bytes in ymm3/m256 and

ymm2 for equality.

VEX.256.66.0F.WIG 75 /r

VPCMPEQW ymm1, ymm2, ymm3 /m256

BV/V AVX2 Compare packed words in ymm3/m256 and

ymm2 for equality.

VEX.256.66.0F.WIG 76 /r

VPCMPEQD ymm1, ymm2, ymm3 /m256

B V/V AVX2 Compare packed doublewords in ymm3/m256

and ymm2 for equality.

EVEX.128.66.0F.W0 76 /r

VPCMPEQD k1 {k2}, xmm2, xmm3/m128/m32bcst

C V/V AVX512VL

AVX512F

Compare Equal between int32 vector xmm2

and int32 vector xmm3/m128/m32bcst, and

set vector mask k1 to reflect the

zero/nonzero status of each element of the

result, under writemask.

EVEX.256.66.0F.W0 76 /r

VPCMPEQD k1 {k2}, ymm2, ymm3/m256/m32bcst

C V/V AVX512VL

AVX512F

Compare Equal between int32 vector ymm2

and int32 vector ymm3/m256/m32bcst, and

set vector mask k1 to reflect the

zero/nonzero status of each element of the

result, under writemask.

EVEX.512.66.0F.W0 76 /r

VPCMPEQD k1 {k2}, zmm2, zmm3/m512/m32bcst

C V/V AVX512F Compare Equal between int32 vectors in

zmm2 and zmm3/m512/m32bcst, and set

destination k1 according to the comparison

results under writemask k2.

EVEX.128.66.0F.WIG 74 /r

VPCMPEQB k1 {k2}, xmm2, xmm3 /m128

D V/V AVX512VL

AVX512BW

Compare packed bytes in xmm3/m128 and

xmm2 for equality and set vector mask k1 to

reflect the zero/nonzero status of each

element of the result, under writemask.

PCMPEQB/PCMPEQW/PCMPEQD— Compare Packed Data for Equal

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-245

Instruction Operand Encoding

Description

Performs a SIMD compare for equality of the packed bytes, words, or doublewords in the destination operand (first

operand) and the source operand (second operand). If a pair of data elements is equal, the corresponding data

element in the destination operand is set to all 1s; otherwise, it is set to all 0s.

The (V)PCMPEQB instruction compares the corresponding bytes in the destination and source operands; the

(V)PCMPEQW instruction compares the corresponding words in the destination and source operands; and the

(V)PCMPEQD instruction compares the corresponding doublewords in the destination and source operands.

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to

access additional registers (XMM8-XMM15).

Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The

destination operand can be an MMX technology register.

128-bit Legacy SSE version: The second source operand can be an XMM register or a 128-bit memory location. The

first source and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM destination

VEX.128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The

first source and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM register

are zeroed.

EVEX.256.66.0F.WIG 74 /r

VPCMPEQB k1 {k2}, ymm2, ymm3 /m256

D V/V AVX512VL

AVX512BW

Compare packed bytes in ymm3/m256 and

ymm2 for equality and set vector mask k1 to

reflect the zero/nonzero status of each

element of the result, under writemask.

EVEX.512.66.0F.WIG 74 /r

VPCMPEQB k1 {k2}, zmm2, zmm3 /m512

D V/V AVX512BW Compare packed bytes in zmm3/m512 and

zmm2 for equality and set vector mask k1 to

reflect the zero/nonzero status of each

element of the result, under writemask.

EVEX.128.66.0F.WIG 75 /r

VPCMPEQW k1 {k2}, xmm2, xmm3 /m128

D V/V AVX512VL

AVX512BW

Compare packed words in xmm3/m128 and

xmm2 for equality and set vector mask k1 to

reflect the zero/nonzero status of each

element of the result, under writemask.

EVEX.256.66.0F.WIG 75 /r

VPCMPEQW k1 {k2}, ymm2, ymm3 /m256

D V/V AVX512VL

AVX512BW

Compare packed words in ymm3/m256 and

ymm2 for equality and set vector mask k1 to

reflect the zero/nonzero status of each

element of the result, under writemask.

EVEX.512.66.0F.WIG 75 /r

VPCMPEQW k1 {k2}, zmm2, zmm3 /m512

D V/V AVX512BW Compare packed words in zmm3/m512 and

zmm2 for equality and set vector mask k1 to

reflect the zero/nonzero status of each

element of the result, under writemask.

NOTES:

1. See note in Section 2.4, “AVX and SSE Instruction Exception Specification” in the Intel® 64 and IA-32 Architectures Software

Developer’s Manual, Volume 2A and Section 22.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”

in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

D Full Mem ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

PCMPEQB/PCMPEQW/PCMPEQD— Compare Packed Data for Equal

INSTRUCTION SET REFERENCE, M-U

4-246 Vol. 2B

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register

or a 256-bit memory location. The destination operand is a YMM register.

EVEX encoded VPCMPEQD: The first source operand (second operand) is a ZMM/YMM/XMM register. The second

source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector

broadcasted from a 32-bit memory location. The destination operand (first operand) is a mask register updated

according to the writemask k2.

EVEX encoded VPCMPEQB/W: The first source operand (second operand) is a ZMM/YMM/XMM register. The second

source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand

(first operand) is a mask register updated according to the writemask k2.

Operation

PCMPEQB (with 64-bit operands)

IF DEST[7:0] = SRC[7:0]

THEN DEST[7:0) ← FFH;

ELSE DEST[7:0] ← 0; FI;

(* Continue comparison of 2nd through 7th bytes in DEST and SRC *)

IF DEST[63:56] = SRC[63:56]

THEN DEST[63:56] ← FFH;

ELSE DEST[63:56] ← 0; FI;

COMPARE_BYTES_EQUAL (SRC1, SRC2)

IF SRC1[7:0] = SRC2[7:0]

THEN DEST[7:0] FFH;

ELSE DEST[7:0] 0; FI;

(* Continue comparison of 2nd through 15th bytes in SRC1 and SRC2 *)

IF SRC1[127:120] = SRC2[127:120]

THEN DEST[127:120] FFH;

ELSE DEST[127:120] 0; FI;

COMPARE_WORDS_EQUAL (SRC1, SRC2)

IF SRC1[15:0] = SRC2[15:0]

THEN DEST[15:0] FFFFH;

ELSE DEST[15:0] 0; FI;

(* Continue comparison of 2nd through 7th 16-bit words in SRC1 and SRC2 *)

IF SRC1[127:112] = SRC2[127:112]

THEN DEST[127:112] FFFFH;

ELSE DEST[127:112] 0; FI;

COMPARE_DWORDS_EQUAL (SRC1, SRC2)

IF SRC1[31:0] = SRC2[31:0]

THEN DEST[31:0] FFFFFFFFH;

ELSE DEST[31:0] 0; FI;

(* Continue comparison of 2nd through 3rd 32-bit dwords in SRC1 and SRC2 *)

IF SRC1[127:96] = SRC2[127:96]

THEN DEST[127:96] FFFFFFFFH;

ELSE DEST[127:96] 0; FI;

PCMPEQB (with 128-bit operands)

DEST[127:0] COMPARE_BYTES_EQUAL(DEST[127:0],SRC[127:0])

DEST[MAXVL-1:128] (Unmodified)

PCMPEQB/PCMPEQW/PCMPEQD— Compare Packed Data for Equal

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-247

VPCMPEQB (VEX.128 encoded version)

DEST[127:0] COMPARE_BYTES_EQUAL(SRC1[127:0],SRC2[127:0])

DEST[MAXVL-1:128]  0

VPCMPEQB (VEX.256 encoded version)

DEST[127:0] COMPARE_BYTES_EQUAL(SRC1[127:0],SRC2[127:0])

DEST[255:128] COMPARE_BYTES_EQUAL(SRC1[255:128],SRC2[255:128])

DEST[MAXVL-1:256]  0

VPCMPEQB (EVEX encoded versions)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j  0 TO KL-1

i  j * 8

IF k2[j] OR *no writemask*

THEN

/* signed comparison */

CMP  SRC1[i+7:i] == SRC2[i+7:i];

IF CMP = TRUE

THEN DEST[j]  1;

ELSE DEST[j]  0; FI;

ELSE DEST[j]  0 ; zeroing-masking onlyFI;

FI;

ENDFOR

DEST[MAX_KL-1:KL]  0

PCMPEQW (with 64-bit operands)

IF DEST[15:0] = SRC[15:0]

THEN DEST[15:0] ← FFFFH;

ELSE DEST[15:0] ← 0; FI;

(* Continue comparison of 2nd and 3rd words in DEST and SRC *)

IF DEST[63:48] = SRC[63:48]

THEN DEST[63:48] ← FFFFH;

ELSE DEST[63:48] ← 0; FI;

PCMPEQW (with 128-bit operands)

DEST[127:0] COMPARE_WORDS_EQUAL(DEST[127:0],SRC[127:0])

DEST[MAXVL-1:128] (Unmodified)

VPCMPEQW (VEX.128 encoded version)

DEST[127:0] COMPARE_WORDS_EQUAL(SRC1[127:0],SRC2[127:0])

DEST[MAXVL-1:128]  0

VPCMPEQW (VEX.256 encoded version)

DEST[127:0] COMPARE_WORDS_EQUAL(SRC1[127:0],SRC2[127:0])

DEST[255:128] COMPARE_WORDS_EQUAL(SRC1[255:128],SRC2[255:128])

DEST[MAXVL-1:256]  0

PCMPEQB/PCMPEQW/PCMPEQD— Compare Packed Data for Equal

INSTRUCTION SET REFERENCE, M-U

4-248 Vol. 2B

VPCMPEQW (EVEX encoded versions)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j  0 TO KL-1

i  j * 16

IF k2[j] OR *no writemask*

THEN

/* signed comparison */

CMP  SRC1[i+15:i] == SRC2[i+15:i];

IF CMP = TRUE

THEN DEST[j]  1;

ELSE DEST[j]  0; FI;

ELSE DEST[j]  0 ; zeroing-masking onlyFI;

FI;

ENDFOR

DEST[MAX_KL-1:KL]  0

PCMPEQD (with 64-bit operands)

IF DEST[31:0] = SRC[31:0]

THEN DEST[31:0] ← FFFFFFFFH;

ELSE DEST[31:0] ← 0; FI;

IF DEST[63:32] = SRC[63:32]

THEN DEST[63:32] ← FFFFFFFFH;

ELSE DEST[63:32] ← 0; FI;

PCMPEQD (with 128-bit operands)

DEST[127:0] COMPARE_DWORDS_EQUAL(DEST[127:0],SRC[127:0])

DEST[MAXVL-1:128] (Unmodified)

VPCMPEQD (VEX.128 encoded version)

DEST[127:0] COMPARE_DWORDS_EQUAL(SRC1[127:0],SRC2[127:0])

DEST[MAXVL-1:128]  0

VPCMPEQD (VEX.256 encoded version)

DEST[127:0] COMPARE_DWORDS_EQUAL(SRC1[127:0],SRC2[127:0])

DEST[255:128] COMPARE_DWORDS_EQUAL(SRC1[255:128],SRC2[255:128])

DEST[MAXVL-1:256]  0

VPCMPEQD (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j  0 TO KL-1

i  j * 32

IF k2[j] OR *no writemask*

THEN

/* signed comparison */

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN CMP  SRC1[i+31:i] = SRC2[31:0];

ELSE CMP  SRC1[i+31:i] = SRC2[i+31:i];

FI;

IF CMP = TRUE

THEN DEST[j]  1;

ELSE DEST[j]  0; FI;

ELSE DEST[j]  0 ; zeroing-masking only

FI;

ENDFOR

DEST[MAX_KL-1:KL]  0

PCMPEQB/PCMPEQW/PCMPEQD— Compare Packed Data for Equal

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-249

Intel C/C++ Compiler Intrinsic Equivalents

VPCMPEQB __mmask64 _mm512_cmpeq_epi8_mask(__m512i a, __m512i b);

VPCMPEQB __mmask64 _mm512_mask_cmpeq_epi8_mask(__mmask64 k, __m512i a, __m512i b);

VPCMPEQB __mmask32 _mm256_cmpeq_epi8_mask(__m256i a, __m256i b);

VPCMPEQB __mmask32 _mm256_mask_cmpeq_epi8_mask(__mmask32 k, __m256i a, __m256i b);

VPCMPEQB __mmask16 _mm_cmpeq_epi8_mask(__m128i a, __m128i b);

VPCMPEQB __mmask16 _mm_mask_cmpeq_epi8_mask(__mmask16 k, __m128i a, __m128i b);

VPCMPEQW __mmask32 _mm512_cmpeq_epi16_mask(__m512i a, __m512i b);

VPCMPEQW __mmask32 _mm512_mask_cmpeq_epi16_mask(__mmask32 k, __m512i a, __m512i b);

VPCMPEQW __mmask16 _mm256_cmpeq_epi16_mask(__m256i a, __m256i b);

VPCMPEQW __mmask16 _mm256_mask_cmpeq_epi16_mask(__mmask16 k, __m256i a, __m256i b);

VPCMPEQW __mmask8 _mm_cmpeq_epi16_mask(__m128i a, __m128i b);

VPCMPEQW __mmask8 _mm_mask_cmpeq_epi16_mask(__mmask8 k, __m128i a, __m128i b);

VPCMPEQD __mmask16 _mm512_cmpeq_epi32_mask( __m512i a, __m512i b);

VPCMPEQD __mmask16 _mm512_mask_cmpeq_epi32_mask(__mmask16 k, __m512i a, __m512i b);

VPCMPEQD __mmask8 _mm256_cmpeq_epi32_mask(__m256i a, __m256i b);

VPCMPEQD __mmask8 _mm256_mask_cmpeq_epi32_mask(__mmask8 k, __m256i a, __m256i b);

VPCMPEQD __mmask8 _mm_cmpeq_epi32_mask(__m128i a, __m128i b);

VPCMPEQD __mmask8 _mm_mask_cmpeq_epi32_mask(__mmask8 k, __m128i a, __m128i b);

PCMPEQB: __m64 _mm_cmpeq_pi8 (__m64 m1, __m64 m2)

PCMPEQW: __m64 _mm_cmpeq_pi16 (__m64 m1, __m64 m2)

PCMPEQD: __m64 _mm_cmpeq_pi32 (__m64 m1, __m64 m2)

(V)PCMPEQB: __m128i _mm_cmpeq_epi8 ( __m128i a, __m128i b)

(V)PCMPEQW: __m128i _mm_cmpeq_epi16 ( __m128i a, __m128i b)

(V)PCMPEQD: __m128i _mm_cmpeq_epi32 ( __m128i a, __m128i b)

VPCMPEQB: __m256i _mm256_cmpeq_epi8 ( __m256i a, __m256i b)

VPCMPEQW: __m256i _mm256_cmpeq_epi16 ( __m256i a, __m256i b)

VPCMPEQD: __m256i _mm256_cmpeq_epi32 ( __m256i a, __m256i b)

Flags Affected

None.

SIMD Floating-Point Exceptions

None.

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

EVEX-encoded VPCMPEQD, see Exceptions Type E4.

EVEX-encoded VPCMPEQB/W, see Exceptions Type E4.nb.

PCMPEQQ — Compare Packed Qword Data for Equal

INSTRUCTION SET REFERENCE, M-U

4-250 Vol. 2B

PCMPEQQ — Compare Packed Qword Data for Equal

Instruction Operand Encoding

Description

Performs an SIMD compare for equality of the packed quadwords in the destination operand (first operand) and the

source operand (second operand). If a pair of data elements is equal, the corresponding data element in the desti-

nation is set to all 1s; otherwise, it is set to 0s.

128-bit Legacy SSE version: The second source operand can be an XMM register or a 128-bit memory location. The

first source and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM destination

VEX.128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The

first source and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM register

are zeroed.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register

or a 256-bit memory location. The destination operand is a YMM register.

EVEX encoded VPCMPEQQ: The first source operand (second operand) is a ZMM/YMM/XMM register. The second

source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector

broadcasted from a 64-bit memory location. The destination operand (first operand) is a mask register updated

according to the writemask k2.

Opcode/

Instruction

Op/

64/32 bit

Mode

Support

CPUID

Feature

Flag

Description

66 0F 38 29 /r

PCMPEQQ xmm1, xmm2/m128

A V/V SSE4_1 Compare packed qwords in xmm2/m128 and

xmm1 for equality.

VEX.128.66.0F38.WIG 29 /r

VPCMPEQQ xmm1, xmm2, xmm3/m128

B V/V AVX Compare packed quadwords in xmm3/m128

and xmm2 for equality.

VEX.256.66.0F38.WIG 29 /r

VPCMPEQQ ymm1, ymm2, ymm3 /m256

B V/V AVX2 Compare packed quadwords in ymm3/m256

and ymm2 for equality.

EVEX.128.66.0F38.W1 29 /r

VPCMPEQQ k1 {k2}, xmm2, xmm3/m128/m64bcst

CV/V AVX512VL

AVX512F

Compare Equal between int64 vector xmm2

and int64 vector xmm3/m128/m64bcst, and

set vector mask k1 to reflect the zero/nonzero

status of each element of the result, under

writemask.

EVEX.256.66.0F38.W1 29 /r

VPCMPEQQ k1 {k2}, ymm2, ymm3/m256/m64bcst

CV/V AVX512VL

AVX512F

Compare Equal between int64 vector ymm2

and int64 vector ymm3/m256/m64bcst, and

set vector mask k1 to reflect the zero/nonzero

status of each element of the result, under

writemask.

EVEX.512.66.0F38.W1 29 /r

VPCMPEQQ k1 {k2}, zmm2, zmm3/m512/m64bcst

C V/V AVX512F Compare Equal between int64 vector zmm2

and int64 vector zmm3/m512/m64bcst, and

set vector mask k1 to reflect the zero/nonzero

status of each element of the result, under

writemask.

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4

A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA

B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA

C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

PCMPEQQ — Compare Packed Qword Data for Equal

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-251

Operation

PCMPEQQ (with 128-bit operands)

IF (DEST[63:0] = SRC[63:0])

THEN DEST[63:0]  FFFFFFFFFFFFFFFFH;

ELSE DEST[63:0]  0; FI;

IF (DEST[127:64] = SRC[127:64])

THEN DEST[127:64]  FFFFFFFFFFFFFFFFH;

ELSE DEST[127:64]  0; FI;

DEST[MAXVL-1:128] (Unmodified)

COMPARE_QWORDS_EQUAL (SRC1, SRC2)

IF SRC1[63:0] = SRC2[63:0]

THEN DEST[63:0] FFFFFFFFFFFFFFFFH;

ELSE DEST[63:0] 0; FI;

IF SRC1[127:64] = SRC2[127:64]

THEN DEST[127:64] FFFFFFFFFFFFFFFFH;

ELSE DEST[127:64] 0; FI;

VPCMPEQQ (VEX.128 encoded version)

DEST[127:0] COMPARE_QWORDS_EQUAL(SRC1,SRC2)

DEST[MAXVL-1:128]  0

VPCMPEQQ (VEX.256 encoded version)

DEST[127:0] COMPARE_QWORDS_EQUAL(SRC1[127:0],SRC2[127:0])

DEST[255:128] COMPARE_QWORDS_EQUAL(SRC1[255:128],SRC2[255:128])

DEST[MAXVL-1:256]  0

VPCMPEQQ (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j  0 TO KL-1

i  j * 64

IF k2[j] OR *no writemask*

THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*)

THEN CMP  SRC1[i+63:i] = SRC2[63:0];

ELSE CMP  SRC1[i+63:i] = SRC2[i+63:i];

FI;

IF CMP = TRUE

THEN DEST[j]  1;

ELSE DEST[j]  0; FI;

ELSE DEST[j]  0 ; zeroing-masking only

FI;

ENDFOR

DEST[MAX_KL-1:KL]  0

PCMPEQQ — Compare Packed Qword Data for Equal

INSTRUCTION SET REFERENCE, M-U

4-252 Vol. 2B

Intel C/C++ Compiler Intrinsic Equivalent

VPCMPEQQ __mmask8 _mm512_cmpeq_epi64_mask( __m512i a, __m512i b);

VPCMPEQQ __mmask8 _mm512_mask_cmpeq_epi64_mask(__mmask8 k, __m512i a, __m512i b);

VPCMPEQQ __mmask8 _mm256_cmpeq_epi64_mask( __m256i a, __m256i b);

VPCMPEQQ __mmask8 _mm256_mask_cmpeq_epi64_mask(__mmask8 k, __m256i a, __m256i b);

VPCMPEQQ __mmask8 _mm_cmpeq_epi64_mask( __m128i a, __m128i b);

VPCMPEQQ __mmask8 _mm_mask_cmpeq_epi64_mask(__mmask8 k, __m128i a, __m128i b);

(V)PCMPEQQ: __m128i _mm_cmpeq_epi64(__m128i a, __m128i b);

VPCMPEQQ: __m256i _mm256_cmpeq_epi64( __m256i a, __m256i b);

Flags Affected

None.

SIMD Floating-Point Exceptions

None.

Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4.

EVEX-encoded VPCMPEQQ, see Exceptions Type E4.

PCMPESTRI — Packed Compare Explicit Length Strings, Return Index

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-253

PCMPESTRI — Packed Compare Explicit Length Strings, Return Index

Instruction Operand Encoding

Description

The instruction compares and processes data from two string fragments based on the encoded value in the Imm8

Control Byte (see Section 4.1, “Imm8 Control Byte Operation for PCMPESTRI / PCMPESTRM / PCMPISTRI / PCMP-

ISTRM”), and generates an index stored to the count register (ECX).

Each string fragment is represented by two values. The first value is an xmm (or possibly m128 for the second

operand) which contains the data elements of the string (byte or word data). The second value is stored in an input

length register. The input length register is EAX/RAX (for xmm1) or EDX/RDX (for xmm2/m128). The length repre-

sents the number of bytes/words which are valid for the respective xmm/m128 data.

The length of each input is interpreted as being the absolute-value of the value in the length register. The absolute-

value computation saturates to 16 (for bytes) and 8 (for words), based on the value of imm8[bit3] when the value

in the length register is greater than 16 (8) or less than -16 (-8).

The comparison and aggregation operations are performed according to the encoded value of Imm8 bit fields (see

Section 4.1). The index of the first (or last, according to imm8[6]) set bit of IntRes2 (see Section 4.1.4) is returned

in ECX. If no bits are set in IntRes2, ECX is set to 16 (8).

Note that the Arithmetic Flags are written in a non-standard manner in order to supply the most relevant informa-

tion:

CFlag – Reset if IntRes2 is equal to zero, set otherwise

ZFlag – Set if absolute-value of EDX is < 16 (8), reset otherwise

SFlag – Set if absolute-value of EAX is < 16 (8), reset otherwise

OFlag – IntRes2[0]

AFlag – Reset

PFlag – Reset

Effective Operand Size

Intel C/C++ Compiler Intrinsic Equivalent For Returning Index

int _mm_cmpestri (__m128i a, int la, __m128i b, int lb, const int mode);

Opcode/

Instruction

Op/

64/32 bit

Mode

Support

CPUID

Feature

Flag

Description

66 0F 3A 61 /r imm8

PCMPESTRI xmm1, xmm2/m128, imm8

RMI V/V SSE4_2 Perform a packed comparison of string data

with explicit lengths, generating an index, and

storing the result in ECX.

VEX.128.66.0F3A 61 /r ib

VPCMPESTRI xmm1, xmm2/m128, imm8

RMI V/V AVX Perform a packed comparison of string data

with explicit lengths, generating an index, and

storing the result in ECX.

Op/En Operand 1 Operand 2 Operand 3 Operand 4

RMI ModRM:reg (r) ModRM:r/m (r) imm8 NA

Operating mode/size Operand 1 Operand 2 Length 1 Length 2 Result

16 bit xmm xmm/m128 EAX EDX ECX

32 bit xmm xmm/m128 EAX EDX ECX

64 bit xmm xmm/m128 EAX EDX ECX

64 bit + REX.W xmm xmm/m128 RAX RDX ECX

PCMPESTRI — Packed Compare Explicit Length Strings, Return Index

INSTRUCTION SET REFERENCE, M-U

4-254 Vol. 2B

Intel C/C++ Compiler Intrinsics For Reading EFlag Results

int _mm_cmpestra (__m128i a, int la, __m128i b, int lb, const int mode);

int _mm_cmpestrc (__m128i a, int la, __m128i b, int lb, const int mode);

int _mm_cmpestro (__m128i a, int la, __m128i b, int lb, const int mode);

int _mm_cmpestrs (__m128i a, int la, __m128i b, int lb, const int mode);

int _mm_cmpestrz (__m128i a, int la, __m128i b, int lb, const int mode);

SIMD Floating-Point Exceptions

None.

Other Exceptions

See Exceptions Type 4; additionally, this instruction does not cause #GP if the memory operand is not aligned to 16

Byte boundary, and

#UD If VEX.L = 1.

If VEX.vvvv ≠ 1111B.

PCMPESTRM — Packed Compare Explicit Length Strings, Return Mask

INSTRUCTION SET REFERENCE, M-U

Vol. 2B 4-255

PCMPESTRM — Packed Compare Explicit Length Strings, Return Mask

Instruction Operand Encoding

Description

The instruction compares data from two string fragments based on the encoded value in the imm8 contol byte (see

Section 4.1, “Imm8 Control Byte Operation for PCMPESTRI / PCMPESTRM / PCMPISTRI / PCMPISTRM”), and gener-

ates a mask stored to XMM0.