Intel® 64 And IA 32 Architectures Software Developer’s Manual Volume 2C: Instruction Set Reference, V Z Developer Vol 2c
User Manual:
Open the PDF directly: View PDF .
Page Count: 626
Download | |
Open PDF In Browser | View PDF |
Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2C: Instruction Set Reference, V-Z NOTE: The Intel® 64 and IA-32 Architectures Software Developer's Manual consists of nine volumes: Basic Architecture, Order Number 253665; Instruction Set Reference A-L, Order Number 253666; Instruction Set Reference M-U, Order Number 253667; Instruction Set Reference V-Z, Order Number 326018; Instruction Set Reference, Order Number 334569; System Programming Guide, Part 1, Order Number 253668; System Programming Guide, Part 2, Order Number 253669; System Programming Guide, Part 3, Order Number 326019; System Programming Guide, Part 4, Order Number 332831. Refer to all nine volumes when evaluating your design needs. Order Number: 326018-060US September 2016 Intel technologies features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses. You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1800-548-4725, or by visiting http://www.intel.com/design/literature.htm. Intel, the Intel logo, Intel Atom, Intel Core, Intel SpeedStep, MMX, Pentium, VTune, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Copyright © 1997-2016, Intel Corporation. All Rights Reserved. CHAPTER 5 INSTRUCTION SET REFERENCE, V-Z 5.1 TERNARY BIT VECTOR LOGIC TABLE VPTERNLOGD/VPTERNLOGQ instructions operate on dword/qword elements and take three bit vectors of the respective input data elements to form a set of 32/64 indices, where each 3-bit value provides an index into an 8bit lookup table represented by the imm8 byte of the instruction. The 256 possible values of the imm8 byte is constructed as a 16x16 boolean logic table. The 16 rows of the table uses the lower 4 bits of imm8 as row index. The 16 columns are referenced by imm8[7:4]. The 16 columns of the table are present in two halves, with 8 columns shown in Table 5-1 for the column index value between 0:7, followed by Table 5-2 showing the 8 columns corresponding to column index 8:15. This section presents the two-halves of the 256-entry table using a shorthand notation representing simple or compound boolean logic expressions with three input bit source data. The three input bit source data will be denoted with the capital letters: A, B, C; where A represents a bit from the first source operand (also the destination operand), B and C represent a bit from the 2nd and 3rd source operands. Each map entry takes the form of a logic expression consisting of one of more component expressions. Each component expression consists of either a unary or binary boolean operator and associated operands. Each binary boolean operator is expressed in lowercase letters, and operands concatenated after the logic operator. The unary operator ‘not’ is expressed using ‘!’. Additionally, the conditional expression “A?B:C” expresses a result returning B if A is set, returning C otherwise. A binary boolean operator is followed by two operands, e.g. andAB. For a compound binary expression that contain commutative components and comprising the same logic operator, the 2nd logic operator is omitted and three operands can be concatenated in sequence, e.g. andABC. When the 2nd operand of the first binary boolean expression comes from the result of another boolean expression, the 2nd boolean expression is concatenated after the uppercase operand of the first logic expression, e.g. norBnandAC. When the result is independent of an operand, that operand is omitted in the logic expression, e.g. zeros or norCB. The 3-input expression “majorABC” returns 0 if two or more input bits are 0, returns 1 if two or more input bits are 1. The 3-input expression “minorABC” returns 1 if two or more input bits are 0, returns 0 if two or more input bits are 1. The building-block bit logic functions used in Table 5-1 and Table 5-2 include; • • • • • Constants: TRUE (1), FALSE (0); Unary function: Not (!); Binary functions: and, nand, or, nor, xor, xnor; Conditional function: Select (?:); Tertiary functions: major, minor. Vol. 2C 5-1 INSTRUCTION SET REFERENCE, V-Z Table 5-1. Low 8 columns of the 16x16 Map of VPTERNLOG Boolean Logic Operations : Imm [7:4] [3:0] 0H 1H 2H 3H 4H 5H 6H 7H 00H FALSE andAnorBC norBnandAC andA!B norCnandBA andA!C andAxorBC andAnandBC 01H norABC norCB norBxorAC A?!B:norBC norCxorBA A?!C:norBC A?xorBC:norB A?nandBC:no C rBC 02H andCnorBA norBxnorAC andC!B norBnorAC C?norBA:and BA C?norBA:A C?!B:andBA C?!B:A 03H norBA norBandAC C?!B:norBA !B C?norBA:xnor A?!C:!B BA A?xorBC:!B A?nandBC:!B 04H andBnorAC norCxnorBA B?norAC:and AC B?norAC:A andB!C norCnorBA B?!C:andAC B?!C:A 05H norCA norCandBA B?norAC:xnor A?!B:!C AC B?!C:norAC !C A?xorBC:!C A?nandBC:!C 06H norAxnorBC A?norBC:xorB B?norAC:C C xorBorAC C?norBA:B xorCorBA xorCB B?!C:orAC 07H norAandBC minorABC C?!B:!A nandBorAC B?!C:!A nandCorBA A?xorBC:nan dBC nandCB 08H norAnandBC A?norBC:and BC andCxorBA A?!B:andBC andBxorAC A?!C:andBC A?xorBC:and BC xorAandBC 09H norAxorBC A?norBC:xnor C?xorBA:norB A?!B:xnorBC BC A B?xorAC:norA A?!C:xnorBC C xnorABC A?nandBC:xn orBC 0AH andC!A A?norBC:C andCnandBA A?!B:C C?!A:andBA xorCA xorCandBA A?nandBC:C 0BH C?!A:norBA C?!A:!B C?nandBA:no rBA C?nandBA:!B B?xorAC:!A B?xorAC:nan dAC C?nandBA:xn orBA nandBxnorAC 0CH andB!A A?norBC:B B?!A:andAC xorBA andBnandAC A?!C:B xorBandAC A?nandBC:B 0DH B?!A:norAC B?!A:!C B?!A:xnorAC C?xorBA:nan dBA B?nandAC:no rAC B?nandAC:!C B?nandAC:xn orAC nandCxnorBA 0EH norAnorBC xorAorBC B?!A:C A?!B:orBC C?!A:B A?!C:orBC B?nandAC:C A?nandBC:or BC 0FH !A nandAorBC C?nandBA:!A nandBA B?nandAC:!A nandCA nandAxnorBC nandABC Table 5-2 shows the half of 256-entry map corresponding to column index values 8:15. 5-2 Vol. 2C INSTRUCTION SET REFERENCE, V-Z Table 5-2. Low 8 columns of the 16x16 Map of VPTERNLOG Boolean Logic Operations : Imm [7:4] [3:0] 08H 09H 0AH 0BH 0CH 0DH 0EH 0FH 00H andABC andAxnorBC andCA B?andAC:A andBA C?andBA:A andAorBC A 01H A?andBC:nor BC B?andAC:!C A?C:norBC C?A:!B A?B:norBC B?A:!C xnorAorBC orAnorBC 02H andCxnorBA B?andAC:xor AC B?andAC:C B?andAC:orA C C?xnorBA:an dBA B?A:xorAC B?A:C B?A:orAC 03H A?andBC:!B xnorBandAC A?C:!B nandBnandA C xnorBA B?A:nandAC A?orBC:!B orA!B 04H andBxnorAC C?andBA:xor BA B?xnorAC:an dAC B?xnorAC:A C?andBA:B C?andBA:orB A C?A:B C?A:orBA 05H A?andBC:!C xnorCandBA xnorCA C?A:nandBA A?B:!C nandCnandB A A?orBC:!C orA!C 06H A?andBC:xor BC xorABC A?C:xorBC B?xnorAC:orA A?B:xorBC C C?xnorBA:orB A?orBC:xorBC orAxorBC A 07H xnorAandBC A?xnorBC:na ndBC A?C:nandBC nandBxorAC A?B:nandBC nandCxorBA A?orBCnandB orAnandBC C 08H andCB A?xnorBC:an dBC andCorAB B?C:A andBorAC C?B:A majorABC 09H B?C:norAC xnorCB xnorCorBA C?orBA:!B xnorBorAC B?orAC:!C A?orBC:xnorB orAxnorBC C 0AH A?andBC:C A?xnorBC:C C B?C:orAC A?B:C B?orAC:xorAC orCandBA orCA 0BH B?C:!A B?C:nandAC orCnorBA orC!B B?orAC:!A B?orAC:nand AC orCxnorBA nandBnorAC 0CH A?andBC:B A?xnorBC:B A?C:B C?orBA:xorBA B C?B:orBA orBandAC orBA 0DH C?B!A C?B:nandBA C?orBA:!A C?orBA:nand BA orBnorAC orB!C orBxnorAC nandCnorBA 0EH A?andBC:orB C A?xnorBC:orB A?C:orBC C orCxorBA A?B:orBC orBxorAC orCB orABC 0FH nandAnandB C nandAxorBC orCnandBA orB!A orBnandAC nandAnorBC TRUE orC!A orAandBC Table 5-1 and Table 5-2 translate each of the possible value of the imm8 byte to a Boolean expression. These tables can also be used by software to translate Boolean expressions to numerical constants to form the imm8 value needed to construct the VPTERNLOG syntax. There is a unique set of three byte constants (F0H, CCH, AAH) that can be used for this purpose as input operands in conjunction with the Boolean expressions defined in those tables. The reverse mapping can be expressed as: Result_imm8 = Table_Lookup_Entry( 0F0H, 0CCH, 0AAH) Table_Lookup_Entry is the Boolean expression defined in Table 5-1 and Table 5-2. Vol. 2C 5-3 INSTRUCTION SET REFERENCE, V-Z 5.2 INSTRUCTIONS (V-Z) Chapter 5 continues an alphabetical discussion of Intel® 64 and IA-32 instructions (V-Z). See also: Chapter 3, “Instruction Set Reference, A-L,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A, and Chapter 4, “Instruction Set Reference, M-U‚” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B. 5-4 Vol. 2C INSTRUCTION SET REFERENCE, V-Z VALIGND/VALIGNQ—Align Doubleword/Quadword Vectors Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.NDS.128.66.0F3A.W0 03 /r ib VALIGND xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst, imm8 EVEX.NDS.128.66.0F3A.W1 03 /r ib VALIGNQ xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst, imm8 FV V/V AVX512VL AVX512F EVEX.NDS.256.66.0F3A.W0 03 /r ib VALIGND ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst, imm8 FV V/V AVX512VL AVX512F EVEX.NDS.256.66.0F3A.W1 03 /r ib VALIGNQ ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst, imm8 FV V/V AVX512VL AVX512F EVEX.NDS.512.66.0F3A.W0 03 /r ib VALIGND zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst, imm8 FV V/V AVX512F EVEX.NDS.512.66.0F3A.W1 03 /r ib VALIGNQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst, imm8 FV V/V AVX512F Description Shift right and merge vectors xmm2 and xmm3/m128/m32bcst with double-word granularity using imm8 as number of elements to shift, and store the final result in xmm1, under writemask. Shift right and merge vectors xmm2 and xmm3/m128/m64bcst with quad-word granularity using imm8 as number of elements to shift, and store the final result in xmm1, under writemask. Shift right and merge vectors ymm2 and ymm3/m256/m32bcst with double-word granularity using imm8 as number of elements to shift, and store the final result in ymm1, under writemask. Shift right and merge vectors ymm2 and ymm3/m256/m64bcst with quad-word granularity using imm8 as number of elements to shift, and store the final result in ymm1, under writemask. Shift right and merge vectors zmm2 and zmm3/m512/m32bcst with double-word granularity using imm8 as number of elements to shift, and store the final result in zmm1, under writemask. Shift right and merge vectors zmm2 and zmm3/m512/m64bcst with quad-word granularity using imm8 as number of elements to shift, and store the final result in zmm1, under writemask. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA Description Concatenates and shifts right doubleword/quadword elements of the first source operand (the second operand) and the second source operand (the third operand) into a 1024/512/256-bit intermediate vector. The low 512/256/128-bit of the intermediate vector is written to the destination operand (the first operand) using the writemask k1. The destination and first source operands are ZMM/YMM/XMM registers. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with the corresponding bit clear in k1 retain their previous values (merging-masking) or are set to 0 (zeroing-masking). VALIGND/VALIGNQ—Align Doubleword/Quadword Vectors Vol. 2C 5-5 INSTRUCTION SET REFERENCE, V-Z Operation VALIGND (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (SRC2 *is memory*) (AND EVEX.b = 1) THEN FOR j 0 TO KL-1 i j * 32 src[i+31:i] SRC2[31:0] ENDFOR; ELSE src SRC2 FI ; Concatenate sources tmp[VL-1:0] src[VL-1:0] tmp[2VL-1:VL] SRC1[VL-1:0] ; Shift right doubleword elements IF VL = 128 THEN SHIFT = imm8[1:0] ELSE IF VL = 256 THEN SHIFT = imm8[2:0] ELSE SHIFT = imm8[3:0] FI FI; tmp[2VL-1:0] tmp[2VL-1:0] >> (32*SHIFT) ; Apply writemask FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] tmp[i+31:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR; DEST[MAX_VL-1:VL] 0 5-6 Vol. 2C VALIGND/VALIGNQ—Align Doubleword/Quadword Vectors INSTRUCTION SET REFERENCE, V-Z VALIGNQ (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256),(8, 512) IF (SRC2 *is memory*) (AND EVEX.b = 1) THEN FOR j 0 TO KL-1 i j * 64 src[i+63:i] SRC2[63:0] ENDFOR; ELSE src SRC2 FI ; Concatenate sources tmp[VL-1:0] src[VL-1:0] tmp[2VL-1:VL] SRC1[VL-1:0] ; Shift right quadword elements IF VL = 128 THEN SHIFT = imm8[0] ELSE IF VL = 256 THEN SHIFT = imm8[1:0] ELSE SHIFT = imm8[2:0] FI FI; tmp[2VL-1:0] tmp[2VL-1:0] >> (64*SHIFT) ; Apply writemask FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] tmp[i+63:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR; DEST[MAX_VL-1:VL] 0 VALIGND/VALIGNQ—Align Doubleword/Quadword Vectors Vol. 2C 5-7 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VALIGND __m512i _mm512_alignr_epi32( __m512i a, __m512i b, int cnt); VALIGND __m512i _mm512_mask_alignr_epi32(__m512i s, __mmask16 k, __m512i a, __m512i b, int cnt); VALIGND __m512i _mm512_maskz_alignr_epi32( __mmask16 k, __m512i a, __m512i b, int cnt); VALIGND __m256i _mm256_mask_alignr_epi32(__m256i s, __mmask8 k, __m256i a, __m256i b, int cnt); VALIGND __m256i _mm256_maskz_alignr_epi32( __mmask8 k, __m256i a, __m256i b, int cnt); VALIGND __m128i _mm_mask_alignr_epi32(__m128i s, __mmask8 k, __m128i a, __m128i b, int cnt); VALIGND __m128i _mm_maskz_alignr_epi32( __mmask8 k, __m128i a, __m128i b, int cnt); VALIGNQ __m512i _mm512_alignr_epi64( __m512i a, __m512i b, int cnt); VALIGNQ __m512i _mm512_mask_alignr_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b, int cnt); VALIGNQ __m512i _mm512_maskz_alignr_epi64( __mmask8 k, __m512i a, __m512i b, int cnt); VALIGNQ __m256i _mm256_mask_alignr_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b, int cnt); VALIGNQ __m256i _mm256_maskz_alignr_epi64( __mmask8 k, __m256i a, __m256i b, int cnt); VALIGNQ __m128i _mm_mask_alignr_epi64(__m128i s, __mmask8 k, __m128i a, __m128i b, int cnt); VALIGNQ __m128i _mm_maskz_alignr_epi64( __mmask8 k, __m128i a, __m128i b, int cnt); Exceptions See Exceptions Type E4NF. 5-8 Vol. 2C VALIGND/VALIGNQ—Align Doubleword/Quadword Vectors INSTRUCTION SET REFERENCE, V-Z VBLENDMPD/VBLENDMPS—Blend Float64/Float32 Vectors Using an OpMask Control Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.NDS.128.66.0F38.W1 65 /r VBLENDMPD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDS.256.66.0F38.W1 65 /r VBLENDMPD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.512.66.0F38.W1 65 /r VBLENDMPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst EVEX.NDS.128.66.0F38.W0 65 /r VBLENDMPS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.256.66.0F38.W0 65 /r VBLENDMPS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.512.66.0F38.W0 65 /r VBLENDMPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst FV V/V AVX512VL AVX512F FV V/V AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F Description Blend double-precision vector xmm2 and double-precision vector xmm3/m128/m64bcst and store the result in xmm1, under control mask. Blend double-precision vector ymm2 and double-precision vector ymm3/m256/m64bcst and store the result in ymm1, under control mask. Blend double-precision vector zmm2 and double-precision vector zmm3/m512/m64bcst and store the result in zmm1, under control mask. Blend single-precision vector xmm2 and single-precision vector xmm3/m128/m32bcst and store the result in xmm1, under control mask. Blend single-precision vector ymm2 and single-precision vector ymm3/m256/m32bcst and store the result in ymm1, under control mask. Blend single-precision vector zmm2 and single-precision vector zmm3/m512/m32bcst using k1 as select control and store the result in zmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA Description Performs an element-by-element blending between float64/float32 elements in the first source operand (the second operand) with the elements in the second source operand (the third operand) using an opmask register as select control. The blended result is written to the destination register. The destination and first source operands are ZMM/YMM/XMM registers. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64bit memory location. The opmask register is not used as a writemask for this instruction. Instead, the mask is used as an element selector: every element of the destination is conditionally selected between first source or second source using the value of the related mask bit (0 for first source operand, 1 for second source operand). If EVEX.z is set, the elements with corresponding mask bit value of 0 in the destination operand are zeroed. VBLENDMPD/VBLENDMPS—Blend Float64/Float32 Vectors Using an OpMask Control Vol. 2C 5-9 INSTRUCTION SET REFERENCE, V-Z Operation VBLENDMPD (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no controlmask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+63:i] SRC2[63:0] ELSE DEST[i+63:i] SRC2[i+63:i] FI; ELSE IF *merging-masking* ; merging-masking THEN DEST[i+63:i] SRC1[i+63:i] ELSE ; zeroing-masking DEST[i+63:i] 0 FI; FI; ENDFOR DEST[MAX_VL-1:VL] 0 VBLENDMPS (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no controlmask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+31:i] SRC2[31:0] ELSE DEST[i+31:i] SRC2[i+31:i] FI; ELSE IF *merging-masking* ; merging-masking THEN DEST[i+31:i] SRC1[i+31:i] ELSE ; zeroing-masking DEST[i+31:i] 0 FI; FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-10 Vol. 2C VBLENDMPD/VBLENDMPS—Blend Float64/Float32 Vectors Using an OpMask Control INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VBLENDMPD __m512d _mm512_mask_blend_pd(__mmask8 k, __m512d a, __m512d b); VBLENDMPD __m256d _mm256_mask_blend_pd(__mmask8 k, __m256d a, __m256d b); VBLENDMPD __m128d _mm_mask_blend_pd(__mmask8 k, __m128d a, __m128d b); VBLENDMPS __m512 _mm512_mask_blend_ps(__mmask16 k, __m512 a, __m512 b); VBLENDMPS __m256 _mm256_mask_blend_ps(__mmask8 k, __m256 a, __m256 b); VBLENDMPS __m128 _mm_mask_blend_ps(__mmask8 k, __m128 a, __m128 b); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E4. VBLENDMPD/VBLENDMPS—Blend Float64/Float32 Vectors Using an OpMask Control Vol. 2C 5-11 INSTRUCTION SET REFERENCE, V-Z VBROADCAST—Load with Broadcast Floating-Point Data Opcode/ Instruction Op / En VEX.128.66.0F38.W0 18 /r VBROADCASTSS xmm1, m32 VEX.256.66.0F38.W0 18 /r VBROADCASTSS ymm1, m32 VEX.256.66.0F38.W0 19 /r VBROADCASTSD ymm1, m64 VEX.256.66.0F38.W0 1A /r VBROADCASTF128 ymm1, m128 EVEX.256.66.0F38.W1 19 /r VBROADCASTSD ymm1 {k1}{z}, xmm2/m64 EVEX.512.66.0F38.W1 19 /r VBROADCASTSD zmm1 {k1}{z}, xmm2/m64 EVEX.256.66.0F38.W0 19 /r VBROADCASTF32X2 ymm1 {k1}{z}, xmm2/m64 EVEX.512.66.0F38.W0 19 /r VBROADCASTF32X2 zmm1 {k1}{z}, xmm2/m64 EVEX.128.66.0F38.W0 18 /r VBROADCASTSS xmm1 {k1}{z}, xmm2/m32 EVEX.256.66.0F38.W0 18 /r VBROADCASTSS ymm1 {k1}{z}, xmm2/m32 EVEX.512.66.0F38.W0 18 /r VBROADCASTSS zmm1 {k1}{z}, xmm2/m32 EVEX.256.66.0F38.W0 1A /r VBROADCASTF32X4 ymm1 {k1}{z}, m128 EVEX.512.66.0F38.W0 1A /r VBROADCASTF32X4 zmm1 {k1}{z}, m128 EVEX.256.66.0F38.W1 1A /r VBROADCASTF64X2 ymm1 {k1}{z}, m128 EVEX.512.66.0F38.W1 1A /r VBROADCASTF64X2 zmm1 {k1}{z}, m128 EVEX.512.66.0F38.W0 1B /r VBROADCASTF32X8 zmm1 {k1}{z}, m256 EVEX.512.66.0F38.W1 1B /r VBROADCASTF64X4 zmm1 {k1}{z}, m256 5-12 Vol. 2C CPUID Feature Flag Description RM 64/32 bit Mode Support V/V AVX RM V/V AVX RM V/V AVX RM V/V AVX T1S V/V AVX512VL AVX512F T1S V/V AVX512F T2 V/V AVX512VL AVX512DQ T2 V/V AVX512DQ T1S V/V AVX512VL AVX512F T1S V/V AVX512VL AVX512F T1S V/V AVX512F T4 V/V AVX512VL AVX512F Broadcast single-precision floating-point element in mem to four locations in xmm1. Broadcast single-precision floating-point element in mem to eight locations in ymm1. Broadcast double-precision floating-point element in mem to four locations in ymm1. Broadcast 128 bits of floating-point data in mem to low and high 128-bits in ymm1. Broadcast low double-precision floating-point element in xmm2/m64 to four locations in ymm1 using writemask k1. Broadcast low double-precision floating-point element in xmm2/m64 to eight locations in zmm1 using writemask k1. Broadcast two single-precision floating-point elements in xmm2/m64 to locations in ymm1 using writemask k1. Broadcast two single-precision floating-point elements in xmm2/m64 to locations in zmm1 using writemask k1. Broadcast low single-precision floating-point element in xmm2/m32 to all locations in xmm1 using writemask k1. Broadcast low single-precision floating-point element in xmm2/m32 to all locations in ymm1 using writemask k1. Broadcast low single-precision floating-point element in xmm2/m32 to all locations in zmm1 using writemask k1. Broadcast 128 bits of 4 single-precision floating-point data in mem to locations in ymm1 using writemask k1. T4 V/V AVX512F Broadcast 128 bits of 4 single-precision floating-point data in mem to locations in zmm1 using writemask k1. T2 V/V AVX512VL AVX512DQ Broadcast 128 bits of 2 double-precision floating-point data in mem to locations in ymm1 using writemask k1. T2 V/V AVX512DQ Broadcast 128 bits of 2 double-precision floating-point data in mem to locations in zmm1 using writemask k1. T8 V/V AVX512DQ Broadcast 256 bits of 8 single-precision floating-point data in mem to locations in zmm1 using writemask k1. T4 V/V AVX512F Broadcast 256 bits of 4 double-precision floating-point data in mem to locations in zmm1 using writemask k1. VBROADCAST—Load with Broadcast Floating-Point Data INSTRUCTION SET REFERENCE, V-Z Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RM ModRM:reg (w) ModRM:r/m (r) NA NA T1S, T2, T4, T8 ModRM:reg (w) ModRM:r/m (r) NA NA Description VBROADCASTSD/VBROADCASTSS/VBROADCASTF128 load floating-point values as one tuple from the source operand (second operand) in memory and broadcast to all elements of the destination operand (first operand). VEX256-encoded versions: The destination operand is a YMM register. The source operand is either a 32-bit, 64-bit, or 128-bit memory location. Register source encodings are reserved and will #UD. Bits (MAX_VL1:256) of the destination register are zeroed. EVEX-encoded versions: The destination operand is a ZMM/YMM/XMM register and updated according to the writemask k1. The source operand is either a 32-bit, 64-bit memory location or the low doubleword/quadword element of an XMM register. VBROADCASTF32X2/VBROADCASTF32X4/VBROADCASTF64X2/VBROADCASTF32X8/VBROADCASTF64X4 load floating-point values as tuples from the source operand (the second operand) in memory or register and broadcast to all elements of the destination operand (the first operand). The destination operand is a YMM/ZMM register updated according to the writemask k1. The source operand is either a register or 64-bit/128-bit/256-bit memory location. VBROADCASTSD and VBROADCASTF128,F32x4 and F64x2 are only supported as 256-bit and 512-bit wide versions and up. VBROADCASTSS is supported in 128-bit, 256-bit and 512-bit wide versions. F32x8 and F64x4 are only supported as 512-bit wide versions. VBROADCASTF32X2/VBROADCASTF32X4/VBROADCASTF32X8 have 32-bit granularity. VBROADCASTF64X2 and VBROADCASTF64X4 have 64-bit granularity. Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. If VBROADCASTSD or VBROADCASTF128 is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will cause an #UD exception. X0 m32 DEST X0 X0 X0 X0 X0 X0 X0 X0 Figure 5-1. VBROADCASTSS Operation (VEX.256 encoded version) X0 m32 DEST 0 0 0 0 X0 X0 X0 X0 Figure 5-2. VBROADCASTSS Operation (VEX.128-bit version) VBROADCAST—Load with Broadcast Floating-Point Data Vol. 2C 5-13 INSTRUCTION SET REFERENCE, V-Z m64 DEST X0 X0 X0 X0 X0 Figure 5-3. VBROADCASTSD Operation (VEX.256-bit version) m128 DEST X0 X0 X0 Figure 5-4. VBROADCASTF128 Operation (VEX.256-bit version) m256 DEST X0 X0 X0 Figure 5-5. VBROADCASTF64X4 Operation (512-bit version with writemask all 1s) Operation VBROADCASTSS (128 bit version VEX and legacy) temp SRC[31:0] DEST[31:0] temp DEST[63:32] temp DEST[95:64] temp DEST[127:96] temp DEST[MAX_VL-1:128] 0 5-14 Vol. 2C VBROADCAST—Load with Broadcast Floating-Point Data INSTRUCTION SET REFERENCE, V-Z VBROADCASTSS (VEX.256 encoded version) temp SRC[31:0] DEST[31:0] temp DEST[63:32] temp DEST[95:64] temp DEST[127:96] temp DEST[159:128] temp DEST[191:160] temp DEST[223:192] temp DEST[255:224] temp DEST[MAX_VL-1:256] 0 VBROADCASTSS (EVEX encoded versions) (KL, VL) (4, 128), (8, 256),= (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] SRC[31:0] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VBROADCASTSD (VEX.256 encoded version) temp SRC[63:0] DEST[63:0] temp DEST[127:64] temp DEST[191:128] temp DEST[255:192] temp DEST[MAX_VL-1:256] 0 VBROADCASTSD (EVEX encoded versions) (KL, VL) = (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] SRC[63:0] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VBROADCAST—Load with Broadcast Floating-Point Data Vol. 2C 5-15 INSTRUCTION SET REFERENCE, V-Z VBROADCASTF32x2 (EVEX encoded versions) (KL, VL) = (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 n (j mod 2) * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] SRC[n+31:n] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VBROADCASTF128 (VEX.256 encoded version) temp SRC[127:0] DEST[127:0] temp DEST[255:128] temp DEST[MAX_VL-1:256] 0 VBROADCASTF32X4 (EVEX encoded versions) (KL, VL) = (8, 256), (16, 512) FOR j 0 TO KL-1 i j* 32 n (j modulo 4) * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] SRC[n+31:n] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-16 Vol. 2C VBROADCAST—Load with Broadcast Floating-Point Data INSTRUCTION SET REFERENCE, V-Z VBROADCASTF64X2 (EVEX encoded versions) (KL, VL) = (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 n (j modulo 2) * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] SRC[n+63:n] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] = 0 FI FI; ENDFOR; VBROADCASTF32X8 (EVEX.U1.512 encoded version) FOR j 0 TO 15 i j * 32 n (j modulo 8) * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] SRC[n+31:n] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VBROADCASTF64X4 (EVEX.512 encoded version) FOR j 0 TO 7 i j * 64 n (j modulo 4) * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] SRC[n+63:n] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VBROADCAST—Load with Broadcast Floating-Point Data Vol. 2C 5-17 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VBROADCASTF32x2 __m512 _mm512_broadcast_f32x2( __m128 a); VBROADCASTF32x2 __m512 _mm512_mask_broadcast_f32x2(__m512 s, __mmask16 k, __m128 a); VBROADCASTF32x2 __m512 _mm512_maskz_broadcast_f32x2( __mmask16 k, __m128 a); VBROADCASTF32x2 __m256 _mm256_broadcast_f32x2( __m128 a); VBROADCASTF32x2 __m256 _mm256_mask_broadcast_f32x2(__m256 s, __mmask8 k, __m128 a); VBROADCASTF32x2 __m256 _mm256_maskz_broadcast_f32x2( __mmask8 k, __m128 a); VBROADCASTF32x4 __m512 _mm512_broadcast_f32x4( __m128 a); VBROADCASTF32x4 __m512 _mm512_mask_broadcast_f32x4(__m512 s, __mmask16 k, __m128 a); VBROADCASTF32x4 __m512 _mm512_maskz_broadcast_f32x4( __mmask16 k, __m128 a); VBROADCASTF32x4 __m256 _mm256_broadcast_f32x4( __m128 a); VBROADCASTF32x4 __m256 _mm256_mask_broadcast_f32x4(__m256 s, __mmask8 k, __m128 a); VBROADCASTF32x4 __m256 _mm256_maskz_broadcast_f32x4( __mmask8 k, __m128 a); VBROADCASTF32x8 __m512 _mm512_broadcast_f32x8( __m256 a); VBROADCASTF32x8 __m512 _mm512_mask_broadcast_f32x8(__m512 s, __mmask16 k, __m256 a); VBROADCASTF32x8 __m512 _mm512_maskz_broadcast_f32x8( __mmask16 k, __m256 a); VBROADCASTF64x2 __m512d _mm512_broadcast_f64x2( __m128d a); VBROADCASTF64x2 __m512d _mm512_mask_broadcast_f64x2(__m512d s, __mmask8 k, __m128d a); VBROADCASTF64x2 __m512d _mm512_maskz_broadcast_f64x2( __mmask8 k, __m128d a); VBROADCASTF64x2 __m256d _mm256_broadcast_f64x2( __m128d a); VBROADCASTF64x2 __m256d _mm256_mask_broadcast_f64x2(__m256d s, __mmask8 k, __m128d a); VBROADCASTF64x2 __m256d _mm256_maskz_broadcast_f64x2( __mmask8 k, __m128d a); VBROADCASTF64x4 __m512d _mm512_broadcast_f64x4( __m256d a); VBROADCASTF64x4 __m512d _mm512_mask_broadcast_f64x4(__m512d s, __mmask8 k, __m256d a); VBROADCASTF64x4 __m512d _mm512_maskz_broadcast_f64x4( __mmask8 k, __m256d a); VBROADCASTSD __m512d _mm512_broadcastsd_pd( __m128d a); VBROADCASTSD __m512d _mm512_mask_broadcastsd_pd(__m512d s, __mmask8 k, __m128d a); VBROADCASTSD __m512d _mm512_maskz_broadcastsd_pd(__mmask8 k, __m128d a); VBROADCASTSD __m256d _mm256_broadcastsd_pd(__m128d a); VBROADCASTSD __m256d _mm256_mask_broadcastsd_pd(__m256d s, __mmask8 k, __m128d a); VBROADCASTSD __m256d _mm256_maskz_broadcastsd_pd( __mmask8 k, __m128d a); VBROADCASTSD __m256d _mm256_broadcast_sd(double *a); VBROADCASTSS __m512 _mm512_broadcastss_ps( __m128 a); VBROADCASTSS __m512 _mm512_mask_broadcastss_ps(__m512 s, __mmask16 k, __m128 a); VBROADCASTSS __m512 _mm512_maskz_broadcastss_ps( __mmask16 k, __m128 a); VBROADCASTSS __m256 _mm256_broadcastss_ps(__m128 a); VBROADCASTSS __m256 _mm256_mask_broadcast_ss(__m256 s, __mmask8 k, __m128 a); VBROADCASTSS __m256 _mm256_maskz_broadcast_ss( __mmask8 k, __m128 a); VBROADCASTSS __m128 _mm_broadcastss_ps(__m128 a); VBROADCASTSS __m128 _mm_mask_broadcast_ss(__m128 s, __mmask8 k, __m128 a); VBROADCASTSS __m128 _mm_maskz_broadcast_ss( __mmask8 k, __m128 a); VBROADCASTSS __m128 _mm_broadcast_ss(float *a); VBROADCASTSS __m256 _mm256_broadcast_ss(float *a); VBROADCASTF128 __m256 _mm256_broadcast_ps(__m128 * a); VBROADCASTF128 __m256d _mm256_broadcast_pd(__m128d * a); Exceptions VEX-encoded instructions, see Exceptions Type 6; EVEX-encoded instructions, see Exceptions Type E6. #UD If VEX.L = 0 for VBROADCASTSD or VBROADCASTF128. If EVEX.L’L = 0 for VBROADCASTSD/VBROADCASTF32X2/VBROADCASTF32X4/VBROADCASTF64X2. If EVEX.L’L < 10b for VBROADCASTF32X8/VBROADCASTF64X4. 5-18 Vol. 2C VBROADCAST—Load with Broadcast Floating-Point Data INSTRUCTION SET REFERENCE, V-Z VPBROADCASTM—Broadcast Mask to Vector Register Opcode/ Instruction Op/ En CPUID Feature Flag AVX512VL AVX512CD Description RM 64/32 bit Mode Support V/V EVEX.128.F3.0F38.W1 2A /r VPBROADCASTMB2Q xmm1, k1 EVEX.256.F3.0F38.W1 2A /r VPBROADCASTMB2Q ymm1, k1 RM V/V AVX512VL AVX512CD Broadcast low byte value in k1 to four locations in ymm1. EVEX.512.F3.0F38.W1 2A /r VPBROADCASTMB2Q zmm1, k1 RM V/V AVX512CD Broadcast low byte value in k1 to eight locations in zmm1. EVEX.128.F3.0F38.W0 3A /r VPBROADCASTMW2D xmm1, k1 RM V/V AVX512VL AVX512CD Broadcast low word value in k1 to four locations in xmm1. EVEX.256.F3.0F38.W0 3A /r VPBROADCASTMW2D ymm1, k1 RM V/V AVX512VL AVX512CD Broadcast low word value in k1 to eight locations in ymm1. EVEX.512.F3.0F38.W0 3A /r VPBROADCASTMW2D zmm1, k1 RM V/V AVX512CD Broadcast low word value in k1 to sixteen locations in zmm1. Broadcast low byte value in k1 to two locations in xmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RM ModRM:reg (w) ModRM:r/m (r) NA NA Description Broadcasts the zero-extended 64/32 bit value of the low byte/word of the source operand (the second operand) to each 64/32 bit element of the destination operand (the first operand). The source operand is an opmask register. The destination operand is a ZMM register (EVEX.512), YMM register (EVEX.256), or XMM register (EVEX.128). EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Operation VPBROADCASTMB2Q (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j*64 DEST[i+63:i] ZeroExtend(SRC[7:0]) ENDFOR DEST[MAX_VL-1:VL] 0 VPBROADCASTMW2D (KL, VL) = (4, 128), (8, 256),(16, 512) FOR j 0 TO KL-1 i j*32 DEST[i+31:i] ZeroExtend(SRC[15:0]) ENDFOR DEST[MAX_VL-1:VL] 0 VPBROADCASTM—Broadcast Mask to Vector Register Vol. 2C 5-19 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPBROADCASTMB2Q __m512i _mm512_broadcastmb_epi64( __mmask8); VPBROADCASTMW2D __m512i _mm512_broadcastmw_epi32( __mmask16); VPBROADCASTMB2Q __m256i _mm256_broadcastmb_epi64( __mmask8); VPBROADCASTMW2D __m256i _mm256_broadcastmw_epi32( __mmask8); VPBROADCASTMB2Q __m128i _mm_broadcastmb_epi64( __mmask8); VPBROADCASTMW2D __m128i _mm_broadcastmw_epi32( __mmask8); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E6NF. 5-20 Vol. 2C VPBROADCASTM—Broadcast Mask to Vector Register INSTRUCTION SET REFERENCE, V-Z VCOMPRESSPD—Store Sparse Packed Double-Precision Floating-Point Values into Dense Memory Opcode/ Instruction Op / En T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.66.0F38.W1 8A /r VCOMPRESSPD xmm1/m128 {k1}{z}, xmm2 EVEX.256.66.0F38.W1 8A /r VCOMPRESSPD ymm1/m256 {k1}{z}, ymm2 EVEX.512.66.0F38.W1 8A /r VCOMPRESSPD zmm1/m512 {k1}{z}, zmm2 T1S V/V AVX512VL AVX512F T1S V/V AVX512F Description Compress packed double-precision floating-point values from xmm2 to xmm1/m128 using writemask k1. Compress packed double-precision floating-point values from ymm2 to ymm1/m256 using writemask k1. Compress packed double-precision floating-point values from zmm2 using control mask k1 to zmm1/m512. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:r/m (w) ModRM:reg (r) NA NA Description Compress (store) up to 8 double-precision floating-point values from the source operand (the second operand) as a contiguous vector to the destination operand (the first operand) The source operand is a ZMM/YMM/XMM register, the destination operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The opmask register k1 selects the active elements (partial vector or possibly non-contiguous if less than 8 active elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to the destination starting from the low element of the destination operand. Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z must be zero. Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper bits are zeroed. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector. Operation VCOMPRESSPD (EVEX encoded versions) store form (KL, VL) = (2, 128), (4, 256), (8, 512) SIZE 64 k0 FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[k+SIZE-1:k] SRC[i+63:i] k k + SIZE FI; ENDFOR VCOMPRESSPD—Store Sparse Packed Double-Precision Floating-Point Values into Dense Memory Vol. 2C 5-21 INSTRUCTION SET REFERENCE, V-Z VCOMPRESSPD (EVEX encoded versions) reg-reg form (KL, VL) = (2, 128), (4, 256), (8, 512) SIZE 64 k0 FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[k+SIZE-1:k] SRC[i+63:i] k k + SIZE FI; ENDFOR IF *merging-masking* THEN *DEST[VL-1:k] remains unchanged* ELSE DEST[VL-1:k] ← 0 FI DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VCOMPRESSPD __m512d _mm512_mask_compress_pd( __m512d s, __mmask8 k, __m512d a); VCOMPRESSPD __m512d _mm512_maskz_compress_pd( __mmask8 k, __m512d a); VCOMPRESSPD void _mm512_mask_compressstoreu_pd( void * d, __mmask8 k, __m512d a); VCOMPRESSPD __m256d _mm256_mask_compress_pd( __m256d s, __mmask8 k, __m256d a); VCOMPRESSPD __m256d _mm256_maskz_compress_pd( __mmask8 k, __m256d a); VCOMPRESSPD void _mm256_mask_compressstoreu_pd( void * d, __mmask8 k, __m256d a); VCOMPRESSPD __m128d _mm_mask_compress_pd( __m128d s, __mmask8 k, __m128d a); VCOMPRESSPD __m128d _mm_maskz_compress_pd( __mmask8 k, __m128d a); VCOMPRESSPD void _mm_mask_compressstoreu_pd( void * d, __mmask8 k, __m128d a); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instructions, see Exceptions Type E4.nb. #UD 5-22 Vol. 2C If EVEX.vvvv != 1111B. VCOMPRESSPD—Store Sparse Packed Double-Precision Floating-Point Values into Dense Memory INSTRUCTION SET REFERENCE, V-Z VCOMPRESSPS—Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory Opcode/ Instruction Op / En T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.66.0F38.W0 8A /r VCOMPRESSPS xmm1/m128 {k1}{z}, xmm2 EVEX.256.66.0F38.W0 8A /r VCOMPRESSPS ymm1/m256 {k1}{z}, ymm2 EVEX.512.66.0F38.W0 8A /r VCOMPRESSPS zmm1/m512 {k1}{z}, zmm2 T1S V/V AVX512VL AVX512F T1S V/V AVX512F Description Compress packed single-precision floating-point values from xmm2 to xmm1/m128 using writemask k1. Compress packed single-precision floating-point values from ymm2 to ymm1/m256 using writemask k1. Compress packed single-precision floating-point values from zmm2 using control mask k1 to zmm1/m512. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:r/m (w) ModRM:reg (r) NA NA Description Compress (stores) up to 16 single-precision floating-point values from the source operand (the second operand) to the destination operand (the first operand). The source operand is a ZMM/YMM/XMM register, the destination operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The opmask register k1 selects the active elements (a partial vector or possibly non-contiguous if less than 16 active elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to the destination starting from the low element of the destination operand. Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z must be zero. Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper bits are zeroed. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector. Operation VCOMPRESSPS (EVEX encoded versions) store form (KL, VL) = (4, 128), (8, 256), (16, 512) SIZE 32 k0 FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[k+SIZE-1:k] SRC[i+31:i] k k + SIZE FI; ENDFOR; VCOMPRESSPS—Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory Vol. 2C 5-23 INSTRUCTION SET REFERENCE, V-Z VCOMPRESSPS (EVEX encoded versions) reg-reg form (KL, VL) = (4, 128), (8, 256), (16, 512) SIZE 32 k0 FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[k+SIZE-1:k] SRC[i+31:i] k k + SIZE FI; ENDFOR IF *merging-masking* THEN *DEST[VL-1:k] remains unchanged* ELSE DEST[VL-1:k] 0 FI DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VCOMPRESSPS __m512 _mm512_mask_compress_ps( __m512 s, __mmask16 k, __m512 a); VCOMPRESSPS __m512 _mm512_maskz_compress_ps( __mmask16 k, __m512 a); VCOMPRESSPS void _mm512_mask_compressstoreu_ps( void * d, __mmask16 k, __m512 a); VCOMPRESSPS __m256 _mm256_mask_compress_ps( __m256 s, __mmask8 k, __m256 a); VCOMPRESSPS __m256 _mm256_maskz_compress_ps( __mmask8 k, __m256 a); VCOMPRESSPS void _mm256_mask_compressstoreu_ps( void * d, __mmask8 k, __m256 a); VCOMPRESSPS __m128 _mm_mask_compress_ps( __m128 s, __mmask8 k, __m128 a); VCOMPRESSPS __m128 _mm_maskz_compress_ps( __mmask8 k, __m128 a); VCOMPRESSPS void _mm_mask_compressstoreu_ps( void * d, __mmask8 k, __m128 a); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instructions, see Exceptions Type E4.nb. #UD 5-24 Vol. 2C If EVEX.vvvv != 1111B. VCOMPRESSPS—Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory INSTRUCTION SET REFERENCE, V-Z VCVTPD2QQ—Convert Packed Double-Precision Floating-Point Values to Packed Quadword Integers Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.128.66.0F.W1 7B /r VCVTPD2QQ xmm1 {k1}{z}, xmm2/m128/m64bcst EVEX.256.66.0F.W1 7B /r VCVTPD2QQ ymm1 {k1}{z}, ymm2/m256/m64bcst EVEX.512.66.0F.W1 7B /r VCVTPD2QQ zmm1 {k1}{z}, zmm2/m512/m64bcst{er} FV V/V AVX512VL AVX512DQ FV V/V AVX512DQ Description Convert two packed double-precision floating-point values from xmm2/m128/m64bcst to two packed quadword integers in xmm1 with writemask k1. Convert four packed double-precision floating-point values from ymm2/m256/m64bcst to four packed quadword integers in ymm1 with writemask k1. Convert eight packed double-precision floating-point values from zmm2/m512/m64bcst to eight packed quadword integers in zmm1 with writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts packed double-precision floating-point values in the source operand (second operand) to packed quadword integers in the destination operand (first operand). EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation is a ZMM/YMM/XMM register conditionally updated with writemask k1. When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (2w-1, where w represents the number of bits in the destination format) is returned. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. VCVTPD2QQ—Convert Packed Double-Precision Floating-Point Values to Packed Quadword Integers Vol. 2C 5-25 INSTRUCTION SET REFERENCE, V-Z Operation VCVTPD2QQ (EVEX encoded version) when src operand is a register (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL == 512) AND (EVEX.b == 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_QuadInteger(SRC[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VCVTPD2QQ (EVEX encoded version) when src operand is a memory source (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) THEN DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_QuadInteger(SRC[63:0]) ELSE DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_QuadInteger(SRC[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-26 Vol. 2C VCVTPD2QQ—Convert Packed Double-Precision Floating-Point Values to Packed Quadword Integers INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VCVTPD2QQ __m512i _mm512_cvtpd_epi64( __m512d a); VCVTPD2QQ __m512i _mm512_mask_cvtpd_epi64( __m512i s, __mmask8 k, __m512d a); VCVTPD2QQ __m512i _mm512_maskz_cvtpd_epi64( __mmask8 k, __m512d a); VCVTPD2QQ __m512i _mm512_cvt_roundpd_epi64( __m512d a, int r); VCVTPD2QQ __m512i _mm512_mask_cvt_roundpd_epi64( __m512i s, __mmask8 k, __m512d a, int r); VCVTPD2QQ __m512i _mm512_maskz_cvt_roundpd_epi64( __mmask8 k, __m512d a, int r); VCVTPD2QQ __m256i _mm256_mask_cvtpd_epi64( __m256i s, __mmask8 k, __m256d a); VCVTPD2QQ __m256i _mm256_maskz_cvtpd_epi64( __mmask8 k, __m256d a); VCVTPD2QQ __m128i _mm_mask_cvtpd_epi64( __m128i s, __mmask8 k, __m128d a); VCVTPD2QQ __m128i _mm_maskz_cvtpd_epi64( __mmask8 k, __m128d a); VCVTPD2QQ __m256i _mm256_cvtpd_epi64 (__m256d src) VCVTPD2QQ __m128i _mm_cvtpd_epi64 (__m128d src) SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E2 #UD If EVEX.vvvv != 1111B. VCVTPD2QQ—Convert Packed Double-Precision Floating-Point Values to Packed Quadword Integers Vol. 2C 5-27 INSTRUCTION SET REFERENCE, V-Z VCVTPD2UDQ—Convert Packed Double-Precision Floating-Point Values to Packed Unsigned Doubleword Integers Opcode Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.0F.W1 79 /r VCVTPD2UDQ xmm1 {k1}{z}, xmm2/m128/m64bcst EVEX.256.0F.W1 79 /r VCVTPD2UDQ xmm1 {k1}{z}, ymm2/m256/m64bcst EVEX.512.0F.W1 79 /r VCVTPD2UDQ ymm1 {k1}{z}, zmm2/m512/m64bcst{er} FV V/V AVX512VL AVX512F FV V/V AVX512F Description Convert two packed double-precision floating-point values in xmm2/m128/m64bcst to two unsigned doubleword integers in xmm1 subject to writemask k1. Convert four packed double-precision floating-point values in ymm2/m256/m64bcst to four unsigned doubleword integers in xmm1 subject to writemask k1. Convert eight packed double-precision floating-point values in zmm2/m512/m64bcst to eight unsigned doubleword integers in ymm1 subject to writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts packed double-precision floating-point values in the source operand (the second operand) to packed unsigned doubleword integers in the destination operand (the first operand). When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format. The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1. The upper bits (MAX_VL-1:256) of the corresponding destination are zeroed. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 5-28 Vol. 2C VCVTPD2UDQ—Convert Packed Double-Precision Floating-Point Values to Packed Unsigned Doubleword Integers INSTRUCTION SET REFERENCE, V-Z Operation VCVTPD2UDQ (EVEX encoded versions) when src2 operand is a register (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 k j * 64 IF k1[j] OR *no writemask* THEN DEST[i+31:i] Convert_Double_Precision_Floating_Point_To_UInteger(SRC[k+63:k]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0 VCVTPD2UDQ (EVEX encoded versions) when src operand is a memory source (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 32 k j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] Convert_Double_Precision_Floating_Point_To_UInteger(SRC[63:0]) ELSE DEST[i+31:i] Convert_Double_Precision_Floating_Point_To_UInteger(SRC[k+63:k]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0 VCVTPD2UDQ—Convert Packed Double-Precision Floating-Point Values to Packed Unsigned Doubleword Integers Vol. 2C 5-29 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VCVTPD2UDQ __m256i _mm512_cvtpd_epu32( __m512d a); VCVTPD2UDQ __m256i _mm512_mask_cvtpd_epu32( __m256i s, __mmask8 k, __m512d a); VCVTPD2UDQ __m256i _mm512_maskz_cvtpd_epu32( __mmask8 k, __m512d a); VCVTPD2UDQ __m256i _mm512_cvt_roundpd_epu32( __m512d a, int r); VCVTPD2UDQ __m256i _mm512_mask_cvt_roundpd_epu32( __m256i s, __mmask8 k, __m512d a, int r); VCVTPD2UDQ __m256i _mm512_maskz_cvt_roundpd_epu32( __mmask8 k, __m512d a, int r); VCVTPD2UDQ __m128i _mm256_mask_cvtpd_epu32( __m128i s, __mmask8 k, __m256d a); VCVTPD2UDQ __m128i _mm256_maskz_cvtpd_epu32( __mmask8 k, __m256d a); VCVTPD2UDQ __m128i _mm_mask_cvtpd_epu32( __m128i s, __mmask8 k, __m128d a); VCVTPD2UDQ __m128i _mm_maskz_cvtpd_epu32( __mmask8 k, __m128d a); SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E2. #UD 5-30 Vol. 2C If EVEX.vvvv != 1111B. VCVTPD2UDQ—Convert Packed Double-Precision Floating-Point Values to Packed Unsigned Doubleword Integers INSTRUCTION SET REFERENCE, V-Z VCVTPD2UQQ—Convert Packed Double-Precision Floating-Point Values to Packed Unsigned Quadword Integers Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.128.66.0F.W1 79 /r VCVTPD2UQQ xmm1 {k1}{z}, xmm2/m128/m64bcst EVEX.256.66.0F.W1 79 /r VCVTPD2UQQ ymm1 {k1}{z}, ymm2/m256/m64bcst EVEX.512.66.0F.W1 79 /r VCVTPD2UQQ zmm1 {k1}{z}, zmm2/m512/m64bcst{er} FV V/V AVX512VL AVX512DQ FV V/V AVX512DQ Description Convert two packed double-precision floating-point values from xmm2/mem to two packed unsigned quadword integers in xmm1 with writemask k1. Convert fourth packed double-precision floating-point values from ymm2/mem to four packed unsigned quadword integers in ymm1 with writemask k1. Convert eight packed double-precision floating-point values from zmm2/mem to eight packed unsigned quadword integers in zmm1 with writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts packed double-precision floating-point values in the source operand (second operand) to packed unsigned quadword integers in the destination operand (first operand). When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format. The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation is a ZMM/YMM/XMM register conditionally updated with writemask k1. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. VCVTPD2UQQ—Convert Packed Double-Precision Floating-Point Values to Packed Unsigned Quadword Integers Vol. 2C 5-31 INSTRUCTION SET REFERENCE, V-Z Operation VCVTPD2UQQ (EVEX encoded versions) when src operand is a register (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL == 512) AND (EVEX.b == 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_UQuadInteger(SRC[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VCVTPD2UQQ (EVEX encoded versions) when src operand is a memory source (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) THEN DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_UQuadInteger(SRC[63:0]) ELSE DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_UQuadInteger(SRC[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-32 Vol. 2C VCVTPD2UQQ—Convert Packed Double-Precision Floating-Point Values to Packed Unsigned Quadword Integers INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VCVTPD2UQQ __m512i _mm512_cvtpd_epu64( __m512d a); VCVTPD2UQQ __m512i _mm512_mask_cvtpd_epu64( __m512i s, __mmask8 k, __m512d a); VCVTPD2UQQ __m512i _mm512_maskz_cvtpd_epu64( __mmask8 k, __m512d a); VCVTPD2UQQ __m512i _mm512_cvt_roundpd_epu64( __m512d a, int r); VCVTPD2UQQ __m512i _mm512_mask_cvt_roundpd_epu64( __m512i s, __mmask8 k, __m512d a, int r); VCVTPD2UQQ __m512i _mm512_maskz_cvt_roundpd_epu64( __mmask8 k, __m512d a, int r); VCVTPD2UQQ __m256i _mm256_mask_cvtpd_epu64( __m256i s, __mmask8 k, __m256d a); VCVTPD2UQQ __m256i _mm256_maskz_cvtpd_epu64( __mmask8 k, __m256d a); VCVTPD2UQQ __m128i _mm_mask_cvtpd_epu64( __m128i s, __mmask8 k, __m128d a); VCVTPD2UQQ __m128i _mm_maskz_cvtpd_epu64( __mmask8 k, __m128d a); VCVTPD2UQQ __m256i _mm256_cvtpd_epu64 (__m256d src) VCVTPD2UQQ __m128i _mm_cvtpd_epu64 (__m128d src) SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E2 #UD If EVEX.vvvv != 1111B. VCVTPD2UQQ—Convert Packed Double-Precision Floating-Point Values to Packed Unsigned Quadword Integers Vol. 2C 5-33 INSTRUCTION SET REFERENCE, V-Z VCVTPH2PS—Convert 16-bit FP values to Single-Precision FP values Opcode/ Instruction Op / En RM 64/32 bit Mode Support V/V CPUID Feature Flag F16C VEX.128.66.0F38.W0 13 /r VCVTPH2PS xmm1, xmm2/m64 Description VEX.256.66.0F38.W0 13 /r VCVTPH2PS ymm1, xmm2/m128 RM V/V F16C EVEX.128.66.0F38.W0 13 /r VCVTPH2PS xmm1 {k1}{z}, xmm2/m64 HVM V/V AVX512VL AVX512F EVEX.256.66.0F38.W0 13 /r VCVTPH2PS ymm1 {k1}{z}, xmm2/m128 EVEX.512.66.0F38.W0 13 /r VCVTPH2PS zmm1 {k1}{z}, ymm2/m256 {sae} HVM V/V AVX512VL AVX512F HVM V/V AVX512F Convert four packed half precision (16-bit) floatingpoint values in xmm2/m64 to packed single-precision floating-point value in xmm1. Convert eight packed half precision (16-bit) floatingpoint values in xmm2/m128 to packed singleprecision floating-point value in ymm1. Convert four packed half precision (16-bit) floatingpoint values in xmm2/m64 to packed single-precision floating-point values in xmm1. Convert eight packed half precision (16-bit) floatingpoint values in xmm2/m128 to packed singleprecision floating-point values in ymm1. Convert sixteen packed half precision (16-bit) floating-point values in ymm2/m256 to packed single-precision floating-point values in zmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RM ModRM:reg (w) ModRM:r/m (r) NA NA HVM ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts packed half precision (16-bits) floating-point values in the low-order bits of the source operand (the second operand) to packed single-precision floating-point values and writes the converted values into the destination operand (the first operand). If case of a denormal operand, the correct normal result is returned. MXCSR.DAZ is ignored and is treated as if it 0. No denormal exception is reported on MXCSR. VEX.128 version: The source operand is a XMM register or 64-bit memory location. The destination operand is a XMM register. The upper bits (MAX_VL-1:128) of the corresponding destination register are zeroed. VEX.256 version: The source operand is a XMM register or 128-bit memory location. The destination operand is a YMM register. Bits (MAX_VL-1:256) of the corresponding destination register are zeroed. EVEX encoded versions: The source operand is a YMM/XMM/XMM (low 64-bits) register or a 256/128/64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1. The diagram below illustrates how data is converted from four packed half precision (in 64 bits) to four single precision (in 128 bits) FP values. Note: VEX.vvvv and EVEX.vvvv are reserved (must be 1111b). 5-34 Vol. 2C VCVTPH2PS—Convert 16-bit FP values to Single-Precision FP values INSTRUCTION SET REFERENCE, V-Z VCVTPH2PS xmm1, xmm2/mem64, imm8 127 96 95 64 63 48 47 VH3 convert 127 31 16 15 0 xmm2/mem64 VH1 VH0 convert convert convert 96 VS3 32 VH2 95 64 63 VS2 32 VS1 31 0 VS0 xmm1 Figure 5-6. VCVTPH2PS (128-bit Version) Operation vCvt_h2s(SRC1[15:0]) { RETURN Cvt_Half_Precision_To_Single_Precision(SRC1[15:0]); } VCVTPH2PS (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 k j * 16 IF k1[j] OR *no writemask* THEN DEST[i+31:i] vCvt_h2s(SRC[k+15:k]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VCVTPH2PS (VEX.256 encoded version) DEST[31:0] vCvt_h2s(SRC1[15:0]); DEST[63:32] vCvt_h2s(SRC1[31:16]); DEST[95:64] vCvt_h2s(SRC1[47:32]); DEST[127:96] vCvt_h2s(SRC1[63:48]); DEST[159:128] vCvt_h2s(SRC1[79:64]); DEST[191:160] vCvt_h2s(SRC1[95:80]); DEST[223:192] vCvt_h2s(SRC1[111:96]); DEST[255:224] vCvt_h2s(SRC1[127:112]); DEST[MAX_VL-1:256] 0 VCVTPH2PS—Convert 16-bit FP values to Single-Precision FP values Vol. 2C 5-35 INSTRUCTION SET REFERENCE, V-Z VCVTPH2PS (VEX.128 encoded version) DEST[31:0] vCvt_h2s(SRC1[15:0]); DEST[63:32] vCvt_h2s(SRC1[31:16]); DEST[95:64] vCvt_h2s(SRC1[47:32]); DEST[127:96] vCvt_h2s(SRC1[63:48]); DEST[MAX_VL-1:128] 0 Flags Affected None Intel C/C++ Compiler Intrinsic Equivalent VCVTPH2PS __m512 _mm512_cvtph_ps( __m256i a); VCVTPH2PS __m512 _mm512_mask_cvtph_ps(__m512 s, __mmask16 k, __m256i a); VCVTPH2PS __m512 _mm512_maskz_cvtph_ps(__mmask16 k, __m256i a); VCVTPH2PS __m512 _mm512_cvt_roundph_ps( __m256i a, int sae); VCVTPH2PS __m512 _mm512_mask_cvt_roundph_ps(__m512 s, __mmask16 k, __m256i a, int sae); VCVTPH2PS __m512 _mm512_maskz_cvt_roundph_ps( __mmask16 k, __m256i a, int sae); VCVTPH2PS __m256 _mm256_mask_cvtph_ps(__m256 s, __mmask8 k, __m128i a); VCVTPH2PS __m256 _mm256_maskz_cvtph_ps(__mmask8 k, __m128i a); VCVTPH2PS __m128 _mm_mask_cvtph_ps(__m128 s, __mmask8 k, __m128i a); VCVTPH2PS __m128 _mm_maskz_cvtph_ps(__mmask8 k, __m128i a); VCVTPH2PS __m128 _mm_cvtph_ps ( __m128i m1); VCVTPH2PS __m256 _mm256_cvtph_ps ( __m128i m1) SIMD Floating-Point Exceptions Invalid Other Exceptions VEX-encoded instructions, see Exceptions Type 11 (do not report #AC); EVEX-encoded instructions, see Exceptions Type E11. #UD If VEX.W=1. #UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B. 5-36 Vol. 2C VCVTPH2PS—Convert 16-bit FP values to Single-Precision FP values INSTRUCTION SET REFERENCE, V-Z VCVTPS2PH—Convert Single-Precision FP value to 16-bit FP value Opcode/ Instruction Op / En MRI 64/32 bit Mode Support V/V CPUID Feature Flag F16C VEX.128.66.0F3A.W0 1D /r ib VCVTPS2PH xmm1/m64, xmm2, imm8 VEX.256.66.0F3A.W0 1D /r ib VCVTPS2PH xmm1/m128, ymm2, imm8 EVEX.128.66.0F3A.W0 1D /r ib VCVTPS2PH xmm1/m64 {k1}{z}, xmm2, imm8 EVEX.256.66.0F3A.W0 1D /r ib VCVTPS2PH xmm1/m128 {k1}{z}, ymm2, imm8 EVEX.512.66.0F3A.W0 1D /r ib VCVTPS2PH ymm1/m256 {k1}{z}, zmm2{sae}, imm8 MRI V/V F16C HVM V/V AVX512VL AVX512F HVM V/V AVX512VL AVX512F HVM V/V AVX512F Description Convert four packed single-precision floating-point values in xmm2 to packed half-precision (16-bit) floating-point values in xmm1/m64. Imm8 provides rounding controls. Convert eight packed single-precision floating-point values in ymm2 to packed half-precision (16-bit) floating-point values in xmm1/m128. Imm8 provides rounding controls. Convert four packed single-precision floating-point values in xmm2 to packed half-precision (16-bit) floating-point values in xmm1/m64. Imm8 provides rounding controls. Convert eight packed single-precision floating-point values in ymm2 to packed half-precision (16-bit) floating-point values in xmm1/m128. Imm8 provides rounding controls. Convert sixteen packed single-precision floating-point values in zmm2 to packed half-precision (16-bit) floatingpoint values in ymm1/m256. Imm8 provides rounding controls. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 MRI ModRM:r/m (w) ModRM:reg (r) Imm8 NA HVM ModRM:r/m (w) ModRM:reg (r) Imm8 NA Description Convert packed single-precision floating values in the source operand to half-precision (16-bit) floating-point values and store to the destination operand. The rounding mode is specified using the immediate field (imm8). Underflow results (i.e., tiny results) are converted to denormals. MXCSR.FTZ is ignored. If a source element is denormal relative to the input format with DM masked and at least one of PM or UM unmasked; a SIMD exception will be raised with DE, UE and PE set. 127 127 96 VCVTPS2PH xmm1/mem64, xmm2, imm8 95 64 63 VS3 VS2 VS1 convert convert convert 96 95 64 63 32 0 VS0 xmm2 convert 48 47 VH3 31 32 31 VH2 16 15 VH1 0 VH0 xmm1/mem64 Figure 5-7. VCVTPS2PH (128-bit Version) The immediate byte defines several bit fields that control rounding operation. The effect and encoding of the RC field are listed in Table 5-3. VCVTPS2PH—Convert Single-Precision FP value to 16-bit FP value Vol. 2C 5-37 INSTRUCTION SET REFERENCE, V-Z Table 5-3. Immediate Byte Encoding for 16-bit Floating-Point Conversion Instructions Bits Imm[1:0] Imm[2] Imm[7:3] Field Name/value Description RC=00B Round to nearest even RC=01B Round down RC=10B Round up RC=11B Truncate MS1=0 Use imm[1:0] for rounding MS1=1 Use MXCSR.RC for rounding Ignored Ignored by processor Comment If Imm[2] = 0 Ignore MXCSR.RC VEX.128 version: The source operand is a XMM register. The destination operand is a XMM register or 64-bit memory location. If the destination operand is a register then the upper bits (MAX_VL-1:64) of corresponding register are zeroed. VEX.256 version: The source operand is a YMM register. The destination operand is a XMM register or 128-bit memory location. If the destination operand is a register, the upper bits (MAX_VL-1:128) of the corresponding destination register are zeroed. Note: VEX.vvvv and EVEX.vvvv are reserved (must be 1111b). EVEX encoded versions: The source operand is a ZMM/YMM/XMM register. The destination operand is a YMM/XMM/XMM (low 64-bits) register or a 256/128/64-bit memory location, conditionally updated with writemask k1. Bits (MAX_VL-1:256/128/64) of the corresponding destination register are zeroed. Operation vCvt_s2h(SRC1[31:0]) { IF Imm[2] = 0 THEN ; using Imm[1:0] for rounding control, see Table 5-3 RETURN Cvt_Single_Precision_To_Half_Precision_FP_Imm(SRC1[31:0]); ELSE ; using MXCSR.RC for rounding control RETURN Cvt_Single_Precision_To_Half_Precision_FP_Mxcsr(SRC1[31:0]); FI; } VCVTPS2PH (EVEX encoded versions) when dest is a register (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 16 k j * 32 IF k1[j] OR *no writemask* THEN DEST[i+15:i] vCvt_s2h(SRC[k+31:k]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged* ELSE ; zeroing-masking DEST[i+15:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0 5-38 Vol. 2C VCVTPS2PH—Convert Single-Precision FP value to 16-bit FP value INSTRUCTION SET REFERENCE, V-Z VCVTPS2PH (EVEX encoded versions) when dest is memory (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 16 k j * 32 IF k1[j] OR *no writemask* THEN DEST[i+15:i] vCvt_s2h(SRC[k+31:k]) ELSE *DEST[i+15:i] remains unchanged* ; merging-masking FI; ENDFOR VCVTPS2PH (VEX.256 encoded version) DEST[15:0] vCvt_s2h(SRC1[31:0]); DEST[31:16] vCvt_s2h(SRC1[63:32]); DEST[47:32] vCvt_s2h(SRC1[95:64]); DEST[63:48] vCvt_s2h(SRC1[127:96]); DEST[79:64] vCvt_s2h(SRC1[159:128]); DEST[95:80] vCvt_s2h(SRC1[191:160]); DEST[111:96] vCvt_s2h(SRC1[223:192]); DEST[127:112] vCvt_s2h(SRC1[255:224]); DEST[MAX_VL-1:128] 0 VCVTPS2PH (VEX.128 encoded version) DEST[15:0] vCvt_s2h(SRC1[31:0]); DEST[31:16] vCvt_s2h(SRC1[63:32]); DEST[47:32] vCvt_s2h(SRC1[95:64]); DEST[63:48] vCvt_s2h(SRC1[127:96]); DEST[MAX_VL-1:64] 0 Flags Affected None Intel C/C++ Compiler Intrinsic Equivalent VCVTPS2PH __m256i _mm512_cvtps_ph(__m512 a); VCVTPS2PH __m256i _mm512_mask_cvtps_ph(__m256i s, __mmask16 k,__m512 a); VCVTPS2PH __m256i _mm512_maskz_cvtps_ph(__mmask16 k,__m512 a); VCVTPS2PH __m256i _mm512_cvt_roundps_ph(__m512 a, const int imm); VCVTPS2PH __m256i _mm512_mask_cvt_roundps_ph(__m256i s, __mmask16 k,__m512 a, const int imm); VCVTPS2PH __m256i _mm512_maskz_cvt_roundps_ph(__mmask16 k,__m512 a, const int imm); VCVTPS2PH __m128i _mm256_mask_cvtps_ph(__m128i s, __mmask8 k,__m256 a); VCVTPS2PH __m128i _mm256_maskz_cvtps_ph(__mmask8 k,__m256 a); VCVTPS2PH __m128i _mm_mask_cvtps_ph(__m128i s, __mmask8 k,__m128 a); VCVTPS2PH __m128i _mm_maskz_cvtps_ph(__mmask8 k,__m128 a); VCVTPS2PH __m128i _mm_cvtps_ph ( __m128 m1, const int imm); VCVTPS2PH __m128i _mm256_cvtps_ph(__m256 m1, const int imm); SIMD Floating-Point Exceptions Invalid, Underflow, Overflow, Precision, Denormal (if MXCSR.DAZ=0); VCVTPS2PH—Convert Single-Precision FP value to 16-bit FP value Vol. 2C 5-39 INSTRUCTION SET REFERENCE, V-Z Other Exceptions VEX-encoded instructions, see Exceptions Type 11 (do not report #AC); EVEX-encoded instructions, see Exceptions Type E11. #UD If VEX.W=1. #UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B. 5-40 Vol. 2C VCVTPS2PH—Convert Single-Precision FP value to 16-bit FP value INSTRUCTION SET REFERENCE, V-Z VCVTPS2UDQ—Convert Packed Single-Precision Floating-Point Values to Packed Unsigned Doubleword Integer Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.0F.W0 79 /r VCVTPS2UDQ xmm1 {k1}{z}, xmm2/m128/m32bcst EVEX.256.0F.W0 79 /r VCVTPS2UDQ ymm1 {k1}{z}, ymm2/m256/m32bcst FV V/V AVX512VL AVX512F EVEX.512.0F.W0 79 /r VCVTPS2UDQ zmm1 {k1}{z}, zmm2/m512/m32bcst{er} FV V/V AVX512F Description Convert four packed single precision floating-point values from xmm2/m128/m32bcst to four packed unsigned doubleword values in xmm1 subject to writemask k1. Convert eight packed single precision floating-point values from ymm2/m256/m32bcst to eight packed unsigned doubleword values in ymm1 subject to writemask k1. Convert sixteen packed single-precision floating-point values from zmm2/m512/m32bcst to sixteen packed unsigned doubleword values in zmm1 subject to writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts sixteen packed single-precision floating-point values in the source operand to sixteen unsigned doubleword integers in the destination operand. When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format. The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1. Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. VCVTPS2UDQ—Convert Packed Single-Precision Floating-Point Values to Packed Unsigned Doubleword Integer Values Vol. 2C 5-41 INSTRUCTION SET REFERENCE, V-Z Operation VCVTPS2UDQ (EVEX encoded versions) when src operand is a register (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] Convert_Single_Precision_Floating_Point_To_UInteger(SRC[i+31:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VCVTPS2UDQ (EVEX encoded versions) when src operand is a memory source (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] Convert_Single_Precision_Floating_Point_To_UInteger(SRC[31:0]) ELSE DEST[i+31:i] Convert_Single_Precision_Floating_Point_To_UInteger(SRC[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-42 Vol. 2C VCVTPS2UDQ—Convert Packed Single-Precision Floating-Point Values to Packed Unsigned Doubleword Integer Values INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VCVTPS2UDQ __m512i _mm512_cvtps_epu32( __m512 a); VCVTPS2UDQ __m512i _mm512_mask_cvtps_epu32( __m512i s, __mmask16 k, __m512 a); VCVTPS2UDQ __m512i _mm512_maskz_cvtps_epu32( __mmask16 k, __m512 a); VCVTPS2UDQ __m512i _mm512_cvt_roundps_epu32( __m512 a, int r); VCVTPS2UDQ __m512i _mm512_mask_cvt_roundps_epu32( __m512i s, __mmask16 k, __m512 a, int r); VCVTPS2UDQ __m512i _mm512_maskz_cvt_roundps_epu32( __mmask16 k, __m512 a, int r); VCVTPS2UDQ __m256i _mm256_cvtps_epu32( __m256d a); VCVTPS2UDQ __m256i _mm256_mask_cvtps_epu32( __m256i s, __mmask8 k, __m256 a); VCVTPS2UDQ __m256i _mm256_maskz_cvtps_epu32( __mmask8 k, __m256 a); VCVTPS2UDQ __m128i _mm_cvtps_epu32( __m128 a); VCVTPS2UDQ __m128i _mm_mask_cvtps_epu32( __m128i s, __mmask8 k, __m128 a); VCVTPS2UDQ __m128i _mm_maskz_cvtps_epu32( __mmask8 k, __m128 a); SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E2. #UD If EVEX.vvvv != 1111B. VCVTPS2UDQ—Convert Packed Single-Precision Floating-Point Values to Packed Unsigned Doubleword Integer Values Vol. 2C 5-43 INSTRUCTION SET REFERENCE, V-Z VCVTPS2QQ—Convert Packed Single Precision Floating-Point Values to Packed Singed Quadword Integer Values Opcode/ Instruction Op / En HV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.128.66.0F.W0 7B /r VCVTPS2QQ xmm1 {k1}{z}, xmm2/m64/m32bcst EVEX.256.66.0F.W0 7B /r VCVTPS2QQ ymm1 {k1}{z}, xmm2/m128/m32bcst EVEX.512.66.0F.W0 7B /r VCVTPS2QQ zmm1 {k1}{z}, ymm2/m256/m32bcst{er} HV V/V AVX512VL AVX512DQ HV V/V AVX512DQ Description Convert two packed single precision floating-point values from xmm2/m64/m32bcst to two packed signed quadword values in xmm1 subject to writemask k1. Convert four packed single precision floating-point values from xmm2/m128/m32bcst to four packed signed quadword values in ymm1 subject to writemask k1. Convert eight packed single precision floating-point values from ymm2/m256/m32bcst to eight packed signed quadword values in zmm1 subject to writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 HV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts eight packed single-precision floating-point values in the source operand to eight signed quadword integers in the destination operand. When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (2w-1, where w represents the number of bits in the destination format) is returned. The source operand is a YMM/XMM/XMM (low 64- bits) register or a 256/128/64-bit memory location. The destination operation is a ZMM/YMM/XMM register conditionally updated with writemask k1. Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 5-44 Vol. 2C VCVTPS2QQ—Convert Packed Single Precision Floating-Point Values to Packed Singed Quadword Integer Values INSTRUCTION SET REFERENCE, V-Z Operation VCVTPS2QQ (EVEX encoded versions) when src operand is a register (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL == 512) AND (EVEX.b == 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] OR *no writemask* THEN DEST[i+63:i] Convert_Single_Precision_To_QuadInteger(SRC[k+31:k]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VCVTPS2QQ (EVEX encoded versions) when src operand is a memory source (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) THEN DEST[i+63:i] Convert_Single_Precision_To_QuadInteger(SRC[31:0]) ELSE DEST[i+63:i] Convert_Single_Precision_To_QuadInteger(SRC[k+31:k]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VCVTPS2QQ—Convert Packed Single Precision Floating-Point Values to Packed Singed Quadword Integer Values Vol. 2C 5-45 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VCVTPS2QQ __m512i _mm512_cvtps_epi64( __m512 a); VCVTPS2QQ __m512i _mm512_mask_cvtps_epi64( __m512i s, __mmask16 k, __m512 a); VCVTPS2QQ __m512i _mm512_maskz_cvtps_epi64( __mmask16 k, __m512 a); VCVTPS2QQ __m512i _mm512_cvt_roundps_epi64( __m512 a, int r); VCVTPS2QQ __m512i _mm512_mask_cvt_roundps_epi64( __m512i s, __mmask16 k, __m512 a, int r); VCVTPS2QQ __m512i _mm512_maskz_cvt_roundps_epi64( __mmask16 k, __m512 a, int r); VCVTPS2QQ __m256i _mm256_cvtps_epi64( __m256 a); VCVTPS2QQ __m256i _mm256_mask_cvtps_epi64( __m256i s, __mmask8 k, __m256 a); VCVTPS2QQ __m256i _mm256_maskz_cvtps_epi64( __mmask8 k, __m256 a); VCVTPS2QQ __m128i _mm_cvtps_epi64( __m128 a); VCVTPS2QQ __m128i _mm_mask_cvtps_epi64( __m128i s, __mmask8 k, __m128 a); VCVTPS2QQ __m128i _mm_maskz_cvtps_epi64( __mmask8 k, __m128 a); SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E3 #UD 5-46 Vol. 2C If EVEX.vvvv != 1111B. VCVTPS2QQ—Convert Packed Single Precision Floating-Point Values to Packed Singed Quadword Integer Values INSTRUCTION SET REFERENCE, V-Z VCVTPS2UQQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values Opcode/ Instruction Op / En HV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.128.66.0F.W0 79 /r VCVTPS2UQQ xmm1 {k1}{z}, xmm2/m64/m32bcst EVEX.256.66.0F.W0 79 /r VCVTPS2UQQ ymm1 {k1}{z}, xmm2/m128/m32bcst EVEX.512.66.0F.W0 79 /r VCVTPS2UQQ zmm1 {k1}{z}, ymm2/m256/m32bcst{er} HV V/V AVX512VL AVX512DQ HV V/V AVX512DQ Description Convert two packed single precision floating-point values from zmm2/m64/m32bcst to two packed unsigned quadword values in zmm1 subject to writemask k1. Convert four packed single precision floating-point values from xmm2/m128/m32bcst to four packed unsigned quadword values in ymm1 subject to writemask k1. Convert eight packed single precision floating-point values from ymm2/m256/m32bcst to eight packed unsigned quadword values in zmm1 subject to writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 HV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts up to eight packed single-precision floating-point values in the source operand to unsigned quadword integers in the destination operand. When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format. The source operand is a YMM/XMM/XMM (low 64- bits) register or a 256/128/64-bit memory location. The destination operation is a ZMM/YMM/XMM register conditionally updated with writemask k1. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. VCVTPS2UQQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values Vol. 2C 5-47 INSTRUCTION SET REFERENCE, V-Z Operation VCVTPS2UQQ (EVEX encoded versions) when src operand is a register (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL == 512) AND (EVEX.b == 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] OR *no writemask* THEN DEST[i+63:i] Convert_Single_Precision_To_UQuadInteger(SRC[k+31:k]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VCVTPS2UQQ (EVEX encoded versions) when src operand is a memory source (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) THEN DEST[i+63:i] Convert_Single_Precision_To_UQuadInteger(SRC[31:0]) ELSE DEST[i+63:i] Convert_Single_Precision_To_UQuadInteger(SRC[k+31:k]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-48 Vol. 2C VCVTPS2UQQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VCVTPS2UQQ __m512i _mm512_cvtps_epu64( __m512 a); VCVTPS2UQQ __m512i _mm512_mask_cvtps_epu64( __m512i s, __mmask16 k, __m512 a); VCVTPS2UQQ __m512i _mm512_maskz_cvtps_epu64( __mmask16 k, __m512 a); VCVTPS2UQQ __m512i _mm512_cvt_roundps_epu64( __m512 a, int r); VCVTPS2UQQ __m512i _mm512_mask_cvt_roundps_epu64( __m512i s, __mmask16 k, __m512 a, int r); VCVTPS2UQQ __m512i _mm512_maskz_cvt_roundps_epu64( __mmask16 k, __m512 a, int r); VCVTPS2UQQ __m256i _mm256_cvtps_epu64( __m256 a); VCVTPS2UQQ __m256i _mm256_mask_cvtps_epu64( __m256i s, __mmask8 k, __m256 a); VCVTPS2UQQ __m256i _mm256_maskz_cvtps_epu64( __mmask8 k, __m256 a); VCVTPS2UQQ __m128i _mm_cvtps_epu64( __m128 a); VCVTPS2UQQ __m128i _mm_mask_cvtps_epu64( __m128i s, __mmask8 k, __m128 a); VCVTPS2UQQ __m128i _mm_maskz_cvtps_epu64( __mmask8 k, __m128 a); SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E3 #UD If EVEX.vvvv != 1111B. VCVTPS2UQQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values Vol. 2C 5-49 INSTRUCTION SET REFERENCE, V-Z VCVTQQ2PD—Convert Packed Quadword Integers to Packed Double-Precision Floating-Point Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.128.F3.0F.W1 E6 /r VCVTQQ2PD xmm1 {k1}{z}, xmm2/m128/m64bcst EVEX.256.F3.0F.W1 E6 /r VCVTQQ2PD ymm1 {k1}{z}, ymm2/m256/m64bcst EVEX.512.F3.0F.W1 E6 /r VCVTQQ2PD zmm1 {k1}{z}, zmm2/m512/m64bcst{er} FV V/V AVX512VL AVX512DQ FV V/V AVX512DQ Description Convert two packed quadword integers from xmm2/m128/m64bcst to packed double-precision floatingpoint values in xmm1 with writemask k1. Convert four packed quadword integers from ymm2/m256/m64bcst to packed double-precision floatingpoint values in ymm1 with writemask k1. Convert eight packed quadword integers from zmm2/m512/m64bcst to eight packed double-precision floating-point values in zmm1 with writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts packed quadword integers in the source operand (second operand) to packed double-precision floatingpoint values in the destination operand (first operand). The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation is a ZMM/YMM/XMM register conditionally updated with writemask k1. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Operation VCVTQQ2PD (EVEX2 encoded versions) when src operand is a register (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL == 512) AND (EVEX.b == 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] Convert_QuadInteger_To_Double_Precision_Floating_Point(SRC[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-50 Vol. 2C VCVTQQ2PD—Convert Packed Quadword Integers to Packed Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VCVTQQ2PD (EVEX encoded versions) when src operand is a memory source (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) THEN DEST[i+63:i] Convert_QuadInteger_To_Double_Precision_Floating_Point(SRC[63:0]) ELSE DEST[i+63:i] Convert_QuadInteger_To_Double_Precision_Floating_Point(SRC[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VCVTQQ2PD __m512d _mm512_cvtepi64_pd( __m512i a); VCVTQQ2PD __m512d _mm512_mask_cvtepi64_pd( __m512d s, __mmask16 k, __m512i a); VCVTQQ2PD __m512d _mm512_maskz_cvtepi64_pd( __mmask16 k, __m512i a); VCVTQQ2PD __m512d _mm512_cvt_roundepi64_pd( __m512i a, int r); VCVTQQ2PD __m512d _mm512_mask_cvt_roundepi_ps( __m512d s, __mmask8 k, __m512i a, int r); VCVTQQ2PD __m512d _mm512_maskz_cvt_roundepi64_pd( __mmask8 k, __m512i a, int r); VCVTQQ2PD __m256d _mm256_mask_cvtepi64_pd( __m256d s, __mmask8 k, __m256i a); VCVTQQ2PD __m256d _mm256_maskz_cvtepi64_pd( __mmask8 k, __m256i a); VCVTQQ2PD __m128d _mm_mask_cvtepi64_pd( __m128d s, __mmask8 k, __m128i a); VCVTQQ2PD __m128d _mm_maskz_cvtepi64_pd( __mmask8 k, __m128i a); SIMD Floating-Point Exceptions Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E2 #UD If EVEX.vvvv != 1111B. VCVTQQ2PD—Convert Packed Quadword Integers to Packed Double-Precision Floating-Point Values Vol. 2C 5-51 INSTRUCTION SET REFERENCE, V-Z VCVTQQ2PS—Convert Packed Quadword Integers to Packed Single-Precision Floating-Point Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.128.0F.W1 5B /r VCVTQQ2PS xmm1 {k1}{z}, xmm2/m128/m64bcst EVEX.256.0F.W1 5B /r VCVTQQ2PS xmm1 {k1}{z}, ymm2/m256/m64bcst EVEX.512.0F.W1 5B /r VCVTQQ2PS ymm1 {k1}{z}, zmm2/m512/m64bcst{er} FV V/V AVX512VL AVX512DQ FV V/V AVX512DQ Description Convert two packed quadword integers from xmm2/mem to packed single-precision floating-point values in xmm1 with writemask k1. Convert four packed quadword integers from ymm2/mem to packed single-precision floating-point values in xmm1 with writemask k1. Convert eight packed quadword integers from zmm2/mem to eight packed single-precision floating-point values in ymm1 with writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts packed quadword integers in the source operand (second operand) to packed single-precision floatingpoint values in the destination operand (first operand). The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation is a YMM/XMM/XMM (lower 64 bits) register conditionally updated with writemask k1. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Operation VCVTQQ2PS (EVEX encoded versions) when src operand is a register (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] OR *no writemask* THEN DEST[k+31:k] Convert_QuadInteger_To_Single_Precision_Floating_Point(SRC[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[k+31:k] remains unchanged* ELSE ; zeroing-masking DEST[k+31:k] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0 5-52 Vol. 2C VCVTQQ2PS—Convert Packed Quadword Integers to Packed Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VCVTQQ2PS (EVEX encoded versions) when src operand is a memory source (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) THEN DEST[k+31:k] Convert_QuadInteger_To_Single_Precision_Floating_Point(SRC[63:0]) ELSE DEST[k+31:k] Convert_QuadInteger_To_Single_Precision_Floating_Point(SRC[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[k+31:k] remains unchanged* ELSE ; zeroing-masking DEST[k+31:k] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0 Intel C/C++ Compiler Intrinsic Equivalent VCVTQQ2PS __m256 _mm512_cvtepi64_ps( __m512i a); VCVTQQ2PS __m256 _mm512_mask_cvtepi64_ps( __m256 s, __mmask16 k, __m512i a); VCVTQQ2PS __m256 _mm512_maskz_cvtepi64_ps( __mmask16 k, __m512i a); VCVTQQ2PS __m256 _mm512_cvt_roundepi64_ps( __m512i a, int r); VCVTQQ2PS __m256 _mm512_mask_cvt_roundepi_ps( __m256 s, __mmask8 k, __m512i a, int r); VCVTQQ2PS __m256 _mm512_maskz_cvt_roundepi64_ps( __mmask8 k, __m512i a, int r); VCVTQQ2PS __m128 _mm256_cvtepi64_ps( __m256i a); VCVTQQ2PS __m128 _mm256_mask_cvtepi64_ps( __m128 s, __mmask8 k, __m256i a); VCVTQQ2PS __m128 _mm256_maskz_cvtepi64_ps( __mmask8 k, __m256i a); VCVTQQ2PS __m128 _mm_cvtepi64_ps( __m128i a); VCVTQQ2PS __m128 _mm_mask_cvtepi64_ps( __m128 s, __mmask8 k, __m128i a); VCVTQQ2PS __m128 _mm_maskz_cvtepi64_ps( __mmask8 k, __m128i a); SIMD Floating-Point Exceptions Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E2 #UD If EVEX.vvvv != 1111B. VCVTQQ2PS—Convert Packed Quadword Integers to Packed Single-Precision Floating-Point Values Vol. 2C 5-53 INSTRUCTION SET REFERENCE, V-Z VCVTSD2USI—Convert Scalar Double-Precision Floating-Point Value to Unsigned Doubleword Integer Opcode/ Instruction Op / En T1F 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F EVEX.LIG.F2.0F.W0 79 /r VCVTSD2USI r32, xmm1/m64{er} EVEX.LIG.F2.0F.W1 79 /r VCVTSD2USI r64, xmm1/m64{er} T1F V/N.E.1 AVX512F Description Convert one double-precision floating-point value from xmm1/m64 to one unsigned doubleword integer r32. Convert one double-precision floating-point value from xmm1/m64 to one unsigned quadword integer zeroextended into r64. NOTES: 1. EVEX.W1 in non-64 bit is ignored; the instructions behaves as if the W0 version is used. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1F ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts a double-precision floating-point value in the source operand (the second operand) to an unsigned doubleword integer in the destination operand (the first operand). The source operand can be an XMM register or a 64-bit memory location. The destination operand is a general-purpose register. When the source operand is an XMM register, the double-precision floating-point value is contained in the low quadword of the register. When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format. Operation VCVTSD2USI (EVEX encoded version) IF (SRC *is register*) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF 64-Bit Mode and OperandSize = 64 THEN DEST[63:0] Convert_Double_Precision_Floating_Point_To_UInteger(SRC[63:0]); ELSE DEST[31:0] Convert_Double_Precision_Floating_Point_To_UInteger(SRC[63:0]); FI Intel C/C++ Compiler Intrinsic Equivalent VCVTSD2USI unsigned int _mm_cvtsd_u32(__m128d); VCVTSD2USI unsigned int _mm_cvt_roundsd_u32(__m128d, int r); VCVTSD2USI unsigned __int64 _mm_cvtsd_u64(__m128d); VCVTSD2USI unsigned __int64 _mm_cvt_roundsd_u64(__m128d, int r); SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E3NF. 5-54 Vol. 2C VCVTSD2USI—Convert Scalar Double-Precision Floating-Point Value to Unsigned Doubleword Integer INSTRUCTION SET REFERENCE, V-Z VCVTSS2USI—Convert Scalar Single-Precision Floating-Point Value to Unsigned Doubleword Integer Opcode/ Instruction Op / En T1F 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F EVEX.LIG.F3.0F.W0 79 /r VCVTSS2USI r32, xmm1/m32{er} EVEX.LIG.F3.0F.W1 79 /r VCVTSS2USI r64, xmm1/m32{er} T1F V/N.E.1 AVX512F Description Convert one single-precision floating-point value from xmm1/m32 to one unsigned doubleword integer in r32. Convert one single-precision floating-point value from xmm1/m32 to one unsigned quadword integer in r64. NOTES: 1. EVEX.W1 in non-64 bit is ignored; the instructions behaves as if the W0 version is used. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1F ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts a single-precision floating-point value in the source operand (the second operand) to an unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the destination operand (the first operand). The source operand can be an XMM register or a memory location. The destination operand is a generalpurpose register. When the source operand is an XMM register, the single-precision floating-point value is contained in the low doubleword of the register. When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format. VEX.W1 and EVEX.W1 versions: promotes the instruction to produce 64-bit data in 64-bit mode. Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. Operation VCVTSS2USI (EVEX encoded version) IF (SRC *is register*) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF 64-bit Mode and OperandSize = 64 THEN DEST[63:0] Convert_Single_Precision_Floating_Point_To_UInteger(SRC[31:0]); ELSE DEST[31:0] Convert_Single_Precision_Floating_Point_To_UInteger(SRC[31:0]); FI; Intel C/C++ Compiler Intrinsic Equivalent VCVTSS2USI unsigned _mm_cvtss_u32( __m128 a); VCVTSS2USI unsigned _mm_cvt_roundss_u32( __m128 a, int r); VCVTSS2USI unsigned __int64 _mm_cvtss_u64( __m128 a); VCVTSS2USI unsigned __int64 _mm_cvt_roundss_u64( __m128 a, int r); VCVTSS2USI—Convert Scalar Single-Precision Floating-Point Value to Unsigned Doubleword Integer Vol. 2C 5-55 INSTRUCTION SET REFERENCE, V-Z SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E3NF. 5-56 Vol. 2C VCVTSS2USI—Convert Scalar Single-Precision Floating-Point Value to Unsigned Doubleword Integer INSTRUCTION SET REFERENCE, V-Z VCVTTPD2QQ—Convert with Truncation Packed Double-Precision Floating-Point Values to Packed Quadword Integers Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.128.66.0F.W1 7A /r VCVTTPD2QQ xmm1 {k1}{z}, xmm2/m128/m64bcst EVEX.256.66.0F.W1 7A /r VCVTTPD2QQ ymm1 {k1}{z}, ymm2/m256/m64bcst EVEX.512.66.0F.W1 7A /r VCVTTPD2QQ zmm1 {k1}{z}, zmm2/m512/m64bcst{sae} FV V/V AVX512VL AVX512DQ FV V/V AVX512DQ Description Convert two packed double-precision floating-point values from zmm2/m128/m64bcst to two packed quadword integers in zmm1 using truncation with writemask k1. Convert four packed double-precision floating-point values from ymm2/m256/m64bcst to four packed quadword integers in ymm1 using truncation with writemask k1. Convert eight packed double-precision floating-point values from zmm2/m512 to eight packed quadword integers in zmm1 using truncation with writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts with truncation packed double-precision floating-point values in the source operand (second operand) to packed quadword integers in the destination operand (first operand). EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1. When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (2w-1, where w represents the number of bits in the destination format) is returned. Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. Operation VCVTTPD2QQ (EVEX encoded version) when src operand is a register (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_QuadInteger_Truncate(SRC[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VCVTTPD2QQ—Convert with Truncation Packed Double-Precision Floating-Point Values to Packed Quadword Integers Vol. 2C 5-57 INSTRUCTION SET REFERENCE, V-Z VCVTTPD2QQ (EVEX encoded version) when src operand is a memory source (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) THEN DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_QuadInteger_Truncate(SRC[63:0]) ELSE DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_QuadInteger_Truncate(SRC[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VCVTTPD2QQ __m512i _mm512_cvttpd_epi64( __m512d a); VCVTTPD2QQ __m512i _mm512_mask_cvttpd_epi64( __m512i s, __mmask8 k, __m512d a); VCVTTPD2QQ __m512i _mm512_maskz_cvttpd_epi64( __mmask8 k, __m512d a); VCVTTPD2QQ __m512i _mm512_cvtt_roundpd_epi64( __m512d a, int sae); VCVTTPD2QQ __m512i _mm512_mask_cvtt_roundpd_epi64( __m512i s, __mmask8 k, __m512d a, int sae); VCVTTPD2QQ __m512i _mm512_maskz_cvtt_roundpd_epi64( __mmask8 k, __m512d a, int sae); VCVTTPD2QQ __m256i _mm256_mask_cvttpd_epi64( __m256i s, __mmask8 k, __m256d a); VCVTTPD2QQ __m256i _mm256_maskz_cvttpd_epi64( __mmask8 k, __m256d a); VCVTTPD2QQ __m128i _mm_mask_cvttpd_epi64( __m128i s, __mmask8 k, __m128d a); VCVTTPD2QQ __m128i _mm_maskz_cvttpd_epi64( __mmask8 k, __m128d a); SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E2. #UD 5-58 Vol. 2C If EVEX.vvvv != 1111B. VCVTTPD2QQ—Convert with Truncation Packed Double-Precision Floating-Point Values to Packed Quadword Integers INSTRUCTION SET REFERENCE, V-Z VCVTTPD2UDQ—Convert with Truncation Packed Double-Precision Floating-Point Values to Packed Unsigned Doubleword Integers Opcode Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.0F.W1 78 /r VCVTTPD2UDQ xmm1 {k1}{z}, xmm2/m128/m64bcst EVEX.256.0F.W1 78 02 /r VCVTTPD2UDQ xmm1 {k1}{z}, ymm2/m256/m64bcst FV V/V AVX512VL AVX512F EVEX.512.0F.W1 78 /r VCVTTPD2UDQ ymm1 {k1}{z}, zmm2/m512/m64bcst{sae} FV V/V AVX512F Description Convert two packed double-precision floating-point values in xmm2/m128/m64bcst to two unsigned doubleword integers in xmm1 using truncation subject to writemask k1. Convert four packed double-precision floating-point values in ymm2/m256/m64bcst to four unsigned doubleword integers in xmm1 using truncation subject to writemask k1. Convert eight packed double-precision floating-point values in zmm2/m512/m64bcst to eight unsigned doubleword integers in ymm1 using truncation subject to writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts with truncation packed double-precision floating-point values in the source operand (the second operand) to packed unsigned doubleword integers in the destination operand (the first operand). When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format. The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a YMM/XMM/XMM (low 64 bits) register conditionally updated with writemask k1. The upper bits (MAX_VL-1:256) of the corresponding destination are zeroed. Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. VCVTTPD2UDQ—Convert with Truncation Packed Double-Precision Floating-Point Values to Packed Unsigned Doubleword Integers Vol. 2C 5-59 INSTRUCTION SET REFERENCE, V-Z Operation VCVTTPD2UDQ (EVEX encoded versions) when src2 operand is a register (KL, VL) = (2, 128), (4, 256),(8, 512) FOR j 0 TO KL-1 i j * 32 k j * 64 IF k1[j] OR *no writemask* THEN DEST[i+31:i] Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[k+63:k]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0 VCVTTPD2UDQ (EVEX encoded versions) when src operand is a memory source (KL, VL) = (2, 128), (4, 256),(8, 512) FOR j 0 TO KL-1 i j * 32 k j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[63:0]) ELSE DEST[i+31:i] Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[k+63:k]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0 5-60 Vol. 2C VCVTTPD2UDQ—Convert with Truncation Packed Double-Precision Floating-Point Values to Packed Unsigned Doubleword Integers INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VCVTTPD2UDQ __m256i _mm512_cvttpd_epu32( __m512d a); VCVTTPD2UDQ __m256i _mm512_mask_cvttpd_epu32( __m256i s, __mmask8 k, __m512d a); VCVTTPD2UDQ __m256i _mm512_maskz_cvttpd_epu32( __mmask8 k, __m512d a); VCVTTPD2UDQ __m256i _mm512_cvtt_roundpd_epu32( __m512d a, int sae); VCVTTPD2UDQ __m256i _mm512_mask_cvtt_roundpd_epu32( __m256i s, __mmask8 k, __m512d a, int sae); VCVTTPD2UDQ __m256i _mm512_maskz_cvtt_roundpd_epu32( __mmask8 k, __m512d a, int sae); VCVTTPD2UDQ __m128i _mm256_mask_cvttpd_epu32( __m128i s, __mmask8 k, __m256d a); VCVTTPD2UDQ __m128i _mm256_maskz_cvttpd_epu32( __mmask8 k, __m256d a); VCVTTPD2UDQ __m128i _mm_mask_cvttpd_epu32( __m128i s, __mmask8 k, __m128d a); VCVTTPD2UDQ __m128i _mm_maskz_cvttpd_epu32( __mmask8 k, __m128d a); SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E2. #UD If EVEX.vvvv != 1111B. VCVTTPD2UDQ—Convert with Truncation Packed Double-Precision Floating-Point Values to Packed Unsigned Doubleword Integers Vol. 2C 5-61 INSTRUCTION SET REFERENCE, V-Z VCVTTPD2UQQ—Convert with Truncation Packed Double-Precision Floating-Point Values to Packed Unsigned Quadword Integers Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.128.66.0F.W1 78 /r VCVTTPD2UQQ xmm1 {k1}{z}, xmm2/m128/m64bcst EVEX.256.66.0F.W1 78 /r VCVTTPD2UQQ ymm1 {k1}{z}, ymm2/m256/m64bcst FV V/V AVX512VL AVX512DQ EVEX.512.66.0F.W1 78 /r VCVTTPD2UQQ zmm1 {k1}{z}, zmm2/m512/m64bcst{sae} FV V/V AVX512DQ Description Convert two packed double-precision floating-point values from xmm2/m128/m64bcst to two packed unsigned quadword integers in xmm1 using truncation with writemask k1. Convert four packed double-precision floating-point values from ymm2/m256/m64bcst to four packed unsigned quadword integers in ymm1 using truncation with writemask k1. Convert eight packed double-precision floating-point values from zmm2/mem to eight packed unsigned quadword integers in zmm1 using truncation with writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts with truncation packed double-precision floating-point values in the source operand (second operand) to packed unsigned quadword integers in the destination operand (first operand). When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format. EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation is a ZMM/YMM/XMM register conditionally updated with writemask k1. Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. Operation VCVTTPD2UQQ (EVEX encoded versions) when src operand is a register (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_UQuadInteger_Truncate(SRC[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-62 Vol. 2C VCVTTPD2UQQ—Convert with Truncation Packed Double-Precision Floating-Point Values to Packed Unsigned Quadword Integers INSTRUCTION SET REFERENCE, V-Z VCVTTPD2UQQ (EVEX encoded versions) when src operand is a memory source (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) THEN DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_UQuadInteger_Truncate(SRC[63:0]) ELSE DEST[i+63:i] Convert_Double_Precision_Floating_Point_To_UQuadInteger_Truncate(SRC[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VCVTTPD2UQQ _mm[_mask[z]]_cvtt[_round]pd_epu64 VCVTTPD2UQQ __m512i _mm512_cvttpd_epu64( __m512d a); VCVTTPD2UQQ __m512i _mm512_mask_cvttpd_epu64( __m512i s, __mmask8 k, __m512d a); VCVTTPD2UQQ __m512i _mm512_maskz_cvttpd_epu64( __mmask8 k, __m512d a); VCVTTPD2UQQ __m512i _mm512_cvtt_roundpd_epu64( __m512d a, int sae); VCVTTPD2UQQ __m512i _mm512_mask_cvtt_roundpd_epu64( __m512i s, __mmask8 k, __m512d a, int sae); VCVTTPD2UQQ __m512i _mm512_maskz_cvtt_roundpd_epu64( __mmask8 k, __m512d a, int sae); VCVTTPD2UQQ __m256i _mm256_mask_cvttpd_epu64( __m256i s, __mmask8 k, __m256d a); VCVTTPD2UQQ __m256i _mm256_maskz_cvttpd_epu64( __mmask8 k, __m256d a); VCVTTPD2UQQ __m128i _mm_mask_cvttpd_epu64( __m128i s, __mmask8 k, __m128d a); VCVTTPD2UQQ __m128i _mm_maskz_cvttpd_epu64( __mmask8 k, __m128d a); SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E2. #UD If EVEX.vvvv != 1111B. VCVTTPD2UQQ—Convert with Truncation Packed Double-Precision Floating-Point Values to Packed Unsigned Quadword Integers Vol. 2C 5-63 INSTRUCTION SET REFERENCE, V-Z VCVTTPS2UDQ—Convert with Truncation Packed Single-Precision Floating-Point Values to Packed Unsigned Doubleword Integer Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.0F.W0 78 /r VCVTTPS2UDQ xmm1 {k1}{z}, xmm2/m128/m32bcst EVEX.256.0F.W0 78 /r VCVTTPS2UDQ ymm1 {k1}{z}, ymm2/m256/m32bcst FV V/V AVX512VL AVX512F EVEX.512.0F.W0 78 /r VCVTTPS2UDQ zmm1 {k1}{z}, zmm2/m512/m32bcst{sae} FV V/V AVX512F Description Convert four packed single precision floating-point values from xmm2/m128/m32bcst to four packed unsigned doubleword values in xmm1 using truncation subject to writemask k1. Convert eight packed single precision floating-point values from ymm2/m256/m32bcst to eight packed unsigned doubleword values in ymm1 using truncation subject to writemask k1. Convert sixteen packed single-precision floatingpoint values from zmm2/m512/m32bcst to sixteen packed unsigned doubleword values in zmm1 using truncation subject to writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts with truncation packed single-precision floating-point values in the source operand to sixteen unsigned doubleword integers in the destination operand. When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format. EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1. Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Operation VCVTTPS2UDQ (EVEX encoded versions) when src operand is a register (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[i+31:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-64 Vol. 2C VCVTTPS2UDQ—Convert with Truncation Packed Single-Precision Floating-Point Values to Packed Unsigned Doubleword Integer Val- INSTRUCTION SET REFERENCE, V-Z VCVTTPS2UDQ (EVEX encoded versions) when src operand is a memory source (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[31:0]) ELSE DEST[i+31:i] Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VCVTTPS2UDQ __m512i _mm512_cvttps_epu32( __m512 a); VCVTTPS2UDQ __m512i _mm512_mask_cvttps_epu32( __m512i s, __mmask16 k, __m512 a); VCVTTPS2UDQ __m512i _mm512_maskz_cvttps_epu32( __mmask16 k, __m512 a); VCVTTPS2UDQ __m512i _mm512_cvtt_roundps_epu32( __m512 a, int sae); VCVTTPS2UDQ __m512i _mm512_mask_cvtt_roundps_epu32( __m512i s, __mmask16 k, __m512 a, int sae); VCVTTPS2UDQ __m512i _mm512_maskz_cvtt_roundps_epu32( __mmask16 k, __m512 a, int sae); VCVTTPS2UDQ __m256i _mm256_mask_cvttps_epu32( __m256i s, __mmask8 k, __m256 a); VCVTTPS2UDQ __m256i _mm256_maskz_cvttps_epu32( __mmask8 k, __m256 a); VCVTTPS2UDQ __m128i _mm_mask_cvttps_epu32( __m128i s, __mmask8 k, __m128 a); VCVTTPS2UDQ __m128i _mm_maskz_cvttps_epu32( __mmask8 k, __m128 a); SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E2. #UD If EVEX.vvvv != 1111B. VCVTTPS2UDQ—Convert with Truncation Packed Single-Precision Floating-Point Values to Packed Unsigned Doubleword Integer Val- Vol. 2C 5-65 INSTRUCTION SET REFERENCE, V-Z VCVTTPS2QQ—Convert with Truncation Packed Single Precision Floating-Point Values to Packed Singed Quadword Integer Values Opcode/ Instruction Op / En HV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.128.66.0F.W0 7A /r VCVTTPS2QQ xmm1 {k1}{z}, xmm2/m64/m32bcst EVEX.256.66.0F.W0 7A /r VCVTTPS2QQ ymm1 {k1}{z}, xmm2/m128/m32bcst EVEX.512.66.0F.W0 7A /r VCVTTPS2QQ zmm1 {k1}{z}, ymm2/m256/m32bcst{sae} HV V/V AVX512VL AVX512DQ HV V/V AVX512DQ Description Convert two packed single precision floating-point values from xmm2/m64/m32bcst to two packed signed quadword values in xmm1 using truncation subject to writemask k1. Convert four packed single precision floating-point values from xmm2/m128/m32bcst to four packed signed quadword values in ymm1 using truncation subject to writemask k1. Convert eight packed single precision floating-point values from ymm2/m256/m32bcst to eight packed signed quadword values in zmm1 using truncation subject to writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 HV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts with truncation packed single-precision floating-point values in the source operand to eight signed quadword integers in the destination operand. When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (2w-1, where w represents the number of bits in the destination format) is returned. EVEX encoded versions: The source operand is a YMM/XMM/XMM (low 64 bits) register or a 256/128/64-bit memory location. The destination operation is a vector register conditionally updated with writemask k1. Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Operation VCVTTPS2QQ (EVEX encoded versions) when src operand is a register (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] OR *no writemask* THEN DEST[i+63:i] Convert_Single_Precision_To_QuadInteger_Truncate(SRC[k+31:k]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-66 Vol. 2C VCVTTPS2QQ—Convert with Truncation Packed Single Precision Floating-Point Values to Packed Singed Quadword Integer Values INSTRUCTION SET REFERENCE, V-Z VCVTTPS2QQ (EVEX encoded versions) when src operand is a memory source (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) THEN DEST[i+63:i] Convert_Single_Precision_To_QuadInteger_Truncate(SRC[31:0]) ELSE DEST[i+63:i] Convert_Single_Precision_To_QuadInteger_Truncate(SRC[k+31:k]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VCVTTPS2QQ __m512i _mm512_cvttps_epi64( __m256 a); VCVTTPS2QQ __m512i _mm512_mask_cvttps_epi64( __m512i s, __mmask16 k, __m256 a); VCVTTPS2QQ __m512i _mm512_maskz_cvttps_epi64( __mmask16 k, __m256 a); VCVTTPS2QQ __m512i _mm512_cvtt_roundps_epi64( __m256 a, int sae); VCVTTPS2QQ __m512i _mm512_mask_cvtt_roundps_epi64( __m512i s, __mmask16 k, __m256 a, int sae); VCVTTPS2QQ __m512i _mm512_maskz_cvtt_roundps_epi64( __mmask16 k, __m256 a, int sae); VCVTTPS2QQ __m256i _mm256_mask_cvttps_epi64( __m256i s, __mmask8 k, __m128 a); VCVTTPS2QQ __m256i _mm256_maskz_cvttps_epi64( __mmask8 k, __m128 a); VCVTTPS2QQ __m128i _mm_mask_cvttps_epi64( __m128i s, __mmask8 k, __m128 a); VCVTTPS2QQ __m128i _mm_maskz_cvttps_epi64( __mmask8 k, __m128 a); SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E3. #UD If EVEX.vvvv != 1111B. VCVTTPS2QQ—Convert with Truncation Packed Single Precision Floating-Point Values to Packed Singed Quadword Integer Values Vol. 2C 5-67 INSTRUCTION SET REFERENCE, V-Z VCVTTPS2UQQ—Convert with Truncation Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values Opcode/ Instruction Op / En EVEX.128.66.0F.W0 78 /r VCVTTPS2UQQ xmm1 {k1}{z}, xmm2/m64/m32bcst EVEX.256.66.0F.W0 78 /r VCVTTPS2UQQ ymm1 {k1}{z}, xmm2/m128/m32bcst EVEX.512.66.0F.W0 78 /r VCVTTPS2UQQ zmm1 {k1}{z}, ymm2/m256/m32bcst{sae} HV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ HV V/V AVX512VL AVX512DQ HV V/V AVX512DQ Description Convert two packed single precision floating-point values from xmm2/m64/m32bcst to two packed unsigned quadword values in xmm1 using truncation subject to writemask k1. Convert four packed single precision floating-point values from xmm2/m128/m32bcst to four packed unsigned quadword values in ymm1 using truncation subject to writemask k1. Convert eight packed single precision floating-point values from ymm2/m256/m32bcst to eight packed unsigned quadword values in zmm1 using truncation subject to writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 HV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts with truncation up to eight packed single-precision floating-point values in the source operand to unsigned quadword integers in the destination operand. When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format. EVEX encoded versions: The source operand is a YMM/XMM/XMM (low 64 bits) register or a 256/128/64-bit memory location. The destination operation is a vector register conditionally updated with writemask k1. Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Operation VCVTTPS2UQQ (EVEX encoded versions) when src operand is a register (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] OR *no writemask* THEN DEST[i+63:i] Convert_Single_Precision_To_UQuadInteger_Truncate(SRC[k+31:k]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-68 Vol. 2C VCVTTPS2UQQ—Convert with Truncation Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values INSTRUCTION SET REFERENCE, V-Z VCVTTPS2UQQ (EVEX encoded versions) when src operand is a memory source (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) THEN DEST[i+63:i] Convert_Single_Precision_To_UQuadInteger_Truncate(SRC[31:0]) ELSE DEST[i+63:i] Convert_Single_Precision_To_UQuadInteger_Truncate(SRC[k+31:k]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VCVTTPS2UQQ _mm [_mask[z]]_cvtt[_round]ps_epu64 VCVTTPS2UQQ __m512i _mm512_cvttps_epu64( __m256 a); VCVTTPS2UQQ __m512i _mm512_mask_cvttps_epu64( __m512i s, __mmask16 k, __m256 a); VCVTTPS2UQQ __m512i _mm512_maskz_cvttps_epu64( __mmask16 k, __m256 a); VCVTTPS2UQQ __m512i _mm512_cvtt_roundps_epu64( __m256 a, int sae); VCVTTPS2UQQ __m512i _mm512_mask_cvtt_roundps_epu64( __m512i s, __mmask16 k, __m256 a, int sae); VCVTTPS2UQQ __m512i _mm512_maskz_cvtt_roundps_epu64( __mmask16 k, __m256 a, int sae); VCVTTPS2UQQ __m256i _mm256_mask_cvttps_epu64( __m256i s, __mmask8 k, __m128 a); VCVTTPS2UQQ __m256i _mm256_maskz_cvttps_epu64( __mmask8 k, __m128 a); VCVTTPS2UQQ __m128i _mm_mask_cvttps_epu64( __m128i s, __mmask8 k, __m128 a); VCVTTPS2UQQ __m128i _mm_maskz_cvttps_epu64( __mmask8 k, __m128 a); SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E3. #UD If EVEX.vvvv != 1111B. VCVTTPS2UQQ—Convert with Truncation Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values Vol. 2C 5-69 INSTRUCTION SET REFERENCE, V-Z VCVTTSD2USI—Convert with Truncation Scalar Double-Precision Floating-Point Value to Unsigned Integer Opcode/ Instruction Op / En T1F 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F EVEX.LIG.F2.0F.W0 78 /r VCVTTSD2USI r32, xmm1/m64{sae} EVEX.LIG.F2.0F.W1 78 /r VCVTTSD2USI r64, xmm1/m64{sae} T1F V/N.E.1 AVX512F Description Convert one double-precision floating-point value from xmm1/m64 to one unsigned doubleword integer r32 using truncation. Convert one double-precision floating-point value from xmm1/m64 to one unsigned quadword integer zeroextended into r64 using truncation. NOTES: 1. For this specific instruction, EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1F ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts with truncation a double-precision floating-point value in the source operand (the second operand) to an unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the destination operand (the first operand). The source operand can be an XMM register or a 64-bit memory location. The destination operand is a general-purpose register. When the source operand is an XMM register, the double-precision floatingpoint value is contained in the low quadword of the register. When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format. EVEX.W1 version: promotes the instruction to produce 64-bit data in 64-bit mode. Operation VCVTTSD2USI (EVEX encoded version) IF 64-Bit Mode and OperandSize = 64 THEN DEST[63:0] Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[63:0]); ELSE DEST[31:0] Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[63:0]); FI Intel C/C++ Compiler Intrinsic Equivalent VCVTTSD2USI unsigned int _mm_cvttsd_u32(__m128d); VCVTTSD2USI unsigned int _mm_cvtt_roundsd_u32(__m128d, int sae); VCVTTSD2USI unsigned __int64 _mm_cvttsd_u64(__m128d); VCVTTSD2USI unsigned __int64 _mm_cvtt_roundsd_u64(__m128d, int sae); SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E3NF. 5-70 Vol. 2C VCVTTSD2USI—Convert with Truncation Scalar Double-Precision Floating-Point Value to Unsigned Integer INSTRUCTION SET REFERENCE, V-Z VCVTTSS2USI—Convert with Truncation Scalar Single-Precision Floating-Point Value to Unsigned Integer Opcode/ Instruction Op / En T1F 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F EVEX.LIG.F3.0F.W0 78 /r VCVTTSS2USI r32, xmm1/m32{sae} EVEX.LIG.F3.0F.W1 78 /r VCVTTSS2USI r64, xmm1/m32{sae} T1F V/N.E.1 AVX512F Description Convert one single-precision floating-point value from xmm1/m32 to one unsigned doubleword integer in r32 using truncation. Convert one single-precision floating-point value from xmm1/m32 to one unsigned quadword integer in r64 using truncation. NOTES: 1. For this specific instruction, EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1F ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts with truncation a single-precision floating-point value in the source operand (the second operand) to an unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the destination operand (the first operand). The source operand can be an XMM register or a memory location. The destination operand is a general-purpose register. When the source operand is an XMM register, the single-precision floating-point value is contained in the low doubleword of the register. When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format. EVEX.W1 version: promotes the instruction to produce 64-bit data in 64-bit mode. Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. VCVTTSS2USI—Convert with Truncation Scalar Single-Precision Floating-Point Value to Unsigned Integer Vol. 2C 5-71 INSTRUCTION SET REFERENCE, V-Z Operation VCVTTSS2USI (EVEX encoded version) IF 64-bit Mode and OperandSize = 64 THEN DEST[63:0] Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[31:0]); ELSE DEST[31:0] Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[31:0]); FI; Intel C/C++ Compiler Intrinsic Equivalent VCVTTSS2USI unsigned int _mm_cvttss_u32( __m128 a); VCVTTSS2USI unsigned int _mm_cvtt_roundss_u32( __m128 a, int sae); VCVTTSS2USI unsigned __int64 _mm_cvttss_u64( __m128 a); VCVTTSS2USI unsigned __int64 _mm_cvtt_roundss_u64( __m128 a, int sae); SIMD Floating-Point Exceptions Invalid, Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E3NF. 5-72 Vol. 2C VCVTTSS2USI—Convert with Truncation Scalar Single-Precision Floating-Point Value to Unsigned Integer INSTRUCTION SET REFERENCE, V-Z VCVTUDQ2PD—Convert Packed Unsigned Doubleword Integers to Packed Double-Precision Floating-Point Values Opcode/ Instruction Op / En EVEX.128.F3.0F.W0 7A /r VCVTUDQ2PD xmm1 {k1}{z}, xmm2/m64/m32bcst EVEX.256.F3.0F.W0 7A /r VCVTUDQ2PD ymm1 {k1}{z}, xmm2/m128/m32bcst EVEX.512.F3.0F.W0 7A /r VCVTUDQ2PD zmm1 {k1}{z}, ymm2/m256/m32bcst HV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F HV V/V AVX512VL AVX512F HV V/V AVX512F Description Convert two packed unsigned doubleword integers from ymm2/m64/m32bcst to packed double-precision floating-point values in zmm1 with writemask k1. Convert four packed unsigned doubleword integers from xmm2/m128/m32bcst to packed doubleprecision floating-point values in zmm1 with writemask k1. Convert eight packed unsigned doubleword integers from ymm2/m256/m32bcst to eight packed doubleprecision floating-point values in zmm1 with writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 HV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts packed unsigned doubleword integers in the source operand (second operand) to packed double-precision floating-point values in the destination operand (first operand). The source operand is a YMM/XMM/XMM (low 64 bits) register, a 256/128/64-bit memory location or a 256/128/64-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1. Attempt to encode this instruction with EVEX embedded rounding is ignored. Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. Operation VCVTUDQ2PD (EVEX encoded versions) when src operand is a register (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] OR *no writemask* THEN DEST[i+63:i] Convert_UInteger_To_Double_Precision_Floating_Point(SRC[k+31:k]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VCVTUDQ2PD—Convert Packed Unsigned Doubleword Integers to Packed Double-Precision Floating-Point Values Vol. 2C 5-73 INSTRUCTION SET REFERENCE, V-Z VCVTUDQ2PD (EVEX encoded versions) when src operand is a memory source (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] Convert_UInteger_To_Double_Precision_Floating_Point(SRC[31:0]) ELSE DEST[i+63:i] Convert_UInteger_To_Double_Precision_Floating_Point(SRC[k+31:k]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VCVTUDQ2PD __m512d _mm512_cvtepu32_pd( __m256i a); VCVTUDQ2PD __m512d _mm512_mask_cvtepu32_pd( __m512d s, __mmask8 k, __m256i a); VCVTUDQ2PD __m512d _mm512_maskz_cvtepu32_pd( __mmask8 k, __m256i a); VCVTUDQ2PD __m256d _mm256_cvtepu32_pd( __m128i a); VCVTUDQ2PD __m256d _mm256_mask_cvtepu32_pd( __m256d s, __mmask8 k, __m128i a); VCVTUDQ2PD __m256d _mm256_maskz_cvtepu32_pd( __mmask8 k, __m128i a); VCVTUDQ2PD __m128d _mm_cvtepu32_pd( __m128i a); VCVTUDQ2PD __m128d _mm_mask_cvtepu32_pd( __m128d s, __mmask8 k, __m128i a); VCVTUDQ2PD __m128d _mm_maskz_cvtepu32_pd( __mmask8 k, __m128i a); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instructions, see Exceptions Type E5. #UD 5-74 Vol. 2C If EVEX.vvvv != 1111B. VCVTUDQ2PD—Convert Packed Unsigned Doubleword Integers to Packed Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VCVTUDQ2PS—Convert Packed Unsigned Doubleword Integers to Packed Single-Precision Floating-Point Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.F2.0F.W0 7A /r VCVTUDQ2PS xmm1 {k1}{z}, xmm2/m128/m32bcst EVEX.256.F2.0F.W0 7A /r VCVTUDQ2PS ymm1 {k1}{z}, ymm2/m256/m32bcst EVEX.512.F2.0F.W0 7A /r VCVTUDQ2PS zmm1 {k1}{z}, zmm2/m512/m32bcst{er} FV V/V AVX512VL AVX512F FV V/V AVX512F Description Convert four packed unsigned doubleword integers from xmm2/m128/m32bcst to packed single-precision floating-point values in xmm1 with writemask k1. Convert eight packed unsigned doubleword integers from ymm2/m256/m32bcst to packed single-precision floating-point values in zmm1 with writemask k1. Convert sixteen packed unsigned doubleword integers from zmm2/m512/m32bcst to sixteen packed singleprecision floating-point values in zmm1 with writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts packed unsigned doubleword integers in the source operand (second operand) to single-precision floating-point values in the destination operand (first operand). The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1. Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. Operation VCVTUDQ2PS (EVEX encoded version) when src operand is a register (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] Convert_UInteger_To_Single_Precision_Floating_Point(SRC[i+31:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VCVTUDQ2PS—Convert Packed Unsigned Doubleword Integers to Packed Single-Precision Floating-Point Values Vol. 2C 5-75 INSTRUCTION SET REFERENCE, V-Z VCVTUDQ2PS (EVEX encoded version) when src operand is a memory source (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] Convert_UInteger_To_Single_Precision_Floating_Point(SRC[31:0]) ELSE DEST[i+31:i] Convert_UInteger_To_Single_Precision_Floating_Point(SRC[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VCVTUDQ2PS __m512 _mm512_cvtepu32_ps( __m512i a); VCVTUDQ2PS __m512 _mm512_mask_cvtepu32_ps( __m512 s, __mmask16 k, __m512i a); VCVTUDQ2PS __m512 _mm512_maskz_cvtepu32_ps( __mmask16 k, __m512i a); VCVTUDQ2PS __m512 _mm512_cvt_roundepu32_ps( __m512i a, int r); VCVTUDQ2PS __m512 _mm512_mask_cvt_roundepu32_ps( __m512 s, __mmask16 k, __m512i a, int r); VCVTUDQ2PS __m512 _mm512_maskz_cvt_roundepu32_ps( __mmask16 k, __m512i a, int r); VCVTUDQ2PS __m256 _mm256_cvtepu32_ps( __m256i a); VCVTUDQ2PS __m256 _mm256_mask_cvtepu32_ps( __m256 s, __mmask8 k, __m256i a); VCVTUDQ2PS __m256 _mm256_maskz_cvtepu32_ps( __mmask8 k, __m256i a); VCVTUDQ2PS __m128 _mm_cvtepu32_ps( __m128i a); VCVTUDQ2PS __m128 _mm_mask_cvtepu32_ps( __m128 s, __mmask8 k, __m128i a); VCVTUDQ2PS __m128 _mm_maskz_cvtepu32_ps( __mmask8 k, __m128i a); SIMD Floating-Point Exceptions Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E2. #UD 5-76 Vol. 2C If EVEX.vvvv != 1111B. VCVTUDQ2PS—Convert Packed Unsigned Doubleword Integers to Packed Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VCVTUQQ2PD—Convert Packed Unsigned Quadword Integers to Packed Double-Precision Floating-Point Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.128.F3.0F.W1 7A /r VCVTUQQ2PD xmm1 {k1}{z}, xmm2/m128/m64bcst EVEX.256.F3.0F.W1 7A /r VCVTUQQ2PD ymm1 {k1}{z}, ymm2/m256/m64bcst EVEX.512.F3.0F.W1 7A /r VCVTUQQ2PD zmm1 {k1}{z}, zmm2/m512/m64bcst{er} FV V/V AVX512VL AVX512DQ FV V/V AVX512DQ Description Convert two packed unsigned quadword integers from xmm2/m128/m64bcst to two packed double-precision floating-point values in xmm1 with writemask k1. Convert four packed unsigned quadword integers from ymm2/m256/m64bcst to packed double-precision floatingpoint values in ymm1 with writemask k1. Convert eight packed unsigned quadword integers from zmm2/m512/m64bcst to eight packed double-precision floating-point values in zmm1 with writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts packed unsigned quadword integers in the source operand (second operand) to packed double-precision floating-point values in the destination operand (first operand). The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1. Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. Operation VCVTUQQ2PD (EVEX encoded version) when src operand is a register (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL == 512) AND (EVEX.b == 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] Convert_UQuadInteger_To_Double_Precision_Floating_Point(SRC[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VCVTUQQ2PD—Convert Packed Unsigned Quadword Integers to Packed Double-Precision Floating-Point Values Vol. 2C 5-77 INSTRUCTION SET REFERENCE, V-Z VCVTUQQ2PD (EVEX encoded version) when src operand is a memory source (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) THEN DEST[i+63:i] Convert_UQuadInteger_To_Double_Precision_Floating_Point(SRC[63:0]) ELSE DEST[i+63:i] Convert_UQuadInteger_To_Double_Precision_Floating_Point(SRC[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VCVTUQQ2PD __m512d _mm512_cvtepu64_ps( __m512i a); VCVTUQQ2PD __m512d _mm512_mask_cvtepu64_ps( __m512d s, __mmask8 k, __m512i a); VCVTUQQ2PD __m512d _mm512_maskz_cvtepu64_ps( __mmask8 k, __m512i a); VCVTUQQ2PD __m512d _mm512_cvt_roundepu64_ps( __m512i a, int r); VCVTUQQ2PD __m512d _mm512_mask_cvt_roundepu64_ps( __m512d s, __mmask8 k, __m512i a, int r); VCVTUQQ2PD __m512d _mm512_maskz_cvt_roundepu64_ps( __mmask8 k, __m512i a, int r); VCVTUQQ2PD __m256d _mm256_cvtepu64_ps( __m256i a); VCVTUQQ2PD __m256d _mm256_mask_cvtepu64_ps( __m256d s, __mmask8 k, __m256i a); VCVTUQQ2PD __m256d _mm256_maskz_cvtepu64_ps( __mmask8 k, __m256i a); VCVTUQQ2PD __m128d _mm_cvtepu64_ps( __m128i a); VCVTUQQ2PD __m128d _mm_mask_cvtepu64_ps( __m128d s, __mmask8 k, __m128i a); VCVTUQQ2PD __m128d _mm_maskz_cvtepu64_ps( __mmask8 k, __m128i a); SIMD Floating-Point Exceptions Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E2. #UD 5-78 Vol. 2C If EVEX.vvvv != 1111B. VCVTUQQ2PD—Convert Packed Unsigned Quadword Integers to Packed Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VCVTUQQ2PS—Convert Packed Unsigned Quadword Integers to Packed Single-Precision Floating-Point Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.128.F2.0F.W1 7A /r VCVTUQQ2PS xmm1 {k1}{z}, xmm2/m128/m64bcst EVEX.256.F2.0F.W1 7A /r VCVTUQQ2PS xmm1 {k1}{z}, ymm2/m256/m64bcst EVEX.512.F2.0F.W1 7A /r VCVTUQQ2PS ymm1 {k1}{z}, zmm2/m512/m64bcst{er} FV V/V AVX512VL AVX512DQ FV V/V AVX512DQ Description Convert two packed unsigned quadword integers from xmm2/m128/m64bcst to packed single-precision floatingpoint values in zmm1 with writemask k1. Convert four packed unsigned quadword integers from ymm2/m256/m64bcst to packed single-precision floatingpoint values in xmm1 with writemask k1. Convert eight packed unsigned quadword integers from zmm2/m512/m64bcst to eight packed single-precision floating-point values in zmm1 with writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts packed unsigned quadword integers in the source operand (second operand) to single-precision floatingpoint values in the destination operand (first operand). EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is a YMM/XMM/XMM (low 64 bits) register conditionally updated with writemask k1. Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. Operation VCVTUQQ2PS (EVEX encoded version) when src operand is a register (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 k j * 64 IF k1[j] OR *no writemask* THEN DEST[i+31:i] Convert_UQuadInteger_To_Single_Precision_Floating_Point(SRC[k+63:k]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0 VCVTUQQ2PS—Convert Packed Unsigned Quadword Integers to Packed Single-Precision Floating-Point Values Vol. 2C 5-79 INSTRUCTION SET REFERENCE, V-Z VCVTUQQ2PS (EVEX encoded version) when src operand is a memory source (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 32 k j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] Convert_UQuadInteger_To_Single_Precision_Floating_Point(SRC[63:0]) ELSE DEST[i+31:i] Convert_UQuadInteger_To_Single_Precision_Floating_Point(SRC[k+63:k]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0 Intel C/C++ Compiler Intrinsic Equivalent VCVTUQQ2PS __m256 _mm512_cvtepu64_ps( __m512i a); VCVTUQQ2PS __m256 _mm512_mask_cvtepu64_ps( __m256 s, __mmask8 k, __m512i a); VCVTUQQ2PS __m256 _mm512_maskz_cvtepu64_ps( __mmask8 k, __m512i a); VCVTUQQ2PS __m256 _mm512_cvt_roundepu64_ps( __m512i a, int r); VCVTUQQ2PS __m256 _mm512_mask_cvt_roundepu64_ps( __m256 s, __mmask8 k, __m512i a, int r); VCVTUQQ2PS __m256 _mm512_maskz_cvt_roundepu64_ps( __mmask8 k, __m512i a, int r); VCVTUQQ2PS __m128 _mm256_cvtepu64_ps( __m256i a); VCVTUQQ2PS __m128 _mm256_mask_cvtepu64_ps( __m128 s, __mmask8 k, __m256i a); VCVTUQQ2PS __m128 _mm256_maskz_cvtepu64_ps( __mmask8 k, __m256i a); VCVTUQQ2PS __m128 _mm_cvtepu64_ps( __m128i a); VCVTUQQ2PS __m128 _mm_mask_cvtepu64_ps( __m128 s, __mmask8 k, __m128i a); VCVTUQQ2PS __m128 _mm_maskz_cvtepu64_ps( __mmask8 k, __m128i a); SIMD Floating-Point Exceptions Precision Other Exceptions EVEX-encoded instructions, see Exceptions Type E2. #UD 5-80 Vol. 2C If EVEX.vvvv != 1111B. VCVTUQQ2PS—Convert Packed Unsigned Quadword Integers to Packed Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VCVTUSI2SD—Convert Unsigned Integer to Scalar Double-Precision Floating-Point Value Opcode/ Instruction Op / En T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F EVEX.NDS.LIG.F2.0F.W0 7B /r VCVTUSI2SD xmm1, xmm2, r/m32 EVEX.NDS.LIG.F2.0F.W1 7B /r VCVTUSI2SD xmm1, xmm2, r/m64{er} T1S V/N.E.1 AVX512F Description Convert one unsigned doubleword integer from r/m32 to one double-precision floating-point value in xmm1. Convert one unsigned quadword integer from r/m64 to one double-precision floating-point value in xmm1. NOTES: 1. For this specific instruction, EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA Description Converts an unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the second source operand to a double-precision floating-point value in the destination operand. The result is stored in the low quadword of the destination operand. When conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. The second source operand can be a general-purpose register or a 32/64-bit memory location. The first source and destination operands are XMM registers. Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (MAX_VL-1:128) of the destination register are zeroed. EVEX.W1 version: promotes the instruction to use 64-bit input value in 64-bit mode. EVEX.W0 version: attempt to encode this instruction with EVEX embedded rounding is ignored. Operation VCVTUSI2SD (EVEX encoded version) IF (SRC2 *is register*) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF 64-Bit Mode And OperandSize = 64 THEN DEST[63:0] Convert_UInteger_To_Double_Precision_Floating_Point(SRC2[63:0]); ELSE DEST[63:0] Convert_UInteger_To_Double_Precision_Floating_Point(SRC2[31:0]); FI; DEST[127:64] SRC1[127:64] DEST[MAX_VL-1:128] 0 VCVTUSI2SD—Convert Unsigned Integer to Scalar Double-Precision Floating-Point Value Vol. 2C 5-81 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VCVTUSI2SD __m128d _mm_cvtu32_sd( __m128d s, unsigned a); VCVTUSI2SD __m128d _mm_cvtu64_sd( __m128d s, unsigned __int64 a); VCVTUSI2SD __m128d _mm_cvt_roundu64_sd( __m128d s, unsigned __int64 a, int r); SIMD Floating-Point Exceptions Precision Other Exceptions See Exceptions Type E3NF if W1, else type E10NF. 5-82 Vol. 2C VCVTUSI2SD—Convert Unsigned Integer to Scalar Double-Precision Floating-Point Value INSTRUCTION SET REFERENCE, V-Z VCVTUSI2SS—Convert Unsigned Integer to Scalar Single-Precision Floating-Point Value Opcode/ Instruction Op / En T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F EVEX.NDS.LIG.F3.0F.W0 7B /r VCVTUSI2SS xmm1, xmm2, r/m32{er} EVEX.NDS.LIG.F3.0F.W1 7B /r VCVTUSI2SS xmm1, xmm2, r/m64{er} T1S V/N.E.1 AVX512F Description Convert one signed doubleword integer from r/m32 to one single-precision floating-point value in xmm1. Convert one signed quadword integer from r/m64 to one single-precision floating-point value in xmm1. NOTES: 1. For this specific instruction, EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA Description Converts a unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the source operand (second operand) to a single-precision floating-point value in the destination operand (first operand). The source operand can be a general-purpose register or a memory location. The destination operand is an XMM register. The result is stored in the low doubleword of the destination operand. When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. The second source operand can be a general-purpose register or a 32/64-bit memory location. The first source and destination operands are XMM registers. Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (MAX_VL-1:128) of the destination register are zeroed. EVEX.W1 version: promotes the instruction to use 64-bit input value in 64-bit mode. VCVTUSI2SS—Convert Unsigned Integer to Scalar Single-Precision Floating-Point Value Vol. 2C 5-83 INSTRUCTION SET REFERENCE, V-Z Operation VCVTUSI2SS (EVEX encoded version) IF (SRC2 *is register*) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF 64-Bit Mode And OperandSize = 64 THEN DEST[31:0] Convert_UInteger_To_Single_Precision_Floating_Point(SRC[63:0]); ELSE DEST[31:0] Convert_UInteger_To_Single_Precision_Floating_Point(SRC[31:0]); FI; DEST[127:32] SRC1[127:32] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VCVTUSI2SS __m128 _mm_cvtu32_ss( __m128 s, unsigned a); VCVTUSI2SS __m128 _mm_cvt_roundu32_ss( __m128 s, unsigned a, int r); VCVTUSI2SS __m128 _mm_cvtu64_ss( __m128 s, unsigned __int64 a); VCVTUSI2SS __m128 _mm_cvt_roundu64_ss( __m128 s, unsigned __int64 a, int r); SIMD Floating-Point Exceptions Precision Other Exceptions See Exceptions Type E3NF. 5-84 Vol. 2C VCVTUSI2SS—Convert Unsigned Integer to Scalar Single-Precision Floating-Point Value INSTRUCTION SET REFERENCE, V-Z VDBPSADBW—Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes Opcode/ Instruction Op / En FVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512BW EVEX.NDS.128.66.0F3A.W0 42 /r ib VDBPSADBW xmm1 {k1}{z}, xmm2, xmm3/m128, imm8 EVEX.NDS.256.66.0F3A.W0 42 /r ib VDBPSADBW ymm1 {k1}{z}, ymm2, ymm3/m256, imm8 FVM V/V AVX512VL AVX512BW EVEX.NDS.512.66.0F3A.W0 42 /r ib VDBPSADBW zmm1 {k1}{z}, zmm2, zmm3/m512, imm8 FVM V/V AVX512BW Description Compute packed SAD word results of unsigned bytes in dword block from xmm2 with unsigned bytes of dword blocks transformed from xmm3/m128 using the shuffle controls in imm8. Results are written to xmm1 under the writemask k1. Compute packed SAD word results of unsigned bytes in dword block from ymm2 with unsigned bytes of dword blocks transformed from ymm3/m256 using the shuffle controls in imm8. Results are written to ymm1 under the writemask k1. Compute packed SAD word results of unsigned bytes in dword block from zmm2 with unsigned bytes of dword blocks transformed from zmm3/m512 using the shuffle controls in imm8. Results are written to zmm1 under the writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FVM ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) Imm8 Description Compute packed SAD (sum of absolute differences) word results of unsigned bytes from two 32-bit dword elements. Packed SAD word results are calculated in multiples of qword superblocks, producing 4 SAD word results in each 64-bit superblock of the destination register. Within each super block of packed word results, the SAD results from two 32-bit dword elements are calculated as follows: • The lower two word results are calculated each from the SAD operation between a sliding dword element within a qword superblock from an intermediate vector with a stationary dword element in the corresponding qword superblock of the first source operand. The intermediate vector, see “Tmp1” in Figure 5-8, is constructed from the second source operand the imm8 byte as shuffle control to select dword elements within a 128-bit lane of the second source operand. The two sliding dword elements in a qword superblock of Tmp1 are located at byte offset 0 and 1 within the superblock, respectively. The stationary dword element in the qword superblock from the first source operand is located at byte offset 0. • The next two word results are calculated each from the SAD operation between a sliding dword element within a qword superblock from the intermediate vector Tmp1 with a second stationary dword element in the corresponding qword superblock of the first source operand. The two sliding dword elements in a qword superblock of Tmp1 are located at byte offset 2and 3 within the superblock, respectively. The stationary dword element in the qword superblock from the first source operand is located at byte offset 4. • The intermediate vector is constructed in 128-bits lanes. Within each 128-bit lane, each dword element of the intermediate vector is selected by a two-bit field within the imm8 byte on the corresponding 128-bits of the second source operand. The imm8 byte serves as dword shuffle control within each 128-bit lanes of the intermediate vector and the second source operand, similarly to PSHUFD. The first source operand is a ZMM/YMM/XMM register. The second source operand is a ZMM/YMM/XMM register, or a 512/256/128-bit memory location. The destination operand is conditionally updated based on writemask k1 at 16-bit word granularity. VDBPSADBW—Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes Vol. 2C 5-85 INSTRUCTION SET REFERENCE, V-Z 127+128*n 128-bit Lane of Src2 95+128*n DW3 63+128*n 31+128*n 00B: DW0 01B: DW1 10B: DW2 11B: DW3 imm8 shuffle control 7 127+128*n 5 95+128*n 3 128*n DW0 DW1 DW2 1 0 63+128*n 31+128*n 128*n 128-bit Lane of Tmp1 Tmp1 qword superblock 55 47 39 31 24 39 31 23 15 8 31 23 15 7 Tmp1 sliding dword Tmp1 sliding dword 63 55 47 39 32 _ _ _ _ abs abs abs abs + Src1 stationary dword 0 Src1 stationary dword 1 47 39 31 _ _ _ _ abs abs abs abs + 23 16 31 23 15 7 0 31 23 15 7 0 Tmp1 sliding dword 63 55 47 39 32 _ _ _ _ abs abs abs abs Src1 stationary dword 1 Tmp1 sliding dword + 63 47 0 31 15 _ _ _ _ abs abs abs abs + Src1 stationary dword 0 0 Destination qword superblock Figure 5-8. 64-bit Super Block of SAD Operation in VDBPSADBW 5-86 Vol. 2C VDBPSADBW—Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes INSTRUCTION SET REFERENCE, V-Z Operation VDBPSADBW (EVEX encoded versions) (KL, VL) = (8, 128), (16, 256), (32, 512) Selection of quadruplets: FOR I = 0 to VL step 128 TMP1[I+31:I] select (SRC2[I+127: I], imm8[1:0]) TMP1[I+63: I+32] select (SRC2[I+127: I], imm8[3:2]) TMP1[I+95: I+64] select (SRC2[I+127: I], imm8[5:4]) TMP1[I+127: I+96] select (SRC2[I+127: I], imm8[7:6]) END FOR SAD of quadruplets: FOR I =0 to VL step 64 TMP_DEST[I+15:I] ABS(SRC1[I+7: I] - TMP1[I+7: I]) + ABS(SRC1[I+15: I+8]- TMP1[I+15: I+8]) + ABS(SRC1[I+23: I+16]- TMP1[I+23: I+16]) + ABS(SRC1[I+31: I+24]- TMP1[I+31: I+24]) TMP_DEST[I+31: I+16] ABS(SRC1[I+7: I] - TMP1[I+15: I+8]) + ABS(SRC1[I+15: I+8]- TMP1[I+23: I+16]) + ABS(SRC1[I+23: I+16]- TMP1[I+31: I+24]) + ABS(SRC1[I+31: I+24]- TMP1[I+39: I+32]) TMP_DEST[I+47: I+32] ABS(SRC1[I+39: I+32] - TMP1[I+23: I+16]) + ABS(SRC1[I+47: I+40]- TMP1[I+31: I+24]) + ABS(SRC1[I+55: I+48]- TMP1[I+39: I+32]) + ABS(SRC1[I+63: I+56]- TMP1[I+47: I+40]) TMP_DEST[I+63: I+48] ABS(SRC1[I+39: I+32] - TMP1[I+31: I+24]) + ABS(SRC1[I+47: I+40] - TMP1[I+39: I+32]) + ABS(SRC1[I+55: I+48] - TMP1[I+47: I+40]) + ABS(SRC1[I+63: I+56] - TMP1[I+55: I+48]) ENDFOR FOR j 0 TO KL-1 i j * 16 IF k1[j] OR *no writemask* THEN DEST[i+15:i] TMP_DEST[i+15:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged* ELSE ; zeroing-masking DEST[i+15:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VDBPSADBW—Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes Vol. 2C 5-87 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VDBPSADBW __m512i _mm512_dbsad_epu8(__m512i a, __m512i b); VDBPSADBW __m512i _mm512_mask_dbsad_epu8(__m512i s, __mmask32 m, __m512i a, __m512i b); VDBPSADBW __m512i _mm512_maskz_dbsad_epu8(__mmask32 m, __m512i a, __m512i b); VDBPSADBW __m256i _mm256_dbsad_epu8(__m256i a, __m256i b); VDBPSADBW __m256i _mm256_mask_dbsad_epu8(__m256i s, __mmask16 m, __m256i a, __m256i b); VDBPSADBW __m256i _mm256_maskz_dbsad_epu8(__mmask16 m, __m256i a, __m256i b); VDBPSADBW __m128i _mm_dbsad_epu8(__m128i a, __m128i b); VDBPSADBW __m128i _mm_mask_dbsad_epu8(__m128i s, __mmask8 m, __m128i a, __m128i b); VDBPSADBW __m128i _mm_maskz_dbsad_epu8(__mmask8 m, __m128i a, __m128i b); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E4NF.nb. 5-88 Vol. 2C VDBPSADBW—Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes INSTRUCTION SET REFERENCE, V-Z VEXPANDPD—Load Sparse Packed Double-Precision Floating-Point Values from Dense Memory Opcode/ Instruction Op / En T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.66.0F38.W1 88 /r VEXPANDPD xmm1 {k1}{z}, xmm2/m128 EVEX.256.66.0F38.W1 88 /r VEXPANDPD ymm1 {k1}{z}, ymm2/m256 EVEX.512.66.0F38.W1 88 /r VEXPANDPD zmm1 {k1}{z}, zmm2/m512 T1S V/V T1S V/V AVX512VL AVX512F AVX512F Description Expand packed double-precision floating-point values from xmm2/m128 to xmm1 using writemask k1. Expand packed double-precision floating-point values from ymm2/m256 to ymm1 using writemask k1. Expand packed double-precision floating-point values from zmm2/m512 to zmm1 using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) ModRM:r/m (r) NA NA Description Expand (load) up to 8/4/2, contiguous, double-precision floating-point values of the input vector in the source operand (the second operand) to sparse elements in the destination operand (the first operand) selected by the writemask k1. The destination operand is a ZMM/YMM/XMM register, the source operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The input vector starts from the lowest element in the source operand. The writemask register k1 selects the destination elements (a partial vector or sparse elements if less than 8 elements) to be replaced by the ascending elements in the input vector. Destination elements not selected by the writemask k1 are either unmodified or zeroed, depending on EVEX.z. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector. Operation VEXPANDPD (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) k0 FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] SRC[k+63:k]; k k + 64 ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking THEN DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VEXPANDPD—Load Sparse Packed Double-Precision Floating-Point Values from Dense Memory Vol. 2C 5-89 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VEXPANDPD __m512d _mm512_mask_expand_pd( __m512d s, __mmask8 k, __m512d a); VEXPANDPD __m512d _mm512_maskz_expand_pd( __mmask8 k, __m512d a); VEXPANDPD __m512d _mm512_mask_expandloadu_pd( __m512d s, __mmask8 k, void * a); VEXPANDPD __m512d _mm512_maskz_expandloadu_pd( __mmask8 k, void * a); VEXPANDPD __m256d _mm256_mask_expand_pd( __m256d s, __mmask8 k, __m256d a); VEXPANDPD __m256d _mm256_maskz_expand_pd( __mmask8 k, __m256d a); VEXPANDPD __m256d _mm256_mask_expandloadu_pd( __m256d s, __mmask8 k, void * a); VEXPANDPD __m256d _mm256_maskz_expandloadu_pd( __mmask8 k, void * a); VEXPANDPD __m128d _mm_mask_expand_pd( __m128d s, __mmask8 k, __m128d a); VEXPANDPD __m128d _mm_maskz_expand_pd( __mmask8 k, __m128d a); VEXPANDPD __m128d _mm_mask_expandloadu_pd( __m128d s, __mmask8 k, void * a); VEXPANDPD __m128d _mm_maskz_expandloadu_pd( __mmask8 k, void * a); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E4.nb. #UD 5-90 Vol. 2C If EVEX.vvvv != 1111B. VEXPANDPD—Load Sparse Packed Double-Precision Floating-Point Values from Dense Memory INSTRUCTION SET REFERENCE, V-Z VEXPANDPS—Load Sparse Packed Single-Precision Floating-Point Values from Dense Memory Opcode/ Instruction Op / En EVEX.128.66.0F38.W0 88 /r VEXPANDPS xmm1 {k1}{z}, xmm2/m128 EVEX.256.66.0F38.W0 88 /r VEXPANDPS ymm1 {k1}{z}, ymm2/m256 EVEX.512.66.0F38.W0 88 /r VEXPANDPS zmm1 {k1}{z}, zmm2/m512 T1S 64/32 bit Mode Support V/V T1S V/V T1S V/V CPUID Feature Flag AVX512VL AVX512F AVX512VL AVX512F AVX512F Description Expand packed single-precision floating-point values from xmm2/m128 to xmm1 using writemask k1. Expand packed single-precision floating-point values from ymm2/m256 to ymm1 using writemask k1. Expand packed single-precision floating-point values from zmm2/m512 to zmm1 using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) ModRM:r/m (r) NA NA Description Expand (load) up to 16/8/4, contiguous, single-precision floating-point values of the input vector in the source operand (the second operand) to sparse elements of the destination operand (the first operand) selected by the writemask k1. The destination operand is a ZMM/YMM/XMM register, the source operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The input vector starts from the lowest element in the source operand. The writemask k1 selects the destination elements (a partial vector or sparse elements if less than 16 elements) to be replaced by the ascending elements in the input vector. Destination elements not selected by the writemask k1 are either unmodified or zeroed, depending on EVEX.z. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector. Operation VEXPANDPS (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) k0 FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] SRC[k+31:k]; k k + 32 ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VEXPANDPS—Load Sparse Packed Single-Precision Floating-Point Values from Dense Memory Vol. 2C 5-91 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VEXPANDPS __m512 _mm512_mask_expand_ps( __m512 s, __mmask16 k, __m512 a); VEXPANDPS __m512 _mm512_maskz_expand_ps( __mmask16 k, __m512 a); VEXPANDPS __m512 _mm512_mask_expandloadu_ps( __m512 s, __mmask16 k, void * a); VEXPANDPS __m512 _mm512_maskz_expandloadu_ps( __mmask16 k, void * a); VEXPANDPD __m256 _mm256_mask_expand_ps( __m256 s, __mmask8 k, __m256 a); VEXPANDPD __m256 _mm256_maskz_expand_ps( __mmask8 k, __m256 a); VEXPANDPD __m256 _mm256_mask_expandloadu_ps( __m256 s, __mmask8 k, void * a); VEXPANDPD __m256 _mm256_maskz_expandloadu_ps( __mmask8 k, void * a); VEXPANDPD __m128 _mm_mask_expand_ps( __m128 s, __mmask8 k, __m128 a); VEXPANDPD __m128 _mm_maskz_expand_ps( __mmask8 k, __m128 a); VEXPANDPD __m128 _mm_mask_expandloadu_ps( __m128 s, __mmask8 k, void * a); VEXPANDPD __m128 _mm_maskz_expandloadu_ps( __mmask8 k, void * a); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E4.nb. #UD 5-92 Vol. 2C If EVEX.vvvv != 1111B. VEXPANDPS—Load Sparse Packed Single-Precision Floating-Point Values from Dense Memory INSTRUCTION SET REFERENCE, V-Z VERR/VERW—Verify a Segment for Reading or Writing Opcode Instruction Op/ En 64-Bit Mode Compat/ Description Leg Mode 0F 00 /4 VERR r/m16 M Valid Valid Set ZF=1 if segment specified with r/m16 can be read. 0F 00 /5 VERW r/m16 M Valid Valid Set ZF=1 if segment specified with r/m16 can be written. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 M ModRM:r/m (r) NA NA NA Description Verifies whether the code or data segment specified with the source operand is readable (VERR) or writable (VERW) from the current privilege level (CPL). The source operand is a 16-bit register or a memory location that contains the segment selector for the segment to be verified. If the segment is accessible and readable (VERR) or writable (VERW), the ZF flag is set; otherwise, the ZF flag is cleared. Code segments are never verified as writable. This check cannot be performed on system segments. To set the ZF flag, the following conditions must be met: • • • • • • The segment selector is not NULL. The selector must denote a descriptor within the bounds of the descriptor table (GDT or LDT). The selector must denote the descriptor of a code or data segment (not that of a system segment or gate). For the VERR instruction, the segment must be readable. For the VERW instruction, the segment must be a writable data segment. If the segment is not a conforming code segment, the segment’s DPL must be greater than or equal to (have less or the same privilege as) both the CPL and the segment selector's RPL. The validation performed is the same as is performed when a segment selector is loaded into the DS, ES, FS, or GS register, and the indicated access (read or write) is performed. The segment selector's value cannot result in a protection exception, enabling the software to anticipate possible segment access problems. This instruction’s operation is the same in non-64-bit modes and 64-bit mode. The operand size is fixed at 16 bits. Operation IF SRC(Offset) > (GDTR(Limit) or (LDTR(Limit)) THEN ZF ← 0; FI; Read segment descriptor; IF SegmentDescriptor(DescriptorType) = 0 (* System segment *) or (SegmentDescriptor(Type) ≠ conforming code segment) and (CPL > DPL) or (RPL > DPL) THEN ZF ← 0; ELSE IF ((Instruction = VERR) and (Segment readable)) or ((Instruction = VERW) and (Segment writable)) THEN ZF ← 1; FI; FI; VERR/VERW—Verify a Segment for Reading or Writing Vol. 2C 5-93 INSTRUCTION SET REFERENCE, V-Z Flags Affected The ZF flag is set to 1 if the segment is accessible and readable (VERR) or writable (VERW); otherwise, it is set to 0. Protected Mode Exceptions The only exceptions generated for these instructions are those related to illegal addressing of the source operand. #GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If the DS, ES, FS, or GS register is used to access memory and it contains a NULL segment selector. #SS(0) If a memory operand effective address is outside the SS segment limit. #PF(fault-code) If a page fault occurs. #AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3. #UD If the LOCK prefix is used. Real-Address Mode Exceptions #UD The VERR and VERW instructions are not recognized in real-address mode. If the LOCK prefix is used. Virtual-8086 Mode Exceptions #UD The VERR and VERW instructions are not recognized in virtual-8086 mode. If the LOCK prefix is used. Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions #SS(0) If a memory address referencing the SS segment is in a non-canonical form. #GP(0) If the memory address is in a non-canonical form. #PF(fault-code) If a page fault occurs. #AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3. #UD If the LOCK prefix is used. 5-94 Vol. 2C VERR/VERW—Verify a Segment for Reading or Writing INSTRUCTION SET REFERENCE, V-Z VEXP2PD—Approximation to the Exponential 2^x of Packed Double-Precision Floating-Point Values with Less Than 2^-23 Relative Error Opcode/ Instruction Op / En EVEX.512.66.0F38.W1 C8 /r VEXP2PD zmm1 {k1}{z}, zmm2/m512/m64bcst {sae} FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512ER Description Computes approximations to the exponential 2^x (with less than 2^-23 of maximum relative error) of the packed doubleprecision floating-point values from zmm2/m512/m64bcst and stores the floating-point result in zmm1with writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (r, w) ModRM:r/m (r) NA NA Description Computes the approximate base-2 exponential evaluation of the double-precision floating-point values in the source operand (the second operand) and stores the results to the destination operand (the first operand) using the writemask k1. The approximate base-2 exponential is evaluated with less than 2^-23 of relative error. Denormal input values are treated as zeros and do not signal #DE, irrespective of MXCSR.DAZ. Denormal results are flushed to zeros and do not signal #UE, irrespective of MXCSR.FZ. The source operand is a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit memory location. The destination operand is a ZMM register, conditionally updated using writemask k1. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. A numerically exact implementation of VEXP2xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. Operation VEXP2PD (KL, VL) = (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC *is memory*) THEN DEST[i+63:i] EXP2_23_DP(SRC[63:0]) ELSE DEST[i+63:i] EXP2_23_DP(SRC[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI; FI; ENDFOR; VEXP2PD—Approximation to the Exponential 2^x of Packed Double-Precision Floating-Point Values with Less Than 2^-23 Relative Vol. 2C 5-95 INSTRUCTION SET REFERENCE, V-Z Table 5-4. Special Values Behavior Source Input Result Comments NaN QNaN(src) If (SRC = SNaN) then #I +∞ +∞ +/-0 1.0f -∞ +0.0f Integral value N 2^ (N) Exact result Exact result Intel C/C++ Compiler Intrinsic Equivalent VEXP2PD __m512d _mm512_exp2a23_round_pd (__m512d a, int sae); VEXP2PD __m512d _mm512_mask_exp2a23_round_pd (__m512d a, __mmask8 m, __m512d b, int sae); VEXP2PD __m512d _mm512_maskz_exp2a23_round_pd ( __mmask8 m, __m512d b, int sae); SIMD Floating-Point Exceptions Invalid (if SNaN input), Overflow Other Exceptions See Exceptions Type E2. 5-96 Vol. 2C VEXP2PD—Approximation to the Exponential 2^x of Packed Double-Precision Floating-Point Values with Less Than 2^-23 Relative Er- INSTRUCTION SET REFERENCE, V-Z VEXP2PS—Approximation to the Exponential 2^x of Packed Single-Precision Floating-Point Values with Less Than 2^-23 Relative Error Opcode/ Instruction Op / En EVEX.512.66.0F38.W0 C8 /r VEXP2PS zmm1 {k1}{z}, zmm2/m512/m32bcst {sae} FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512ER Description Computes approximations to the exponential 2^x (with less than 2^-23 of maximum relative error) of the packed singleprecision floating-point values from zmm2/m512/m32bcst and stores the floating-point result in zmm1with writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (r, w) ModRM:r/m (r) NA NA Description Computes the approximate base-2 exponential evaluation of the single-precision floating-point values in the source operand (the second operand) and store the results in the destination operand (the first operand) using the writemask k1. The approximate base-2 exponential is evaluated with less than 2^-23 of relative error. Denormal input values are treated as zeros and do not signal #DE, irrespective of MXCSR.DAZ. Denormal results are flushed to zeros and do not signal #UE, irrespective of MXCSR.FZ. The source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM register, conditionally updated using writemask k1. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. A numerically exact implementation of VEXP2xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. Operation VEXP2PS (KL, VL) = (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC *is memory*) THEN DEST[i+31:i] EXP2_23_SP(SRC[31:0]) ELSE DEST[i+31:i] EXP2_23_SP(SRC[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI; FI; ENDFOR; VEXP2PS—Approximation to the Exponential 2^x of Packed Single-Precision Floating-Point Values with Less Than 2^-23 Relative Er- Vol. 2C 5-97 INSTRUCTION SET REFERENCE, V-Z Table 5-5. Special Values Behavior Source Input Result Comments NaN QNaN(src) If (SRC = SNaN) then #I +∞ +∞ +/-0 1.0f -∞ +0.0f Integral value N 2^ (N) Exact result Exact result Intel C/C++ Compiler Intrinsic Equivalent VEXP2PS __m512 _mm512_exp2a23_round_ps (__m512 a, int sae); VEXP2PS __m512 _mm512_mask_exp2a23_round_ps (__m512 a, __mmask16 m, __m512 b, int sae); VEXP2PS __m512 _mm512_maskz_exp2a23_round_ps (__mmask16 m, __m512 b, int sae); SIMD Floating-Point Exceptions Invalid (if SNaN input), Overflow Other Exceptions See Exceptions Type E2. 5-98 Vol. 2C VEXP2PS—Approximation to the Exponential 2^x of Packed Single-Precision Floating-Point Values with Less Than 2^-23 Relative Er- INSTRUCTION SET REFERENCE, V-Z VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4—Extr act Packed Floating-Point Values Opcode/ Instruction Op / En RMI 64/32 bit Mode Support V/V CPUID Feature Flag AVX VEX.256.66.0F3A.W0 19 /r ib VEXTRACTF128 xmm1/m128, ymm2, imm8 EVEX.256.66.0F3A.W0 19 /r ib VEXTRACTF32X4 xmm1/m128 {k1}{z}, ymm2, imm8 EVEX.512.66.0F3A.W0 19 /r ib VEXTRACTF32x4 xmm1/m128 {k1}{z}, zmm2, imm8 EVEX.256.66.0F3A.W1 19 /r ib VEXTRACTF64X2 xmm1/m128 {k1}{z}, ymm2, imm8 EVEX.512.66.0F3A.W1 19 /r ib VEXTRACTF64X2 xmm1/m128 {k1}{z}, zmm2, imm8 EVEX.512.66.0F3A.W0 1B /r ib VEXTRACTF32X8 ymm1/m256 {k1}{z}, zmm2, imm8 EVEX.512.66.0F3A.W1 1B /r ib VEXTRACTF64x4 ymm1/m256 {k1}{z}, zmm2, imm8 T4 V/V AVX512VL AVX512F T4 V/V AVX512F T2 V/V AVX512VL AVX512DQ T2 V/V AVX512DQ T8 V/V AVX512DQ T4 V/V AVX512F Description Extract 128 bits of packed floating-point values from ymm2 and store results in xmm1/m128. Extract 128 bits of packed single-precision floatingpoint values from ymm2 and store results in xmm1/m128 subject to writemask k1. Extract 128 bits of packed single-precision floatingpoint values from zmm2 and store results in xmm1/m128 subject to writemask k1. Extract 128 bits of packed double-precision floating-point values from ymm2 and store results in xmm1/m128 subject to writemask k1. Extract 128 bits of packed double-precision floating-point values from zmm2 and store results in xmm1/m128 subject to writemask k1. Extract 256 bits of packed single-precision floatingpoint values from zmm2 and store results in ymm1/m256 subject to writemask k1. Extract 256 bits of packed double-precision floating-point values from zmm2 and store results in ymm1/m256 subject to writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RMI ModRM:r/m (w) ModRM:reg (r) Imm8 NA T2, T4, T8 ModRM:r/m (w) ModRM:reg (r) Imm8 NA Description VEXTRACTF128/VEXTRACTF32x4 and VEXTRACTF64x2 extract 128-bits of single-precision floating-point values from the source operand (the second operand) and store to the low 128-bit of the destination operand (the first operand). The 128-bit data extraction occurs at an 128-bit granular offset specified by imm8[0] (256-bit) or imm8[1:0] as the multiply factor. The destination may be either a vector register or an 128-bit memory location. VEXTRACTF32x4: The low 128-bit of the destination operand is updated at 32-bit granularity according to the writemask. VEXTRACTF32x8 and VEXTRACTF64x4 extract 256-bits of double-precision floating-point values from the source operand (second operand) and store to the low 256-bit of the destination operand (the first operand). The 256-bit data extraction occurs at an 256-bit granular offset specified by imm8[0] (256-bit) or imm8[0] as the multiply factor The destination may be either a vector register or a 256-bit memory location. VEXTRACTF64x4: The low 256-bit of the destination operand is updated at 64-bit granularity according to the writemask. VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. The high 6 bits of the immediate are ignored. If VEXTRACTF128 is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will cause an #UD exception. VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4—Extract Packed Floating-Point Values Vol. 2C 5-99 INSTRUCTION SET REFERENCE, V-Z Operation VEXTRACTF32x4 (EVEX encoded versions) when destination is a register VL = 256, 512 IF VL = 256 CASE (imm8[0]) OF 0: TMP_DEST[127:0] SRC1[127:0] 1: TMP_DEST[127:0] SRC1[255:128] ESAC. FI; IF VL = 512 CASE (imm8[1:0]) OF 00: TMP_DEST[127:0] SRC1[127:0] 01: TMP_DEST[127:0] SRC1[255:128] 10: TMP_DEST[127:0] SRC1[383:256] 11: TMP_DEST[127:0] SRC1[511:384] ESAC. FI; FOR j 0 TO 3 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:128] 0 5-100 Vol. 2C VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4—Extract Packed Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VEXTRACTF32x4 (EVEX encoded versions) when destination is memory VL = 256, 512 IF VL = 256 CASE (imm8[0]) OF 0: TMP_DEST[127:0] SRC1[127:0] 1: TMP_DEST[127:0] SRC1[255:128] ESAC. FI; IF VL = 512 CASE (imm8[1:0]) OF 00: TMP_DEST[127:0] SRC1[127:0] 01: TMP_DEST[127:0] SRC1[255:128] 10: TMP_DEST[127:0] SRC1[383:256] 11: TMP_DEST[127:0] SRC1[511:384] ESAC. FI; FOR j 0 TO 3 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE *DEST[i+31:i] remains unchanged* FI; ENDFOR ; merging-masking VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4—Extract Packed Floating-Point Values Vol. 2C 5-101 INSTRUCTION SET REFERENCE, V-Z VEXTRACTF64x2 (EVEX encoded versions) when destination is a register VL = 256, 512 IF VL = 256 CASE (imm8[0]) OF 0: TMP_DEST[127:0] SRC1[127:0] 1: TMP_DEST[127:0] SRC1[255:128] ESAC. FI; IF VL = 512 CASE (imm8[1:0]) OF 00: TMP_DEST[127:0] SRC1[127:0] 01: TMP_DEST[127:0] SRC1[255:128] 10: TMP_DEST[127:0] SRC1[383:256] 11: TMP_DEST[127:0] SRC1[511:384] ESAC. FI; FOR j 0 TO 1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:128] 0 5-102 Vol. 2C VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4—Extract Packed Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VEXTRACTF64x2 (EVEX encoded versions) when destination is memory VL = 256, 512 IF VL = 256 CASE (imm8[0]) OF 0: TMP_DEST[127:0] SRC1[127:0] 1: TMP_DEST[127:0] SRC1[255:128] ESAC. FI; IF VL = 512 CASE (imm8[1:0]) OF 00: TMP_DEST[127:0] SRC1[127:0] 01: TMP_DEST[127:0] SRC1[255:128] 10: TMP_DEST[127:0] SRC1[383:256] 11: TMP_DEST[127:0] SRC1[511:384] ESAC. FI; FOR j 0 TO 1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE *DEST[i+63:i] remains unchanged* FI; ENDFOR ; merging-masking VEXTRACTF32x8 (EVEX.U1.512 encoded version) when destination is a register VL = 512 CASE (imm8[0]) OF 0: TMP_DEST[255:0] SRC1[255:0] 1: TMP_DEST[255:0] SRC1[511:256] ESAC. FOR j 0 TO 7 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:256] 0 VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4—Extract Packed Floating-Point Values Vol. 2C 5-103 INSTRUCTION SET REFERENCE, V-Z VEXTRACTF32x8 (EVEX.U1.512 encoded version) when destination is memory CASE (imm8[0]) OF 0: TMP_DEST[255:0] SRC1[255:0] 1: TMP_DEST[255:0] SRC1[511:256] ESAC. FOR j 0 TO 7 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE *DEST[i+31:i] remains unchanged* FI; ENDFOR ; merging-masking VEXTRACTF64x4 (EVEX.512 encoded version) when destination is a register VL = 512 CASE (imm8[0]) OF 0: TMP_DEST[255:0] SRC1[255:0] 1: TMP_DEST[255:0] SRC1[511:256] ESAC. FOR j 0 TO 3 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:256] 0 VEXTRACTF64x4 (EVEX.512 encoded version) when destination is memory CASE (imm8[0]) OF 0: TMP_DEST[255:0] SRC1[255:0] 1: TMP_DEST[255:0] SRC1[511:256] ESAC. FOR j 0 TO 3 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE ; merging-masking *DEST[i+63:i] remains unchanged* FI; ENDFOR 5-104 Vol. 2C VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4—Extract Packed Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VEXTRACTF128 (memory destination form) CASE (imm8[0]) OF 0: DEST[127:0] SRC1[127:0] 1: DEST[127:0] SRC1[255:128] ESAC. VEXTRACTF128 (register destination form) CASE (imm8[0]) OF 0: DEST[127:0] SRC1[127:0] 1: DEST[127:0] SRC1[255:128] ESAC. DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VEXTRACTF32x4 __m128 _mm512_extractf32x4_ps(__m512 a, const int nidx); VEXTRACTF32x4 __m128 _mm512_mask_extractf32x4_ps(__m128 s, __mmask8 k, __m512 a, const int nidx); VEXTRACTF32x4 __m128 _mm512_maskz_extractf32x4_ps( __mmask8 k, __m512 a, const int nidx); VEXTRACTF32x4 __m128 _mm256_extractf32x4_ps(__m256 a, const int nidx); VEXTRACTF32x4 __m128 _mm256_mask_extractf32x4_ps(__m128 s, __mmask8 k, __m256 a, const int nidx); VEXTRACTF32x4 __m128 _mm256_maskz_extractf32x4_ps( __mmask8 k, __m256 a, const int nidx); VEXTRACTF32x8 __m256 _mm512_extractf32x8_ps(__m512 a, const int nidx); VEXTRACTF32x8 __m256 _mm512_mask_extractf32x8_ps(__m256 s, __mmask8 k, __m512 a, const int nidx); VEXTRACTF32x8 __m256 _mm512_maskz_extractf32x8_ps( __mmask8 k, __m512 a, const int nidx); VEXTRACTF64x2 __m128d _mm512_extractf64x2_pd(__m512d a, const int nidx); VEXTRACTF64x2 __m128d _mm512_mask_extractf64x2_pd(__m128d s, __mmask8 k, __m512d a, const int nidx); VEXTRACTF64x2 __m128d _mm512_maskz_extractf64x2_pd( __mmask8 k, __m512d a, const int nidx); VEXTRACTF64x2 __m128d _mm256_extractf64x2_pd(__m256d a, const int nidx); VEXTRACTF64x2 __m128d _mm256_mask_extractf64x2_pd(__m128d s, __mmask8 k, __m256d a, const int nidx); VEXTRACTF64x2 __m128d _mm256_maskz_extractf64x2_pd( __mmask8 k, __m256d a, const int nidx); VEXTRACTF64x4 __m256d _mm512_extractf64x4_pd( __m512d a, const int nidx); VEXTRACTF64x4 __m256d _mm512_mask_extractf64x4_pd(__m256d s, __mmask8 k, __m512d a, const int nidx); VEXTRACTF64x4 __m256d _mm512_maskz_extractf64x4_pd( __mmask8 k, __m512d a, const int nidx); VEXTRACTF128 __m128 _mm256_extractf128_ps (__m256 a, int offset); VEXTRACTF128 __m128d _mm256_extractf128_pd (__m256d a, int offset); VEXTRACTF128 __m128i_mm256_extractf128_si256(__m256i a, int offset); SIMD Floating-Point Exceptions None Other Exceptions VEX-encoded instructions, see Exceptions Type 6; EVEX-encoded instructions, see Exceptions Type E6NF. #UD IF VEX.L = 0. #UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B. VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4—Extract Packed Floating-Point Values Vol. 2C 5-105 INSTRUCTION SET REFERENCE, V-Z VEXTRACTI128/VEXTRACTI32x4/VEXTRACTI64x2/VEXTRACTI32x8/VEXTRACTI64x4—Extract packed Integer Values Opcode/ Instruction Op / En CPUID Feature Flag AVX2 Description RMI 64/32 bit Mode Support V/V VEX.256.66.0F3A.W0 39 /r ib VEXTRACTI128 xmm1/m128, ymm2, imm8 EVEX.256.66.0F3A.W0 39 /r ib VEXTRACTI32X4 xmm1/m128 {k1}{z}, ymm2, imm8 EVEX.512.66.0F3A.W0 39 /r ib VEXTRACTI32x4 xmm1/m128 {k1}{z}, zmm2, imm8 EVEX.256.66.0F3A.W1 39 /r ib VEXTRACTI64X2 xmm1/m128 {k1}{z}, ymm2, imm8 EVEX.512.66.0F3A.W1 39 /r ib VEXTRACTI64X2 xmm1/m128 {k1}{z}, zmm2, imm8 EVEX.512.66.0F3A.W0 3B /r ib VEXTRACTI32X8 ymm1/m256 {k1}{z}, zmm2, imm8 EVEX.512.66.0F3A.W1 3B /r ib VEXTRACTI64x4 ymm1/m256 {k1}{z}, zmm2, imm8 T4 V/V AVX512VL AVX512F T4 V/V AVX512F T2 V/V AVX512VL AVX512DQ T2 V/V AVX512DQ T8 V/V AVX512DQ T4 V/V AVX512F Extract 128 bits of double-word integer values from ymm2 and store results in xmm1/m128 subject to writemask k1. Extract 128 bits of double-word integer values from zmm2 and store results in xmm1/m128 subject to writemask k1. Extract 128 bits of quad-word integer values from ymm2 and store results in xmm1/m128 subject to writemask k1. Extract 128 bits of quad-word integer values from zmm2 and store results in xmm1/m128 subject to writemask k1. Extract 256 bits of double-word integer values from zmm2 and store results in ymm1/m256 subject to writemask k1. Extract 256 bits of quad-word integer values from zmm2 and store results in ymm1/m256 subject to writemask k1. Extract 128 bits of integer data from ymm2 and store results in xmm1/m128. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RMI ModRM:r/m (w) ModRM:reg (r) Imm8 NA T2, T4, T8 ModRM:r/m (w) ModRM:reg (r) Imm8 NA Description VEXTRACTI128/VEXTRACTI32x4 and VEXTRACTI64x2 extract 128-bits of doubleword integer values from the source operand (the second operand) and store to the low 128-bit of the destination operand (the first operand). The 128-bit data extraction occurs at an 128-bit granular offset specified by imm8[0] (256-bit) or imm8[1:0] as the multiply factor. The destination may be either a vector register or an 128-bit memory location. VEXTRACTI32x4: The low 128-bit of the destination operand is updated at 32-bit granularity according to the writemask. VEXTRACTI64x2: The low 128-bit of the destination operand is updated at 64-bit granularity according to the writemask. VEXTRACTI32x8 and VEXTRACTI64x4 extract 256-bits of quadword integer values from the source operand (the second operand) and store to the low 256-bit of the destination operand (the first operand). The 256-bit data extraction occurs at an 256-bit granular offset specified by imm8[0] (256-bit) or imm8[0] as the multiply factor The destination may be either a vector register or a 256-bit memory location. VEXTRACTI32x8: The low 256-bit of the destination operand is updated at 32-bit granularity according to the writemask. 5-106 Vol. 2C VEXTRACTI128/VEXTRACTI32x4/VEXTRACTI64x2/VEXTRACTI32x8/VEXTRACTI64x4—Extract packed Integer Values INSTRUCTION SET REFERENCE, V-Z VEXTRACTI64x4: The low 256-bit of the destination operand is updated at 64-bit granularity according to the writemask. VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. The high 7 bits (6 bits in EVEX.512) of the immediate are ignored. If VEXTRACTI128 is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will cause an #UD exception. Operation VEXTRACTI32x4 (EVEX encoded versions) when destination is a register VL = 256, 512 IF VL = 256 CASE (imm8[0]) OF 0: TMP_DEST[127:0] SRC1[127:0] 1: TMP_DEST[127:0] SRC1[255:128] ESAC. FI; IF VL = 512 CASE (imm8[1:0]) OF 00: TMP_DEST[127:0] SRC1[127:0] 01: TMP_DEST[127:0] SRC1[255:128] 10: TMP_DEST[127:0] SRC1[383:256] 11: TMP_DEST[127:0] SRC1[511:384] ESAC. FI; FOR j 0 TO 3 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:128] 0 VEXTRACTI32x4 (EVEX encoded versions) when destination is memory VL = 256, 512 IF VL = 256 CASE (imm8[0]) OF 0: TMP_DEST[127:0] SRC1[127:0] 1: TMP_DEST[127:0] SRC1[255:128] ESAC. FI; IF VL = 512 CASE (imm8[1:0]) OF 00: TMP_DEST[127:0] SRC1[127:0] 01: TMP_DEST[127:0] SRC1[255:128] 10: TMP_DEST[127:0] SRC1[383:256] 11: TMP_DEST[127:0] SRC1[511:384] ESAC. VEXTRACTI128/VEXTRACTI32x4/VEXTRACTI64x2/VEXTRACTI32x8/VEXTRACTI64x4—Extract packed Integer Values Vol. 2C 5-107 INSTRUCTION SET REFERENCE, V-Z FI; FOR j 0 TO 3 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE *DEST[i+31:i] remains unchanged* FI; ENDFOR ; merging-masking VEXTRACTI64x2 (EVEX encoded versions) when destination is a register VL = 256, 512 IF VL = 256 CASE (imm8[0]) OF 0: TMP_DEST[127:0] SRC1[127:0] 1: TMP_DEST[127:0] SRC1[255:128] ESAC. FI; IF VL = 512 CASE (imm8[1:0]) OF 00: TMP_DEST[127:0] SRC1[127:0] 01: TMP_DEST[127:0] SRC1[255:128] 10: TMP_DEST[127:0] SRC1[383:256] 11: TMP_DEST[127:0] SRC1[511:384] ESAC. FI; FOR j 0 TO 1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:128] 0 5-108 Vol. 2C VEXTRACTI128/VEXTRACTI32x4/VEXTRACTI64x2/VEXTRACTI32x8/VEXTRACTI64x4—Extract packed Integer Values INSTRUCTION SET REFERENCE, V-Z VEXTRACTI64x2 (EVEX encoded versions) when destination is memory VL = 256, 512 IF VL = 256 CASE (imm8[0]) OF 0: TMP_DEST[127:0] SRC1[127:0] 1: TMP_DEST[127:0] SRC1[255:128] ESAC. FI; IF VL = 512 CASE (imm8[1:0]) OF 00: TMP_DEST[127:0] SRC1[127:0] 01: TMP_DEST[127:0] SRC1[255:128] 10: TMP_DEST[127:0] SRC1[383:256] 11: TMP_DEST[127:0] SRC1[511:384] ESAC. FI; FOR j 0 TO 1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE *DEST[i+63:i] remains unchanged* FI; ENDFOR ; merging-masking VEXTRACTI32x8 (EVEX.U1.512 encoded version) when destination is a register VL = 512 CASE (imm8[0]) OF 0: TMP_DEST[255:0] SRC1[255:0] 1: TMP_DEST[255:0] SRC1[511:256] ESAC. FOR j 0 TO 7 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:256] 0 VEXTRACTI128/VEXTRACTI32x4/VEXTRACTI64x2/VEXTRACTI32x8/VEXTRACTI64x4—Extract packed Integer Values Vol. 2C 5-109 INSTRUCTION SET REFERENCE, V-Z VEXTRACTI32x8 (EVEX.U1.512 encoded version) when destination is memory CASE (imm8[0]) OF 0: TMP_DEST[255:0] SRC1[255:0] 1: TMP_DEST[255:0] SRC1[511:256] ESAC. FOR j 0 TO 7 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE *DEST[i+31:i] remains unchanged* FI; ENDFOR ; merging-masking VEXTRACTI64x4 (EVEX.512 encoded version) when destination is a register VL = 512 CASE (imm8[0]) OF 0: TMP_DEST[255:0] SRC1[255:0] 1: TMP_DEST[255:0] SRC1[511:256] ESAC. FOR j 0 TO 3 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:256] 0 VEXTRACTI64x4 (EVEX.512 encoded version) when destination is memory CASE (imm8[0]) OF 0: TMP_DEST[255:0] SRC1[255:0] 1: TMP_DEST[255:0] SRC1[511:256] ESAC. FOR j 0 TO 3 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE *DEST[i+63:i] remains unchanged* ; merging-masking FI; ENDFOR 5-110 Vol. 2C VEXTRACTI128/VEXTRACTI32x4/VEXTRACTI64x2/VEXTRACTI32x8/VEXTRACTI64x4—Extract packed Integer Values INSTRUCTION SET REFERENCE, V-Z VEXTRACTI128 (memory destination form) CASE (imm8[0]) OF 0: DEST[127:0] SRC1[127:0] 1: DEST[127:0] SRC1[255:128] ESAC. VEXTRACTI128 (register destination form) CASE (imm8[0]) OF 0: DEST[127:0] SRC1[127:0] 1: DEST[127:0] SRC1[255:128] ESAC. DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VEXTRACTI32x4 __m128i _mm512_extracti32x4_epi32(__m512i a, const int nidx); VEXTRACTI32x4 __m128i _mm512_mask_extracti32x4_epi32(__m128i s, __mmask8 k, __m512i a, const int nidx); VEXTRACTI32x4 __m128i _mm512_maskz_extracti32x4_epi32( __mmask8 k, __m512i a, const int nidx); VEXTRACTI32x4 __m128i _mm256_extracti32x4_epi32(__m256i a, const int nidx); VEXTRACTI32x4 __m128i _mm256_mask_extracti32x4_epi32(__m128i s, __mmask8 k, __m256i a, const int nidx); VEXTRACTI32x4 __m128i _mm256_maskz_extracti32x4_epi32( __mmask8 k, __m256i a, const int nidx); VEXTRACTI32x8 __m256i _mm512_extracti32x8_epi32(__m512i a, const int nidx); VEXTRACTI32x8 __m256i _mm512_mask_extracti32x8_epi32(__m256i s, __mmask8 k, __m512i a, const int nidx); VEXTRACTI32x8 __m256i _mm512_maskz_extracti32x8_epi32( __mmask8 k, __m512i a, const int nidx); VEXTRACTI64x2 __m128i _mm512_extracti64x2_epi64(__m512i a, const int nidx); VEXTRACTI64x2 __m128i _mm512_mask_extracti64x2_epi64(__m128i s, __mmask8 k, __m512i a, const int nidx); VEXTRACTI64x2 __m128i _mm512_maskz_extracti64x2_epi64( __mmask8 k, __m512i a, const int nidx); VEXTRACTI64x2 __m128i _mm256_extracti64x2_epi64(__m256i a, const int nidx); VEXTRACTI64x2 __m128i _mm256_mask_extracti64x2_epi64(__m128i s, __mmask8 k, __m256i a, const int nidx); VEXTRACTI64x2 __m128i _mm256_maskz_extracti64x2_epi64( __mmask8 k, __m256i a, const int nidx); VEXTRACTI64x4 __m256i _mm512_extracti64x4_epi64(__m512i a, const int nidx); VEXTRACTI64x4 __m256i _mm512_mask_extracti64x4_epi64(__m256i s, __mmask8 k, __m512i a, const int nidx); VEXTRACTI64x4 __m256i _mm512_maskz_extracti64x4_epi64( __mmask8 k, __m512i a, const int nidx); VEXTRACTI128 __m128i _mm256_extracti128_si256(__m256i a, int offset); SIMD Floating-Point Exceptions None Other Exceptions VEX-encoded instructions, see Exceptions Type 6; EVEX-encoded instructions, see Exceptions Type E6NF. #UD IF VEX.L = 0. #UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B. VEXTRACTI128/VEXTRACTI32x4/VEXTRACTI64x2/VEXTRACTI32x8/VEXTRACTI64x4—Extract packed Integer Values Vol. 2C 5-111 INSTRUCTION SET REFERENCE, V-Z VFIXUPIMMPD—Fix Up Special Packed Float64 Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.NDS.128.66.0F3A.W1 54 /r ib VFIXUPIMMPD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst, imm8 EVEX.NDS.256.66.0F3A.W1 54 /r ib VFIXUPIMMPD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst, imm8 EVEX.NDS.512.66.0F3A.W1 54 /r ib VFIXUPIMMPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{sae}, imm8 FV V/V AVX512VL AVX512F FV V/V AVX512F Description Fix up special numbers in float64 vector xmm1, float64 vector xmm2 and int64 vector xmm3/m128/m64bcst and store the result in xmm1, under writemask. Fix up special numbers in float64 vector ymm1, float64 vector ymm2 and int64 vector ymm3/m256/m64bcst and store the result in ymm1, under writemask. Fix up elements of float64 vector in zmm2 using int64 vector table in zmm3/m512/m64bcst, combine with preserved elements from zmm1, and store the result in zmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (r, w) EVEX.vvvv ModRM:r/m (r) Imm8 Description Perform fix-up of quad-word elements encoded in double-precision floating-point format in the first source operand (the second operand) using a 32-bit, two-level look-up table specified in the corresponding quadword element of the second source operand (the third operand) with exception reporting specifier imm8. The elements that are fixed-up are selected by mask bits of 1 specified in the opmask k1. Mask bits of 0 in the opmask k1 or table response action of 0000b preserves the corresponding element of the first operand. The fixed-up elements from the first source operand and the preserved element in the first operand are combined as the final results in the destination operand (the first operand). The destination and the first source operands are ZMM/YMM/XMM registers. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64bit memory location. The two-level look-up table perform a fix-up of each DP FP input data in the first source operand by decoding the input data encoding into 8 token types. A response table is defined for each token type that converts the input encoding in the first source operand with one of 16 response actions. This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source so that they match the spec, although it is generally useful for fixing up the results of multiple-instruction sequences to reflect special-number inputs. For example, consider rcp(0). Input 0 to rcp, and you should get INF according to the DX10 spec. However, evaluating rcp via Newton-Raphson, where x=approx(1/0), yields an incorrect result. To deal with this, VFIXUPIMMPD can be used after the N-R reciprocal sequence to set the result to the correct value (i.e. INF when the input is 0). If MXCSR.DAZ is not set, denormal input elements in the first source operand are considered as normal inputs and do not trigger any fixup nor fault reporting. Imm8 is used to set the required flags reporting. It supports #ZE and #IE fault reporting (see details below). MXCSR mask bits are ignored and are treated as if all mask bits are set to masked response). If any of the imm8 bits is set and the condition met for fault reporting, MXCSR.IE or MXCSR.ZE might be updated. This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1 are computed and stored into zmm1. Elements in the destination with the corresponding bit clear in k1 retain their previous values or are set to 0. 5-112 Vol. 2C VFIXUPIMMPD—Fix Up Special Packed Float64 Values INSTRUCTION SET REFERENCE, V-Z Operation enum TOKEN_TYPE { QNAN_TOKEN 0, SNAN_TOKEN 1, ZERO_VALUE_TOKEN 2, POS_ONE_VALUE_TOKEN 3, NEG_INF_TOKEN 4, POS_INF_TOKEN 5, NEG_VALUE_TOKEN 6, POS_VALUE_TOKEN 7 } FIXUPIMM_DP (dest[63:0], src1[63:0],tbl3[63:0], imm8 [7:0]){ tsrc[63:0] ((src1[62:52] = 0) AND (MXCSR.DAZ =1)) ? 0.0 : src1[63:0] CASE(tsrc[63:0] of TOKEN_TYPE) { QNAN_TOKEN: j 0; SNAN_TOKEN: j 1; ZERO_VALUE_TOKEN: j 2; POS_ONE_VALUE_TOKEN: j 3; NEG_INF_TOKEN: j 4; POS_INF_TOKEN: j 5; NEG_VALUE_TOKEN: j 6; POS_VALUE_TOKEN: j 7; } ; end source special CASE(tsrc…) ; The required response from src3 table is extracted token_response[3:0] = tbl3[3+4*j:4*j]; CASE(token_response[3:0]) { 0000: dest[63:0] dest[63:0] ; ; preserve content of DEST 0001: dest[63:0] tsrc[63:0]; ; pass through src1 normal input value, denormal as zero 0010: dest[63:0] QNaN(tsrc[63:0]); 0011: dest[63:0] QNAN_Indefinite; 0100: dest[63:0] -INF; 0101: dest[63:0] +INF; 0110: dest[63:0] tsrc.sign? –INF : +INF; 0111: dest[63:0] -0; 1000: dest[63:0] +0; 1001: dest[63:0] -1; 1010: dest[63:0] +1; 1011: dest[63:0] ½; 1100: dest[63:0] 90.0; 1101: dest[63:0] PI/2; 1110: dest[63:0] MAX_FLOAT; 1111: dest[63:0] -MAX_FLOAT; } ; end of token_response CASE VFIXUPIMMPD—Fix Up Special Packed Float64 Values Vol. 2C 5-113 INSTRUCTION SET REFERENCE, V-Z } ; The required fault reporting from imm8 is extracted ; TOKENs are mutually exclusive and TOKENs priority defines the order. ; Multiple faults related to a single token can occur simultaneously. IF (tsrc[63:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[0] then set #ZE; IF (tsrc[63:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[1] then set #IE; IF (tsrc[63:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[2] then set #ZE; IF (tsrc[63:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[3] then set #IE; IF (tsrc[63:0] of TOKEN_TYPE: SNAN_TOKEN) AND imm8[4] then set #IE; IF (tsrc[63:0] of TOKEN_TYPE: NEG_INF_TOKEN) AND imm8[5] then set #IE; IF (tsrc[63:0] of TOKEN_TYPE: NEG_VALUE_TOKEN) AND imm8[6] then set #IE; IF (tsrc[63:0] of TOKEN_TYPE: POS_INF_TOKEN) AND imm8[7] then set #IE; ; end fault reporting return dest[63:0]; ; end of FIXUPIMM_DP() VFIXUPIMMPD (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+63:i] FIXUPIMM_DP(DEST[i+63:i], SRC1[i+63:i], SRC2[63:0], imm8 [7:0]) ELSE DEST[i+63:i] FIXUPIMM_DP(DEST[i+63:i], SRC1[i+63:i], SRC2[i+63:i], imm8 [7:0]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE DEST[i+63:i] 0 ; zeroing-masking FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-114 Vol. 2C VFIXUPIMMPD—Fix Up Special Packed Float64 Values INSTRUCTION SET REFERENCE, V-Z Immediate Control Description: 7 6 5 4 3 2 1 0 + INF - VE - INF SNaN ONE ONE ZERO ZERO #IE #IE #IE #IE #IE #ZE #IE #ZE Figure 5-9. VFIXUPIMMPD Immediate Control Description Intel C/C++ Compiler Intrinsic Equivalent VFIXUPIMMPD __m512d _mm512_fixupimm_pd( __m512d a, __m512i tbl, int imm); VFIXUPIMMPD __m512d _mm512_mask_fixupimm_pd(__m512d s, __mmask8 k, __m512d a, __m512i tbl, int imm); VFIXUPIMMPD __m512d _mm512_maskz_fixupimm_pd( __mmask8 k, __m512d a, __m512i tbl, int imm); VFIXUPIMMPD __m512d _mm512_fixupimm_round_pd( __m512d a, __m512i tbl, int imm, int sae); VFIXUPIMMPD __m512d _mm512_mask_fixupimm_round_pd(__m512d s, __mmask8 k, __m512d a, __m512i tbl, int imm, int sae); VFIXUPIMMPD __m512d _mm512_maskz_fixupimm_round_pd( __mmask8 k, __m512d a, __m512i tbl, int imm, int sae); VFIXUPIMMPD __m256d _mm256_fixupimm_pd( __m256d a, __m256i tbl, int imm); VFIXUPIMMPD __m256d _mm256_mask_fixupimm_pd(__m256d s, __mmask8 k, __m256d a, __m256i tbl, int imm); VFIXUPIMMPD __m256d _mm256_maskz_fixupimm_pd( __mmask8 k, __m256d a, __m256i tbl, int imm); VFIXUPIMMPD __m128d _mm_fixupimm_pd( __m128d a, __m128i tbl, int imm); VFIXUPIMMPD __m128d _mm_mask_fixupimm_pd(__m128d s, __mmask8 k, __m128d a, __m128i tbl, int imm); VFIXUPIMMPD __m128d _mm_maskz_fixupimm_pd( __mmask8 k, __m128d a, __m128i tbl, int imm); SIMD Floating-Point Exceptions Zero, Invalid Other Exceptions See Exceptions Type E2. VFIXUPIMMPD—Fix Up Special Packed Float64 Values Vol. 2C 5-115 INSTRUCTION SET REFERENCE, V-Z VFIXUPIMMPS—Fix Up Special Packed Float32 Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.NDS.128.66.0F3A.W0 54 /r VFIXUPIMMPS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst, imm8 EVEX.NDS.256.66.0F3A.W0 54 /r VFIXUPIMMPS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst, imm8 EVEX.NDS.512.66.0F3A.W0 54 /r ib VFIXUPIMMPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{sae}, imm8 FV V/V AVX512VL AVX512F FV V/V AVX512F Description Fix up special numbers in float32 vector xmm1, float32 vector xmm2 and int32 vector xmm3/m128/m32bcst and store the result in xmm1, under writemask. Fix up special numbers in float32 vector ymm1, float32 vector ymm2 and int32 vector ymm3/m256/m32bcst and store the result in ymm1, under writemask. Fix up elements of float32 vector in zmm2 using int32 vector table in zmm3/m512/m32bcst, combine with preserved elements from zmm1, and store the result in zmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (r, w) EVEX.vvvv ModRM:r/m (r) Imm8 Description Perform fix-up of doubleword elements encoded in single-precision floating-point format in the first source operand (the second operand) using a 32-bit, two-level look-up table specified in the corresponding doubleword element of the second source operand (the third operand) with exception reporting specifier imm8. The elements that are fixed-up are selected by mask bits of 1 specified in the opmask k1. Mask bits of 0 in the opmask k1 or table response action of 0000b preserves the corresponding element of the first operand. The fixed-up elements from the first source operand and the preserved element in the first operand are combined as the final results in the destination operand (the first operand). The destination and the first source operands are ZMM/YMM/XMM registers. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64bit memory location. The two-level look-up table perform a fix-up of each SP FP input data in the first source operand by decoding the input data encoding into 8 token types. A response table is defined for each token type that converts the input encoding in the first source operand with one of 16 response actions. This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source so that they match the spec, although it is generally useful for fixing up the results of multiple-instruction sequences to reflect special-number inputs. For example, consider rcp(0). Input 0 to rcp, and you should get INF according to the DX10 spec. However, evaluating rcp via Newton-Raphson, where x=approx(1/0), yields an incorrect result. To deal with this, VFIXUPIMMPS can be used after the N-R reciprocal sequence to set the result to the correct value (i.e. INF when the input is 0). If MXCSR.DAZ is not set, denormal input elements in the first source operand are considered as normal inputs and do not trigger any fixup nor fault reporting. Imm8 is used to set the required flags reporting. It supports #ZE and #IE fault reporting (see details below). MXCSR.DAZ is used and refer to zmm2 only (i.e. zmm1 is not considered as zero in case MXCSR.DAZ is set). MXCSR mask bits are ignored and are treated as if all mask bits are set to masked response). If any of the imm8 bits is set and the condition met for fault reporting, MXCSR.IE or MXCSR.ZE might be updated. 5-116 Vol. 2C VFIXUPIMMPS—Fix Up Special Packed Float32 Values INSTRUCTION SET REFERENCE, V-Z Operation enum TOKEN_TYPE { QNAN_TOKEN 0, SNAN_TOKEN 1, ZERO_VALUE_TOKEN 2, POS_ONE_VALUE_TOKEN 3, NEG_INF_TOKEN 4, POS_INF_TOKEN 5, NEG_VALUE_TOKEN 6, POS_VALUE_TOKEN 7 } FIXUPIMM_SP ( dest[31:0], src1[31:0],tbl3[31:0], imm8 [7:0]){ tsrc[31:0] ((src1[30:23] = 0) AND (MXCSR.DAZ =1)) ? 0.0 : src1[31:0] CASE(tsrc[63:0] of TOKEN_TYPE) { QNAN_TOKEN: j 0; SNAN_TOKEN: j 1; ZERO_VALUE_TOKEN: j 2; POS_ONE_VALUE_TOKEN: j 3; NEG_INF_TOKEN: j 4; POS_INF_TOKEN: j 5; NEG_VALUE_TOKEN: j 6; POS_VALUE_TOKEN: j 7; } ; end source special CASE(tsrc…) ; The required response from src3 table is extracted token_response[3:0] = tbl3[3+4*j:4*j]; CASE(token_response[3:0]) { 0000: dest[31:0] dest[31:0]; ; preserve content of DEST 0001: dest[31:0] tsrc[31:0]; ; pass through src1 normal input value, denormal as zero 0010: dest[31:0] QNaN(tsrc[31:0]); 0011: dest[31:0] QNAN_Indefinite; 0100: dest[31:0] -INF; 0101: dest[31:0] +INF; 0110: dest[31:0] tsrc.sign? –INF : +INF; 0111: dest[31:0] -0; 1000: dest[31:0] +0; 1001: dest[31:0] -1; 1010: dest[31:0] +1; 1011: dest[31:0] ½; 1100: dest[31:0] 90.0; 1101: dest[31:0] PI/2; 1110: dest[31:0] MAX_FLOAT; 1111: dest[31:0] -MAX_FLOAT; } ; end of token_response CASE VFIXUPIMMPS—Fix Up Special Packed Float32 Values Vol. 2C 5-117 INSTRUCTION SET REFERENCE, V-Z } ; The required fault reporting from imm8 is extracted ; TOKENs are mutually exclusive and TOKENs priority defines the order. ; Multiple faults related to a single token can occur simultaneously. IF (tsrc[31:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[0] then set #ZE; IF (tsrc[31:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[1] then set #IE; IF (tsrc[31:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[2] then set #ZE; IF (tsrc[31:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[3] then set #IE; IF (tsrc[31:0] of TOKEN_TYPE: SNAN_TOKEN) AND imm8[4] then set #IE; IF (tsrc[31:0] of TOKEN_TYPE: NEG_INF_TOKEN) AND imm8[5] then set #IE; IF (tsrc[31:0] of TOKEN_TYPE: NEG_VALUE_TOKEN) AND imm8[6] then set #IE; IF (tsrc[31:0] of TOKEN_TYPE: POS_INF_TOKEN) AND imm8[7] then set #IE; ; end fault reporting return dest[31:0]; ; end of FIXUPIMM_SP() VFIXUPIMMPS (EVEX) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+31:i] FIXUPIMM_SP(DEST[i+31:i], SRC1[i+31:i], SRC2[31:0], imm8 [7:0]) ELSE DEST[i+31:i] FIXUPIMM_SP(DEST[i+31:i], SRC1[i+31:i], SRC2[i+31:i], imm8 [7:0]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE DEST[i+31:i] 0 ; zeroing-masking FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-118 Vol. 2C VFIXUPIMMPS—Fix Up Special Packed Float32 Values INSTRUCTION SET REFERENCE, V-Z Immediate Control Description: 7 6 5 4 3 2 1 0 + INF - VE - INF SNaN ONE ONE ZERO ZERO #IE #IE #IE #IE #IE #ZE #IE #ZE Figure 5-10. VFIXUPIMMPS Immediate Control Description Intel C/C++ Compiler Intrinsic Equivalent VFIXUPIMMPS __m512 _mm512_fixupimm_ps( __m512 a, __m512i tbl, int imm); VFIXUPIMMPS __m512 _mm512_mask_fixupimm_ps(__m512 s, __mmask16 k, __m512 a, __m512i tbl, int imm); VFIXUPIMMPS __m512 _mm512_maskz_fixupimm_ps( __mmask16 k, __m512 a, __m512i tbl, int imm); VFIXUPIMMPS __m512 _mm512_fixupimm_round_ps( __m512 a, __m512i tbl, int imm, int sae); VFIXUPIMMPS __m512 _mm512_mask_fixupimm_round_ps(__m512 s, __mmask16 k, __m512 a, __m512i tbl, int imm, int sae); VFIXUPIMMPS __m512 _mm512_maskz_fixupimm_round_ps( __mmask16 k, __m512 a, __m512i tbl, int imm, int sae); VFIXUPIMMPS __m256 _mm256_fixupimm_ps( __m256 a, __m256i tbl, int imm); VFIXUPIMMPS __m256 _mm256_mask_fixupimm_ps(__m256 s, __mmask8 k, __m256 a, __m256i tbl, int imm); VFIXUPIMMPS __m256 _mm256_maskz_fixupimm_ps( __mmask8 k, __m256 a, __m256i tbl, int imm); VFIXUPIMMPS __m128 _mm_fixupimm_ps( __m128 a, __m128i tbl, int imm); VFIXUPIMMPS __m128 _mm_mask_fixupimm_ps(__m128 s, __mmask8 k, __m128 a, __m128i tbl, int imm); VFIXUPIMMPS __m128 _mm_maskz_fixupimm_ps( __mmask8 k, __m128 a, __m128i tbl, int imm); SIMD Floating-Point Exceptions Zero, Invalid Other Exceptions See Exceptions Type E2. VFIXUPIMMPS—Fix Up Special Packed Float32 Values Vol. 2C 5-119 INSTRUCTION SET REFERENCE, V-Z VFIXUPIMMSD—Fix Up Special Scalar Float64 Value Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F3A.W1 55 /r ib VFIXUPIMMSD xmm1 {k1}{z}, xmm2, xmm3/m64{sae}, imm8 T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F Description Fix up a float64 number in the low quadword element of xmm2 using scalar int32 table in xmm3/m64 and store the result in xmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (r, w) EVEX.vvvv ModRM:r/m (r) Imm8 Description Perform a fix-up of the low quadword element encoded in double-precision floating-point format in the first source operand (the second operand) using a 32-bit, two-level look-up table specified in the low quadword element of the second source operand (the third operand) with exception reporting specifier imm8. The element that is fixed-up is selected by mask bit of 1 specified in the opmask k1. Mask bit of 0 in the opmask k1 or table response action of 0000b preserves the corresponding element of the first operand. The fixed-up element from the first source operand or the preserved element in the first operand becomes the low quadword element of the destination operand (the first operand). Bits 127:64 of the destination operand is copied from the corresponding bits of the first source operand. The destination and first source operands are XMM registers. The second source operand can be a XMM register or a 64- bit memory location. The two-level look-up table perform a fix-up of each DP FP input data in the first source operand by decoding the input data encoding into 8 token types. A response table is defined for each token type that converts the input encoding in the first source operand with one of 16 response actions. This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source so that they match the spec, although it is generally useful for fixing up the results of multiple-instruction sequences to reflect special-number inputs. For example, consider rcp(0). Input 0 to rcp, and you should get INF according to the DX10 spec. However, evaluating rcp via Newton-Raphson, where x=approx(1/0), yields an incorrect result. To deal with this, VFIXUPIMMPD can be used after the N-R reciprocal sequence to set the result to the correct value (i.e. INF when the input is 0). If MXCSR.DAZ is not set, denormal input elements in the first source operand are considered as normal inputs and do not trigger any fixup nor fault reporting. Imm8 is used to set the required flags reporting. It supports #ZE and #IE fault reporting (see details below). MXCSR.DAZ is used and refer to zmm2 only (i.e. zmm1 is not considered as zero in case MXCSR.DAZ is set). MXCSR mask bits are ignored and are treated as if all mask bits are set to masked response). If any of the imm8 bits is set and the condition met for fault reporting, MXCSR.IE or MXCSR.ZE might be updated. Operation enum TOKEN_TYPE { QNAN_TOKEN 0, SNAN_TOKEN 1, ZERO_VALUE_TOKEN 2, POS_ONE_VALUE_TOKEN 3, NEG_INF_TOKEN 4, POS_INF_TOKEN 5, NEG_VALUE_TOKEN 6, POS_VALUE_TOKEN 7 } 5-120 Vol. 2C VFIXUPIMMSD—Fix Up Special Scalar Float64 Value INSTRUCTION SET REFERENCE, V-Z FIXUPIMM_DP (dest[63:0], src1[63:0],tbl3[63:0], imm8 [7:0]){ tsrc[63:0] ((src1[62:52] = 0) AND (MXCSR.DAZ =1)) ? 0.0 : src1[63:0] CASE(tsrc[63:0] of TOKEN_TYPE) { QNAN_TOKEN: j 0; SNAN_TOKEN: j 1; ZERO_VALUE_TOKEN: j 2; POS_ONE_VALUE_TOKEN: j 3; NEG_INF_TOKEN: j 4; POS_INF_TOKEN: j 5; NEG_VALUE_TOKEN: j 6; POS_VALUE_TOKEN: j 7; } ; end source special CASE(tsrc…) ; The required response from src3 table is extracted token_response[3:0] = tbl3[3+4*j:4*j]; CASE(token_response[3:0]) { 0000: dest[63:0] dest[63:0] ; preserve content of DEST 0001: dest[63:0] tsrc[63:0]; ; pass through src1 normal input value, denormal as zero 0010: dest[63:0] QNaN(tsrc[63:0]); 0011: dest[63:0] QNAN_Indefinite; 0100:dest[63:0] -INF; 0101: dest[63:0] +INF; 0110: dest[63:0] tsrc.sign? –INF : +INF; 0111: dest[63:0] -0; 1000: dest[63:0] +0; 1001: dest[63:0] -1; 1010: dest[63:0] +1; 1011: dest[63:0] ½; 1100: dest[63:0] 90.0; 1101: dest[63:0] PI/2; 1110: dest[63:0] MAX_FLOAT; 1111: dest[63:0] -MAX_FLOAT; } ; end of token_response CASE } ; The required fault reporting from imm8 is extracted ; TOKENs are mutually exclusive and TOKENs priority defines the order. ; Multiple faults related to a single token can occur simultaneously. IF (tsrc[63:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[0] then set #ZE; IF (tsrc[63:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[1] then set #IE; IF (tsrc[63:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[2] then set #ZE; IF (tsrc[63:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[3] then set #IE; IF (tsrc[63:0] of TOKEN_TYPE: SNAN_TOKEN) AND imm8[4] then set #IE; IF (tsrc[63:0] of TOKEN_TYPE: NEG_INF_TOKEN) AND imm8[5] then set #IE; IF (tsrc[63:0] of TOKEN_TYPE: NEG_VALUE_TOKEN) AND imm8[6] then set #IE; IF (tsrc[63:0] of TOKEN_TYPE: POS_INF_TOKEN) AND imm8[7] then set #IE; ; end fault reporting return dest[63:0]; ; end of FIXUPIMM_DP() VFIXUPIMMSD—Fix Up Special Scalar Float64 Value Vol. 2C 5-121 INSTRUCTION SET REFERENCE, V-Z VFIXUPIMMSD (EVEX encoded version) IF k1[0] OR *no writemask* THEN DEST[63:0] FIXUPIMM_DP(DEST[63:0], SRC1[63:0], SRC2[63:0], imm8 [7:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE DEST[63:0] 0 ; zeroing-masking FI FI; DEST[127:64] SRC1[127:64] DEST[MAX_VL-1:128] 0 Immediate Control Description: 7 6 5 4 3 2 1 0 + INF - VE - INF SNaN ONE ONE ZERO ZERO #IE #IE #IE #IE #IE #ZE #IE #ZE Figure 5-11. VFIXUPIMMSD Immediate Control Description Intel C/C++ Compiler Intrinsic Equivalent VFIXUPIMMSD __m128d _mm_fixupimm_sd( __m128d a, __m128i tbl, int imm); VFIXUPIMMSD __m128d _mm_mask_fixupimm_sd(__m128d s, __mmask8 k, __m128d a, __m128i tbl, int imm); VFIXUPIMMSD __m128d _mm_maskz_fixupimm_sd( __mmask8 k, __m128d a, __m128i tbl, int imm); VFIXUPIMMSD __m128d _mm_fixupimm_round_sd( __m128d a, __m128i tbl, int imm, int sae); VFIXUPIMMSD __m128d _mm_mask_fixupimm_round_sd(__m128d s, __mmask8 k, __m128d a, __m128i tbl, int imm, int sae); VFIXUPIMMSD __m128d _mm_maskz_fixupimm_round_sd( __mmask8 k, __m128d a, __m128i tbl, int imm, int sae); SIMD Floating-Point Exceptions Zero, Invalid Other Exceptions See Exceptions Type E3. 5-122 Vol. 2C VFIXUPIMMSD—Fix Up Special Scalar Float64 Value INSTRUCTION SET REFERENCE, V-Z VFIXUPIMMSS—Fix Up Special Scalar Float32 Value Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F3A.W0 55 /r ib VFIXUPIMMSS xmm1 {k1}{z}, xmm2, xmm3/m32{sae}, imm8 T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F Description Fix up a float32 number in the low doubleword element in xmm2 using scalar int32 table in xmm3/m32 and store the result in xmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (r, w) EVEX.vvvv ModRM:r/m (r) Imm8 Description Perform a fix-up of the low doubleword element encoded in single-precision floating-point format in the first source operand (the second operand) using a 32-bit, two-level look-up table specified in the low doubleword element of the second source operand (the third operand) with exception reporting specifier imm8. The element that is fixedup is selected by mask bit of 1 specified in the opmask k1. Mask bit of 0 in the opmask k1 or table response action of 0000b preserves the corresponding element of the first operand. The fixed-up element from the first source operand or the preserved element in the first operand becomes the low doubleword element of the destination operand (the first operand) Bits 127:32 of the destination operand is copied from the corresponding bits of the first source operand. The destination and first source operands are XMM registers. The second source operand can be a XMM register or a 32-bit memory location. The two-level look-up table perform a fix-up of each SP FP input data in the first source operand by decoding the input data encoding into 8 token types. A response table is defined for each token type that converts the input encoding in the first source operand with one of 16 response actions. This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source so that they match the spec, although it is generally useful for fixing up the results of multiple-instruction sequences to reflect special-number inputs. For example, consider rcp(0). Input 0 to rcp, and you should get INF according to the DX10 spec. However, evaluating rcp via Newton-Raphson, where x=approx(1/0), yields an incorrect result. To deal with this, VFIXUPIMMPD can be used after the N-R reciprocal sequence to set the result to the correct value (i.e. INF when the input is 0). If MXCSR.DAZ is not set, denormal input elements in the first source operand are considered as normal inputs and do not trigger any fixup nor fault reporting. Imm8 is used to set the required flags reporting. It supports #ZE and #IE fault reporting (see details below). MXCSR.DAZ is used and refer to zmm2 only (i.e. zmm1 is not considered as zero in case MXCSR.DAZ is set). MXCSR mask bits are ignored and are treated as if all mask bits are set to masked response). If any of the imm8 bits is set and the condition met for fault reporting, MXCSR.IE or MXCSR.ZE might be updated. Operation enum TOKEN_TYPE { QNAN_TOKEN 0, SNAN_TOKEN 1, ZERO_VALUE_TOKEN 2, POS_ONE_VALUE_TOKEN 3, NEG_INF_TOKEN 4, POS_INF_TOKEN 5, NEG_VALUE_TOKEN 6, POS_VALUE_TOKEN 7 } VFIXUPIMMSS—Fix Up Special Scalar Float32 Value Vol. 2C 5-123 INSTRUCTION SET REFERENCE, V-Z FIXUPIMM_SP (dest[31:0], src1[31:0],tbl3[31:0], imm8 [7:0]){ tsrc[31:0] ((src1[30:23] = 0) AND (MXCSR.DAZ =1)) ? 0.0 : src1[31:0] CASE(tsrc[63:0] of TOKEN_TYPE) { QNAN_TOKEN: j 0; SNAN_TOKEN: j 1; ZERO_VALUE_TOKEN: j 2; POS_ONE_VALUE_TOKEN: j 3; NEG_INF_TOKEN: j 4; POS_INF_TOKEN: j 5; NEG_VALUE_TOKEN: j 6; POS_VALUE_TOKEN: j = 7; } ; end source special CASE(tsrc…) ; The required response from src3 table is extracted token_response[3:0] = tbl3[3+4*j:4*j]; CASE(token_response[3:0]) { 0000: dest[31:0] dest[31:0]; ; preserve content of DEST 0001: dest[31:0] tsrc[31:0]; ; pass through src1 normal input value, denormal as zero 0010: dest[31:0] QNaN(tsrc[31:0]); 0011: dest[31:0] QNAN_Indefinite; 0100: dest[31:0] -INF; 0101: dest[31:0] +INF; 0110: dest[31:0] tsrc.sign? –INF : +INF; 0111: dest[31:0] -0; 1000: dest[31:0] +0; 1001: dest[31:0] -1; 1010: dest[31:0] +1; 1011: dest[31:0] ½; 1100: dest[31:0] 90.0; 1101: dest[31:0] PI/2; 1110: dest[31:0] MAX_FLOAT; 1111: dest[31:0] -MAX_FLOAT; } ; end of token_response CASE } ; The required fault reporting from imm8 is extracted ; TOKENs are mutually exclusive and TOKENs priority defines the order. ; Multiple faults related to a single token can occur simultaneously. IF (tsrc[31:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[0] then set #ZE; IF (tsrc[31:0] of TOKEN_TYPE: ZERO_VALUE_TOKEN) AND imm8[1] then set #IE; IF (tsrc[31:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[2] then set #ZE; IF (tsrc[31:0] of TOKEN_TYPE: ONE_VALUE_TOKEN) AND imm8[3] then set #IE; IF (tsrc[31:0] of TOKEN_TYPE: SNAN_TOKEN) AND imm8[4] then set #IE; IF (tsrc[31:0] of TOKEN_TYPE: NEG_INF_TOKEN) AND imm8[5] then set #IE; IF (tsrc[31:0] of TOKEN_TYPE: NEG_VALUE_TOKEN) AND imm8[6] then set #IE; IF (tsrc[31:0] of TOKEN_TYPE: POS_INF_TOKEN) AND imm8[7] then set #IE; ; end fault reporting return dest[31:0]; ; end of FIXUPIMM_SP() 5-124 Vol. 2C VFIXUPIMMSS—Fix Up Special Scalar Float32 Value INSTRUCTION SET REFERENCE, V-Z VFIXUPIMMSS (EVEX encoded version) IF k1[0] OR *no writemask* THEN DEST[31:0] FIXUPIMM_SP(DEST[31:0], SRC1[31:0], SRC2[31:0], imm8 [7:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE DEST[31:0] 0 ; zeroing-masking FI FI; DEST[127:32] SRC1[127:32] DEST[MAX_VL-1:128] 0 Immediate Control Description: 7 6 5 4 3 2 1 0 + INF - VE - INF SNaN ONE ONE ZERO ZERO #IE #IE #IE #IE #IE #ZE #IE #ZE Figure 5-12. VFIXUPIMMSS Immediate Control Description Intel C/C++ Compiler Intrinsic Equivalent VFIXUPIMMSS __m128 _mm_fixupimm_ss( __m128 a, __m128i tbl, int imm); VFIXUPIMMSS __m128 _mm_mask_fixupimm_ss(__m128 s, __mmask8 k, __m128 a, __m128i tbl, int imm); VFIXUPIMMSS __m128 _mm_maskz_fixupimm_ss( __mmask8 k, __m128 a, __m128i tbl, int imm); VFIXUPIMMSS __m128 _mm_fixupimm_round_ss( __m128 a, __m128i tbl, int imm, int sae); VFIXUPIMMSS __m128 _mm_mask_fixupimm_round_ss(__m128 s, __mmask8 k, __m128 a, __m128i tbl, int imm, int sae); VFIXUPIMMSS __m128 _mm_maskz_fixupimm_round_ss( __mmask8 k, __m128 a, __m128i tbl, int imm, int sae); SIMD Floating-Point Exceptions Zero, Invalid Other Exceptions See Exceptions Type E3. VFIXUPIMMSS—Fix Up Special Scalar Float32 Value Vol. 2C 5-125 INSTRUCTION SET REFERENCE, V-Z VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed DoublePrecision Floating-Point Values Opcode/ Instruction Op/ En VEX.NDS.128.66.0F38.W1 98 /r VFMADD132PD xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W1 A8 /r VFMADD213PD xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W1 B8 /r VFMADD231PD xmm1, xmm2, xmm3/m128 VEX.NDS.256.66.0F38.W1 98 /r VFMADD132PD ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.W1 A8 /r VFMADD213PD ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.W1 B8 /r VFMADD231PD ymm1, ymm2, ymm3/m256 EVEX.NDS.128.66.0F38.W1 98 /r VFMADD132PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDS.128.66.0F38.W1 A8 /r VFMADD213PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDS.128.66.0F38.W1 B8 /r VFMADD231PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDS.256.66.0F38.W1 98 /r VFMADD132PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.256.66.0F38.W1 A8 /r VFMADD213PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.256.66.0F38.W1 B8 /r VFMADD231PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.512.66.0F38.W1 98 /r VFMADD132PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} EVEX.NDS.512.66.0F38.W1 A8 /r VFMADD213PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} EVEX.NDS.512.66.0F38.W1 B8 /r VFMADD231PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} 5-126 Vol. 2C RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F FV V/V AVX512F FV V/V AVX512F Description Multiply packed double-precision floating-point values from xmm1 and xmm3/mem, add to xmm2 and put result in xmm1. Multiply packed double-precision floating-point values from xmm1 and xmm2, add to xmm3/mem and put result in xmm1. Multiply packed double-precision floating-point values from xmm2 and xmm3/mem, add to xmm1 and put result in xmm1. Multiply packed double-precision floating-point values from ymm1 and ymm3/mem, add to ymm2 and put result in ymm1. Multiply packed double-precision floating-point values from ymm1 and ymm2, add to ymm3/mem and put result in ymm1. Multiply packed double-precision floating-point values from ymm2 and ymm3/mem, add to ymm1 and put result in ymm1. Multiply packed double-precision floating-point values from xmm1 and xmm3/m128/m64bcst, add to xmm2 and put result in xmm1. Multiply packed double-precision floating-point values from xmm1 and xmm2, add to xmm3/m128/m64bcst and put result in xmm1. Multiply packed double-precision floating-point values from xmm2 and xmm3/m128/m64bcst, add to xmm1 and put result in xmm1. Multiply packed double-precision floating-point values from ymm1 and ymm3/m256/m64bcst, add to ymm2 and put result in ymm1. Multiply packed double-precision floating-point values from ymm1 and ymm2, add to ymm3/m256/m64bcst and put result in ymm1. Multiply packed double-precision floating-point values from ymm2 and ymm3/m256/m64bcst, add to ymm1 and put result in ymm1. Multiply packed double-precision floating-point values from zmm1 and zmm3/m512/m64bcst, add to zmm2 and put result in zmm1. Multiply packed double-precision floating-point values from zmm1 and zmm2, add to zmm3/m512/m64bcst and put result in zmm1. Multiply packed double-precision floating-point values from zmm2 and zmm3/m512/m64bcst, add to zmm1 and put result in zmm1. VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Performs a set of SIMD multiply-add computation on packed double-precision floating-point values using three source operands and writes the multiply-add results in the destination operand. The destination operand is also the first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD register or a memory location. VFMADD132PD: Multiplies the two, four or eight packed double-precision floating-point values from the first source operand to the two, four or eight packed double-precision floating-point values in the third source operand, adds the infinite precision intermediate result to the two, four or eight packed double-precision floating-point values in the second source operand, performs rounding and stores the resulting two, four or eight packed double-precision floating-point values to the destination operand (first source operand). VFMADD213PD: Multiplies the two, four or eight packed double-precision floating-point values from the second source operand to the two, four or eight packed double-precision floating-point values in the first source operand, adds the infinite precision intermediate result to the two, four or eight packed double-precision floating-point values in the third source operand, performs rounding and stores the resulting two, four or eight packed doubleprecision floating-point values to the destination operand (first source operand). VFMADD231PD: Multiplies the two, four or eight packed double-precision floating-point values from the second source to the two, four or eight packed double-precision floating-point values in the third source operand, adds the infinite precision intermediate result to the two, four or eight packed double-precision floating-point values in the first source operand, performs rounding and stores the resulting two, four or eight packed double-precision floating-point values to the destination operand (first source operand). EVEX encoded versions: The destination operand (also first source operand) is a ZMM register and encoded in reg_field. The second source operand is a ZMM register and encoded in EVEX.vvvv. The third source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 64-bit memory location. The destination operand is conditionally updated with write mask k1. VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field. VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed. VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed Double-Precision Floating-Point Values Vol. 2C 5-127 INSTRUCTION SET REFERENCE, V-Z Operation In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding). VFMADD132PD DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 64*i; DEST[n+63:n] RoundFPControl_MXCSR(DEST[n+63:n]*SRC3[n+63:n] + SRC2[n+63:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFMADD213PD DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 64*i; DEST[n+63:n] RoundFPControl_MXCSR(SRC2[n+63:n]*DEST[n+63:n] + SRC3[n+63:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFMADD231PD DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 64*i; DEST[n+63:n] RoundFPControl_MXCSR(SRC2[n+63:n]*SRC3[n+63:n] + DEST[n+63:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI 5-128 Vol. 2C VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFMADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] RoundFPControl(DEST[i+63:i]*SRC3[i+63:i] + SRC2[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] + SRC2[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] + SRC2[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed Double-Precision Floating-Point Values Vol. 2C 5-129 INSTRUCTION SET REFERENCE, V-Z VFMADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-130 Vol. 2C VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFMADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[63:0]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed Double-Precision Floating-Point Values Vol. 2C 5-131 INSTRUCTION SET REFERENCE, V-Z VFMADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] + DEST[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VFMADDxxxPD __m512d _mm512_fmadd_pd(__m512d a, __m512d b, __m512d c); VFMADDxxxPD __m512d _mm512_fmadd_round_pd(__m512d a, __m512d b, __m512d c, int r); VFMADDxxxPD __m512d _mm512_mask_fmadd_pd(__m512d a, __mmask8 k, __m512d b, __m512d c); VFMADDxxxPD __m512d _mm512_maskz_fmadd_pd(__mmask8 k, __m512d a, __m512d b, __m512d c); VFMADDxxxPD __m512d _mm512_mask3_fmadd_pd(__m512d a, __m512d b, __m512d c, __mmask8 k); VFMADDxxxPD __m512d _mm512_mask_fmadd_round_pd(__m512d a, __mmask8 k, __m512d b, __m512d c, int r); VFMADDxxxPD __m512d _mm512_maskz_fmadd_round_pd(__mmask8 k, __m512d a, __m512d b, __m512d c, int r); VFMADDxxxPD __m512d _mm512_mask3_fmadd_round_pd(__m512d a, __m512d b, __m512d c, __mmask8 k, int r); VFMADDxxxPD __m256d _mm256_mask_fmadd_pd(__m256d a, __mmask8 k, __m256d b, __m256d c); VFMADDxxxPD __m256d _mm256_maskz_fmadd_pd(__mmask8 k, __m256d a, __m256d b, __m256d c); VFMADDxxxPD __m256d _mm256_mask3_fmadd_pd(__m256d a, __m256d b, __m256d c, __mmask8 k); VFMADDxxxPD __m128d _mm_mask_fmadd_pd(__m128d a, __mmask8 k, __m128d b, __m128d c); VFMADDxxxPD __m128d _mm_maskz_fmadd_pd(__mmask8 k, __m128d a, __m128d b, __m128d c); VFMADDxxxPD __m128d _mm_mask3_fmadd_pd(__m128d a, __m128d b, __m128d c, __mmask8 k); VFMADDxxxPD __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c); VFMADDxxxPD __m256d _mm256_fmadd_pd (__m256d a, __m256d b, __m256d c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 2. EVEX-encoded instructions, see Exceptions Type E2. 5-132 Vol. 2C VFMADD132PD/VFMADD213PD/VFMADD231PD—Fused Multiply-Add of Packed Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFMADD132PS/VFMADD213PS/VFMADD231PS—Fused Multiply-Add of Packed SinglePrecision Floating-Point Values Opcode/ Instruction Op/ En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.NDS.128.66.0F38.W0 98 /r VFMADD132PS xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W0 A8 /r VFMADD213PS xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W0 B8 /r VFMADD231PS xmm1, xmm2, xmm3/m128 VEX.NDS.256.66.0F38.W0 98 /r VFMADD132PS ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.W0 A8 /r VFMADD213PS ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.0 B8 /r VFMADD231PS ymm1, ymm2, ymm3/m256 EVEX.NDS.128.66.0F38.W0 98 /r VFMADD132PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.128.66.0F38.W0 A8 /r VFMADD213PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.128.66.0F38.W0 B8 /r VFMADD231PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.256.66.0F38.W0 98 /r VFMADD132PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.256.66.0F38.W0 A8 /r VFMADD213PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.256.66.0F38.W0 B8 /r VFMADD231PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.512.66.0F38.W0 98 /r VFMADD132PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} EVEX.NDS.512.66.0F38.W0 A8 /r VFMADD213PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} EVEX.NDS.512.66.0F38.W0 B8 /r VFMADD231PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F FV V/V AVX512F FV V/V AVX512F Description Multiply packed single-precision floating-point values from xmm1 and xmm3/mem, add to xmm2 and put result in xmm1. Multiply packed single-precision floating-point values from xmm1 and xmm2, add to xmm3/mem and put result in xmm1. Multiply packed single-precision floating-point values from xmm2 and xmm3/mem, add to xmm1 and put result in xmm1. Multiply packed single-precision floating-point values from ymm1 and ymm3/mem, add to ymm2 and put result in ymm1. Multiply packed single-precision floating-point values from ymm1 and ymm2, add to ymm3/mem and put result in ymm1. Multiply packed single-precision floating-point values from ymm2 and ymm3/mem, add to ymm1 and put result in ymm1. Multiply packed single-precision floating-point values from xmm1 and xmm3/m128/m32bcst, add to xmm2 and put result in xmm1. Multiply packed single-precision floating-point values from xmm1 and xmm2, add to xmm3/m128/m32bcst and put result in xmm1. Multiply packed single-precision floating-point values from xmm2 and xmm3/m128/m32bcst, add to xmm1 and put result in xmm1. Multiply packed single-precision floating-point values from ymm1 and ymm3/m256/m32bcst, add to ymm2 and put result in ymm1. Multiply packed single-precision floating-point values from ymm1 and ymm2, add to ymm3/m256/m32bcst and put result in ymm1. Multiply packed single-precision floating-point values from ymm2 and ymm3/m256/m32bcst, add to ymm1 and put result in ymm1. Multiply packed single-precision floating-point values from zmm1 and zmm3/m512/m32bcst, add to zmm2 and put result in zmm1. Multiply packed single-precision floating-point values from zmm1 and zmm2, add to zmm3/m512/m32bcst and put result in zmm1. Multiply packed single-precision floating-point values from zmm2 and zmm3/m512/m32bcst, add to zmm1 and put result in zmm1. VFMADD132PS/VFMADD213PS/VFMADD231PS—Fused Multiply-Add of Packed Single-Precision Floating-Point Values Vol. 2C 5-133 INSTRUCTION SET REFERENCE, V-Z Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Performs a set of SIMD multiply-add computation on packed single-precision floating-point values using three source operands and writes the multiply-add results in the destination operand. The destination operand is also the first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD register or a memory location. VFMADD132PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the first source operand to the four, eight or sixteen packed single-precision floating-point values in the third source operand, adds the infinite precision intermediate result to the four, eight or sixteen packed single-precision floating-point values in the second source operand, performs rounding and stores the resulting four, eight or sixteen packed single-precision floating-point values to the destination operand (first source operand). VFMADD213PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the second source operand to the four, eight or sixteen packed single-precision floating-point values in the first source operand, adds the infinite precision intermediate result to the four, eight or sixteen packed single-precision floating-point values in the third source operand, performs rounding and stores the resulting the four, eight or sixteen packed single-precision floating-point values to the destination operand (first source operand). VFMADD231PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the second source operand to the four, eight or sixteen packed single-precision floating-point values in the third source operand, adds the infinite precision intermediate result to the four, eight or sixteen packed single-precision floating-point values in the first source operand, performs rounding and stores the resulting four, eight or sixteen packed single-precision floating-point values to the destination operand (first source operand). EVEX encoded versions: The destination operand (also first source operand) is a ZMM register and encoded in reg_field. The second source operand is a ZMM register and encoded in EVEX.vvvv. The third source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32-bit memory location. The destination operand is conditionally updated with write mask k1. VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field. VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed. 5-134 Vol. 2C VFMADD132PS/VFMADD213PS/VFMADD231PS—Fused Multiply-Add of Packed Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z Operation In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding). VFMADD132PS DEST, SRC2, SRC3 IF (VEX.128) THEN MAXNUM 4 ELSEIF (VEX.256) MAXNUM 8 FI For i = 0 to MAXNUM-1 { n 32*i; DEST[n+31:n] RoundFPControl_MXCSR(DEST[n+31:n]*SRC3[n+31:n] + SRC2[n+31:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFMADD213PS DEST, SRC2, SRC3 IF (VEX.128) THEN MAXNUM 4 ELSEIF (VEX.256) MAXNUM 8 FI For i = 0 to MAXNUM-1 { n 32*i; DEST[n+31:n] RoundFPControl_MXCSR(SRC2[n+31:n]*DEST[n+31:n] + SRC3[n+31:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFMADD231PS DEST, SRC2, SRC3 IF (VEX.128) THEN MAXNUM 4 ELSEIF (VEX.256) MAXNUM 8 FI For i = 0 to MAXNUM-1 { n 32*i; DEST[n+31:n] RoundFPControl_MXCSR(SRC2[n+31:n]*SRC3[n+31:n] + DEST[n+31:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFMADD132PS/VFMADD213PS/VFMADD231PS—Fused Multiply-Add of Packed Single-Precision Floating-Point Values Vol. 2C 5-135 INSTRUCTION SET REFERENCE, V-Z VFMADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] + SRC2[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-136 Vol. 2C VFMADD132PS/VFMADD213PS/VFMADD231PS—Fused Multiply-Add of Packed Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFMADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] RoundFPControl(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMADD132PS/VFMADD213PS/VFMADD231PS—Fused Multiply-Add of Packed Single-Precision Floating-Point Values Vol. 2C 5-137 INSTRUCTION SET REFERENCE, V-Z VFMADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] + SRC3[31:0]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] RoundFPControl(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-138 Vol. 2C VFMADD132PS/VFMADD213PS/VFMADD231PS—Fused Multiply-Add of Packed Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFMADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] + DEST[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VFMADDxxxPS __m512 _mm512_fmadd_ps(__m512 a, __m512 b, __m512 c); VFMADDxxxPS __m512 _mm512_fmadd_round_ps(__m512 a, __m512 b, __m512 c, int r); VFMADDxxxPS __m512 _mm512_mask_fmadd_ps(__m512 a, __mmask16 k, __m512 b, __m512 c); VFMADDxxxPS __m512 _mm512_maskz_fmadd_ps(__mmask16 k, __m512 a, __m512 b, __m512 c); VFMADDxxxPS __m512 _mm512_mask3_fmadd_ps(__m512 a, __m512 b, __m512 c, __mmask16 k); VFMADDxxxPS __m512 _mm512_mask_fmadd_round_ps(__m512 a, __mmask16 k, __m512 b, __m512 c, int r); VFMADDxxxPS __m512 _mm512_maskz_fmadd_round_ps(__mmask16 k, __m512 a, __m512 b, __m512 c, int r); VFMADDxxxPS __m512 _mm512_mask3_fmadd_round_ps(__m512 a, __m512 b, __m512 c, __mmask16 k, int r); VFMADDxxxPS __m256 _mm256_mask_fmadd_ps(__m256 a, __mmask8 k, __m256 b, __m256 c); VFMADDxxxPS __m256 _mm256_maskz_fmadd_ps(__mmask8 k, __m256 a, __m256 b, __m256 c); VFMADDxxxPS __m256 _mm256_mask3_fmadd_ps(__m256 a, __m256 b, __m256 c, __mmask8 k); VFMADDxxxPS __m128 _mm_mask_fmadd_ps(__m128 a, __mmask8 k, __m128 b, __m128 c); VFMADDxxxPS __m128 _mm_maskz_fmadd_ps(__mmask8 k, __m128 a, __m128 b, __m128 c); VFMADDxxxPS __m128 _mm_mask3_fmadd_ps(__m128 a, __m128 b, __m128 c, __mmask8 k); VFMADDxxxPS __m128 _mm_fmadd_ps (__m128 a, __m128 b, __m128 c); VFMADDxxxPS __m256 _mm256_fmadd_ps (__m256 a, __m256 b, __m256 c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 2. EVEX-encoded instructions, see Exceptions Type E2. VFMADD132PS/VFMADD213PS/VFMADD231PS—Fused Multiply-Add of Packed Single-Precision Floating-Point Values Vol. 2C 5-139 INSTRUCTION SET REFERENCE, V-Z VFMADD132SD/VFMADD213SD/VFMADD231SD—Fused Multiply-Add of Scalar DoublePrecision Floating-Point Values Opcode/ Instruction Op/ En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.DDS.LIG.66.0F38.W1 99 /r VFMADD132SD xmm1, xmm2, xmm3/m64 VEX.DDS.LIG.66.0F38.W1 A9 /r VFMADD213SD xmm1, xmm2, xmm3/m64 VEX.DDS.LIG.66.0F38.W1 B9 /r VFMADD231SD xmm1, xmm2, xmm3/m64 EVEX.DDS.LIG.66.0F38.W1 99 /r VFMADD132SD xmm1 {k1}{z}, xmm2, xmm3/m64{er} EVEX.DDS.LIG.66.0F38.W1 A9 /r VFMADD213SD xmm1 {k1}{z}, xmm2, xmm3/m64{er} EVEX.DDS.LIG.66.0F38.W1 B9 /r VFMADD231SD xmm1 {k1}{z}, xmm2, xmm3/m64{er} RVM V/V FMA RVM V/V FMA T1S V/V AVX512F T1S V/V AVX512F T1S V/V AVX512F Description Multiply scalar double-precision floating-point value from xmm1 and xmm3/m64, add to xmm2 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm1 and xmm2, add to xmm3/m64 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm2 and xmm3/m64, add to xmm1 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm1 and xmm3/m64, add to xmm2 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm1 and xmm2, add to xmm3/m64 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm2 and xmm3/m64, add to xmm1 and put result in xmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA T1S ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Performs a SIMD multiply-add computation on the low double-precision floating-point values using three source operands and writes the multiply-add result in the destination operand. The destination operand is also the first source operand. The first and second operand are XMM registers. The third source operand can be an XMM register or a 64-bit memory location. VFMADD132SD: Multiplies the low double-precision floating-point value from the first source operand to the low double-precision floating-point value in the third source operand, adds the infinite precision intermediate result to the low double-precision floating-point values in the second source operand, performs rounding and stores the resulting double-precision floating-point value to the destination operand (first source operand). VFMADD213SD: Multiplies the low double-precision floating-point value from the second source operand to the low double-precision floating-point value in the first source operand, adds the infinite precision intermediate result to the low double-precision floating-point value in the third source operand, performs rounding and stores the resulting double-precision floating-point value to the destination operand (first source operand). VFMADD231SD: Multiplies the low double-precision floating-point value from the second source to the low doubleprecision floating-point value in the third source operand, adds the infinite precision intermediate result to the low double-precision floating-point value in the first source operand, performs rounding and stores the resulting double-precision floating-point value to the destination operand (first source operand). VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field. The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field. Bits 127:64 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed. EVEX encoded version: The low quadword element of the destination is updated according to the writemask. 5-140 Vol. 2C VFMADD132SD/VFMADD213SD/VFMADD231SD—Fused Multiply-Add of Scalar Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z Operation In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding). VFMADD132SD DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[63:0] RoundFPControl(DEST[63:0]*SRC3[63:0] + SRC2[63:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[63:0] 0 FI; FI; DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFMADD213SD DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[63:0] RoundFPControl(SRC2[63:0]*DEST[63:0] + SRC3[63:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[63:0] 0 FI; FI; DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFMADD132SD/VFMADD213SD/VFMADD231SD—Fused Multiply-Add of Scalar Double-Precision Floating-Point Values Vol. 2C 5-141 INSTRUCTION SET REFERENCE, V-Z VFMADD231SD DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[63:0] RoundFPControl(SRC2[63:0]*SRC3[63:0] + DEST[63:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[63:0] 0 FI; FI; DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFMADD132SD DEST, SRC2, SRC3 (VEX encoded version) DEST[63:0] MAX_VL-1:128RoundFPControl_MXCSR(DEST[63:0]*SRC3[63:0] + SRC2[63:0]) DEST[127:63] DEST[127:63] DEST[MAX_VL-1:128] 0 VFMADD213SD DEST, SRC2, SRC3 (VEX encoded version) DEST[63:0] RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] + SRC3[63:0]) DEST[127:63] DEST[127:63] DEST[MAX_VL-1:128] 0 VFMADD231SD DEST, SRC2, SRC3 (VEX encoded version) DEST[63:0] RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] + DEST[63:0]) DEST[127:63] DEST[127:63] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VFMADDxxxSD __m128d _mm_fmadd_round_sd(__m128d a, __m128d b, __m128d c, int r); VFMADDxxxSD __m128d _mm_mask_fmadd_sd(__m128d a, __mmask8 k, __m128d b, __m128d c); VFMADDxxxSD __m128d _mm_maskz_fmadd_sd(__mmask8 k, __m128d a, __m128d b, __m128d c); VFMADDxxxSD __m128d _mm_mask3_fmadd_sd(__m128d a, __m128d b, __m128d c, __mmask8 k); VFMADDxxxSD __m128d _mm_mask_fmadd_round_sd(__m128d a, __mmask8 k, __m128d b, __m128d c, int r); VFMADDxxxSD __m128d _mm_maskz_fmadd_round_sd(__mmask8 k, __m128d a, __m128d b, __m128d c, int r); VFMADDxxxSD __m128d _mm_mask3_fmadd_round_sd(__m128d a, __m128d b, __m128d c, __mmask8 k, int r); VFMADDxxxSD __m128d _mm_fmadd_sd (__m128d a, __m128d b, __m128d c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 3. EVEX-encoded instructions, see Exceptions Type E3. 5-142 Vol. 2C VFMADD132SD/VFMADD213SD/VFMADD231SD—Fused Multiply-Add of Scalar Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFMADD132SS/VFMADD213SS/VFMADD231SS—Fused Multiply-Add of Scalar Single-Precision Floating-Point Values Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.DDS.LIG.66.0F38.W0 99 /r VFMADD132SS xmm1, xmm2, xmm3/m32 VEX.DDS.LIG.66.0F38.W0 A9 /r VFMADD213SS xmm1, xmm2, xmm3/m32 VEX.DDS.LIG.66.0F38.W0 B9 /r VFMADD231SS xmm1, xmm2, xmm3/m32 EVEX.DDS.LIG.66.0F38.W0 99 /r VFMADD132SS xmm1 {k1}{z}, xmm2, xmm3/m32{er} EVEX.DDS.LIG.66.0F38.W0 A9 /r VFMADD213SS xmm1 {k1}{z}, xmm2, xmm3/m32{er} EVEX.DDS.LIG.66.0F38.W0 B9 /r VFMADD231SS xmm1 {k1}{z}, xmm2, xmm3/m32{er} RVM V/V FMA RVM V/V FMA T1S V/V AVX512F T1S V/V AVX512F T1S V/V AVX512F Description Multiply scalar single-precision floating-point value from xmm1 and xmm3/m32, add to xmm2 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm1 and xmm2, add to xmm3/m32 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm2 and xmm3/m32, add to xmm1 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm1 and xmm3/m32, add to xmm2 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm1 and xmm2, add to xmm3/m32 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm2 and xmm3/m32, add to xmm1 and put result in xmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA T1S ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Performs a SIMD multiply-add computation on single-precision floating-point values using three source operands and writes the multiply-add results in the destination operand. The destination operand is also the first source operand. The first and second operands are XMM registers. The third source operand can be a XMM register or a 32-bit memory location. VFMADD132SS: Multiplies the low single-precision floating-point value from the first source operand to the low single-precision floating-point value in the third source operand, adds the infinite precision intermediate result to the low single-precision floating-point value in the second source operand, performs rounding and stores the resulting single-precision floating-point value to the destination operand (first source operand). VFMADD213SS: Multiplies the low single-precision floating-point value from the second source operand to the low single-precision floating-point value in the first source operand, adds the infinite precision intermediate result to the low single-precision floating-point value in the third source operand, performs rounding and stores the resulting single-precision floating-point value to the destination operand (first source operand). VFMADD231SS: Multiplies the low single-precision floating-point value from the second source operand to the low single-precision floating-point value in the third source operand, adds the infinite precision intermediate result to the low single-precision floating-point value in the first source operand, performs rounding and stores the resulting single-precision floating-point value to the destination operand (first source operand). VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field. The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field. Bits 127:32 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed. VFMADD132SS/VFMADD213SS/VFMADD231SS—Fused Multiply-Add of Scalar Single-Precision Floating-Point Values Vol. 2C 5-143 INSTRUCTION SET REFERENCE, V-Z EVEX encoded version: The low doubleword element of the destination is updated according to the writemask. Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. Operation In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding). VFMADD132SS DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[31:0] RoundFPControl(DEST[31:0]*SRC3[31:0] + SRC2[31:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[31:0] 0 FI; FI; DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFMADD213SS DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[31:0] RoundFPControl(SRC2[31:0]*DEST[31:0] + SRC3[31:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[31:0] 0 FI; FI; DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 5-144 Vol. 2C VFMADD132SS/VFMADD213SS/VFMADD231SS—Fused Multiply-Add of Scalar Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFMADD231SS DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[31:0] RoundFPControl(SRC2[31:0]*SRC3[31:0] + DEST[31:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0]] remains unchanged* ELSE ; zeroing-masking THEN DEST[31:0] 0 FI; FI; DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFMADD132SS DEST, SRC2, SRC3 (VEX encoded version) DEST[31:0] RoundFPControl_MXCSR(DEST[31:0]*SRC3[31:0] + SRC2[31:0]) DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFMADD213SS DEST, SRC2, SRC3 (VEX encoded version) DEST[31:0] RoundFPControl_MXCSR(SRC2[31:0]*DEST[31:0] + SRC3[31:0]) DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFMADD231SS DEST, SRC2, SRC3 (VEX encoded version) DEST[31:0] RoundFPControl_MXCSR(SRC2[31:0]*SRC3[31:0] + DEST[31:0]) DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VFMADDxxxSS __m128 _mm_fmadd_round_ss(__m128 a, __m128 b, __m128 c, int r); VFMADDxxxSS __m128 _mm_mask_fmadd_ss(__m128 a, __mmask8 k, __m128 b, __m128 c); VFMADDxxxSS __m128 _mm_maskz_fmadd_ss(__mmask8 k, __m128 a, __m128 b, __m128 c); VFMADDxxxSS __m128 _mm_mask3_fmadd_ss(__m128 a, __m128 b, __m128 c, __mmask8 k); VFMADDxxxSS __m128 _mm_mask_fmadd_round_ss(__m128 a, __mmask8 k, __m128 b, __m128 c, int r); VFMADDxxxSS __m128 _mm_maskz_fmadd_round_ss(__mmask8 k, __m128 a, __m128 b, __m128 c, int r); VFMADDxxxSS __m128 _mm_mask3_fmadd_round_ss(__m128 a, __m128 b, __m128 c, __mmask8 k, int r); VFMADDxxxSS __m128 _mm_fmadd_ss (__m128 a, __m128 b, __m128 c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 3. EVEX-encoded instructions, see Exceptions Type E3. VFMADD132SS/VFMADD213SS/VFMADD231SS—Fused Multiply-Add of Scalar Single-Precision Floating-Point Values Vol. 2C 5-145 INSTRUCTION SET REFERENCE, V-Z VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double-Precision Floating-Point Values Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.DDS.128.66.0F38.W1 96 /r VFMADDSUB132PD xmm1, xmm2, xmm3/m128 VEX.DDS.128.66.0F38.W1 A6 /r VFMADDSUB213PD xmm1, xmm2, xmm3/m128 VEX.DDS.128.66.0F38.W1 B6 /r VFMADDSUB231PD xmm1, xmm2, xmm3/m128 VEX.DDS.256.66.0F38.W1 96 /r VFMADDSUB132PD ymm1, ymm2, ymm3/m256 VEX.DDS.256.66.0F38.W1 A6 /r VFMADDSUB213PD ymm1, ymm2, ymm3/m256 VEX.DDS.256.66.0F38.W1 B6 /r VFMADDSUB231PD ymm1, ymm2, ymm3/m256 EVEX.DDS.128.66.0F38.W1 A6 /r VFMADDSUB213PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA FV V/V AVX512VL AVX512F EVEX.DDS.128.66.0F38.W1 B6 /r VFMADDSUB231PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst FV V/V AVX512VL AVX512F EVEX.DDS.128.66.0F38.W1 96 /r VFMADDSUB132PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst FV V/V AVX512VL AVX512F EVEX.DDS.256.66.0F38.W1 A6 /r VFMADDSUB213PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst FV V/V AVX512VL AVX512F EVEX.DDS.256.66.0F38.W1 B6 /r VFMADDSUB231PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst FV V/V AVX512VL AVX512F EVEX.DDS.256.66.0F38.W1 96 /r VFMADDSUB132PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst FV V/V AVX512VL AVX512F 5-146 Vol. 2C Description Multiply packed double-precision floating-point values from xmm1 and xmm3/mem, add/subtract elements in xmm2 and put result in xmm1. Multiply packed double-precision floating-point values from xmm1 and xmm2, add/subtract elements in xmm3/mem and put result in xmm1. Multiply packed double-precision floating-point values from xmm2 and xmm3/mem, add/subtract elements in xmm1 and put result in xmm1. Multiply packed double-precision floating-point values from ymm1 and ymm3/mem, add/subtract elements in ymm2 and put result in ymm1. Multiply packed double-precision floating-point values from ymm1 and ymm2, add/subtract elements in ymm3/mem and put result in ymm1. Multiply packed double-precision floating-point values from ymm2 and ymm3/mem, add/subtract elements in ymm1 and put result in ymm1. Multiply packed double-precision floating-point values from xmm1 and xmm2, add/subtract elements in xmm3/m128/m64bcst and put result in xmm1 subject to writemask k1. Multiply packed double-precision floating-point values from xmm2 and xmm3/m128/m64bcst, add/subtract elements in xmm1 and put result in xmm1 subject to writemask k1. Multiply packed double-precision floating-point values from xmm1 and xmm3/m128/m64bcst, add/subtract elements in xmm2 and put result in xmm1 subject to writemask k1. Multiply packed double-precision floating-point values from ymm1 and ymm2, add/subtract elements in ymm3/m256/m64bcst and put result in ymm1 subject to writemask k1. Multiply packed double-precision floating-point values from ymm2 and ymm3/m256/m64bcst, add/subtract elements in ymm1 and put result in ymm1 subject to writemask k1. Multiply packed double-precision floating-point values from ymm1 and ymm3/m256/m64bcst, add/subtract elements in ymm2 and put result in ymm1 subject to writemask k1. VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double-Precision INSTRUCTION SET REFERENCE, V-Z Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F EVEX.DDS.512.66.0F38.W1 A6 /r VFMADDSUB213PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} EVEX.DDS.512.66.0F38.W1 B6 /r VFMADDSUB231PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} FV V/V AVX512F EVEX.DDS.512.66.0F38.W1 96 /r VFMADDSUB132PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} FV V/V AVX512F Description Multiply packed double-precision floating-point values from zmm1and zmm2, add/subtract elements in zmm3/m512/m64bcst and put result in zmm1 subject to writemask k1. Multiply packed double-precision floating-point values from zmm2 and zmm3/m512/m64bcst, add/subtract elements in zmm1 and put result in zmm1 subject to writemask k1. Multiply packed double-precision floating-point values from zmm1 and zmm3/m512/m64bcst, add/subtract elements in zmm2 and put result in zmm1 subject to writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description VFMADDSUB132PD: Multiplies the two, four, or eight packed double-precision floating-point values from the first source operand to the two or four packed double-precision floating-point values in the third source operand. From the infinite precision intermediate result, adds the odd double-precision floating-point elements and subtracts the even double-precision floating-point values in the second source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand). VFMADDSUB213PD: Multiplies the two, four, or eight packed double-precision floating-point values from the second source operand to the two or four packed double-precision floating-point values in the first source operand. From the infinite precision intermediate result, adds the odd double-precision floating-point elements and subtracts the even double-precision floating-point values in the third source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand). VFMADDSUB231PD: Multiplies the two, four, or eight packed double-precision floating-point values from the second source operand to the two or four packed double-precision floating-point values in the third source operand. From the infinite precision intermediate result, adds the odd double-precision floating-point elements and subtracts the even double-precision floating-point values in the first source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand). EVEX encoded versions: The destination operand (also first source operand) and the second source operand are ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is conditionally updated with write mask k1. VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field. VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double-Precision Vol. 2C 5-147 INSTRUCTION SET REFERENCE, V-Z VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed. Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. Operation In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no rounding). VFMADDSUB132PD DEST, SRC2, SRC3 IF (VEX.128) THEN DEST[63:0] RoundFPControl_MXCSR(DEST[63:0]*SRC3[63:0] - SRC2[63:0]) DEST[127:64] RoundFPControl_MXCSR(DEST[127:64]*SRC3[127:64] + SRC2[127:64]) DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[63:0] RoundFPControl_MXCSR(DEST[63:0]*SRC3[63:0] - SRC2[63:0]) DEST[127:64] RoundFPControl_MXCSR(DEST[127:64]*SRC3[127:64] + SRC2[127:64]) DEST[191:128] RoundFPControl_MXCSR(DEST[191:128]*SRC3[191:128] - SRC2[191:128]) DEST[255:192] RoundFPControl_MXCSR(DEST[255:192]*SRC3[255:192] + SRC2[255:192] FI VFMADDSUB213PD DEST, SRC2, SRC3 IF (VEX.128) THEN DEST[63:0] RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] - SRC3[63:0]) DEST[127:64] RoundFPControl_MXCSR(SRC2[127:64]*DEST[127:64] + SRC3[127:64]) DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[63:0] RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] - SRC3[63:0]) DEST[127:64] RoundFPControl_MXCSR(SRC2[127:64]*DEST[127:64] + SRC3[127:64]) DEST[191:128] RoundFPControl_MXCSR(SRC2[191:128]*DEST[191:128] - SRC3[191:128]) DEST[255:192] RoundFPControl_MXCSR(SRC2[255:192]*DEST[255:192] + SRC3[255:192] FI VFMADDSUB231PD DEST, SRC2, SRC3 IF (VEX.128) THEN DEST[63:0] RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] - DEST[63:0]) DEST[127:64] RoundFPControl_MXCSR(SRC2[127:64]*SRC3[127:64] + DEST[127:64]) DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[63:0] RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] - DEST[63:0]) DEST[127:64] RoundFPControl_MXCSR(SRC2[127:64]*SRC3[127:64] + DEST[127:64]) DEST[191:128] RoundFPControl_MXCSR(SRC2[191:128]*SRC3[191:128] - DEST[191:128]) DEST[255:192] RoundFPControl_MXCSR(SRC2[255:192]*SRC3[255:192] + DEST[255:192] FI 5-148 Vol. 2C VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double-Precision INSTRUCTION SET REFERENCE, V-Z VFMADDSUB132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF j *is even* THEN DEST[i+63:i] RoundFPControl(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i]) ELSE DEST[i+63:i] RoundFPControl(DEST[i+63:i]*SRC3[i+63:i] + SRC2[i+63:i]) FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double-Precision Vol. 2C 5-149 INSTRUCTION SET REFERENCE, V-Z VFMADDSUB132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF j *is even* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] - SRC2[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i]) FI; ELSE IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] + SRC2[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] + SRC2[i+63:i]) FI; FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-150 Vol. 2C VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double-Precision INSTRUCTION SET REFERENCE, V-Z VFMADDSUB213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF j *is even* THEN DEST[i+63:i] RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] - SRC3[i+63:i]) ELSE DEST[i+63:i] RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i]) FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double-Precision Vol. 2C 5-151 INSTRUCTION SET REFERENCE, V-Z VFMADDSUB213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF j *is even* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] - SRC3[63:0]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] - SRC3[i+63:i]) FI; ELSE IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[63:0]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i]) FI; FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-152 Vol. 2C VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double-Precision INSTRUCTION SET REFERENCE, V-Z VFMADDSUB231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF j *is even* THEN DEST[i+63:i] RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i]) ELSE DEST[i+63:i] RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i]) FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double-Precision Vol. 2C 5-153 INSTRUCTION SET REFERENCE, V-Z VFMADDSUB231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF j *is even* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] - DEST[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i]) FI; ELSE IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] + DEST[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i]) FI; FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-154 Vol. 2C VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double-Precision INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VFMADDSUBxxxPD __m512d _mm512_fmaddsub_pd(__m512d a, __m512d b, __m512d c); VFMADDSUBxxxPD __m512d _mm512_fmaddsub_round_pd(__m512d a, __m512d b, __m512d c, int r); VFMADDSUBxxxPD __m512d _mm512_mask_fmaddsub_pd(__m512d a, __mmask8 k, __m512d b, __m512d c); VFMADDSUBxxxPD __m512d _mm512_maskz_fmaddsub_pd(__mmask8 k, __m512d a, __m512d b, __m512d c); VFMADDSUBxxxPD __m512d _mm512_mask3_fmaddsub_pd(__m512d a, __m512d b, __m512d c, __mmask8 k); VFMADDSUBxxxPD __m512d _mm512_mask_fmaddsub_round_pd(__m512d a, __mmask8 k, __m512d b, __m512d c, int r); VFMADDSUBxxxPD __m512d _mm512_maskz_fmaddsub_round_pd(__mmask8 k, __m512d a, __m512d b, __m512d c, int r); VFMADDSUBxxxPD __m512d _mm512_mask3_fmaddsub_round_pd(__m512d a, __m512d b, __m512d c, __mmask8 k, int r); VFMADDSUBxxxPD __m256d _mm256_mask_fmaddsub_pd(__m256d a, __mmask8 k, __m256d b, __m256d c); VFMADDSUBxxxPD __m256d _mm256_maskz_fmaddsub_pd(__mmask8 k, __m256d a, __m256d b, __m256d c); VFMADDSUBxxxPD __m256d _mm256_mask3_fmaddsub_pd(__m256d a, __m256d b, __m256d c, __mmask8 k); VFMADDSUBxxxPD __m128d _mm_mask_fmaddsub_pd(__m128d a, __mmask8 k, __m128d b, __m128d c); VFMADDSUBxxxPD __m128d _mm_maskz_fmaddsub_pd(__mmask8 k, __m128d a, __m128d b, __m128d c); VFMADDSUBxxxPD __m128d _mm_mask3_fmaddsub_pd(__m128d a, __m128d b, __m128d c, __mmask8 k); VFMADDSUBxxxPD __m128d _mm_fmaddsub_pd (__m128d a, __m128d b, __m128d c); VFMADDSUBxxxPD __m256d _mm256_fmaddsub_pd (__m256d a, __m256d b, __m256d c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 2. EVEX-encoded instructions, see Exceptions Type E2. VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD—Fused Multiply-Alternating Add/Subtract of Packed Double-Precision Vol. 2C 5-155 INSTRUCTION SET REFERENCE, V-Z VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single-Precision Floating-Point Values Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.DDS.128.66.0F38.W0 96 /r VFMADDSUB132PS xmm1, xmm2, xmm3/m128 VEX.DDS.128.66.0F38.W0 A6 /r VFMADDSUB213PS xmm1, xmm2, xmm3/m128 VEX.DDS.128.66.0F38.W0 B6 /r VFMADDSUB231PS xmm1, xmm2, xmm3/m128 VEX.DDS.256.66.0F38.W0 96 /r VFMADDSUB132PS ymm1, ymm2, ymm3/m256 VEX.DDS.256.66.0F38.W0 A6 /r VFMADDSUB213PS ymm1, ymm2, ymm3/m256 VEX.DDS.256.66.0F38.W0 B6 /r VFMADDSUB231PS ymm1, ymm2, ymm3/m256 EVEX.DDS.128.66.0F38.W0 A6 /r VFMADDSUB213PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA FV V/V AVX512VL AVX512F EVEX.DDS.128.66.0F38.W0 B6 /r VFMADDSUB231PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.DDS.128.66.0F38.W0 96 /r VFMADDSUB132PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.DDS.256.66.0F38.W0 A6 /r VFMADDSUB213PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F EVEX.DDS.256.66.0F38.W0 B6 /r VFMADDSUB231PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.DDS.256.66.0F38.W0 96 /r VFMADDSUB132PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.DDS.512.66.0F38.W0 A6 /r VFMADDSUB213PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F EVEX.DDS.512.66.0F38.W0 B6 /r VFMADDSUB231PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} EVEX.DDS.512.66.0F38.W0 96 /r VFMADDSUB132PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} FV V/V AVX512F FV V/V AVX512F 5-156 Vol. 2C Description Multiply packed single-precision floating-point values from xmm1 and xmm3/mem, add/subtract elements in xmm2 and put result in xmm1. Multiply packed single-precision floating-point values from xmm1 and xmm2, add/subtract elements in xmm3/mem and put result in xmm1. Multiply packed single-precision floating-point values from xmm2 and xmm3/mem, add/subtract elements in xmm1 and put result in xmm1. Multiply packed single-precision floating-point values from ymm1 and ymm3/mem, add/subtract elements in ymm2 and put result in ymm1. Multiply packed single-precision floating-point values from ymm1 and ymm2, add/subtract elements in ymm3/mem and put result in ymm1. Multiply packed single-precision floating-point values from ymm2 and ymm3/mem, add/subtract elements in ymm1 and put result in ymm1. Multiply packed single-precision floating-point values from xmm1 and xmm2, add/subtract elements in xmm3/m128/m32bcst and put result in xmm1 subject to writemask k1. Multiply packed single-precision floating-point values from xmm2 and xmm3/m128/m32bcst, add/subtract elements in xmm1 and put result in xmm1 subject to writemask k1. Multiply packed single-precision floating-point values from xmm1 and xmm3/m128/m32bcst, add/subtract elements in zmm2 and put result in xmm1 subject to writemask k1. Multiply packed single-precision floating-point values from ymm1 and ymm2, add/subtract elements in ymm3/m256/m32bcst and put result in ymm1 subject to writemask k1. Multiply packed single-precision floating-point values from ymm2 and ymm3/m256/m32bcst, add/subtract elements in ymm1 and put result in ymm1 subject to writemask k1. Multiply packed single-precision floating-point values from ymm1 and ymm3/m256/m32bcst, add/subtract elements in ymm2 and put result in ymm1 subject to writemask k1. Multiply packed single-precision floating-point values from zmm1 and zmm2, add/subtract elements in zmm3/m512/m32bcst and put result in zmm1 subject to writemask k1. Multiply packed single-precision floating-point values from zmm2 and zmm3/m512/m32bcst, add/subtract elements in zmm1 and put result in zmm1 subject to writemask k1. Multiply packed single-precision floating-point values from zmm1 and zmm3/m512/m32bcst, add/subtract elements in zmm2 and put result in zmm1 subject to writemask k1. VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single-Precision INSTRUCTION SET REFERENCE, V-Z Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description VFMADDSUB132PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the first source operand to the corresponding packed single-precision floating-point values in the third source operand. From the infinite precision intermediate result, adds the odd single-precision floating-point elements and subtracts the even single-precision floating-point values in the second source operand, performs rounding and stores the resulting packed single-precision floating-point values to the destination operand (first source operand). VFMADDSUB213PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the second source operand to the corresponding packed single-precision floating-point values in the first source operand. From the infinite precision intermediate result, adds the odd single-precision floating-point elements and subtracts the even single-precision floating-point values in the third source operand, performs rounding and stores the resulting packed single-precision floating-point values to the destination operand (first source operand). VFMADDSUB231PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the second source operand to the corresponding packed single-precision floating-point values in the third source operand. From the infinite precision intermediate result, adds the odd single-precision floating-point elements and subtracts the even single-precision floating-point values in the first source operand, performs rounding and stores the resulting packed single-precision floating-point values to the destination operand (first source operand). EVEX encoded versions: The destination operand (also first source operand) and the second source operand are ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is conditionally updated with write mask k1. VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field. VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed. Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single-Precision Vol. 2C 5-157 INSTRUCTION SET REFERENCE, V-Z Operation In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding). VFMADDSUB132PS DEST, SRC2, SRC3 IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM -1{ n 64*i; DEST[n+31:n] RoundFPControl_MXCSR(DEST[n+31:n]*SRC3[n+31:n] - SRC2[n+31:n]) DEST[n+63:n+32] RoundFPControl_MXCSR(DEST[n+63:n+32]*SRC3[n+63:n+32] + SRC2[n+63:n+32]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFMADDSUB213PS DEST, SRC2, SRC3 IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM -1{ n 64*i; DEST[n+31:n] RoundFPControl_MXCSR(SRC2[n+31:n]*DEST[n+31:n] - SRC3[n+31:n]) DEST[n+63:n+32] RoundFPControl_MXCSR(SRC2[n+63:n+32]*DEST[n+63:n+32] + SRC3[n+63:n+32]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI 5-158 Vol. 2C VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single-Precision INSTRUCTION SET REFERENCE, V-Z VFMADDSUB231PS DEST, SRC2, SRC3 IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM -1{ n 64*i; DEST[n+31:n] RoundFPControl_MXCSR(SRC2[n+31:n]*SRC3[n+31:n] - DEST[n+31:n]) DEST[n+63:n+32] RoundFPControl_MXCSR(SRC2[n+63:n+32]*SRC3[n+63:n+32] + DEST[n+63:n+32]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFMADDSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) (4, 128), (8, 256),= (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF j *is even* THEN DEST[i+31:i] RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i]) ELSE DEST[i+31:i] RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i]) FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single-Precision Vol. 2C 5-159 INSTRUCTION SET REFERENCE, V-Z VFMADDSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF j *is even* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] - SRC2[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i]) FI; ELSE IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] + SRC2[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i]) FI; FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-160 Vol. 2C VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single-Precision INSTRUCTION SET REFERENCE, V-Z VFMADDSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF j *is even* THEN DEST[i+31:i] RoundFPControl(SRC2[i+31:i]*DEST[i+31:i] - SRC3[i+31:i]) ELSE DEST[i+31:i] RoundFPControl(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i]) FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMADDSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF j *is even* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[31:0]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[i+31:i]) FI; ELSE IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] + SRC3[31:0]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i]) FI; VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single-Precision Vol. 2C 5-161 INSTRUCTION SET REFERENCE, V-Z FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMADDSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF j *is even* THEN DEST[i+31:i] RoundFPControl(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i]) ELSE DEST[i+31:i] RoundFPControl(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i]) FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-162 Vol. 2C VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single-Precision INSTRUCTION SET REFERENCE, V-Z VFMADDSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF j *is even* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] - DEST[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i]) FI; ELSE IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] + DEST[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i]) FI; FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single-Precision Vol. 2C 5-163 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VFMADDSUBxxxPS __m512 _mm512_fmaddsub_ps(__m512 a, __m512 b, __m512 c); VFMADDSUBxxxPS __m512 _mm512_fmaddsub_round_ps(__m512 a, __m512 b, __m512 c, int r); VFMADDSUBxxxPS __m512 _mm512_mask_fmaddsub_ps(__m512 a, __mmask16 k, __m512 b, __m512 c); VFMADDSUBxxxPS __m512 _mm512_maskz_fmaddsub_ps(__mmask16 k, __m512 a, __m512 b, __m512 c); VFMADDSUBxxxPS __m512 _mm512_mask3_fmaddsub_ps(__m512 a, __m512 b, __m512 c, __mmask16 k); VFMADDSUBxxxPS __m512 _mm512_mask_fmaddsub_round_ps(__m512 a, __mmask16 k, __m512 b, __m512 c, int r); VFMADDSUBxxxPS __m512 _mm512_maskz_fmaddsub_round_ps(__mmask16 k, __m512 a, __m512 b, __m512 c, int r); VFMADDSUBxxxPS __m512 _mm512_mask3_fmaddsub_round_ps(__m512 a, __m512 b, __m512 c, __mmask16 k, int r); VFMADDSUBxxxPS __m256 _mm256_mask_fmaddsub_ps(__m256 a, __mmask8 k, __m256 b, __m256 c); VFMADDSUBxxxPS __m256 _mm256_maskz_fmaddsub_ps(__mmask8 k, __m256 a, __m256 b, __m256 c); VFMADDSUBxxxPS __m256 _mm256_mask3_fmaddsub_ps(__m256 a, __m256 b, __m256 c, __mmask8 k); VFMADDSUBxxxPS __m128 _mm_mask_fmaddsub_ps(__m128 a, __mmask8 k, __m128 b, __m128 c); VFMADDSUBxxxPS __m128 _mm_maskz_fmaddsub_ps(__mmask8 k, __m128 a, __m128 b, __m128 c); VFMADDSUBxxxPS __m128 _mm_mask3_fmaddsub_ps(__m128 a, __m128 b, __m128 c, __mmask8 k); VFMADDSUBxxxPS __m128 _mm_fmaddsub_ps (__m128 a, __m128 b, __m128 c); VFMADDSUBxxxPS __m256 _mm256_fmaddsub_ps (__m256 a, __m256 b, __m256 c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 2. EVEX-encoded instructions, see Exceptions Type E2. 5-164 Vol. 2C VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS—Fused Multiply-Alternating Add/Subtract of Packed Single-Precision INSTRUCTION SET REFERENCE, V-Z VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double-Precision Floating-Point Values Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.DDS.128.66.0F38.W1 97 /r VFMSUBADD132PD xmm1, xmm2, xmm3/m128 VEX.DDS.128.66.0F38.W1 A7 /r VFMSUBADD213PD xmm1, xmm2, xmm3/m128 VEX.DDS.128.66.0F38.W1 B7 /r VFMSUBADD231PD xmm1, xmm2, xmm3/m128 VEX.DDS.256.66.0F38.W1 97 /r VFMSUBADD132PD ymm1, ymm2, ymm3/m256 VEX.DDS.256.66.0F38.W1 A7 /r VFMSUBADD213PD ymm1, ymm2, ymm3/m256 VEX.DDS.256.66.0F38.W1 B7 /r VFMSUBADD231PD ymm1, ymm2, ymm3/m256 EVEX.DDS.128.66.0F38.W1 97 /r VFMSUBADD132PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA FV V/V AVX512VL AVX512F EVEX.DDS.128.66.0F38.W1 A7 /r VFMSUBADD213PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst FV V/V AVX512VL AVX512F EVEX.DDS.128.66.0F38.W1 B7 /r VFMSUBADD231PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst FV V/V AVX512VL AVX512F EVEX.DDS.256.66.0F38.W1 97 /r VFMSUBADD132PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst FV V/V AVX512VL AVX512F EVEX.DDS.256.66.0F38.W1 A7 /r VFMSUBADD213PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst FV V/V AVX512VL AVX512F EVEX.DDS.256.66.0F38.W1 B7 /r VFMSUBADD231PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst FV V/V AVX512VL AVX512F Description Multiply packed double-precision floating-point values from xmm1 and xmm3/mem, subtract/add elements in xmm2 and put result in xmm1. Multiply packed double-precision floating-point values from xmm1 and xmm2, subtract/add elements in xmm3/mem and put result in xmm1. Multiply packed double-precision floating-point values from xmm2 and xmm3/mem, subtract/add elements in xmm1 and put result in xmm1. Multiply packed double-precision floating-point values from ymm1 and ymm3/mem, subtract/add elements in ymm2 and put result in ymm1. Multiply packed double-precision floating-point values from ymm1 and ymm2, subtract/add elements in ymm3/mem and put result in ymm1. Multiply packed double-precision floating-point values from ymm2 and ymm3/mem, subtract/add elements in ymm1 and put result in ymm1. Multiply packed double-precision floating-point values from xmm1 and xmm3/m128/m64bcst, subtract/add elements in xmm2 and put result in xmm1 subject to writemask k1. Multiply packed double-precision floating-point values from xmm1 and xmm2, subtract/add elements in xmm3/m128/m64bcst and put result in xmm1 subject to writemask k1. Multiply packed double-precision floating-point values from xmm2 and xmm3/m128/m64bcst, subtract/add elements in xmm1 and put result in xmm1 subject to writemask k1. Multiply packed double-precision floating-point values from ymm1 and ymm3/m256/m64bcst, subtract/add elements in ymm2 and put result in ymm1 subject to writemask k1. Multiply packed double-precision floating-point values from ymm1 and ymm2, subtract/add elements in ymm3/m256/m64bcst and put result in ymm1 subject to writemask k1. Multiply packed double-precision floating-point values from ymm2 and ymm3/m256/m64bcst, subtract/add elements in ymm1 and put result in ymm1 subject to writemask k1. VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double-Precision Vol. 2C 5-165 INSTRUCTION SET REFERENCE, V-Z Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F EVEX.DDS.512.66.0F38.W1 97 /r VFMSUBADD132PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} EVEX.DDS.512.66.0F38.W1 A7 /r VFMSUBADD213PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} FV V/V AVX512F EVEX.DDS.512.66.0F38.W1 B7 /r VFMSUBADD231PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} FV V/V AVX512F Description Multiply packed double-precision floating-point values from zmm1 and zmm3/m512/m64bcst, subtract/add elements in zmm2 and put result in zmm1 subject to writemask k1. Multiply packed double-precision floating-point values from zmm1 and zmm2, subtract/add elements in zmm3/m512/m64bcst and put result in zmm1 subject to writemask k1. Multiply packed double-precision floating-point values from zmm2 and zmm3/m512/m64bcst, subtract/add elements in zmm1 and put result in zmm1 subject to writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description VFMSUBADD132PD: Multiplies the two, four, or eight packed double-precision floating-point values from the first source operand to the two or four packed double-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the odd double-precision floating-point elements and adds the even double-precision floating-point values in the second source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand). VFMSUBADD213PD: Multiplies the two, four, or eight packed double-precision floating-point values from the second source operand to the two or four packed double-precision floating-point values in the first source operand. From the infinite precision intermediate result, subtracts the odd double-precision floating-point elements and adds the even double-precision floating-point values in the third source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand). VFMSUBADD231PD: Multiplies the two, four, or eight packed double-precision floating-point values from the second source operand to the two or four packed double-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the odd double-precision floating-point elements and adds the even double-precision floating-point values in the first source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand). EVEX encoded versions: The destination operand (also first source operand) and the second source operand are ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is conditionally updated with write mask k1. 5-166 Vol. 2C VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double-Precision INSTRUCTION SET REFERENCE, V-Z VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field. VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed. Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. Operation In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding). VFMSUBADD132PD DEST, SRC2, SRC3 IF (VEX.128) THEN DEST[63:0] RoundFPControl_MXCSR(DEST[63:0]*SRC3[63:0] + SRC2[63:0]) DEST[127:64] RoundFPControl_MXCSR(DEST[127:64]*SRC3[127:64] - SRC2[127:64]) DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[63:0] RoundFPControl_MXCSR(DEST[63:0]*SRC3[63:0] + SRC2[63:0]) DEST[127:64] RoundFPControl_MXCSR(DEST[127:64]*SRC3[127:64] - SRC2[127:64]) DEST[191:128] RoundFPControl_MXCSR(DEST[191:128]*SRC3[191:128] + SRC2[191:128]) DEST[255:192] RoundFPControl_MXCSR(DEST[255:192]*SRC3[255:192] - SRC2[255:192] FI VFMSUBADD213PD DEST, SRC2, SRC3 IF (VEX.128) THEN DEST[63:0] RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] + SRC3[63:0]) DEST[127:64] RoundFPControl_MXCSR(SRC2[127:64]*DEST[127:64] - SRC3[127:64]) DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[63:0] RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] + SRC3[63:0]) DEST[127:64] RoundFPControl_MXCSR(SRC2[127:64]*DEST[127:64] - SRC3[127:64]) DEST[191:128] RoundFPControl_MXCSR(SRC2[191:128]*DEST[191:128] + SRC3[191:128]) DEST[255:192] RoundFPControl_MXCSR(SRC2[255:192]*DEST[255:192] - SRC3[255:192] FI VFMSUBADD231PD DEST, SRC2, SRC3 IF (VEX.128) THEN DEST[63:0] RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] + DEST[63:0]) DEST[127:64] RoundFPControl_MXCSR(SRC2[127:64]*SRC3[127:64] - DEST[127:64]) DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[63:0] RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] + DEST[63:0]) DEST[127:64] RoundFPControl_MXCSR(SRC2[127:64]*SRC3[127:64] - DEST[127:64]) DEST[191:128] RoundFPControl_MXCSR(SRC2[191:128]*SRC3[191:128] + DEST[191:128]) DEST[255:192] RoundFPControl_MXCSR(SRC2[255:192]*SRC3[255:192] - DEST[255:192] FI VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double-Precision Vol. 2C 5-167 INSTRUCTION SET REFERENCE, V-Z VFMSUBADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF j *is even* THEN DEST[i+63:i] RoundFPControl(DEST[i+63:i]*SRC3[i+63:i] + SRC2[i+63:i]) ELSE DEST[i+63:i] RoundFPControl(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i]) FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-168 Vol. 2C VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double-Precision INSTRUCTION SET REFERENCE, V-Z VFMSUBADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF j *is even* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] + SRC2[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] + SRC2[i+63:i]) FI; ELSE IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] - SRC2[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i]) FI; FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double-Precision Vol. 2C 5-169 INSTRUCTION SET REFERENCE, V-Z VFMSUBADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF j *is even* THEN DEST[i+63:i] RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i]) ELSE DEST[i+63:i] RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] - SRC3[i+63:i]) FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-170 Vol. 2C VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double-Precision INSTRUCTION SET REFERENCE, V-Z VFMSUBADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF j *is even* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[63:0]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i]) FI; ELSE IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] - SRC3[63:0]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] - SRC3[i+63:i]) FI; FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double-Precision Vol. 2C 5-171 INSTRUCTION SET REFERENCE, V-Z VFMSUBADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF j *is even* THEN DEST[i+63:i] RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i]) ELSE DEST[i+63:i] RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i]) FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-172 Vol. 2C VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double-Precision INSTRUCTION SET REFERENCE, V-Z VFMSUBADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF j *is even* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] + DEST[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i]) FI; ELSE IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] - DEST[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i]) FI; FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double-Precision Vol. 2C 5-173 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VFMSUBADDxxxPD __m512d _mm512_fmsubadd_pd(__m512d a, __m512d b, __m512d c); VFMSUBADDxxxPD __m512d _mm512_fmsubadd_round_pd(__m512d a, __m512d b, __m512d c, int r); VFMSUBADDxxxPD __m512d _mm512_mask_fmsubadd_pd(__m512d a, __mmask8 k, __m512d b, __m512d c); VFMSUBADDxxxPD __m512d _mm512_maskz_fmsubadd_pd(__mmask8 k, __m512d a, __m512d b, __m512d c); VFMSUBADDxxxPD __m512d _mm512_mask3_fmsubadd_pd(__m512d a, __m512d b, __m512d c, __mmask8 k); VFMSUBADDxxxPD __m512d _mm512_mask_fmsubadd_round_pd(__m512d a, __mmask8 k, __m512d b, __m512d c, int r); VFMSUBADDxxxPD __m512d _mm512_maskz_fmsubadd_round_pd(__mmask8 k, __m512d a, __m512d b, __m512d c, int r); VFMSUBADDxxxPD __m512d _mm512_mask3_fmsubadd_round_pd(__m512d a, __m512d b, __m512d c, __mmask8 k, int r); VFMSUBADDxxxPD __m256d _mm256_mask_fmsubadd_pd(__m256d a, __mmask8 k, __m256d b, __m256d c); VFMSUBADDxxxPD __m256d _mm256_maskz_fmsubadd_pd(__mmask8 k, __m256d a, __m256d b, __m256d c); VFMSUBADDxxxPD __m256d _mm256_mask3_fmsubadd_pd(__m256d a, __m256d b, __m256d c, __mmask8 k); VFMSUBADDxxxPD __m128d _mm_mask_fmsubadd_pd(__m128d a, __mmask8 k, __m128d b, __m128d c); VFMSUBADDxxxPD __m128d _mm_maskz_fmsubadd_pd(__mmask8 k, __m128d a, __m128d b, __m128d c); VFMSUBADDxxxPD __m128d _mm_mask3_fmsubadd_pd(__m128d a, __m128d b, __m128d c, __mmask8 k); VFMSUBADDxxxPD __m128d _mm_fmsubadd_pd (__m128d a, __m128d b, __m128d c); VFMSUBADDxxxPD __m256d _mm256_fmsubadd_pd (__m256d a, __m256d b, __m256d c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 2. EVEX-encoded instructions, see Exceptions Type E2. 5-174 Vol. 2C VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD—Fused Multiply-Alternating Subtract/Add of Packed Double-Precision INSTRUCTION SET REFERENCE, V-Z VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single-Precision Floating-Point Values Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.DDS.128.66.0F38.W0 97 /r VFMSUBADD132PS xmm1, xmm2, xmm3/m128 VEX.DDS.128.66.0F38.W0 A7 /r VFMSUBADD213PS xmm1, xmm2, xmm3/m128 VEX.DDS.128.66.0F38.W0 B7 /r VFMSUBADD231PS xmm1, xmm2, xmm3/m128 VEX.DDS.256.66.0F38.W0 97 /r VFMSUBADD132PS ymm1, ymm2, ymm3/m256 VEX.DDS.256.66.0F38.W0 A7 /r VFMSUBADD213PS ymm1, ymm2, ymm3/m256 VEX.DDS.256.66.0F38.W0 B7 /r VFMSUBADD231PS ymm1, ymm2, ymm3/m256 EVEX.DDS.128.66.0F38.W0 97 /r VFMSUBADD132PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.DDS.128.66.0F38.W0 A7 /r VFMSUBADD213PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F EVEX.DDS.128.66.0F38.W0 B7 /r VFMSUBADD231PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.DDS.256.66.0F38.W0 97 /r VFMSUBADD132PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.DDS.256.66.0F38.W0 A7 /r VFMSUBADD213PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F EVEX.DDS.256.66.0F38.W0 B7 /r VFMSUBADD231PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.DDS.512.66.0F38.W0 97 /r VFMSUBADD132PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} EVEX.DDS.512.66.0F38.W0 A7 /r VFMSUBADD213PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} FV V/V AVX512VL AVX512F FV V/V AVX512F FV V/V AVX512F EVEX.DDS.512.66.0F38.W0 B7 /r VFMSUBADD231PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} FV V/V AVX512F Description Multiply packed single-precision floating-point values from xmm1 and xmm3/mem, subtract/add elements in xmm2 and put result in xmm1. Multiply packed single-precision floating-point values from xmm1 and xmm2, subtract/add elements in xmm3/mem and put result in xmm1. Multiply packed single-precision floating-point values from xmm2 and xmm3/mem, subtract/add elements in xmm1 and put result in xmm1. Multiply packed single-precision floating-point values from ymm1 and ymm3/mem, subtract/add elements in ymm2 and put result in ymm1. Multiply packed single-precision floating-point values from ymm1 and ymm2, subtract/add elements in ymm3/mem and put result in ymm1. Multiply packed single-precision floating-point values from ymm2 and ymm3/mem, subtract/add elements in ymm1 and put result in ymm1. Multiply packed single-precision floating-point values from xmm1 and xmm3/m128/m32bcst, subtract/add elements in xmm2 and put result in xmm1 subject to writemask k1. Multiply packed single-precision floating-point values from xmm1 and xmm2, subtract/add elements in xmm3/m128/m32bcst and put result in xmm1 subject to writemask k1. Multiply packed single-precision floating-point values from xmm2 and xmm3/m128/m32bcst, subtract/add elements in xmm1 and put result in xmm1 subject to writemask k1. Multiply packed single-precision floating-point values from ymm1 and ymm3/m256/m32bcst, subtract/add elements in ymm2 and put result in ymm1 subject to writemask k1. Multiply packed single-precision floating-point values from ymm1 and ymm2, subtract/add elements in ymm3/m256/m32bcst and put result in ymm1 subject to writemask k1. Multiply packed single-precision floating-point values from ymm2 and ymm3/m256/m32bcst, subtract/add elements in ymm1 and put result in ymm1 subject to writemask k1. Multiply packed single-precision floating-point values from zmm1 and zmm3/m512/m32bcst, subtract/add elements in zmm2 and put result in zmm1 subject to writemask k1. Multiply packed single-precision floating-point values from zmm1 and zmm2, subtract/add elements in zmm3/m512/m32bcst and put result in zmm1 subject to writemask k1. Multiply packed single-precision floating-point values from zmm2 and zmm3/m512/m32bcst, subtract/add elements in zmm1 and put result in zmm1 subject to writemask k1. VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single-Precision Vol. 2C 5-175 INSTRUCTION SET REFERENCE, V-Z Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description VFMSUBADD132PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the first source operand to the corresponding packed single-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the odd single-precision floating-point elements and adds the even single-precision floating-point values in the second source operand, performs rounding and stores the resulting packed single-precision floating-point values to the destination operand (first source operand). VFMSUBADD213PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the second source operand to the corresponding packed single-precision floating-point values in the first source operand. From the infinite precision intermediate result, subtracts the odd single-precision floating-point elements and adds the even single-precision floating-point values in the third source operand, performs rounding and stores the resulting packed single-precision floating-point values to the destination operand (first source operand). VFMSUBADD231PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the second source operand to the corresponding packed single-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the odd single-precision floating-point elements and adds the even single-precision floating-point values in the first source operand, performs rounding and stores the resulting packed single-precision floating-point values to the destination operand (first source operand). EVEX encoded versions: The destination operand (also first source operand) and the second source operand are ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is conditionally updated with write mask k1. VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field. VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed. Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. 5-176 Vol. 2C VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single-Precision INSTRUCTION SET REFERENCE, V-Z Operation In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding). VFMSUBADD132PS DEST, SRC2, SRC3 IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM -1{ n 64*i; DEST[n+31:n] RoundFPControl_MXCSR(DEST[n+31:n]*SRC3[n+31:n] + SRC2[n+31:n]) DEST[n+63:n+32] RoundFPControl_MXCSR(DEST[n+63:n+32]*SRC3[n+63:n+32] -SRC2[n+63:n+32]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFMSUBADD213PS DEST, SRC2, SRC3 IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM -1{ n 64*i; DEST[n+31:n] RoundFPControl_MXCSR(SRC2[n+31:n]*DEST[n+31:n] +SRC3[n+31:n]) DEST[n+63:n+32] RoundFPControl_MXCSR(SRC2[n+63:n+32]*DEST[n+63:n+32] -SRC3[n+63:n+32]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFMSUBADD231PS DEST, SRC2, SRC3 IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM -1{ n 64*i; DEST[n+31:n] RoundFPControl_MXCSR(SRC2[n+31:n]*SRC3[n+31:n] + DEST[n+31:n]) DEST[n+63:n+32] RoundFPControl_MXCSR(SRC2[n+63:n+32]*SRC3[n+63:n+32] -DEST[n+63:n+32]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single-Precision Vol. 2C 5-177 INSTRUCTION SET REFERENCE, V-Z VFMSUBADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF j *is even* THEN DEST[i+31:i] RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i]) ELSE DEST[i+31:i] RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i]) FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-178 Vol. 2C VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single-Precision INSTRUCTION SET REFERENCE, V-Z VFMSUBADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF j *is even* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] + SRC2[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i]) FI; ELSE IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] - SRC2[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i]) FI; FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single-Precision Vol. 2C 5-179 INSTRUCTION SET REFERENCE, V-Z VFMSUBADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF j *is even* THEN DEST[i+31:i] RoundFPControl(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i]) ELSE DEST[i+31:i] RoundFPControl(SRC2[i+31:i]*DEST[i+31:i] - SRC3[i+31:i]) FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-180 Vol. 2C VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single-Precision INSTRUCTION SET REFERENCE, V-Z VFMSUBADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF j *is even* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] + SRC3[31:0]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i]) FI; ELSE IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[31:0]) FI; FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single-Precision Vol. 2C 5-181 INSTRUCTION SET REFERENCE, V-Z VFMSUBADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF j *is even* THEN DEST[i+31:i] RoundFPControl(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i]) ELSE DEST[i+31:i] RoundFPControl(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i]) FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-182 Vol. 2C VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single-Precision INSTRUCTION SET REFERENCE, V-Z VFMSUBADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF j *is even* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] + DEST[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i]) FI; ELSE IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] - DEST[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i]) FI; FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single-Precision Vol. 2C 5-183 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VFMSUBADDxxxPS __m512 _mm512_fmsubadd_ps(__m512 a, __m512 b, __m512 c); VFMSUBADDxxxPS __m512 _mm512_fmsubadd_round_ps(__m512 a, __m512 b, __m512 c, int r); VFMSUBADDxxxPS __m512 _mm512_mask_fmsubadd_ps(__m512 a, __mmask16 k, __m512 b, __m512 c); VFMSUBADDxxxPS __m512 _mm512_maskz_fmsubadd_ps(__mmask16 k, __m512 a, __m512 b, __m512 c); VFMSUBADDxxxPS __m512 _mm512_mask3_fmsubadd_ps(__m512 a, __m512 b, __m512 c, __mmask16 k); VFMSUBADDxxxPS __m512 _mm512_mask_fmsubadd_round_ps(__m512 a, __mmask16 k, __m512 b, __m512 c, int r); VFMSUBADDxxxPS __m512 _mm512_maskz_fmsubadd_round_ps(__mmask16 k, __m512 a, __m512 b, __m512 c, int r); VFMSUBADDxxxPS __m512 _mm512_mask3_fmsubadd_round_ps(__m512 a, __m512 b, __m512 c, __mmask16 k, int r); VFMSUBADDxxxPS __m256 _mm256_mask_fmsubadd_ps(__m256 a, __mmask8 k, __m256 b, __m256 c); VFMSUBADDxxxPS __m256 _mm256_maskz_fmsubadd_ps(__mmask8 k, __m256 a, __m256 b, __m256 c); VFMSUBADDxxxPS __m256 _mm256_mask3_fmsubadd_ps(__m256 a, __m256 b, __m256 c, __mmask8 k); VFMSUBADDxxxPS __m128 _mm_mask_fmsubadd_ps(__m128 a, __mmask8 k, __m128 b, __m128 c); VFMSUBADDxxxPS __m128 _mm_maskz_fmsubadd_ps(__mmask8 k, __m128 a, __m128 b, __m128 c); VFMSUBADDxxxPS __m128 _mm_mask3_fmsubadd_ps(__m128 a, __m128 b, __m128 c, __mmask8 k); VFMSUBADDxxxPS __m128 _mm_fmsubadd_ps (__m128 a, __m128 b, __m128 c); VFMSUBADDxxxPS __m256 _mm256_fmsubadd_ps (__m256 a, __m256 b, __m256 c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 2. EVEX-encoded instructions, see Exceptions Type E2. 5-184 Vol. 2C VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS—Fused Multiply-Alternating Subtract/Add of Packed Single-Precision INSTRUCTION SET REFERENCE, V-Z VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed DoublePrecision Floating-Point Values Opcode/ Instruction Op/ En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.NDS.128.66.0F38.W1 9A /r VFMSUB132PD xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W1 AA /r VFMSUB213PD xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W1 BA /r VFMSUB231PD xmm1, xmm2, xmm3/m128 VEX.NDS.256.66.0F38.W1 9A /r VFMSUB132PD ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.W1 AA /r VFMSUB213PD ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.W1 BA /r VFMSUB231PD ymm1, ymm2, ymm3/m256 EVEX.NDS.128.66.0F38.W1 9A /r VFMSUB132PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDS.128.66.0F38.W1 AA /r VFMSUB213PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDS.128.66.0F38.W1 BA /r VFMSUB231PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDS.256.66.0F38.W1 9A /r VFMSUB132PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.256.66.0F38.W1 AA /r VFMSUB213PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.256.66.0F38.W1 BA /r VFMSUB231PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.512.66.0F38.W1 9A /r VFMSUB132PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} EVEX.NDS.512.66.0F38.W1 AA /r VFMSUB213PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} EVEX.NDS.512.66.0F38.W1 BA /r VFMSUB231PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F FV V/V AVX512F FV V/V AVX512F Description Multiply packed double-precision floating-point values from xmm1 and xmm3/mem, subtract xmm2 and put result in xmm1. Multiply packed double-precision floating-point values from xmm1 and xmm2, subtract xmm3/mem and put result in xmm1. Multiply packed double-precision floating-point values from xmm2 and xmm3/mem, subtract xmm1 and put result in xmm1. Multiply packed double-precision floating-point values from ymm1 and ymm3/mem, subtract ymm2 and put result in ymm1. Multiply packed double-precision floating-point values from ymm1 and ymm2, subtract ymm3/mem and put result in ymm1. Multiply packed double-precision floating-point values from ymm2 and ymm3/mem, subtract ymm1 and put result in ymm1.S Multiply packed double-precision floating-point values from xmm1 and xmm3/m128/m64bcst, subtract xmm2 and put result in xmm1 subject to writemask k1. Multiply packed double-precision floating-point values from xmm1 and xmm2, subtract xmm3/m128/m64bcst and put result in xmm1 subject to writemask k1. Multiply packed double-precision floating-point values from xmm2 and xmm3/m128/m64bcst, subtract xmm1 and put result in xmm1 subject to writemask k1. Multiply packed double-precision floating-point values from ymm1 and ymm3/m256/m64bcst, subtract ymm2 and put result in ymm1 subject to writemask k1. Multiply packed double-precision floating-point values from ymm1 and ymm2, subtract ymm3/m256/m64bcst and put result in ymm1 subject to writemask k1. Multiply packed double-precision floating-point values from ymm2 and ymm3/m256/m64bcst, subtract ymm1 and put result in ymm1 subject to writemask k1. Multiply packed double-precision floating-point values from zmm1 and zmm3/m512/m64bcst, subtract zmm2 and put result in zmm1 subject to writemask k1. Multiply packed double-precision floating-point values from zmm1 and zmm2, subtract zmm3/m512/m64bcst and put result in zmm1 subject to writemask k1. Multiply packed double-precision floating-point values from zmm2 and zmm3/m512/m64bcst, subtract zmm1 and put result in zmm1 subject to writemask k1. VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed Double-Precision Floating-Point Values Vol. 2C 5-185 INSTRUCTION SET REFERENCE, V-Z Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Performs a set of SIMD multiply-subtract computation on packed double-precision floating-point values using three source operands and writes the multiply-subtract results in the destination operand. The destination operand is also the first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD register or a memory location. VFMSUB132PD: Multiplies the two, four or eight packed double-precision floating-point values from the first source operand to the two, four or eight packed double-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the two, four or eight packed double-precision floating-point values in the second source operand, performs rounding and stores the resulting two, four or eight packed doubleprecision floating-point values to the destination operand (first source operand). VFMSUB213PD: Multiplies the two, four or eight packed double-precision floating-point values from the second source operand to the two, four or eight packed double-precision floating-point values in the first source operand. From the infinite precision intermediate result, subtracts the two, four or eight packed double-precision floatingpoint values in the third source operand, performs rounding and stores the resulting two, four or eight packed double-precision floating-point values to the destination operand (first source operand). VFMSUB231PD: Multiplies the two, four or eight packed double-precision floating-point values from the second source to the two, four or eight packed double-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the two, four or eight packed double-precision floating-point values in the first source operand, performs rounding and stores the resulting two, four or eight packed double-precision floating-point values to the destination operand (first source operand). EVEX encoded versions: The destination operand (also first source operand) and the second source operand are ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is conditionally updated with write mask k1. VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field. VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed. 5-186 Vol. 2C VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z Operation In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no rounding). VFMSUB132PD DEST, SRC2, SRC3 (VEX encoded versions) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 64*i; DEST[n+63:n] RoundFPControl_MXCSR(DEST[n+63:n]*SRC3[n+63:n] - SRC2[n+63:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFMSUB213PD DEST, SRC2, SRC3 (VEX encoded versions) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 64*i; DEST[n+63:n] RoundFPControl_MXCSR(SRC2[n+63:n]*DEST[n+63:n] - SRC3[n+63:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFMSUB231PD DEST, SRC2, SRC3 (VEX encoded versions) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 64*i; DEST[n+63:n] RoundFPControl_MXCSR(SRC2[n+63:n]*SRC3[n+63:n] - DEST[n+63:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed Double-Precision Floating-Point Values Vol. 2C 5-187 INSTRUCTION SET REFERENCE, V-Z VFMSUB132PD DEST, SRC2, SRC3 (EVEX encoded versions, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] RoundFPControl(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMSUB132PD DEST, SRC2, SRC3 (EVEX encoded versions, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] - SRC2[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-188 Vol. 2C VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFMSUB213PD DEST, SRC2, SRC3 (EVEX encoded versions, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] - SRC3[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed Double-Precision Floating-Point Values Vol. 2C 5-189 INSTRUCTION SET REFERENCE, V-Z VFMSUB213PD DEST, SRC2, SRC3 (EVEX encoded versions, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] - SRC3[63:0]) +31:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] - SRC3[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMSUB231PD DEST, SRC2, SRC3 (EVEX encoded versions, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-190 Vol. 2C VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFMSUB231PD DEST, SRC2, SRC3 (EVEX encoded versions, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] - DEST[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VFMSUBxxxPD __m512d _mm512_fmsub_pd(__m512d a, __m512d b, __m512d c); VFMSUBxxxPD __m512d _mm512_fmsub_round_pd(__m512d a, __m512d b, __m512d c, int r); VFMSUBxxxPD __m512d _mm512_mask_fmsub_pd(__m512d a, __mmask8 k, __m512d b, __m512d c); VFMSUBxxxPD __m512d _mm512_maskz_fmsub_pd(__mmask8 k, __m512d a, __m512d b, __m512d c); VFMSUBxxxPD __m512d _mm512_mask3_fmsub_pd(__m512d a, __m512d b, __m512d c, __mmask8 k); VFMSUBxxxPD __m512d _mm512_mask_fmsub_round_pd(__m512d a, __mmask8 k, __m512d b, __m512d c, int r); VFMSUBxxxPD __m512d _mm512_maskz_fmsub_round_pd(__mmask8 k, __m512d a, __m512d b, __m512d c, int r); VFMSUBxxxPD __m512d _mm512_mask3_fmsub_round_pd(__m512d a, __m512d b, __m512d c, __mmask8 k, int r); VFMSUBxxxPD __m256d _mm256_mask_fmsub_pd(__m256d a, __mmask8 k, __m256d b, __m256d c); VFMSUBxxxPD __m256d _mm256_maskz_fmsub_pd(__mmask8 k, __m256d a, __m256d b, __m256d c); VFMSUBxxxPD __m256d _mm256_mask3_fmsub_pd(__m256d a, __m256d b, __m256d c, __mmask8 k); VFMSUBxxxPD __m128d _mm_mask_fmsub_pd(__m128d a, __mmask8 k, __m128d b, __m128d c); VFMSUBxxxPD __m128d _mm_maskz_fmsub_pd(__mmask8 k, __m128d a, __m128d b, __m128d c); VFMSUBxxxPD __m128d _mm_mask3_fmsub_pd(__m128d a, __m128d b, __m128d c, __mmask8 k); VFMSUBxxxPD __m128d _mm_fmsub_pd (__m128d a, __m128d b, __m128d c); VFMSUBxxxPD __m256d _mm256_fmsub_pd (__m256d a, __m256d b, __m256d c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 2. EVEX-encoded instructions, see Exceptions Type E2. VFMSUB132PD/VFMSUB213PD/VFMSUB231PD—Fused Multiply-Subtract of Packed Double-Precision Floating-Point Values Vol. 2C 5-191 INSTRUCTION SET REFERENCE, V-Z VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed SinglePrecision Floating-Point Values Opcode/ Instruction Op/E n VEX.NDS.128.66.0F38.W0 9A /r VFMSUB132PS xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W0 AA /r VFMSUB213PS xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W0 BA /r VFMSUB231PS xmm1, xmm2, xmm3/m128 VEX.NDS.256.66.0F38.W0 9A /r VFMSUB132PS ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.W0 AA /r VFMSUB213PS ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.0 BA /r VFMSUB231PS ymm1, ymm2, ymm3/m256 EVEX.NDS.128.66.0F38.W0 9A /r VFMSUB132PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.128.66.0F38.W0 AA /r VFMSUB213PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.128.66.0F38.W0 BA /r VFMSUB231PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.256.66.0F38.W0 9A /r VFMSUB132PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.256.66.0F38.W0 AA /r VFMSUB213PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.256.66.0F38.W0 BA /r VFMSUB231PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.512.66.0F38.W0 9A /r VFMSUB132PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} EVEX.NDS.512.66.0F38.W0 AA /r VFMSUB213PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} EVEX.NDS.512.66.0F38.W0 BA /r VFMSUB231PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} 5-192 Vol. 2C RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F FV V/V AVX512F FV V/V AVX512F Description Multiply packed single-precision floating-point values from xmm1 and xmm3/mem, subtract xmm2 and put result in xmm1. Multiply packed single-precision floating-point values from xmm1 and xmm2, subtract xmm3/mem and put result in xmm1. Multiply packed single-precision floating-point values from xmm2 and xmm3/mem, subtract xmm1 and put result in xmm1. Multiply packed single-precision floating-point values from ymm1 and ymm3/mem, subtract ymm2 and put result in ymm1. Multiply packed single-precision floating-point values from ymm1 and ymm2, subtract ymm3/mem and put result in ymm1. Multiply packed single-precision floating-point values from ymm2 and ymm3/mem, subtract ymm1 and put result in ymm1. Multiply packed single-precision floating-point values from xmm1 and xmm3/m128/m32bcst, subtract xmm2 and put result in xmm1. Multiply packed single-precision floating-point values from xmm1 and xmm2, subtract xmm3/m128/m32bcst and put result in xmm1. Multiply packed single-precision floating-point values from xmm2 and xmm3/m128/m32bcst, subtract xmm1 and put result in xmm1. Multiply packed single-precision floating-point values from ymm1 and ymm3/m256/m32bcst, subtract ymm2 and put result in ymm1. Multiply packed single-precision floating-point values from ymm1 and ymm2, subtract ymm3/m256/m32bcst and put result in ymm1. Multiply packed single-precision floating-point values from ymm2 and ymm3/m256/m32bcst, subtract ymm1 and put result in ymm1. Multiply packed single-precision floating-point values from zmm1 and zmm3/m512/m32bcst, subtract zmm2 and put result in zmm1. Multiply packed single-precision floating-point values from zmm1 and zmm2, subtract zmm3/m512/m32bcst and put result in zmm1. Multiply packed single-precision floating-point values from zmm2 and zmm3/m512/m32bcst, subtract zmm1 and put result in zmm1. VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Performs a set of SIMD multiply-subtract computation on packed single-precision floating-point values using three source operands and writes the multiply-subtract results in the destination operand. The destination operand is also the first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD register or a memory location. VFMSUB132PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the first source operand to the four, eight or sixteen packed single-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the four, eight or sixteen packed single-precision floating-point values in the second source operand, performs rounding and stores the resulting four, eight or sixteen packed single-precision floating-point values to the destination operand (first source operand). VFMSUB213PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the second source operand to the four, eight or sixteen packed single-precision floating-point values in the first source operand. From the infinite precision intermediate result, subtracts the four, eight or sixteen packed single-precision floating-point values in the third source operand, performs rounding and stores the resulting four, eight or sixteen packed single-precision floating-point values to the destination operand (first source operand). VFMSUB231PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the second source to the four, eight or sixteen packed single-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the four, eight or sixteen packed single-precision floating-point values in the first source operand, performs rounding and stores the resulting four, eight or sixteen packed singleprecision floating-point values to the destination operand (first source operand). EVEX encoded versions: The destination operand (also first source operand) and the second source operand are ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is conditionally updated with write mask k1. VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field. VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed. VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed Single-Precision Floating-Point Values Vol. 2C 5-193 INSTRUCTION SET REFERENCE, V-Z Operation In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no rounding). VFMSUB132PS DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 32*i; DEST[n+31:n] RoundFPControl_MXCSR(DEST[n+31:n]*SRC3[n+31:n] - SRC2[n+31:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFMSUB213PS DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 32*i; DEST[n+31:n] RoundFPControl_MXCSR(SRC2[n+31:n]*DEST[n+31:n] - SRC3[n+31:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFMSUB231PS DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 32*i; DEST[n+31:n] RoundFPControl_MXCSR(SRC2[n+31:n]*SRC3[n+31:n] - DEST[n+31:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI 5-194 Vol. 2C VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFMSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] - SRC2[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed Single-Precision Floating-Point Values Vol. 2C 5-195 INSTRUCTION SET REFERENCE, V-Z VFMSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[i+31:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-196 Vol. 2C VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFMSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[31:0]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed Single-Precision Floating-Point Values Vol. 2C 5-197 INSTRUCTION SET REFERENCE, V-Z VFMSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] - DEST[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VFMSUBxxxPS __m512 _mm512_fmsub_ps(__m512 a, __m512 b, __m512 c); VFMSUBxxxPS __m512 _mm512_fmsub_round_ps(__m512 a, __m512 b, __m512 c, int r); VFMSUBxxxPS __m512 _mm512_mask_fmsub_ps(__m512 a, __mmask16 k, __m512 b, __m512 c); VFMSUBxxxPS __m512 _mm512_maskz_fmsub_ps(__mmask16 k, __m512 a, __m512 b, __m512 c); VFMSUBxxxPS __m512 _mm512_mask3_fmsub_ps(__m512 a, __m512 b, __m512 c, __mmask16 k); VFMSUBxxxPS __m512 _mm512_mask_fmsub_round_ps(__m512 a, __mmask16 k, __m512 b, __m512 c, int r); VFMSUBxxxPS __m512 _mm512_maskz_fmsub_round_ps(__mmask16 k, __m512 a, __m512 b, __m512 c, int r); VFMSUBxxxPS __m512 _mm512_mask3_fmsub_round_ps(__m512 a, __m512 b, __m512 c, __mmask16 k, int r); VFMSUBxxxPS __m256 _mm256_mask_fmsub_ps(__m256 a, __mmask8 k, __m256 b, __m256 c); VFMSUBxxxPS __m256 _mm256_maskz_fmsub_ps(__mmask8 k, __m256 a, __m256 b, __m256 c); VFMSUBxxxPS __m256 _mm256_mask3_fmsub_ps(__m256 a, __m256 b, __m256 c, __mmask8 k); VFMSUBxxxPS __m128 _mm_mask_fmsub_ps(__m128 a, __mmask8 k, __m128 b, __m128 c); VFMSUBxxxPS __m128 _mm_maskz_fmsub_ps(__mmask8 k, __m128 a, __m128 b, __m128 c); VFMSUBxxxPS __m128 _mm_mask3_fmsub_ps(__m128 a, __m128 b, __m128 c, __mmask8 k); VFMSUBxxxPS __m128 _mm_fmsub_ps (__m128 a, __m128 b, __m128 c); VFMSUBxxxPS __m256 _mm256_fmsub_ps (__m256 a, __m256 b, __m256 c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 2. EVEX-encoded instructions, see Exceptions Type E2. 5-198 Vol. 2C VFMSUB132PS/VFMSUB213PS/VFMSUB231PS—Fused Multiply-Subtract of Packed Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFMSUB132SD/VFMSUB213SD/VFMSUB231SD—Fused Multiply-Subtract of Scalar DoublePrecision Floating-Point Values Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.DDS.LIG.66.0F38.W1 9B /r VFMSUB132SD xmm1, xmm2, xmm3/m64 VEX.DDS.LIG.66.0F38.W1 AB /r VFMSUB213SD xmm1, xmm2, xmm3/m64 VEX.DDS.LIG.66.0F38.W1 BB /r VFMSUB231SD xmm1, xmm2, xmm3/m64 EVEX.DDS.LIG.66.0F38.W1 9B /r VFMSUB132SD xmm1 {k1}{z}, xmm2, xmm3/m64{er} EVEX.DDS.LIG.66.0F38.W1 AB /r VFMSUB213SD xmm1 {k1}{z}, xmm2, xmm3/m64{er} EVEX.DDS.LIG.66.0F38.W1 BB /r VFMSUB231SD xmm1 {k1}{z}, xmm2, xmm3/m64{er} RVM V/V FMA RVM V/V FMA T1S V/V AVX512F T1S V/V AVX512F T1S V/V AVX512F Description Multiply scalar double-precision floating-point value from xmm1 and xmm3/m64, subtract xmm2 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm1 and xmm2, subtract xmm3/m64 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm2 and xmm3/m64, subtract xmm1 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm1 and xmm3/m64, subtract xmm2 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm1 and xmm2, subtract xmm3/m64 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm2 and xmm3/m64, subtract xmm1 and put result in xmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA T1S ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Performs a SIMD multiply-subtract computation on the low packed double-precision floating-point values using three source operands and writes the multiply-subtract result in the destination operand. The destination operand is also the first source operand. The second operand must be a XMM register. The third source operand can be a XMM register or a 64-bit memory location. VFMSUB132SD: Multiplies the low packed double-precision floating-point value from the first source operand to the low packed double-precision floating-point value in the third source operand. From the infinite precision intermediate result, subtracts the low packed double-precision floating-point values in the second source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand). VFMSUB213SD: Multiplies the low packed double-precision floating-point value from the second source operand to the low packed double-precision floating-point value in the first source operand. From the infinite precision intermediate result, subtracts the low packed double-precision floating-point value in the third source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand). VFMSUB231SD: Multiplies the low packed double-precision floating-point value from the second source to the low packed double-precision floating-point value in the third source operand. From the infinite precision intermediate result, subtracts the low packed double-precision floating-point value in the first source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand). VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field. The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field. Bits 127:64 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed. VFMSUB132SD/VFMSUB213SD/VFMSUB231SD—Fused Multiply-Subtract of Scalar Double-Precision Floating-Point Values Vol. 2C 5-199 INSTRUCTION SET REFERENCE, V-Z EVEX encoded version: The low quadword element of the destination is updated according to the writemask. Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. Operation In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no rounding). VFMSUB132SD DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[63:0] RoundFPControl(DEST[63:0]*SRC3[63:0] - SRC2[63:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[63:0] 0 FI; FI; DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFMSUB213SD DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[63:0] RoundFPControl(SRC2[63:0]*DEST[63:0] - SRC3[63:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[63:0] 0 FI; FI; DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 5-200 Vol. 2C VFMSUB132SD/VFMSUB213SD/VFMSUB231SD—Fused Multiply-Subtract of Scalar Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFMSUB231SD DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[63:0] RoundFPControl(SRC2[63:0]*SRC3[63:0] - DEST[63:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[63:0] 0 FI; FI; DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFMSUB132SD DEST, SRC2, SRC3 (VEX encoded version) DEST[63:0] RoundFPControl_MXCSR(DEST[63:0]*SRC3[63:0] - SRC2[63:0]) DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFMSUB213SD DEST, SRC2, SRC3 (VEX encoded version) DEST[63:0] RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] - SRC3[63:0]) DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFMSUB231SD DEST, SRC2, SRC3 (VEX encoded version) DEST[63:0] RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] - DEST[63:0]) DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VFMSUBxxxSD __m128d _mm_fmsub_round_sd(__m128d a, __m128d b, __m128d c, int r); VFMSUBxxxSD __m128d _mm_mask_fmsub_sd(__m128d a, __mmask8 k, __m128d b, __m128d c); VFMSUBxxxSD __m128d _mm_maskz_fmsub_sd(__mmask8 k, __m128d a, __m128d b, __m128d c); VFMSUBxxxSD __m128d _mm_mask3_fmsub_sd(__m128d a, __m128d b, __m128d c, __mmask8 k); VFMSUBxxxSD __m128d _mm_mask_fmsub_round_sd(__m128d a, __mmask8 k, __m128d b, __m128d c, int r); VFMSUBxxxSD __m128d _mm_maskz_fmsub_round_sd(__mmask8 k, __m128d a, __m128d b, __m128d c, int r); VFMSUBxxxSD __m128d _mm_mask3_fmsub_round_sd(__m128d a, __m128d b, __m128d c, __mmask8 k, int r); VFMSUBxxxSD __m128d _mm_fmsub_sd (__m128d a, __m128d b, __m128d c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 3. EVEX-encoded instructions, see Exceptions Type E3. VFMSUB132SD/VFMSUB213SD/VFMSUB231SD—Fused Multiply-Subtract of Scalar Double-Precision Floating-Point Values Vol. 2C 5-201 INSTRUCTION SET REFERENCE, V-Z VFMSUB132SS/VFMSUB213SS/VFMSUB231SS—Fused Multiply-Subtract of Scalar SinglePrecision Floating-Point Values Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.DDS.LIG.66.0F38.W0 9B /r VFMSUB132SS xmm1, xmm2, xmm3/m32 VEX.DDS.LIG.66.0F38.W0 AB /r VFMSUB213SS xmm1, xmm2, xmm3/m32 VEX.DDS.LIG.66.0F38.W0 BB /r VFMSUB231SS xmm1, xmm2, xmm3/m32 EVEX.DDS.LIG.66.0F38.W0 9B /r VFMSUB132SS xmm1 {k1}{z}, xmm2, xmm3/m32{er} EVEX.DDS.LIG.66.0F38.W0 AB /r VFMSUB213SS xmm1 {k1}{z}, xmm2, xmm3/m32{er} EVEX.DDS.LIG.66.0F38.W0 BB /r VFMSUB231SS xmm1 {k1}{z}, xmm2, xmm3/m32{er} RVM V/V FMA RVM V/V FMA T1S V/V AVX512F T1S V/V AVX512F T1S V/V AVX512F Description Multiply scalar single-precision floating-point value from xmm1 and xmm3/m32, subtract xmm2 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm1 and xmm2, subtract xmm3/m32 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm2 and xmm3/m32, subtract xmm1 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm1 and xmm3/m32, subtract xmm2 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm1 and xmm2, subtract xmm3/m32 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm2 and xmm3/m32, subtract xmm1 and put result in xmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA T1S ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Performs a SIMD multiply-subtract computation on the low packed single-precision floating-point values using three source operands and writes the multiply-subtract result in the destination operand. The destination operand is also the first source operand. The second operand must be a XMM register. The third source operand can be a XMM register or a 32-bit memory location. VFMSUB132SS: Multiplies the low packed single-precision floating-point value from the first source operand to the low packed single-precision floating-point value in the third source operand. From the infinite precision intermediate result, subtracts the low packed single-precision floating-point values in the second source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand). VFMSUB213SS: Multiplies the low packed single-precision floating-point value from the second source operand to the low packed single-precision floating-point value in the first source operand. From the infinite precision intermediate result, subtracts the low packed single-precision floating-point value in the third source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand). VFMSUB231SS: Multiplies the low packed single-precision floating-point value from the second source to the low packed single-precision floating-point value in the third source operand. From the infinite precision intermediate result, subtracts the low packed single-precision floating-point value in the first source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand). VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field. The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field. Bits 127:32 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed. 5-202 Vol. 2C VFMSUB132SS/VFMSUB213SS/VFMSUB231SS—Fused Multiply-Subtract of Scalar Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z EVEX encoded version: The low doubleword element of the destination is updated according to the writemask. Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. Operation In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no rounding). VFMSUB132SS DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[31:0] RoundFPControl(DEST[31:0]*SRC3[31:0] - SRC2[31:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[31:0] 0 FI; FI; DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFMSUB213SS DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[31:0] RoundFPControl(SRC2[31:0]*DEST[31:0] - SRC3[31:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[31:0] 0 FI; FI; DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFMSUB132SS/VFMSUB213SS/VFMSUB231SS—Fused Multiply-Subtract of Scalar Single-Precision Floating-Point Values Vol. 2C 5-203 INSTRUCTION SET REFERENCE, V-Z VFMSUB231SS DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[31:0] RoundFPControl(SRC2[31:0]*SRC3[63:0] - DEST[31:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[31:0] 0 FI; FI; DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFMSUB132SS DEST, SRC2, SRC3 (VEX encoded version) DEST[31:0] RoundFPControl_MXCSR(DEST[31:0]*SRC3[31:0] - SRC2[31:0]) DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFMSUB213SS DEST, SRC2, SRC3 (VEX encoded version) DEST[31:0] RoundFPControl_MXCSR(SRC2[31:0]*DEST[31:0] - SRC3[31:0]) DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFMSUB231SS DEST, SRC2, SRC3 (VEX encoded version) DEST[31:0] RoundFPControl_MXCSR(SRC2[31:0]*SRC3[31:0] - DEST[31:0]) DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VFMSUBxxxSS __m128 _mm_fmsub_round_ss(__m128 a, __m128 b, __m128 c, int r); VFMSUBxxxSS __m128 _mm_mask_fmsub_ss(__m128 a, __mmask8 k, __m128 b, __m128 c); VFMSUBxxxSS __m128 _mm_maskz_fmsub_ss(__mmask8 k, __m128 a, __m128 b, __m128 c); VFMSUBxxxSS __m128 _mm_mask3_fmsub_ss(__m128 a, __m128 b, __m128 c, __mmask8 k); VFMSUBxxxSS __m128 _mm_mask_fmsub_round_ss(__m128 a, __mmask8 k, __m128 b, __m128 c, int r); VFMSUBxxxSS __m128 _mm_maskz_fmsub_round_ss(__mmask8 k, __m128 a, __m128 b, __m128 c, int r); VFMSUBxxxSS __m128 _mm_mask3_fmsub_round_ss(__m128 a, __m128 b, __m128 c, __mmask8 k, int r); VFMSUBxxxSS __m128 _mm_fmsub_ss (__m128 a, __m128 b, __m128 c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 3. EVEX-encoded instructions, see Exceptions Type E3. 5-204 Vol. 2C VFMSUB132SS/VFMSUB213SS/VFMSUB231SS—Fused Multiply-Subtract of Scalar Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double-Precision Floating-Point Values Opcode/ Instruction Op/ En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.NDS.128.66.0F38.W1 9C /r VFNMADD132PD xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W1 AC /r VFNMADD213PD xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W1 BC /r VFNMADD231PD xmm1, xmm2, xmm3/m128 VEX.NDS.256.66.0F38.W1 9C /r VFNMADD132PD ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.W1 AC /r VFNMADD213PD ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.W1 BC /r VFNMADD231PD ymm1, ymm2, ymm3/m256 EVEX.NDS.128.66.0F38.W1 9C /r VFNMADD132PD xmm0 {k1}{z}, xmm1, xmm2/m128/m64bcst RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA FV V/V AVX512VL AVX512F EVEX.NDS.128.66.0F38.W1 AC /r VFNMADD213PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDS.128.66.0F38.W1 BC /r VFNMADD231PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F EVEX.NDS.256.66.0F38.W1 9C /r VFNMADD132PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst FV V/V AVX512VL AVX512F EVEX.NDS.256.66.0F38.W1 AC /r VFNMADD213PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.256.66.0F38.W1 BC /r VFNMADD231PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F FV V/V AVX512F FV V/V AVX512F EVEX.NDS.512.66.0F38.W1 9C /r VFNMADD132PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} EVEX.NDS.512.66.0F38.W1 AC /r VFNMADD213PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} EVEX.NDS.512.66.0F38.W1 BC /r VFNMADD231PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} Description Multiply packed double-precision floating-point values from xmm1 and xmm3/mem, negate the multiplication result and add to xmm2 and put result in xmm1. Multiply packed double-precision floating-point values from xmm1 and xmm2, negate the multiplication result and add to xmm3/mem and put result in xmm1. Multiply packed double-precision floating-point values from xmm2 and xmm3/mem, negate the multiplication result and add to xmm1 and put result in xmm1. Multiply packed double-precision floating-point values from ymm1 and ymm3/mem, negate the multiplication result and add to ymm2 and put result in ymm1. Multiply packed double-precision floating-point values from ymm1 and ymm2, negate the multiplication result and add to ymm3/mem and put result in ymm1. Multiply packed double-precision floating-point values from ymm2 and ymm3/mem, negate the multiplication result and add to ymm1 and put result in ymm1. Multiply packed double-precision floating-point values from xmm1 and xmm3/m128/m64bcst, negate the multiplication result and add to xmm2 and put result in xmm1. Multiply packed double-precision floating-point values from xmm1 and xmm2, negate the multiplication result and add to xmm3/m128/m64bcst and put result in xmm1. Multiply packed double-precision floating-point values from xmm2 and xmm3/m128/m64bcst, negate the multiplication result and add to xmm1 and put result in xmm1. Multiply packed double-precision floating-point values from ymm1 and ymm3/m256/m64bcst, negate the multiplication result and add to ymm2 and put result in ymm1. Multiply packed double-precision floating-point values from ymm1 and ymm2, negate the multiplication result and add to ymm3/m256/m64bcst and put result in ymm1. Multiply packed double-precision floating-point values from ymm2 and ymm3/m256/m64bcst, negate the multiplication result and add to ymm1 and put result in ymm1. Multiply packed double-precision floating-point values from zmm1 and zmm3/m512/m64bcst, negate the multiplication result and add to zmm2 and put result in zmm1. Multiply packed double-precision floating-point values from zmm1 and zmm2, negate the multiplication result and add to zmm3/m512/m64bcst and put result in zmm1. Multiply packed double-precision floating-point values from zmm2 and zmm3/m512/m64bcst, negate the multiplication result and add to zmm1 and put result in zmm1. VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double-Precision Floating-Point Val- Vol. 2C 5-205 INSTRUCTION SET REFERENCE, V-Z Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description VFNMADD132PD: Multiplies the two, four or eight packed double-precision floating-point values from the first source operand to the two, four or eight packed double-precision floating-point values in the third source operand, adds the negated infinite precision intermediate result to the two, four or eight packed double-precision floatingpoint values in the second source operand, performs rounding and stores the resulting two, four or eight packed double-precision floating-point values to the destination operand (first source operand). VFNMADD213PD: Multiplies the two, four or eight packed double-precision floating-point values from the second source operand to the two, four or eight packed double-precision floating-point values in the first source operand, adds the negated infinite precision intermediate result to the two, four or eight packed double-precision floatingpoint values in the third source operand, performs rounding and stores the resulting two, four or eight packed double-precision floating-point values to the destination operand (first source operand). VFNMADD231PD: Multiplies the two, four or eight packed double-precision floating-point values from the second source to the two, four or eight packed double-precision floating-point values in the third source operand, the negated infinite precision intermediate result to the two, four or eight packed double-precision floating-point values in the first source operand, performs rounding and stores the resulting two, four or eight packed double-precision floating-point values to the destination operand (first source operand). EVEX encoded versions: The destination operand (also first source operand) and the second source operand are ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is conditionally updated with write mask k1. VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field. VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed. Operation In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no rounding). 5-206 Vol. 2C VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFNMADD132PD DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 64*i; DEST[n+63:n] RoundFPControl_MXCSR(-(DEST[n+63:n]*SRC3[n+63:n]) + SRC2[n+63:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFNMADD213PD DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 64*i; DEST[n+63:n] RoundFPControl_MXCSR(-(SRC2[n+63:n]*DEST[n+63:n]) + SRC3[n+63:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFNMADD231PD DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 64*i; DEST[n+63:n] RoundFPControl_MXCSR(-(SRC2[n+63:n]*SRC3[n+63:n]) + DEST[n+63:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double-Precision Floating-Point Val- Vol. 2C 5-207 INSTRUCTION SET REFERENCE, V-Z VFNMADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] RoundFPControl(-(DEST[i+63:i]*SRC3[i+63:i]) + SRC2[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFNMADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(-(DEST[i+63:i]*SRC3[63:0]) + SRC2[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(-(DEST[i+63:i]*SRC3[i+63:i]) + SRC2[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-208 Vol. 2C VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFNMADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] RoundFPControl(-(SRC2[i+63:i]*DEST[i+63:i]) + SRC3[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFNMADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(-(SRC2[i+63:i]*DEST[i+63:i]) + SRC3[63:0]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(-(SRC2[i+63:i]*DEST[i+63:i]) + SRC3[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double-Precision Floating-Point Val- Vol. 2C 5-209 INSTRUCTION SET REFERENCE, V-Z VFNMADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] RoundFPControl(-(SRC2[i+63:i]*SRC3[i+63:i]) + DEST[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFNMADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(-(SRC2[i+63:i]*SRC3[63:0]) + DEST[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(-(SRC2[i+63:i]*SRC3[i+63:i]) + DEST[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-210 Vol. 2C VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VFNMADDxxxPD __m512d _mm512_fnmadd_pd(__m512d a, __m512d b, __m512d c); VFNMADDxxxPD __m512d _mm512_fnmadd_round_pd(__m512d a, __m512d b, __m512d c, int r); VFNMADDxxxPD __m512d _mm512_mask_fnmadd_pd(__m512d a, __mmask8 k, __m512d b, __m512d c); VFNMADDxxxPD __m512d _mm512_maskz_fnmadd_pd(__mmask8 k, __m512d a, __m512d b, __m512d c); VFNMADDxxxPD __m512d _mm512_mask3_fnmadd_pd(__m512d a, __m512d b, __m512d c, __mmask8 k); VFNMADDxxxPD __m512d _mm512_mask_fnmadd_round_pd(__m512d a, __mmask8 k, __m512d b, __m512d c, int r); VFNMADDxxxPD __m512d _mm512_maskz_fnmadd_round_pd(__mmask8 k, __m512d a, __m512d b, __m512d c, int r); VFNMADDxxxPD __m512d _mm512_mask3_fnmadd_round_pd(__m512d a, __m512d b, __m512d c, __mmask8 k, int r); VFNMADDxxxPD __m256d _mm256_mask_fnmadd_pd(__m256d a, __mmask8 k, __m256d b, __m256d c); VFNMADDxxxPD __m256d _mm256_maskz_fnmadd_pd(__mmask8 k, __m256d a, __m256d b, __m256d c); VFNMADDxxxPD __m256d _mm256_mask3_fnmadd_pd(__m256d a, __m256d b, __m256d c, __mmask8 k); VFNMADDxxxPD __m128d _mm_mask_fnmadd_pd(__m128d a, __mmask8 k, __m128d b, __m128d c); VFNMADDxxxPD __m128d _mm_maskz_fnmadd_pd(__mmask8 k, __m128d a, __m128d b, __m128d c); VFNMADDxxxPD __m128d _mm_mask3_fnmadd_pd(__m128d a, __m128d b, __m128d c, __mmask8 k); VFNMADDxxxPD __m128d _mm_fnmadd_pd (__m128d a, __m128d b, __m128d c); VFNMADDxxxPD __m256d _mm256_fnmadd_pd (__m256d a, __m256d b, __m256d c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 2. EVEX-encoded instructions, see Exceptions Type E2. VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double-Precision Floating-Point Val- Vol. 2C 5-211 INSTRUCTION SET REFERENCE, V-Z VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single-Precision Floating-Point Values Opcode/ Instruction Op/ En VEX.NDS.128.66.0F38.W0 9C /r VFNMADD132PS xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W0 AC /r VFNMADD213PS xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W0 BC /r VFNMADD231PS xmm1, xmm2, xmm3/m128 VEX.NDS.256.66.0F38.W0 9C /r VFNMADD132PS ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.W0 AC /r VFNMADD213PS ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.0 BC /r VFNMADD231PS ymm1, ymm2, ymm3/m256 EVEX.NDS.128.66.0F38.W0 9C /r VFNMADD132PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.128.66.0F38.W0 AC /r VFNMADD213PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.128.66.0F38.W0 BC /r VFNMADD231PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.256.66.0F38.W0 9C /r VFNMADD132PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.256.66.0F38.W0 AC /r VFNMADD213PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.256.66.0F38.W0 BC /r VFNMADD231PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.512.66.0F38.W0 9C /r VFNMADD132PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} EVEX.NDS.512.66.0F38.W0 AC /r VFNMADD213PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} EVEX.NDS.512.66.0F38.W0 BC /r VFNMADD231PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} 5-212 Vol. 2C RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F FV V/V AVX512F Description Multiply packed single-precision floating-point values from xmm1 and xmm3/mem, negate the multiplication result and add to xmm2 and put result in xmm1. Multiply packed single-precision floating-point values from xmm1 and xmm2, negate the multiplication result and add to xmm3/mem and put result in xmm1. Multiply packed single-precision floating-point values from xmm2 and xmm3/mem, negate the multiplication result and add to xmm1 and put result in xmm1. Multiply packed single-precision floating-point values from ymm1 and ymm3/mem, negate the multiplication result and add to ymm2 and put result in ymm1. Multiply packed single-precision floating-point values from ymm1 and ymm2, negate the multiplication result and add to ymm3/mem and put result in ymm1. Multiply packed single-precision floating-point values from ymm2 and ymm3/mem, negate the multiplication result and add to ymm1 and put result in ymm1. Multiply packed single-precision floating-point values from xmm1 and xmm3/m128/m32bcst, negate the multiplication result and add to xmm2 and put result in xmm1. Multiply packed single-precision floating-point values from xmm1 and xmm2, negate the multiplication result and add to xmm3/m128/m32bcst and put result in xmm1. Multiply packed single-precision floating-point values from xmm2 and xmm3/m128/m32bcst, negate the multiplication result and add to xmm1 and put result in xmm1. Multiply packed single-precision floating-point values from ymm1 and ymm3/m256/m32bcst, negate the multiplication result and add to ymm2 and put result in ymm1. Multiply packed single-precision floating-point values from ymm1 and ymm2, negate the multiplication result and add to ymm3/m256/m32bcst and put result in ymm1. Multiply packed single-precision floating-point values from ymm2 and ymm3/m256/m32bcst, negate the multiplication result and add to ymm1 and put result in ymm1. Multiply packed single-precision floating-point values from zmm1 and zmm3/m512/m32bcst, negate the multiplication result and add to zmm2 and put result in zmm1. Multiply packed single-precision floating-point values from zmm1 and zmm2, negate the multiplication result and add to zmm3/m512/m32bcst and put result in zmm1. Multiply packed single-precision floating-point values from zmm2 and zmm3/m512/m32bcst, negate the multiplication result and add to zmm1 and put result in zmm1. VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description VFNMADD132PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the first source operand to the four, eight or sixteen packed single-precision floating-point values in the third source operand, adds the negated infinite precision intermediate result to the four, eight or sixteen packed single-precision floating-point values in the second source operand, performs rounding and stores the resulting four, eight or sixteen packed single-precision floating-point values to the destination operand (first source operand). VFNMADD213PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the second source operand to the four, eight or sixteen packed single-precision floating-point values in the first source operand, adds the negated infinite precision intermediate result to the four, eight or sixteen packed single-precision floating-point values in the third source operand, performs rounding and stores the resulting the four, eight or sixteen packed single-precision floating-point values to the destination operand (first source operand). VFNMADD231PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the second source operand to the four, eight or sixteen packed single-precision floating-point values in the third source operand, adds the negated infinite precision intermediate result to the four, eight or sixteen packed single-precision floating-point values in the first source operand, performs rounding and stores the resulting four, eight or sixteen packed single-precision floating-point values to the destination operand (first source operand). EVEX encoded versions: The destination operand (also first source operand) and the second source operand are ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is conditionally updated with write mask k1. VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field. VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed. Operation In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding). VFNMADD132PS DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 32*i; DEST[n+31:n] RoundFPControl_MXCSR(- (DEST[n+31:n]*SRC3[n+31:n]) + SRC2[n+31:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single-Precision Floating-Point Values Vol. 2C 5-213 INSTRUCTION SET REFERENCE, V-Z VFNMADD213PS DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 32*i; DEST[n+31:n] RoundFPControl_MXCSR(- (SRC2[n+31:n]*DEST[n+31:n]) + SRC3[n+31:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFNMADD231PS DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 32*i; DEST[n+31:n] RoundFPControl_MXCSR(- (SRC2[n+31:n]*SRC3[n+31:n]) + DEST[n+31:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFNMADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] RoundFPControl(-(DEST[i+31:i]*SRC3[i+31:i]) + SRC2[i+31:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-214 Vol. 2C VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFNMADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(-(DEST[i+31:i]*SRC3[31:0]) + SRC2[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(-(DEST[i+31:i]*SRC3[i+31:i]) + SRC2[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFNMADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] RoundFPControl(-(SRC2[i+31:i]*DEST[i+31:i]) + SRC3[i+31:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single-Precision Floating-Point Values Vol. 2C 5-215 INSTRUCTION SET REFERENCE, V-Z VFNMADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(-(SRC2[i+31:i]*DEST[i+31:i]) + SRC3[31:0]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(-(SRC2[i+31:i]*DEST[i+31:i]) + SRC3[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFNMADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] RoundFPControl(-(SRC2[i+31:i]*SRC3[i+31:i]) + DEST[i+31:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-216 Vol. 2C VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFNMADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(-(SRC2[i+31:i]*SRC3[31:0]) + DEST[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(-(SRC2[i+31:i]*SRC3[i+31:i]) + DEST[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VFNMADDxxxPS __m512 _mm512_fnmadd_ps(__m512 a, __m512 b, __m512 c); VFNMADDxxxPS __m512 _mm512_fnmadd_round_ps(__m512 a, __m512 b, __m512 c, int r); VFNMADDxxxPS __m512 _mm512_mask_fnmadd_ps(__m512 a, __mmask16 k, __m512 b, __m512 c); VFNMADDxxxPS __m512 _mm512_maskz_fnmadd_ps(__mmask16 k, __m512 a, __m512 b, __m512 c); VFNMADDxxxPS __m512 _mm512_mask3_fnmadd_ps(__m512 a, __m512 b, __m512 c, __mmask16 k); VFNMADDxxxPS __m512 _mm512_mask_fnmadd_round_ps(__m512 a, __mmask16 k, __m512 b, __m512 c, int r); VFNMADDxxxPS __m512 _mm512_maskz_fnmadd_round_ps(__mmask16 k, __m512 a, __m512 b, __m512 c, int r); VFNMADDxxxPS __m512 _mm512_mask3_fnmadd_round_ps(__m512 a, __m512 b, __m512 c, __mmask16 k, int r); VFNMADDxxxPS __m256 _mm256_mask_fnmadd_ps(__m256 a, __mmask8 k, __m256 b, __m256 c); VFNMADDxxxPS __m256 _mm256_maskz_fnmadd_ps(__mmask8 k, __m256 a, __m256 b, __m256 c); VFNMADDxxxPS __m256 _mm256_mask3_fnmadd_ps(__m256 a, __m256 b, __m256 c, __mmask8 k); VFNMADDxxxPS __m128 _mm_mask_fnmadd_ps(__m128 a, __mmask8 k, __m128 b, __m128 c); VFNMADDxxxPS __m128 _mm_maskz_fnmadd_ps(__mmask8 k, __m128 a, __m128 b, __m128 c); VFNMADDxxxPS __m128 _mm_mask3_fnmadd_ps(__m128 a, __m128 b, __m128 c, __mmask8 k); VFNMADDxxxPS __m128 _mm_fnmadd_ps (__m128 a, __m128 b, __m128 c); VFNMADDxxxPS __m256 _mm256_fnmadd_ps (__m256 a, __m256 b, __m256 c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 2. EVEX-encoded instructions, see Exceptions Type E2. VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single-Precision Floating-Point Values Vol. 2C 5-217 INSTRUCTION SET REFERENCE, V-Z VFNMADD132SD/VFNMADD213SD/VFNMADD231SD—Fused Negative Multiply-Add of Scalar Double-Precision Floating-Point Values Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.DDS.LIG.66.0F38.W1 9D /r VFNMADD132SD xmm1, xmm2, xmm3/m64 VEX.DDS.LIG.66.0F38.W1 AD /r VFNMADD213SD xmm1, xmm2, xmm3/m64 VEX.DDS.LIG.66.0F38.W1 BD /r VFNMADD231SD xmm1, xmm2, xmm3/m64 EVEX.DDS.LIG.66.0F38.W1 9D /r VFNMADD132SD xmm1 {k1}{z}, xmm2, xmm3/m64{er} EVEX.DDS.LIG.66.0F38.W1 AD /r VFNMADD213SD xmm1 {k1}{z}, xmm2, xmm3/m64{er} EVEX.DDS.LIG.66.0F38.W1 BD /r VFNMADD231SD xmm1 {k1}{z}, xmm2, xmm3/m64{er} RVM V/V FMA RVM V/V FMA T1S V/V AVX512F T1S V/V AVX512F T1S V/V AVX512F Description Multiply scalar double-precision floating-point value from xmm1 and xmm3/mem, negate the multiplication result and add to xmm2 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm1 and xmm2, negate the multiplication result and add to xmm3/mem and put result in xmm1. Multiply scalar double-precision floating-point value from xmm2 and xmm3/mem, negate the multiplication result and add to xmm1 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm1 and xmm3/m64, negate the multiplication result and add to xmm2 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm1 and xmm2, negate the multiplication result and add to xmm3/m64 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm2 and xmm3/m64, negate the multiplication result and add to xmm1 and put result in xmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA T1S ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description VFNMADD132SD: Multiplies the low packed double-precision floating-point value from the first source operand to the low packed double-precision floating-point value in the third source operand, adds the negated infinite precision intermediate result to the low packed double-precision floating-point values in the second source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand). VFNMADD213SD: Multiplies the low packed double-precision floating-point value from the second source operand to the low packed double-precision floating-point value in the first source operand, adds the negated infinite precision intermediate result to the low packed double-precision floating-point value in the third source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand). VFNMADD231SD: Multiplies the low packed double-precision floating-point value from the second source to the low packed double-precision floating-point value in the third source operand, adds the negated infinite precision intermediate result to the low packed double-precision floating-point value in the first source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand). VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field. The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field. Bits 127:64 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed. 5-218 Vol. 2C VFNMADD132SD/VFNMADD213SD/VFNMADD231SD—Fused Negative Multiply-Add of Scalar Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z EVEX encoded version: The low quadword element of the destination is updated according to the writemask. Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. Operation In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding). VFNMADD132SD DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[63:0] RoundFPControl(-(DEST[63:0]*SRC3[63:0]) + SRC2[63:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[63:0] 0 FI; FI; DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFNMADD213SD DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[63:0] RoundFPControl(-(SRC2[63:0]*DEST[63:0]) + SRC3[63:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[63:0] 0 FI; FI; DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFNMADD132SD/VFNMADD213SD/VFNMADD231SD—Fused Negative Multiply-Add of Scalar Double-Precision Floating-Point Values Vol. 2C 5-219 INSTRUCTION SET REFERENCE, V-Z VFNMADD231SD DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[63:0] RoundFPControl(-(SRC2[63:0]*SRC3[63:0]) + DEST[63:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[63:0] 0 FI; FI; DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFNMADD132SD DEST, SRC2, SRC3 (VEX encoded version) DEST[63:0] RoundFPControl_MXCSR(- (DEST[63:0]*SRC3[63:0]) + SRC2[63:0]) DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFNMADD213SD DEST, SRC2, SRC3 (VEX encoded version) DEST[63:0] RoundFPControl_MXCSR(- (SRC2[63:0]*DEST[63:0]) + SRC3[63:0]) DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFNMADD231SD DEST, SRC2, SRC3 (VEX encoded version) DEST[63:0] RoundFPControl_MXCSR(- (SRC2[63:0]*SRC3[63:0]) + DEST[63:0]) DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VFNMADDxxxSD __m128d _mm_fnmadd_round_sd(__m128d a, __m128d b, __m128d c, int r); VFNMADDxxxSD __m128d _mm_mask_fnmadd_sd(__m128d a, __mmask8 k, __m128d b, __m128d c); VFNMADDxxxSD __m128d _mm_maskz_fnmadd_sd(__mmask8 k, __m128d a, __m128d b, __m128d c); VFNMADDxxxSD __m128d _mm_mask3_fnmadd_sd(__m128d a, __m128d b, __m128d c, __mmask8 k); VFNMADDxxxSD __m128d _mm_mask_fnmadd_round_sd(__m128d a, __mmask8 k, __m128d b, __m128d c, int r); VFNMADDxxxSD __m128d _mm_maskz_fnmadd_round_sd(__mmask8 k, __m128d a, __m128d b, __m128d c, int r); VFNMADDxxxSD __m128d _mm_mask3_fnmadd_round_sd(__m128d a, __m128d b, __m128d c, __mmask8 k, int r); VFNMADDxxxSD __m128d _mm_fnmadd_sd (__m128d a, __m128d b, __m128d c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 3. EVEX-encoded instructions, see Exceptions Type E3. 5-220 Vol. 2C VFNMADD132SD/VFNMADD213SD/VFNMADD231SD—Fused Negative Multiply-Add of Scalar Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFNMADD132SS/VFNMADD213SS/VFNMADD231SS—Fused Negative Multiply-Add of Scalar Single-Precision Floating-Point Values Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.DDS.LIG.66.0F38.W0 9D /r VFNMADD132SS xmm1, xmm2, xmm3/m32 VEX.DDS.LIG.66.0F38.W0 AD /r VFNMADD213SS xmm1, xmm2, xmm3/m32 VEX.DDS.LIG.66.0F38.W0 BD /r VFNMADD231SS xmm1, xmm2, xmm3/m32 EVEX.DDS.LIG.66.0F38.W0 9D /r VFNMADD132SS xmm1 {k1}{z}, xmm2, xmm3/m32{er} EVEX.DDS.LIG.66.0F38.W0 AD /r VFNMADD213SS xmm1 {k1}{z}, xmm2, xmm3/m32{er} EVEX.DDS.LIG.66.0F38.W0 BD /r VFNMADD231SS xmm1 {k1}{z}, xmm2, xmm3/m32{er} RVM V/V FMA RVM V/V FMA T1S V/V AVX512F T1S V/V AVX512F T1S V/V AVX512F Description Multiply scalar single-precision floating-point value from xmm1 and xmm3/m32, negate the multiplication result and add to xmm2 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm1 and xmm2, negate the multiplication result and add to xmm3/m32 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm2 and xmm3/m32, negate the multiplication result and add to xmm1 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm1 and xmm3/m32, negate the multiplication result and add to xmm2 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm1 and xmm2, negate the multiplication result and add to xmm3/m32 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm2 and xmm3/m32, negate the multiplication result and add to xmm1 and put result in xmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA T1S ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description VFNMADD132SS: Multiplies the low packed single-precision floating-point value from the first source operand to the low packed single-precision floating-point value in the third source operand, adds the negated infinite precision intermediate result to the low packed single-precision floating-point value in the second source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand). VFNMADD213SS: Multiplies the low packed single-precision floating-point value from the second source operand to the low packed single-precision floating-point value in the first source operand, adds the negated infinite precision intermediate result to the low packed single-precision floating-point value in the third source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand). VFNMADD231SS: Multiplies the low packed single-precision floating-point value from the second source operand to the low packed single-precision floating-point value in the third source operand, adds the negated infinite precision intermediate result to the low packed single-precision floating-point value in the first source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand). VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field. The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field. Bits 127:32 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed. VFNMADD132SS/VFNMADD213SS/VFNMADD231SS—Fused Negative Multiply-Add of Scalar Single-Precision Floating-Point Values Vol. 2C 5-221 INSTRUCTION SET REFERENCE, V-Z EVEX encoded version: The low doubleword element of the destination is updated according to the writemask. Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. Operation In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding). VFNMADD132SS DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[31:0] RoundFPControl(-(DEST[31:0]*SRC3[31:0]) + SRC2[31:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[31:0] 0 FI; FI; DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFNMADD213SS DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[31:0] RoundFPControl(-(SRC2[31:0]*DEST[31:0]) + SRC3[31:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[31:0] 0 FI; FI; DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 5-222 Vol. 2C VFNMADD132SS/VFNMADD213SS/VFNMADD231SS—Fused Negative Multiply-Add of Scalar Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VFNMADD231SS DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[31:0] RoundFPControl(-(SRC2[31:0]*SRC3[63:0]) + DEST[31:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[31:0] 0 FI; FI; DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFNMADD132SS DEST, SRC2, SRC3 (VEX encoded version) DEST[31:0] RoundFPControl_MXCSR(- (DEST[31:0]*SRC3[31:0]) + SRC2[31:0]) DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFNMADD213SS DEST, SRC2, SRC3 (VEX encoded version) DEST[31:0] RoundFPControl_MXCSR(- (SRC2[31:0]*DEST[31:0]) + SRC3[31:0]) DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFNMADD231SS DEST, SRC2, SRC3 (VEX encoded version) DEST[31:0] RoundFPControl_MXCSR(- (SRC2[31:0]*SRC3[31:0]) + DEST[31:0]) DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VFNMADDxxxSS __m128 _mm_fnmadd_round_ss(__m128 a, __m128 b, __m128 c, int r); VFNMADDxxxSS __m128 _mm_mask_fnmadd_ss(__m128 a, __mmask8 k, __m128 b, __m128 c); VFNMADDxxxSS __m128 _mm_maskz_fnmadd_ss(__mmask8 k, __m128 a, __m128 b, __m128 c); VFNMADDxxxSS __m128 _mm_mask3_fnmadd_ss(__m128 a, __m128 b, __m128 c, __mmask8 k); VFNMADDxxxSS __m128 _mm_mask_fnmadd_round_ss(__m128 a, __mmask8 k, __m128 b, __m128 c, int r); VFNMADDxxxSS __m128 _mm_maskz_fnmadd_round_ss(__mmask8 k, __m128 a, __m128 b, __m128 c, int r); VFNMADDxxxSS __m128 _mm_mask3_fnmadd_round_ss(__m128 a, __m128 b, __m128 c, __mmask8 k, int r); VFNMADDxxxSS __m128 _mm_fnmadd_ss (__m128 a, __m128 b, __m128 c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 3. EVEX-encoded instructions, see Exceptions Type E3. VFNMADD132SS/VFNMADD213SS/VFNMADD231SS—Fused Negative Multiply-Add of Scalar Single-Precision Floating-Point Values Vol. 2C 5-223 INSTRUCTION SET REFERENCE, V-Z VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of Packed Double-Precision Floating-Point Values Opcode/ Instruction Op/ En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.NDS.128.66.0F38.W1 9E /r VFNMSUB132PD xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W1 AE /r VFNMSUB213PD xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W1 BE /r VFNMSUB231PD xmm1, xmm2, xmm3/m128 VEX.NDS.256.66.0F38.W1 9E /r VFNMSUB132PD ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.W1 AE /r VFNMSUB213PD ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.W1 BE /r VFNMSUB231PD ymm1, ymm2, ymm3/m256 EVEX.NDS.128.66.0F38.W1 9E /r VFNMSUB132PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA FV V/V AVX512VL AVX512F EVEX.NDS.128.66.0F38.W1 AE /r VFNMSUB213PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDS.128.66.0F38.W1 BE /r VFNMSUB231PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F EVEX.NDS.256.66.0F38.W1 9E /r VFNMSUB132PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst FV V/V AVX512VL AVX512F EVEX.NDS.256.66.0F38.W1 AE /r VFNMSUB213PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.256.66.0F38.W1 BE /r VFNMSUB231PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F FV V/V AVX512F FV V/V AVX512F EVEX.NDS.512.66.0F38.W1 9E /r VFNMSUB132PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} EVEX.NDS.512.66.0F38.W1 AE /r VFNMSUB213PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} EVEX.NDS.512.66.0F38.W1 BE /r VFNMSUB231PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} 5-224 Vol. 2C Description Multiply packed double-precision floating-point values from xmm1 and xmm3/mem, negate the multiplication result and subtract xmm2 and put result in xmm1. Multiply packed double-precision floating-point values from xmm1 and xmm2, negate the multiplication result and subtract xmm3/mem and put result in xmm1. Multiply packed double-precision floating-point values from xmm2 and xmm3/mem, negate the multiplication result and subtract xmm1 and put result in xmm1. Multiply packed double-precision floating-point values from ymm1 and ymm3/mem, negate the multiplication result and subtract ymm2 and put result in ymm1. Multiply packed double-precision floating-point values from ymm1 and ymm2, negate the multiplication result and subtract ymm3/mem and put result in ymm1. Multiply packed double-precision floating-point values from ymm2 and ymm3/mem, negate the multiplication result and subtract ymm1 and put result in ymm1. Multiply packed double-precision floating-point values from xmm1 and xmm3/m128/m64bcst, negate the multiplication result and subtract xmm2 and put result in xmm1. Multiply packed double-precision floating-point values from xmm1 and xmm2, negate the multiplication result and subtract xmm3/m128/m64bcst and put result in xmm1. Multiply packed double-precision floating-point values from xmm2 and xmm3/m128/m64bcst, negate the multiplication result and subtract xmm1 and put result in xmm1. Multiply packed double-precision floating-point values from ymm1 and ymm3/m256/m64bcst, negate the multiplication result and subtract ymm2 and put result in ymm1. Multiply packed double-precision floating-point values from ymm1 and ymm2, negate the multiplication result and subtract ymm3/m256/m64bcst and put result in ymm1. Multiply packed double-precision floating-point values from ymm2 and ymm3/m256/m64bcst, negate the multiplication result and subtract ymm1 and put result in ymm1. Multiply packed double-precision floating-point values from zmm1 and zmm3/m512/m64bcst, negate the multiplication result and subtract zmm2 and put result in zmm1. Multiply packed double-precision floating-point values from zmm1 and zmm2, negate the multiplication result and subtract zmm3/m512/m64bcst and put result in zmm1. Multiply packed double-precision floating-point values from zmm2 and zmm3/m512/m64bcst, negate the multiplication result and subtract zmm1 and put result in zmm1. VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of Packed Double-Precision Floating-Point INSTRUCTION SET REFERENCE, V-Z Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description VFNMSUB132PD: Multiplies the two, four or eight packed double-precision floating-point values from the first source operand to the two, four or eight packed double-precision floating-point values in the third source operand. From negated infinite precision intermediate results, subtracts the two, four or eight packed double-precision floating-point values in the second source operand, performs rounding and stores the resulting two, four or eight packed double-precision floating-point values to the destination operand (first source operand). VFNMSUB213PD: Multiplies the two, four or eight packed double-precision floating-point values from the second source operand to the two, four or eight packed double-precision floating-point values in the first source operand. From negated infinite precision intermediate results, subtracts the two, four or eight packed double-precision floating-point values in the third source operand, performs rounding and stores the resulting two, four or eight packed double-precision floating-point values to the destination operand (first source operand). VFNMSUB231PD: Multiplies the two, four or eight packed double-precision floating-point values from the second source to the two, four or eight packed double-precision floating-point values in the third source operand. From negated infinite precision intermediate results, subtracts the two, four or eight packed double-precision floatingpoint values in the first source operand, performs rounding and stores the resulting two, four or eight packed double-precision floating-point values to the destination operand (first source operand). EVEX encoded versions: The destination operand (also first source operand) and the second source operand are ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is conditionally updated with write mask k1. VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field. VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed. Operation In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no rounding). VFNMSUB132PD DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 64*i; DEST[n+63:n] RoundFPControl_MXCSR( - (DEST[n+63:n]*SRC3[n+63:n]) - SRC2[n+63:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of Packed Double-Precision Floating-Point Vol. 2C 5-225 INSTRUCTION SET REFERENCE, V-Z VFNMSUB213PD DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 64*i; DEST[n+63:n] RoundFPControl_MXCSR( - (SRC2[n+63:n]*DEST[n+63:n]) - SRC3[n+63:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFNMSUB231PD DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 64*i; DEST[n+63:n] RoundFPControl_MXCSR( - (SRC2[n+63:n]*SRC3[n+63:n]) - DEST[n+63:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFNMSUB132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] RoundFPControl(-(DEST[i+63:i]*SRC3[i+63:i]) - SRC2[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-226 Vol. 2C VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of Packed Double-Precision Floating-Point INSTRUCTION SET REFERENCE, V-Z VFNMSUB132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(-(DEST[i+63:i]*SRC3[63:0]) - SRC2[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(-(DEST[i+63:i]*SRC3[i+63:i]) - SRC2[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFNMSUB213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] RoundFPControl(-(SRC2[i+63:i]*DEST[i+63:i]) - SRC3[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of Packed Double-Precision Floating-Point Vol. 2C 5-227 INSTRUCTION SET REFERENCE, V-Z VFNMSUB213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(-(SRC2[i+63:i]*DEST[i+63:i]) - SRC3[63:0]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(-(SRC2[i+63:i]*DEST[i+63:i]) - SRC3[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFNMSUB231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] RoundFPControl(-(SRC2[i+63:i]*SRC3[i+63:i]) - DEST[i+63:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-228 Vol. 2C VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of Packed Double-Precision Floating-Point INSTRUCTION SET REFERENCE, V-Z VFNMSUB231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+63:i] RoundFPControl_MXCSR(-(SRC2[i+63:i]*SRC3[63:0]) - DEST[i+63:i]) ELSE DEST[i+63:i] RoundFPControl_MXCSR(-(SRC2[i+63:i]*SRC3[i+63:i]) - DEST[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VFNMSUBxxxPD __m512d _mm512_fnmsub_pd(__m512d a, __m512d b, __m512d c); VFNMSUBxxxPD __m512d _mm512_fnmsub_round_pd(__m512d a, __m512d b, __m512d c, int r); VFNMSUBxxxPD __m512d _mm512_mask_fnmsub_pd(__m512d a, __mmask8 k, __m512d b, __m512d c); VFNMSUBxxxPD __m512d _mm512_maskz_fnmsub_pd(__mmask8 k, __m512d a, __m512d b, __m512d c); VFNMSUBxxxPD __m512d _mm512_mask3_fnmsub_pd(__m512d a, __m512d b, __m512d c, __mmask8 k); VFNMSUBxxxPD __m512d _mm512_mask_fnmsub_round_pd(__m512d a, __mmask8 k, __m512d b, __m512d c, int r); VFNMSUBxxxPD __m512d _mm512_maskz_fnmsub_round_pd(__mmask8 k, __m512d a, __m512d b, __m512d c, int r); VFNMSUBxxxPD __m512d _mm512_mask3_fnmsub_round_pd(__m512d a, __m512d b, __m512d c, __mmask8 k, int r); VFNMSUBxxxPD __m256d _mm256_mask_fnmsub_pd(__m256d a, __mmask8 k, __m256d b, __m256d c); VFNMSUBxxxPD __m256d _mm256_maskz_fnmsub_pd(__mmask8 k, __m256d a, __m256d b, __m256d c); VFNMSUBxxxPD __m256d _mm256_mask3_fnmsub_pd(__m256d a, __m256d b, __m256d c, __mmask8 k); VFNMSUBxxxPD __m128d _mm_mask_fnmsub_pd(__m128d a, __mmask8 k, __m128d b, __m128d c); VFNMSUBxxxPD __m128d _mm_maskz_fnmsub_pd(__mmask8 k, __m128d a, __m128d b, __m128d c); VFNMSUBxxxPD __m128d _mm_mask3_fnmsub_pd(__m128d a, __m128d b, __m128d c, __mmask8 k); VFNMSUBxxxPD __m128d _mm_fnmsub_pd (__m128d a, __m128d b, __m128d c); VFNMSUBxxxPD __m256d _mm256_fnmsub_pd (__m256d a, __m256d b, __m256d c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 2. EVEX-encoded instructions, see Exceptions Type E2. VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of Packed Double-Precision Floating-Point Vol. 2C 5-229 INSTRUCTION SET REFERENCE, V-Z VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single-Precision Floating-Point Values Opcode/ Instruction Op/ En VEX.NDS.128.66.0F38.W0 9E /r VFNMSUB132PS xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W0 AE /r VFNMSUB213PS xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W0 BE /r VFNMSUB231PS xmm1, xmm2, xmm3/m128 VEX.NDS.256.66.0F38.W0 9E /r VFNMSUB132PS ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.W0 AE /r VFNMSUB213PS ymm1, ymm2, ymm3/m256 VEX.NDS.256.66.0F38.0 BE /r VFNMSUB231PS ymm1, ymm2, ymm3/m256 EVEX.NDS.128.66.0F38.W0 9E /r VFNMSUB132PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.128.66.0F38.W0 AE /r VFNMSUB213PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.128.66.0F38.W0 BE /r VFNMSUB231PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.256.66.0F38.W0 9E /r VFNMSUB132PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.256.66.0F38.W0 AE /r VFNMSUB213PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.256.66.0F38.W0 BE /r VFNMSUB231PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.512.66.0F38.W0 9E /r VFNMSUB132PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} EVEX.NDS.512.66.0F38.W0 AE /r VFNMSUB213PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} EVEX.NDS.512.66.0F38.W0 BE /r VFNMSUB231PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} 5-230 Vol. 2C RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA RVM V/V FMA FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F FV V/V AVX512F FV V/V AVX512F Description Multiply packed single-precision floating-point values from xmm1 and xmm3/mem, negate the multiplication result and subtract xmm2 and put result in xmm1. Multiply packed single-precision floating-point values from xmm1 and xmm2, negate the multiplication result and subtract xmm3/mem and put result in xmm1. Multiply packed single-precision floating-point values from xmm2 and xmm3/mem, negate the multiplication result and subtract xmm1 and put result in xmm1. Multiply packed single-precision floating-point values from ymm1 and ymm3/mem, negate the multiplication result and subtract ymm2 and put result in ymm1. Multiply packed single-precision floating-point values from ymm1 and ymm2, negate the multiplication result and subtract ymm3/mem and put result in ymm1. Multiply packed single-precision floating-point values from ymm2 and ymm3/mem, negate the multiplication result and subtract ymm1 and put result in ymm1. Multiply packed single-precision floating-point values from xmm1 and xmm3/m128/m32bcst, negate the multiplication result and subtract xmm2 and put result in xmm1. Multiply packed single-precision floating-point values from xmm1 and xmm2, negate the multiplication result and subtract xmm3/m128/m32bcst and put result in xmm1. Multiply packed single-precision floating-point values from xmm2 and xmm3/m128/m32bcst, negate the multiplication result subtract add to xmm1 and put result in xmm1. Multiply packed single-precision floating-point values from ymm1 and ymm3/m256/m32bcst, negate the multiplication result and subtract ymm2 and put result in ymm1. Multiply packed single-precision floating-point values from ymm1 and ymm2, negate the multiplication result and subtract ymm3/m256/m32bcst and put result in ymm1. Multiply packed single-precision floating-point values from ymm2 and ymm3/m256/m32bcst, negate the multiplication result subtract add to ymm1 and put result in ymm1. Multiply packed single-precision floating-point values from zmm1 and zmm3/m512/m32bcst, negate the multiplication result and subtract zmm2 and put result in zmm1. Multiply packed single-precision floating-point values from zmm1 and zmm2, negate the multiplication result and subtract zmm3/m512/m32bcst and put result in zmm1. Multiply packed single-precision floating-point values from zmm2 and zmm3/m512/m32bcst, negate the multiplication result subtract add to zmm1 and put result in zmm1. VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single-Precision Floating-Point Val- INSTRUCTION SET REFERENCE, V-Z Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description VFNMSUB132PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the first source operand to the four, eight or sixteen packed single-precision floating-point values in the third source operand. From negated infinite precision intermediate results, subtracts the four, eight or sixteen packed singleprecision floating-point values in the second source operand, performs rounding and stores the resulting four, eight or sixteen packed single-precision floating-point values to the destination operand (first source operand). VFNMSUB213PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the second source operand to the four, eight or sixteen packed single-precision floating-point values in the first source operand. From negated infinite precision intermediate results, subtracts the four, eight or sixteen packed singleprecision floating-point values in the third source operand, performs rounding and stores the resulting four, eight or sixteen packed single-precision floating-point values to the destination operand (first source operand). VFNMSUB231PS: Multiplies the four, eight or sixteen packed single-precision floating-point values from the second source to the four, eight or sixteen packed single-precision floating-point values in the third source operand. From negated infinite precision intermediate results, subtracts the four, eight or sixteen packed single-precision floatingpoint values in the first source operand, performs rounding and stores the resulting four, eight or sixteen packed single-precision floating-point values to the destination operand (first source operand). EVEX encoded versions: The destination operand (also first source operand) and the second source operand are ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is conditionally updated with write mask k1. VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field. VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed. Operation In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no rounding). VFNMSUB132PS DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 32*i; DEST[n+31:n] RoundFPControl_MXCSR( - (DEST[n+31:n]*SRC3[n+31:n]) - SRC2[n+31:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single-Precision Floating-Point Val- Vol. 2C 5-231 INSTRUCTION SET REFERENCE, V-Z VFNMSUB213PS DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 32*i; DEST[n+31:n] RoundFPControl_MXCSR( - (SRC2[n+31:n]*DEST[n+31:n]) - SRC3[n+31:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFNMSUB231PS DEST, SRC2, SRC3 (VEX encoded version) IF (VEX.128) THEN MAXNUM 2 ELSEIF (VEX.256) MAXNUM 4 FI For i = 0 to MAXNUM-1 { n 32*i; DEST[n+31:n] RoundFPControl_MXCSR( - (SRC2[n+31:n]*SRC3[n+31:n]) - DEST[n+31:n]) } IF (VEX.128) THEN DEST[MAX_VL-1:128] 0 ELSEIF (VEX.256) DEST[MAX_VL-1:256] 0 FI VFNMSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] RoundFPControl(-(DEST[i+31:i]*SRC3[i+31:i]) - SRC2[i+31:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-232 Vol. 2C VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single-Precision Floating-Point Val- INSTRUCTION SET REFERENCE, V-Z VFNMSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(-(DEST[i+31:i]*SRC3[31:0]) - SRC2[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(-(DEST[i+31:i]*SRC3[i+31:i]) - SRC2[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFNMSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] RoundFPControl_MXCSR(-(SRC2[i+31:i]*DEST[i+31:i]) - SRC3[i+31:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single-Precision Floating-Point Val- Vol. 2C 5-233 INSTRUCTION SET REFERENCE, V-Z VFNMSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(-(SRC2[i+31:i]*DEST[i+31:i]) - SRC3[31:0]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(-(SRC2[i+31:i]*DEST[i+31:i]) - SRC3[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFNMSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] RoundFPControl_MXCSR(-(SRC2[i+31:i]*SRC3[i+31:i]) - DEST[i+31:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VFNMSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* 5-234 Vol. 2C VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single-Precision Floating-Point Val- INSTRUCTION SET REFERENCE, V-Z THEN IF (EVEX.b = 1) THEN DEST[i+31:i] RoundFPControl_MXCSR(-(SRC2[i+31:i]*SRC3[31:0]) - DEST[i+31:i]) ELSE DEST[i+31:i] RoundFPControl_MXCSR(-(SRC2[i+31:i]*SRC3[i+31:i]) - DEST[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VFNMSUBxxxPS __m512 _mm512_fnmsub_ps(__m512 a, __m512 b, __m512 c); VFNMSUBxxxPS __m512 _mm512_fnmsub_round_ps(__m512 a, __m512 b, __m512 c, int r); VFNMSUBxxxPS __m512 _mm512_mask_fnmsub_ps(__m512 a, __mmask16 k, __m512 b, __m512 c); VFNMSUBxxxPS __m512 _mm512_maskz_fnmsub_ps(__mmask16 k, __m512 a, __m512 b, __m512 c); VFNMSUBxxxPS __m512 _mm512_mask3_fnmsub_ps(__m512 a, __m512 b, __m512 c, __mmask16 k); VFNMSUBxxxPS __m512 _mm512_mask_fnmsub_round_ps(__m512 a, __mmask16 k, __m512 b, __m512 c, int r); VFNMSUBxxxPS __m512 _mm512_maskz_fnmsub_round_ps(__mmask16 k, __m512 a, __m512 b, __m512 c, int r); VFNMSUBxxxPS __m512 _mm512_mask3_fnmsub_round_ps(__m512 a, __m512 b, __m512 c, __mmask16 k, int r); VFNMSUBxxxPS __m256 _mm256_mask_fnmsub_ps(__m256 a, __mmask8 k, __m256 b, __m256 c); VFNMSUBxxxPS __m256 _mm256_maskz_fnmsub_ps(__mmask8 k, __m256 a, __m256 b, __m256 c); VFNMSUBxxxPS __m256 _mm256_mask3_fnmsub_ps(__m256 a, __m256 b, __m256 c, __mmask8 k); VFNMSUBxxxPS __m128 _mm_mask_fnmsub_ps(__m128 a, __mmask8 k, __m128 b, __m128 c); VFNMSUBxxxPS __m128 _mm_maskz_fnmsub_ps(__mmask8 k, __m128 a, __m128 b, __m128 c); VFNMSUBxxxPS __m128 _mm_mask3_fnmsub_ps(__m128 a, __m128 b, __m128 c, __mmask8 k); VFNMSUBxxxPS __m128 _mm_fnmsub_ps (__m128 a, __m128 b, __m128 c); VFNMSUBxxxPS __m256 _mm256_fnmsub_ps (__m256 a, __m256 b, __m256 c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 2. EVEX-encoded instructions, see Exceptions Type E2. VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single-Precision Floating-Point Val- Vol. 2C 5-235 INSTRUCTION SET REFERENCE, V-Z VFNMSUB132SD/VFNMSUB213SD/VFNMSUB231SD—Fused Negative Multiply-Subtract of Scalar Double-Precision Floating-Point Values Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.DDS.LIG.66.0F38.W1 9F /r VFNMSUB132SD xmm1, xmm2, xmm3/m64 VEX.DDS.LIG.66.0F38.W1 AF /r VFNMSUB213SD xmm1, xmm2, xmm3/m64 VEX.DDS.LIG.66.0F38.W1 BF /r VFNMSUB231SD xmm1, xmm2, xmm3/m64 EVEX.DDS.LIG.66.0F38.W1 9F /r VFNMSUB132SD xmm1 {k1}{z}, xmm2, xmm3/m64{er} EVEX.DDS.LIG.66.0F38.W1 AF /r VFNMSUB213SD xmm1 {k1}{z}, xmm2, xmm3/m64{er} EVEX.DDS.LIG.66.0F38.W1 BF /r VFNMSUB231SD xmm1 {k1}{z}, xmm2, xmm3/m64{er} RVM V/V FMA RVM V/V FMA T1S V/V AVX512F T1S V/V AVX512F T1S V/V AVX512F Description Multiply scalar double-precision floating-point value from xmm1 and xmm3/mem, negate the multiplication result and subtract xmm2 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm1 and xmm2, negate the multiplication result and subtract xmm3/mem and put result in xmm1. Multiply scalar double-precision floating-point value from xmm2 and xmm3/mem, negate the multiplication result and subtract xmm1 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm1 and xmm3/m64, negate the multiplication result and subtract xmm2 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm1 and xmm2, negate the multiplication result and subtract xmm3/m64 and put result in xmm1. Multiply scalar double-precision floating-point value from xmm2 and xmm3/m64, negate the multiplication result and subtract xmm1 and put result in xmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA T1S ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description VFNMSUB132SD: Multiplies the low packed double-precision floating-point value from the first source operand to the low packed double-precision floating-point value in the third source operand. From negated infinite precision intermediate result, subtracts the low double-precision floating-point value in the second source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand). VFNMSUB213SD: Multiplies the low packed double-precision floating-point value from the second source operand to the low packed double-precision floating-point value in the first source operand. From negated infinite precision intermediate result, subtracts the low double-precision floating-point value in the third source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand). VFNMSUB231SD: Multiplies the low packed double-precision floating-point value from the second source to the low packed double-precision floating-point value in the third source operand. From negated infinite precision intermediate result, subtracts the low double-precision floating-point value in the first source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand). VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field. The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field. Bits 127:64 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed. 5-236 Vol. 2C VFNMSUB132SD/VFNMSUB213SD/VFNMSUB231SD—Fused Negative Multiply-Subtract of Scalar Double-Precision Floating-Point Val- INSTRUCTION SET REFERENCE, V-Z EVEX encoded version: The low quadword element of the destination is updated according to the writemask. Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. Operation In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no rounding). VFNMSUB132SD DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[63:0] RoundFPControl(-(DEST[63:0]*SRC3[63:0]) - SRC2[63:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[63:0] 0 FI; FI; DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFNMSUB213SD DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[63:0] RoundFPControl(-(SRC2[63:0]*DEST[63:0]) - SRC3[63:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[63:0] 0 FI; FI; DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFNMSUB132SD/VFNMSUB213SD/VFNMSUB231SD—Fused Negative Multiply-Subtract of Scalar Double-Precision Floating-Point Val- Vol. 2C 5-237 INSTRUCTION SET REFERENCE, V-Z VFNMSUB231SD DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[63:0] RoundFPControl(-(SRC2[63:0]*SRC3[63:0]) - DEST[63:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[63:0] 0 FI; FI; DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFNMSUB132SD DEST, SRC2, SRC3 (VEX encoded version) DEST[63:0] RoundFPControl_MXCSR(- (DEST[63:0]*SRC3[63:0]) - SRC2[63:0]) DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFNMSUB213SD DEST, SRC2, SRC3 (VEX encoded version) DEST[63:0] RoundFPControl_MXCSR(- (SRC2[63:0]*DEST[63:0]) - SRC3[63:0]) DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 VFNMSUB231SD DEST, SRC2, SRC3 (VEX encoded version) DEST[63:0] RoundFPControl_MXCSR(- (SRC2[63:0]*SRC3[63:0]) - DEST[63:0]) DEST[127:64] DEST[127:64] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VFNMSUBxxxSD __m128d _mm_fnmsub_round_sd(__m128d a, __m128d b, __m128d c, int r); VFNMSUBxxxSD __m128d _mm_mask_fnmsub_sd(__m128d a, __mmask8 k, __m128d b, __m128d c); VFNMSUBxxxSD __m128d _mm_maskz_fnmsub_sd(__mmask8 k, __m128d a, __m128d b, __m128d c); VFNMSUBxxxSD __m128d _mm_mask3_fnmsub_sd(__m128d a, __m128d b, __m128d c, __mmask8 k); VFNMSUBxxxSD __m128d _mm_mask_fnmsub_round_sd(__m128d a, __mmask8 k, __m128d b, __m128d c, int r); VFNMSUBxxxSD __m128d _mm_maskz_fnmsub_round_sd(__mmask8 k, __m128d a, __m128d b, __m128d c, int r); VFNMSUBxxxSD __m128d _mm_mask3_fnmsub_round_sd(__m128d a, __m128d b, __m128d c, __mmask8 k, int r); VFNMSUBxxxSD __m128d _mm_fnmsub_sd (__m128d a, __m128d b, __m128d c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 3. EVEX-encoded instructions, see Exceptions Type E3. 5-238 Vol. 2C VFNMSUB132SD/VFNMSUB213SD/VFNMSUB231SD—Fused Negative Multiply-Subtract of Scalar Double-Precision Floating-Point Val- INSTRUCTION SET REFERENCE, V-Z VFNMSUB132SS/VFNMSUB213SS/VFNMSUB231SS—Fused Negative Multiply-Subtract of Scalar Single-Precision Floating-Point Values Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag FMA VEX.DDS.LIG.66.0F38.W0 9F /r VFNMSUB132SS xmm1, xmm2, xmm3/m32 VEX.DDS.LIG.66.0F38.W0 AF /r VFNMSUB213SS xmm1, xmm2, xmm3/m32 VEX.DDS.LIG.66.0F38.W0 BF /r VFNMSUB231SS xmm1, xmm2, xmm3/m32 EVEX.DDS.LIG.66.0F38.W0 9F /r VFNMSUB132SS xmm1 {k1}{z}, xmm2, xmm3/m32{er} EVEX.DDS.LIG.66.0F38.W0 AF /r VFNMSUB213SS xmm1 {k1}{z}, xmm2, xmm3/m32{er} EVEX.DDS.LIG.66.0F38.W0 BF /r VFNMSUB231SS xmm1 {k1}{z}, xmm2, xmm3/m32{er} RVM V/V FMA RVM V/V FMA T1S V/V AVX512F T1S V/V AVX512F T1S V/V AVX512F Description Multiply scalar single-precision floating-point value from xmm1 and xmm3/m32, negate the multiplication result and subtract xmm2 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm1 and xmm2, negate the multiplication result and subtract xmm3/m32 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm2 and xmm3/m32, negate the multiplication result and subtract xmm1 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm1 and xmm3/m32, negate the multiplication result and subtract xmm2 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm1 and xmm2, negate the multiplication result and subtract xmm3/m32 and put result in xmm1. Multiply scalar single-precision floating-point value from xmm2 and xmm3/m32, negate the multiplication result and subtract xmm1 and put result in xmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (r, w) VEX.vvvv (r) ModRM:r/m (r) NA T1S ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description VFNMSUB132SS: Multiplies the low packed single-precision floating-point value from the first source operand to the low packed single-precision floating-point value in the third source operand. From negated infinite precision intermediate result, the low single-precision floating-point value in the second source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand). VFNMSUB213SS: Multiplies the low packed single-precision floating-point value from the second source operand to the low packed single-precision floating-point value in the first source operand. From negated infinite precision intermediate result, the low single-precision floating-point value in the third source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand). VFNMSUB231SS: Multiplies the low packed single-precision floating-point value from the second source to the low packed single-precision floating-point value in the third source operand. From negated infinite precision intermediate result, the low single-precision floating-point value in the first source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand). VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field. The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field. Bits 127:32 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed. EVEX encoded version: The low doubleword element of the destination is updated according to the writemask. Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. VFNMSUB132SS/VFNMSUB213SS/VFNMSUB231SS—Fused Negative Multiply-Subtract of Scalar Single-Precision Floating-Point Val- Vol. 2C 5-239 INSTRUCTION SET REFERENCE, V-Z Operation In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no rounding). VFNMSUB132SS DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[31:0] RoundFPControl(-(DEST[31:0]*SRC3[31:0]) - SRC2[31:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[31:0] 0 FI; FI; DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFNMSUB213SS DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[31:0] RoundFPControl(-(SRC2[31:0]*DEST[31:0]) - SRC3[31:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[31:0] 0 FI; FI; DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 5-240 Vol. 2C VFNMSUB132SS/VFNMSUB213SS/VFNMSUB231SS—Fused Negative Multiply-Subtract of Scalar Single-Precision Floating-Point Val- INSTRUCTION SET REFERENCE, V-Z VFNMSUB231SS DEST, SRC2, SRC3 (EVEX encoded version) IF (EVEX.b = 1) and SRC3 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] or *no writemask* THEN DEST[31:0] RoundFPControl(-(SRC2[31:0]*SRC3[63:0]) - DEST[31:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[31:0] 0 FI; FI; DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFNMSUB132SS DEST, SRC2, SRC3 (VEX encoded version) DEST[31:0] RoundFPControl_MXCSR(- (DEST[31:0]*SRC3[31:0]) - SRC2[31:0]) DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFNMSUB213SS DEST, SRC2, SRC3 (VEX encoded version) DEST[31:0] RoundFPControl_MXCSR(- (SRC2[31:0]*DEST[31:0]) - SRC3[31:0]) DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 VFNMSUB231SS DEST, SRC2, SRC3 (VEX encoded version) DEST[31:0] RoundFPControl_MXCSR(- (SRC2[31:0]*SRC3[31:0]) - DEST[31:0]) DEST[127:32] DEST[127:32] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VFNMSUBxxxSS __m128 _mm_fnmsub_round_ss(__m128 a, __m128 b, __m128 c, int r); VFNMSUBxxxSS __m128 _mm_mask_fnmsub_ss(__m128 a, __mmask8 k, __m128 b, __m128 c); VFNMSUBxxxSS __m128 _mm_maskz_fnmsub_ss(__mmask8 k, __m128 a, __m128 b, __m128 c); VFNMSUBxxxSS __m128 _mm_mask3_fnmsub_ss(__m128 a, __m128 b, __m128 c, __mmask8 k); VFNMSUBxxxSS __m128 _mm_mask_fnmsub_round_ss(__m128 a, __mmask8 k, __m128 b, __m128 c, int r); VFNMSUBxxxSS __m128 _mm_maskz_fnmsub_round_ss(__mmask8 k, __m128 a, __m128 b, __m128 c, int r); VFNMSUBxxxSS __m128 _mm_mask3_fnmsub_round_ss(__m128 a, __m128 b, __m128 c, __mmask8 k, int r); VFNMSUBxxxSS __m128 _mm_fnmsub_ss (__m128 a, __m128 b, __m128 c); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal Other Exceptions VEX-encoded instructions, see Exceptions Type 3. EVEX-encoded instructions, see Exceptions Type E3. VFNMSUB132SS/VFNMSUB213SS/VFNMSUB231SS—Fused Negative Multiply-Subtract of Scalar Single-Precision Floating-Point Val- Vol. 2C 5-241 INSTRUCTION SET REFERENCE, V-Z VFPCLASSPD—Tests Types Of a Packed Float64 Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.128.66.0F3A.W1 66 /r ib VFPCLASSPD k2 {k1}, xmm2/m128/m64bcst, imm8 EVEX.256.66.0F3A.W1 66 /r ib VFPCLASSPD k2 {k1}, ymm2/m256/m64bcst, imm8 FV V/V AVX512VL AVX512DQ EVEX.512.66.0F3A.W1 66 /r ib VFPCLASSPD k2 {k1}, zmm2/m512/m64bcst, imm8 FV V/V AVX512DQ Description Tests the input for the following categories: NaN, +0, -0, +Infinity, -Infinity, denormal, finite negative. The immediate field provides a mask bit for each of these category tests. The masked test results are OR-ed together to form a mask result. Tests the input for the following categories: NaN, +0, -0, +Infinity, -Infinity, denormal, finite negative. The immediate field provides a mask bit for each of these category tests. The masked test results are OR-ed together to form a mask result. Tests the input for the following categories: NaN, +0, -0, +Infinity, -Infinity, denormal, finite negative. The immediate field provides a mask bit for each of these category tests. The masked test results are OR-ed together to form a mask result. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description The FPCLASSPD instruction checks the packed double precision floating point values for special categories, specified by the set bits in the imm8 byte. Each set bit in imm8 specifies a category of floating-point values that the input data element is classified against. The classified results of all specified categories of an input value are ORed together to form the final boolean result for the input element. The result of each element is written to the corresponding bit in a mask register k2 according to the writemask k1. Bits [MAX_KL-1:8/4/2] of the destination are cleared. The classification categories specified by imm8 are shown in Figure 5-13. The classification test for each category is listed in Table 5-6. 7 SNaN 6 Neg. Finite 5 Denormal 4 Neg. INF 3 +INF 2 1 Neg. 0 +0 0 QNaN Figure 5-13. Imm8 Byte Specifier of Special Case FP Values for VFPCLASSPD/SD/PS/SS Table 5-6. Classifier Operations for VFPCLASSPD/SD/PS/SS Bits Imm8[0] Imm8[1] Imm8[2] Imm8[3] Imm8[4] Imm8[5] Imm8[6] Imm8[7] Category QNAN PosZero NegZero PosINF NegINF Denormal Negative SNAN Classifier Checks for QNaN Checks for +0 Checks for 0 Checks for +INF Checks for INF Checks for Denormal Checks for Negative finite Checks for SNaN The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 64-bit memory location. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 5-242 Vol. 2C VFPCLASSPD—Tests Types Of a Packed Float64 Values INSTRUCTION SET REFERENCE, V-Z Operation CheckFPClassDP (tsrc[63:0], imm8[7:0]){ //* Start checking the source operand for special type *// NegNum tsrc[63]; IF (tsrc[62:52]=07FFh) Then ExpAllOnes 1; FI; IF (tsrc[62:52]=0h) Then ExpAllZeros 1; IF (ExpAllZeros AND MXCSR.DAZ) Then MantAllZeros 1; ELSIF (tsrc[51:0]=0h) Then MantAllZeros 1; FI; ZeroNumber ExpAllZeros AND MantAllZeros SignalingBit tsrc[51]; sNaN_res ExpAllOnes AND NOT(MantAllZeros) AND NOT(SignalingBit) ; // sNaN qNaN_res ExpAllOnes AND NOT(MantAllZeros) AND SignalingBit;; // qNaN Pzero_res NOT(NegNum) AND ExpAllZeros AND MantAllZeros;; // +0 Nzero_res NegNum AND ExpAllZeros AND MantAllZeros;; // -0 PInf_res NOT(NegNum) AND ExpAllOnes AND MantAllZeros;; // +Inf NInf_res NegNum AND ExpAllOnes AND MantAllZeros;; // -Inf Denorm_res ExpAllZeros AND NOT(MantAllZeros);; // denorm FinNeg_res NegNum AND NOT(ExpAllOnes) AND NOT(ZeroNumber);; // -finite bResult = ( imm8[0] AND qNaN_res ) OR (imm8[1] AND Pzero_res ) OR ( imm8[2] AND Nzero_res ) OR ( imm8[3] AND PInf_res ) OR ( imm8[4] AND NInf_res ) OR ( imm8[5] AND Denorm_res ) OR ( imm8[6] AND FinNeg_res ) OR ( imm8[7] AND sNaN_res ) ; Return bResult; } //* end of CheckFPClassDP() *// VFPCLASSPD (EVEX Encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) AND (SRC *is memory*) THEN DEST[j] CheckFPClassDP(SRC1[63:0], imm8[7:0]); ELSE DEST[j] CheckFPClassDP(SRC1[i+63:i], imm8[7:0]); FI; ELSE DEST[j] 0 ; zeroing-masking only FI; ENDFOR DEST[MAX_KL-1:KL] 0 VFPCLASSPD—Tests Types Of a Packed Float64 Values Vol. 2C 5-243 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VFPCLASSPD __mmask8 _mm512_fpclass_pd_mask( __m512d a, int c); VFPCLASSPD __mmask8 _mm512_mask_fpclass_pd_mask( __mmask8 m, __m512d a, int c) VFPCLASSPD __mmask8 _mm256_fpclass_pd_mask( __m256d a, int c) VFPCLASSPD __mmask8 _mm256_mask_fpclass_pd_mask( __mmask8 m, __m256d a, int c) VFPCLASSPD __mmask8 _mm_fpclass_pd_mask( __m128d a, int c) VFPCLASSPD __mmask8 _mm_mask_fpclass_pd_mask( __mmask8 m, __m128d a, int c) SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E4 #UD 5-244 Vol. 2C If EVEX.vvvv != 1111B. VFPCLASSPD—Tests Types Of a Packed Float64 Values INSTRUCTION SET REFERENCE, V-Z VFPCLASSPS—Tests Types Of a Packed Float32 Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.128.66.0F3A.W0 66 /r ib VFPCLASSPS k2 {k1}, xmm2/m128/m32bcst, imm8 EVEX.256.66.0F3A.W0 66 /r ib VFPCLASSPS k2 {k1}, ymm2/m256/m32bcst, imm8 FV V/V AVX512VL AVX512DQ EVEX.512.66.0F3A.W0 66 /r ib VFPCLASSPS k2 {k1}, zmm2/m512/m32bcst, imm8 FV V/V AVX512DQ Description Tests the input for the following categories: NaN, +0, -0, +Infinity, -Infinity, denormal, finite negative. The immediate field provides a mask bit for each of these category tests. The masked test results are OR-ed together to form a mask result. Tests the input for the following categories: NaN, +0, -0, +Infinity, -Infinity, denormal, finite negative. The immediate field provides a mask bit for each of these category tests. The masked test results are OR-ed together to form a mask result. Tests the input for the following categories: NaN, +0, -0, +Infinity, -Infinity, denormal, finite negative. The immediate field provides a mask bit for each of these category tests. The masked test results are OR-ed together to form a mask result. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description The FPCLASSPS instruction checks the packed single-precision floating point values for special categories, specified by the set bits in the imm8 byte. Each set bit in imm8 specifies a category of floating-point values that the input data element is classified against. The classified results of all specified categories of an input value are ORed together to form the final boolean result for the input element. The result of each element is written to the corresponding bit in a mask register k2 according to the writemask k1. Bits [MAX_KL-1:16/8/4] of the destination are cleared. The classification categories specified by imm8 are shown in Figure 5-13. The classification test for each category is listed in Table 5-6. The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32-bit memory location. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Operation CheckFPClassSP (tsrc[31:0], imm8[7:0]){ //* Start checking the source operand for special type *// NegNum tsrc[31]; IF (tsrc[30:23]=0FFh) Then ExpAllOnes 1; FI; IF (tsrc[30:23]=0h) Then ExpAllZeros 1; IF (ExpAllZeros AND MXCSR.DAZ) Then MantAllZeros 1; ELSIF (tsrc[22:0]=0h) Then MantAllZeros 1; FI; ZeroNumber= ExpAllZeros AND MantAllZeros SignalingBit= tsrc[22]; sNaN_res ExpAllOnes AND NOT(MantAllZeros) AND NOT(SignalingBit) ; // sNaN qNaN_res ExpAllOnes AND NOT(MantAllZeros) AND SignalingBit;; // qNaN Pzero_res NOT(NegNum) AND ExpAllZeros AND MantAllZeros;; // +0 VFPCLASSPS—Tests Types Of a Packed Float32 Values Vol. 2C 5-245 INSTRUCTION SET REFERENCE, V-Z Nzero_res NegNum AND ExpAllZeros AND MantAllZeros;; // -0 PInf_res NOT(NegNum) AND ExpAllOnes AND MantAllZeros;; // +Inf NInf_res NegNum AND ExpAllOnes AND MantAllZeros;; // -Inf Denorm_res ExpAllZeros AND NOT(MantAllZeros);; // denorm FinNeg_res NegNum AND NOT(ExpAllOnes) AND NOT(ZeroNumber);; // -finite bResult = ( imm8[0] AND qNaN_res ) OR (imm8[1] AND Pzero_res ) OR ( imm8[2] AND Nzero_res ) OR ( imm8[3] AND PInf_res ) OR ( imm8[4] AND NInf_res ) OR ( imm8[5] AND Denorm_res ) OR ( imm8[6] AND FinNeg_res ) OR ( imm8[7] AND sNaN_res ) ; Return bResult; } //* end of CheckSPClassSP() *// VFPCLASSPS (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) AND (SRC *is memory*) THEN DEST[j] CheckFPClassDP(SRC1[31:0], imm8[7:0]); ELSE DEST[j] CheckFPClassDP(SRC1[i+31:i], imm8[7:0]); FI; ELSE DEST[j] 0 ; zeroing-masking only FI; ENDFOR DEST[MAX_KL-1:KL] 0 Intel C/C++ Compiler Intrinsic Equivalent VFPCLASSPS __mmask16 _mm512_fpclass_ps_mask( __m512 a, int c); VFPCLASSPS __mmask16 _mm512_mask_fpclass_ps_mask( __mmask16 m, __m512 a, int c) VFPCLASSPS __mmask8 _mm256_fpclass_ps_mask( __m256 a, int c) VFPCLASSPS __mmask8 _mm256_mask_fpclass_ps_mask( __mmask8 m, __m256 a, int c) VFPCLASSPS __mmask8 _mm_fpclass_ps_mask( __m128 a, int c) VFPCLASSPS __mmask8 _mm_mask_fpclass_ps_mask( __mmask8 m, __m128 a, int c) SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E4 #UD 5-246 Vol. 2C If EVEX.vvvv != 1111B. VFPCLASSPS—Tests Types Of a Packed Float32 Values INSTRUCTION SET REFERENCE, V-Z VFPCLASSSD—Tests Types Of a Scalar Float64 Values Opcode/ Instruction Op / En EVEX.LIG.66.0F3A.W1 67 /r ib VFPCLASSSD k2 {k1}, xmm2/m64, imm8 T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512DQ Description Tests the input for the following categories: NaN, +0, -0, +Infinity, -Infinity, denormal, finite negative. The immediate field provides a mask bit for each of these category tests. The masked test results are OR-ed together to form a mask result. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) ModRM:r/m (r) NA NA Description The FPCLASSSD instruction checks the low double precision floating point value in the source operand for special categories, specified by the set bits in the imm8 byte. Each set bit in imm8 specifies a category of floating-point values that the input data element is classified against. The classified results of all specified categories of an input value are ORed together to form the final boolean result for the input element. The result is written to the low bit in a mask register k2 according to the writemask k1. Bits MAX_KL-1: 1 of the destination are cleared. The classification categories specified by imm8 are shown in Figure 5-13. The classification test for each category is listed in Table 5-6. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Operation CheckFPClassDP (tsrc[63:0], imm8[7:0]){ NegNum tsrc[63]; IF (tsrc[62:52]=07FFh) Then ExpAllOnes 1; FI; IF (tsrc[62:52]=0h) Then ExpAllZeros 1; IF (ExpAllZeros AND MXCSR.DAZ) Then MantAllZeros 1; ELSIF (tsrc[51:0]=0h) Then MantAllZeros 1; FI; ZeroNumber ExpAllZeros AND MantAllZeros SignalingBit tsrc[51]; sNaN_res ExpAllOnes AND NOT(MantAllZeros) AND NOT(SignalingBit) ; // sNaN qNaN_res ExpAllOnes AND NOT(MantAllZeros) AND SignalingBit;; // qNaN Pzero_res NOT(NegNum) AND ExpAllZeros AND MantAllZeros;; // +0 Nzero_res NegNum AND ExpAllZeros AND MantAllZeros;; // -0 PInf_res NOT(NegNum) AND ExpAllOnes AND MantAllZeros;; // +Inf NInf_res NegNum AND ExpAllOnes AND MantAllZeros;; // -Inf Denorm_res ExpAllZeros AND NOT(MantAllZeros);; // denorm FinNeg_res NegNum AND NOT(ExpAllOnes) AND NOT(ZeroNumber);; // -finite bResult = ( imm8[0] AND qNaN_res ) OR (imm8[1] AND Pzero_res ) OR ( imm8[2] AND Nzero_res ) OR ( imm8[3] AND PInf_res ) OR ( imm8[4] AND NInf_res ) OR ( imm8[5] AND Denorm_res ) OR ( imm8[6] AND FinNeg_res ) OR ( imm8[7] AND sNaN_res ) ; Return bResult; } //* end of CheckFPClassDP() *// VFPCLASSSD—Tests Types Of a Scalar Float64 Values Vol. 2C 5-247 INSTRUCTION SET REFERENCE, V-Z VFPCLASSSD (EVEX encoded version) IF k1[0] OR *no writemask* THEN DEST[0] CheckFPClassDP(SRC1[63:0], imm8[7:0]) ELSE DEST[0] 0 ; zeroing-masking only FI; DEST[MAX_KL-1:1] 0 Intel C/C++ Compiler Intrinsic Equivalent VFPCLASSSD __mmask8 _mm_fpclass_sd_mask( __m128d a, int c) VFPCLASSSD __mmask8 _mm_mask_fpclass_sd_mask( __mmask8 m, __m128d a, int c) SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E6 #UD 5-248 Vol. 2C If EVEX.vvvv != 1111B. VFPCLASSSD—Tests Types Of a Scalar Float64 Values INSTRUCTION SET REFERENCE, V-Z VFPCLASSSS—Tests Types Of a Scalar Float32 Values Opcode/ Instruction Op / En EVEX.LIG.66.0F3A.W0 67 /r VFPCLASSSS k2 {k1}, xmm2/m32, imm8 T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512DQ Description Tests the input for the following categories: NaN, +0, -0, +Infinity, -Infinity, denormal, finite negative. The immediate field provides a mask bit for each of these category tests. The masked test results are OR-ed together to form a mask result. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) ModRM:r/m (r) NA NA Description The FPCLASSSS instruction checks the low single-precision floating point value in the source operand for special categories, specified by the set bits in the imm8 byte. Each set bit in imm8 specifies a category of floating-point values that the input data element is classified against. The classified results of all specified categories of an input value are ORed together to form the final boolean result for the input element. The result is written to the low bit in a mask register k2 according to the writemask k1. Bits MAX_KL-1: 1 of the destination are cleared. The classification categories specified by imm8 are shown in Figure 5-13. The classification test for each category is listed in Table 5-6. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. VFPCLASSSS—Tests Types Of a Scalar Float32 Values Vol. 2C 5-249 INSTRUCTION SET REFERENCE, V-Z Operation CheckFPClassSP (tsrc[31:0], imm8[7:0]){ //* Start checking the source operand for special type *// NegNum tsrc[31]; IF (tsrc[30:23]=0FFh) Then ExpAllOnes 1; FI; IF (tsrc[30:23]=0h) Then ExpAllZeros 1; IF (ExpAllZeros AND MXCSR.DAZ) Then MantAllZeros 1; ELSIF (tsrc[22:0]=0h) Then MantAllZeros 1; FI; ZeroNumber= ExpAllZeros AND MantAllZeros SignalingBit= tsrc[22]; sNaN_res ExpAllOnes AND NOT(MantAllZeros) AND NOT(SignalingBit) ; // sNaN qNaN_res ExpAllOnes AND NOT(MantAllZeros) AND SignalingBit;; // qNaN Pzero_res NOT(NegNum) AND ExpAllZeros AND MantAllZeros;; // +0 Nzero_res NegNum AND ExpAllZeros AND MantAllZeros;; // -0 PInf_res NOT(NegNum) AND ExpAllOnes AND MantAllZeros;; // +Inf NInf_res NegNum AND ExpAllOnes AND MantAllZeros;; // -Inf Denorm_res ExpAllZeros AND NOT(MantAllZeros);; // denorm FinNeg_res NegNum AND NOT(ExpAllOnes) AND NOT(ZeroNumber);; // -finite bResult = ( imm8[0] AND qNaN_res ) OR (imm8[1] AND Pzero_res ) OR ( imm8[2] AND Nzero_res ) OR ( imm8[3] AND PInf_res ) OR ( imm8[4] AND NInf_res ) OR ( imm8[5] AND Denorm_res ) OR ( imm8[6] AND FinNeg_res ) OR ( imm8[7] AND sNaN_res ) ; Return bResult; } //* end of CheckSPClassSP() *// VFPCLASSSS (EVEX encoded version) IF k1[0] OR *no writemask* THEN DEST[0] CheckFPClassSP(SRC1[31:0], imm8[7:0]) ELSE DEST[0] 0 ; zeroing-masking only FI; DEST[MAX_KL-1:1] 0 Intel C/C++ Compiler Intrinsic Equivalent VFPCLASSSS __mmask8 _mm_fpclass_ss_mask( __m128 a, int c) VFPCLASSSS __mmask8 _mm_mask_fpclass_ss_mask( __mmask8 m, __m128 a, int c) SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E6 #UD 5-250 Vol. 2C If EVEX.vvvv != 1111B. VFPCLASSSS—Tests Types Of a Scalar Float32 Values INSTRUCTION SET REFERENCE, V-Z VGATHERDPD/VGATHERQPD — Gather Packed DP FP Values Using Signed Dword/Qword Indices Opcode/ Instruction Op/ En RMV 64/3 2-bit Mode V/V CPUID Feature Flag AVX2 VEX.DDS.128.66.0F38.W1 92 /r VGATHERDPD xmm1, vm32x, xmm2 Description VEX.DDS.128.66.0F38.W1 93 /r VGATHERQPD xmm1, vm64x, xmm2 RMV V/V AVX2 Using qword indices specified in vm64x, gather double-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1. VEX.DDS.256.66.0F38.W1 92 /r VGATHERDPD ymm1, vm32x, ymm2 RMV V/V AVX2 Using dword indices specified in vm32x, gather double-precision FP values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1. VEX.DDS.256.66.0F38.W1 93 /r VGATHERQPD ymm1, vm64y, ymm2 RMV V/V AVX2 Using qword indices specified in vm64y, gather double-precision FP values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1. Using dword indices specified in vm32x, gather double-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RMV ModRM:reg (r,w) BaseReg (R): VSIB:base, VectorReg(R): VSIB:index VEX.vvvv (r, w) NA Description The instruction conditionally loads up to 2 or 4 double-precision floating-point values from memory addresses specified by the memory operand (the second operand) and using qword indices. The memory operand uses the VSIB form of the SIB byte to specify a general purpose register operand as the common base, a vector register for an array of indices relative to the base and a constant scale factor. The mask operand (the third operand) specifies the conditional load operation from each memory address and the corresponding update of each data element of the destination operand (the first operand). Conditionality is specified by the most significant bit of each data element of the mask register. If an element’s mask bit is not set, the corresponding element of the destination register is left unchanged. The width of data element in the destination register and mask register are identical. The entire mask register will be set to zero by this instruction unless the instruction causes an exception. Using dword indices in the lower half of the mask register, the instruction conditionally loads up to 2 or 4 doubleprecision floating-point values from the VSIB addressing memory operand, and updates the destination register. This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination register and the mask operand are partially updated; those elements that have been gathered are placed into the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruction breakpoint is not re-triggered when the instruction is continued. If the data size and index size are different, part of the destination register and part of the mask register do not correspond to any elements being gathered. This instruction sets those parts to zero. It may do this to one or both of those registers even if the instruction triggers an exception, and even if the instruction triggers the exception before gathering any elements. VGATHERDPD/VGATHERQPD — Gather Packed DP FP Values Using Signed Dword/Qword Indices Vol. 2C 5-251 INSTRUCTION SET REFERENCE, V-Z VEX.128 version: The instruction will gather two double-precision floating-point values. For dword indices, only the lower two indices in the vector index register are used. VEX.256 version: The instruction will gather four double-precision floating-point values. For dword indices, only the lower four indices in the vector index register are used. Note that: • • If any pair of the index, mask, or destination registers are the same, this instruction results a #UD fault. • Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all elements closer to the LSB of the destination will be completed (and non-faulting). Individual elements closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered in the conventional order. • Elements may be gathered in any order, but faults must be delivered in a right-to-left order; thus, elements to the left of a faulting one may be gathered before the fault is delivered. A given implementation of this instruction is repeatable - given the same input values and architectural state, the same set of elements to the left of the faulting one will be gathered. • • • • This instruction does not perform AC checks, and so will never deliver an AC fault. • The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32bit mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are ignored. The values may be read from memory in any order. Memory ordering with other instructions follows the Intel64 memory-ordering model. This instruction will cause a #UD if the address size attribute is 16-bit. This instruction will cause a #UD if the memory operand is encoded without the SIB byte. This instruction should not be used to access memory mapped I/O as the ordering of the individual loads it does is implementation specific, and some implementations may use loads larger than the data element size or load elements an indeterminate number of times. 5-252 Vol. 2C VGATHERDPD/VGATHERQPD — Gather Packed DP FP Values Using Signed Dword/Qword Indices INSTRUCTION SET REFERENCE, V-Z Operation DEST SRC1; BASE_ADDR: base register encoded in VSIB addressing; VINDEX: the vector index register encoded by VSIB addressing; SCALE: scale factor encoded by SIB:[7:6]; DISP: optional 1, 4 byte displacement; MASK SRC3; VGATHERDPD (VEX.128 version) FOR j 0 to 1 i j * 64; IF MASK[63+i] THEN MASK[i +63:i] FFFFFFFF_FFFFFFFFH; // extend from most significant bit ELSE MASK[i +63:i] 0; FI; ENDFOR FOR j 0 to 1 k j * 32; i j * 64; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX[k+31:k])*SCALE + DISP; IF MASK[63+i] THEN DEST[i +63:i] FETCH_64BITS(DATA_ADDR); // a fault exits the instruction FI; MASK[i +63: i] 0; ENDFOR MASK[VLMAX-1:128] 0; DEST[VLMAX-1:128] 0; (non-masked elements of the mask register have the content of respective element cleared) VGATHERQPD (VEX.128 version) FOR j 0 to 1 i j * 64; IF MASK[63+i] THEN MASK[i +63:i] FFFFFFFF_FFFFFFFFH; // extend from most significant bit ELSE MASK[i +63:i] 0; FI; ENDFOR FOR j 0 to 1 i j * 64; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX1[i+63:i])*SCALE + DISP; IF MASK[63+i] THEN DEST[i +63:i] FETCH_64BITS(DATA_ADDR); // a fault exits this instruction FI; MASK[i +63: i] 0; ENDFOR MASK[VLMAX-1:128] 0; DEST[VLMAX-1:128] 0; (non-masked elements of the mask register have the content of respective element cleared) VGATHERDPD/VGATHERQPD — Gather Packed DP FP Values Using Signed Dword/Qword Indices Vol. 2C 5-253 INSTRUCTION SET REFERENCE, V-Z VGATHERQPD (VEX.256 version) FOR j 0 to 3 i j * 64; IF MASK[63+i] THEN MASK[i +63:i] FFFFFFFF_FFFFFFFFH; // extend from most significant bit ELSE MASK[i +63:i] 0; FI; ENDFOR FOR j 0 to 3 i j * 64; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX1[i+63:i])*SCALE + DISP; IF MASK[63+i] THEN DEST[i +63:i] FETCH_64BITS(DATA_ADDR); // a fault exits the instruction FI; MASK[i +63: i] 0; ENDFOR (non-masked elements of the mask register have the content of respective element cleared) VGATHERDPD (VEX.256 version) FOR j 0 to 3 i j * 64; IF MASK[63+i] THEN MASK[i +63:i] FFFFFFFF_FFFFFFFFH; // extend from most significant bit ELSE MASK[i +63:i] 0; FI; ENDFOR FOR j 0 to 3 k j * 32; i j * 64; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX1[k+31:k])*SCALE + DISP; IF MASK[63+i] THEN DEST[i +63:i] FETCH_64BITS(DATA_ADDR); // a fault exits the instruction FI; MASK[i +63:i] 0; ENDFOR (non-masked elements of the mask register have the content of respective element cleared) 5-254 Vol. 2C VGATHERDPD/VGATHERQPD — Gather Packed DP FP Values Using Signed Dword/Qword Indices INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VGATHERDPD: __m128d _mm_i32gather_pd (double const * base, __m128i index, const int scale); VGATHERDPD: __m128d _mm_mask_i32gather_pd (__m128d src, double const * base, __m128i index, __m128d mask, const int scale); VGATHERDPD: __m256d _mm256_i32gather_pd (double const * base, __m128i index, const int scale); VGATHERDPD: __m256d _mm256_mask_i32gather_pd (__m256d src, double const * base, __m128i index, __m256d mask, const int scale); VGATHERQPD: __m128d _mm_i64gather_pd (double const * base, __m128i index, const int scale); VGATHERQPD: __m128d _mm_mask_i64gather_pd (__m128d src, double const * base, __m128i index, __m128d mask, const int scale); VGATHERQPD: __m256d _mm256_i64gather_pd (double const * base, __m256i index, const int scale); VGATHERQPD: __m256d _mm256_mask_i64gather_pd (__m256d src, double const * base, __m256i index, __m256d mask, const int scale); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type 12. VGATHERDPD/VGATHERQPD — Gather Packed DP FP Values Using Signed Dword/Qword Indices Vol. 2C 5-255 INSTRUCTION SET REFERENCE, V-Z VGATHERDPS/VGATHERQPS — Gather Packed SP FP values Using Signed Dword/Qword Indices Opcode/ Instruction Op/ En RMV 64/32 -bit Mode V/V CPUID Feature Flag AVX2 VEX.DDS.128.66.0F38.W0 92 /r VGATHERDPS xmm1, vm32x, xmm2 Description VEX.DDS.128.66.0F38.W0 93 /r VGATHERQPS xmm1, vm64x, xmm2 RMV V/V AVX2 Using qword indices specified in vm64x, gather single-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1. VEX.DDS.256.66.0F38.W0 92 /r VGATHERDPS ymm1, vm32y, ymm2 RMV V/V AVX2 Using dword indices specified in vm32y, gather single-precision FP values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1. VEX.DDS.256.66.0F38.W0 93 /r VGATHERQPS xmm1, vm64y, xmm2 RMV V/V AVX2 Using qword indices specified in vm64y, gather single-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1. Using dword indices specified in vm32x, gather single-precision FP values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 A ModRM:reg (r,w) BaseReg (R): VSIB:base, VectorReg(R): VSIB:index VEX.vvvv (r, w) NA Description The instruction conditionally loads up to 4 or 8 single-precision floating-point values from memory addresses specified by the memory operand (the second operand) and using dword indices. The memory operand uses the VSIB form of the SIB byte to specify a general purpose register operand as the common base, a vector register for an array of indices relative to the base and a constant scale factor. The mask operand (the third operand) specifies the conditional load operation from each memory address and the corresponding update of each data element of the destination operand (the first operand). Conditionality is specified by the most significant bit of each data element of the mask register. If an element’s mask bit is not set, the corresponding element of the destination register is left unchanged. The width of data element in the destination register and mask register are identical. The entire mask register will be set to zero by this instruction unless the instruction causes an exception. Using qword indices, the instruction conditionally loads up to 2 or 4 single-precision floating-point values from the VSIB addressing memory operand, and updates the lower half of the destination register. The upper 128 or 256 bits of the destination register are zero’ed with qword indices. This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination register and the mask operand are partially updated; those elements that have been gathered are placed into the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruction breakpoint is not re-triggered when the instruction is continued. If the data size and index size are different, part of the destination register and part of the mask register do not correspond to any elements being gathered. This instruction sets those parts to zero. It may do this to one or both of those registers even if the instruction triggers an exception, and even if the instruction triggers the exception before gathering any elements. 5-256 Vol. 2C VGATHERDPS/VGATHERQPS — Gather Packed SP FP values Using Signed Dword/Qword Indices INSTRUCTION SET REFERENCE, V-Z VEX.128 version: For dword indices, the instruction will gather four single-precision floating-point values. For qword indices, the instruction will gather two values and zeroes the upper 64 bits of the destination. VEX.256 version: For dword indices, the instruction will gather eight single-precision floating-point values. For qword indices, the instruction will gather four values and zeroes the upper 128 bits of the destination. Note that: • • If any pair of the index, mask, or destination registers are the same, this instruction results a UD fault. • Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all elements closer to the LSB of the destination will be completed (and non-faulting). Individual elements closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered in the conventional order. • Elements may be gathered in any order, but faults must be delivered in a right-to-left order; thus, elements to the left of a faulting one may be gathered before the fault is delivered. A given implementation of this instruction is repeatable - given the same input values and architectural state, the same set of elements to the left of the faulting one will be gathered. • • • • This instruction does not perform AC checks, and so will never deliver an AC fault. • The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32bit mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are ignored. The values may be read from memory in any order. Memory ordering with other instructions follows the Intel64 memory-ordering model. This instruction will cause a #UD if the address size attribute is 16-bit. This instruction will cause a #UD if the memory operand is encoded without the SIB byte. This instruction should not be used to access memory mapped I/O as the ordering of the individual loads it does is implementation specific, and some implementations may use loads larger than the data element size or load elements an indeterminate number of times. VGATHERDPS/VGATHERQPS — Gather Packed SP FP values Using Signed Dword/Qword Indices Vol. 2C 5-257 INSTRUCTION SET REFERENCE, V-Z Operation DEST SRC1; BASE_ADDR: base register encoded in VSIB addressing; VINDEX: the vector index register encoded by VSIB addressing; SCALE: scale factor encoded by SIB:[7:6]; DISP: optional 1, 4 byte displacement; MASK SRC3; VGATHERDPS (VEX.128 version) FOR j 0 to 3 i j * 32; IF MASK[31+i] THEN MASK[i +31:i] FFFFFFFFH; // extend from most significant bit ELSE MASK[i +31:i] 0; FI; ENDFOR MASK[VLMAX-1:128] 0; FOR j 0 to 3 i j * 32; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX[i+31:i])*SCALE + DISP; IF MASK[31+i] THEN DEST[i +31:i] FETCH_32BITS(DATA_ADDR); // a fault exits the instruction FI; MASK[i +31:i] 0; ENDFOR DEST[VLMAX-1:128] 0; (non-masked elements of the mask register have the content of respective element cleared) VGATHERQPS (VEX.128 version) FOR j 0 to 3 i j * 32; IF MASK[31+i] THEN MASK[i +31:i] FFFFFFFFH; // extend from most significant bit ELSE MASK[i +31:i] 0; FI; ENDFOR MASK[VLMAX-1:128] 0; FOR j 0 to 1 k j * 64; i j * 32; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX1[k+63:k])*SCALE + DISP; IF MASK[31+i] THEN DEST[i +31:i] FETCH_32BITS(DATA_ADDR); // a fault exits the instruction FI; MASK[i +31:i] 0; ENDFOR MASK[127:64] 0; DEST[VLMAX-1:64] 0; (non-masked elements of the mask register have the content of respective element cleared) 5-258 Vol. 2C VGATHERDPS/VGATHERQPS — Gather Packed SP FP values Using Signed Dword/Qword Indices INSTRUCTION SET REFERENCE, V-Z VGATHERDPS (VEX.256 version) FOR j 0 to 7 i j * 32; IF MASK[31+i] THEN MASK[i +31:i] FFFFFFFFH; // extend from most significant bit ELSE MASK[i +31:i] 0; FI; ENDFOR FOR j 0 to 7 i j * 32; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX1[i+31:i])*SCALE + DISP; IF MASK[31+i] THEN DEST[i +31:i] FETCH_32BITS(DATA_ADDR); // a fault exits the instruction FI; MASK[i +31:i] 0; ENDFOR (non-masked elements of the mask register have the content of respective element cleared) VGATHERQPS (VEX.256 version) FOR j 0 to 7 i j * 32; IF MASK[31+i] THEN MASK[i +31:i] FFFFFFFFH; // extend from most significant bit ELSE MASK[i +31:i] 0; FI; ENDFOR FOR j 0 to 3 k j * 64; i j * 32; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX1[k+63:k])*SCALE + DISP; IF MASK[31+i] THEN DEST[i +31:i] FETCH_32BITS(DATA_ADDR); // a fault exits the instruction FI; MASK[i +31:i] 0; ENDFOR MASK[VLMAX-1:128] 0; DEST[VLMAX-1:128] 0; (non-masked elements of the mask register have the content of respective element cleared) VGATHERDPS/VGATHERQPS — Gather Packed SP FP values Using Signed Dword/Qword Indices Vol. 2C 5-259 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VGATHERDPS: __m128 _mm_i32gather_ps (float const * base, __m128i index, const int scale); VGATHERDPS: __m128 _mm_mask_i32gather_ps (__m128 src, float const * base, __m128i index, __m128 mask, const int scale); VGATHERDPS: __m256 _mm256_i32gather_ps (float const * base, __m256i index, const int scale); VGATHERDPS: scale); __m256 _mm256_mask_i32gather_ps (__m256 src, float const * base, __m256i index, __m256 mask, const int VGATHERQPS: __m128 _mm_i64gather_ps (float const * base, __m128i index, const int scale); VGATHERQPS: __m128 _mm_mask_i64gather_ps (__m128 src, float const * base, __m128i index, __m128 mask, const int scale); VGATHERQPS: __m128 _mm256_i64gather_ps (float const * base, __m256i index, const int scale); VGATHERQPS: scale); __m128 _mm256_mask_i64gather_ps (__m128 src, float const * base, __m256i index, __m128 mask, const int SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type 12 5-260 Vol. 2C VGATHERDPS/VGATHERQPS — Gather Packed SP FP values Using Signed Dword/Qword Indices INSTRUCTION SET REFERENCE, V-Z VGATHERDPS/VGATHERDPD—Gather Packed Single, Packed Double with Signed Dword Opcode/ Instruction Op/ En EVEX.128.66.0F38.W0 92 /vsib VGATHERDPS xmm1 {k1}, vm32x EVEX.256.66.0F38.W0 92 /vsib VGATHERDPS ymm1 {k1}, vm32y EVEX.512.66.0F38.W0 92 /vsib VGATHERDPS zmm1 {k1}, vm32z EVEX.128.66.0F38.W1 92 /vsib VGATHERDPD xmm1 {k1}, vm32x EVEX.256.66.0F38.W1 92 /vsib VGATHERDPD ymm1 {k1}, vm32x EVEX.512.66.0F38.W1 92 /vsib VGATHERDPD zmm1 {k1}, vm32y T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F AVX512VL AVX512F AVX512F Description T1S V/V T1S V/V T1S V/V AVX512VL AVX512F Using signed dword indices, gather single-precision floatingpoint values from memory using k1 as completion mask. Using signed dword indices, gather single-precision floatingpoint values from memory using k1 as completion mask. Using signed dword indices, gather single-precision floatingpoint values from memory using k1 as completion mask. Using signed dword indices, gather float64 vector into float64 vector xmm1 using k1 as completion mask. T1S V/V AVX512VL AVX512F Using signed dword indices, gather float64 vector into float64 vector ymm1 using k1 as completion mask. T1S V/V AVX512F Using signed dword indices, gather float64 vector into float64 vector zmm1 using k1 as completion mask. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) BaseReg (R): VSIB:base, VectorReg(R): VSIB:index NA NA Description A set of single-precision/double-precision faulting-point memory locations pointed by base address BASE_ADDR and index vector V_INDEX with scale SCALE are gathered. The result is written into a vector register. The elements are specified via the VSIB (i.e., the index register is a vector register, holding packed indices). Elements will only be loaded if their corresponding mask bit is one. If an element’s mask bit is not set, the corresponding element of the destination register is left unchanged. The entire mask register will be set to zero by this instruction unless it triggers an exception. This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception is triggered by an element other than the right most one with its mask bit set). When this happens, the destination register and the mask register (k1) are partially updated; those elements that have been gathered are placed into the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruction breakpoint is not re-triggered when the instruction is continued. If the data element size is less than the index element size, the higher part of the destination register and the mask register do not correspond to any elements being gathered. This instruction sets those higher parts to zero. It may update these unused elements to one or both of those registers even if the instruction triggers an exception, and even if the instruction triggers the exception before gathering any elements. Note that: • The values may be read from memory in any order. Memory ordering with other instructions follows the Intel64 memory-ordering model. • Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all elements closer to the LSB of the destination zmm will be completed (and non-faulting). Individual elements closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered in the conventional order. • Elements may be gathered in any order, but faults must be delivered in a right-to left order; thus, elements to the left of a faulting one may be gathered before the fault is delivered. A given implementation of this instruction is repeatable - given the same input values and architectural state, the same set of elements to the left of the faulting one will be gathered. VGATHERDPS/VGATHERDPD—Gather Packed Single, Packed Double with Signed Dword Vol. 2C 5-261 INSTRUCTION SET REFERENCE, V-Z • • This instruction does not perform AC checks, and so will never deliver an AC fault. Not valid with 16-bit effective addresses. Will deliver a #UD fault. Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if ModRM.rm is different than 100b. This instruction has special disp8*N and alignment rules. N is considered to be the size of a single vector element. The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are ignored. The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX. The instruction will #UD fault if the k0 mask register is specified. Operation BASE_ADDR stands for the memory operand base address (a GPR); may not exist VINDEX stands for the memory operand vector of indices (a vector register) SCALE stands for the memory operand scalar (1, 2, 4 or 8) DISP is the optional 1, 2 or 4 byte displacement VGATHERDPS (EVEX encoded version) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] THEN DEST[i+31:i] MEM[BASE_ADDR + SignExtend(VINDEX[i+31:i]) * SCALE + DISP] k1[j] 0 ELSE *DEST[i+31:i] remains unchanged* FI; ENDFOR k1[MAX_KL-1:KL] 0 DEST[MAX_VL-1:VL] 0 VGATHERDPD (EVEX encoded version) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] THEN DEST[i+63:i] MEM[BASE_ADDR + SignExtend(VINDEX[k+31:k]) * SCALE + DISP] k1[j] 0 ELSE *DEST[i+63:i] remains unchanged* FI; ENDFOR k1[MAX_KL-1:KL] 0 DEST[MAX_VL-1:VL] 0 5-262 Vol. 2C VGATHERDPS/VGATHERDPD—Gather Packed Single, Packed Double with Signed Dword INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VGATHERDPD __m512d _mm512_i32gather_pd( __m256i vdx, void * base, int scale); VGATHERDPD __m512d _mm512_mask_i32gather_pd(__m512d s, __mmask8 k, __m256i vdx, void * base, int scale); VGATHERDPD __m256d _mm256_mmask_i32gather_pd(__m256d s, __mmask8 k, __m128i vdx, void * base, int scale); VGATHERDPD __m128d _mm_mmask_i32gather_pd(__m128d s, __mmask8 k, __m128i vdx, void * base, int scale); VGATHERDPS __m512 _mm512_i32gather_ps( __m512i vdx, void * base, int scale); VGATHERDPS __m512 _mm512_mask_i32gather_ps(__m512 s, __mmask16 k, __m512i vdx, void * base, int scale); VGATHERDPS __m256 _mm256_mmask_i32gather_ps(__m256 s, __mmask8 k, __m256i vdx, void * base, int scale); GATHERDPS __m128 _mm_mmask_i32gather_ps(__m128 s, __mmask8 k, __m128i vdx, void * base, int scale); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E12. VGATHERDPS/VGATHERDPD—Gather Packed Single, Packed Double with Signed Dword Vol. 2C 5-263 INSTRUCTION SET REFERENCE, V-Z VGATHERPF0DPS/VGATHERPF0QPS/VGATHERPF0DPD/VGATHERPF0QPD—Sparse Prefetch Packed SP/DP Data Values with Signed Dword, Signed Qword Indices Using T0 Hint Opcode/ Instruction Op/ En T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512PF EVEX.512.66.0F38.W0 C6 /1 /vsib VGATHERPF0DPS vm32z {k1} EVEX.512.66.0F38.W0 C7 /1 /vsib VGATHERPF0QPS vm64z {k1} T1S V/V AVX512PF EVEX.512.66.0F38.W1 C6 /1 /vsib VGATHERPF0DPD vm32y {k1} T1S V/V AVX512PF EVEX.512.66.0F38.W1 C7 /1 /vsib VGATHERPF0QPD vm64z {k1} T1S V/V AVX512PF Description Using signed dword indices, prefetch sparse byte memory locations containing single-precision data using opmask k1 and T0 hint. Using signed qword indices, prefetch sparse byte memory locations containing single-precision data using opmask k1 and T0 hint. Using signed dword indices, prefetch sparse byte memory locations containing double-precision data using opmask k1 and T0 hint. Using signed qword indices, prefetch sparse byte memory locations containing double-precision data using opmask k1 and T0 hint. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S BaseReg (R): VSIB:base, VectorReg(R): VSIB:index NA NA NA Description The instruction conditionally prefetches up to sixteen 32-bit or eight 64-bit integer byte data elements. The elements are specified via the VSIB (i.e., the index register is an zmm, holding packed indices). Elements will only be prefetched if their corresponding mask bit is one. Lines prefetched are loaded into to a location in the cache hierarchy specified by a locality hint (T0): • T0 (temporal data)—prefetch data into the first level cache. [PS data] For dword indices, the instruction will prefetch sixteen memory locations. For qword indices, the instruction will prefetch eight values. [PD data] For dword and qword indices, the instruction will prefetch eight memory locations. Note that: (1) The prefetches may happen in any order (or not at all). The instruction is a hint. (2) The mask is left unchanged. (3) Not valid with 16-bit effective addresses. Will deliver a #UD fault. (4) No FP nor memory faults may be produced by this instruction. (5) Prefetches do not handle cache line splits (6) A #UD is signaled if the memory operand is encoded without the SIB byte. 5-264 Vol. 2C VGATHERPF0DPS/VGATHERPF0QPS/VGATHERPF0DPD/VGATHERPF0QPD—Sparse Prefetch Packed SP/DP Data Values with Signed INSTRUCTION SET REFERENCE, V-Z Operation BASE_ADDR stands for the memory operand base address (a GPR); may not exist VINDEX stands for the memory operand vector of indices (a vector register) SCALE stands for the memory operand scalar (1, 2, 4 or 8) DISP is the optional 1, 2 or 4 byte displacement PREFETCH(mem, Level, State) Prefetches a byte memory location pointed by ‘mem’ into the cache level specified by ‘Level’; a request for exclusive/ownership is done if ‘State’ is 1. Note that the memory location ignore cache line splits. This operation is considered a hint for the processor and may be skipped depending on implementation. VGATHERPF0DPS (EVEX encoded version) (KL, VL) = (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[i+31:i]) * SCALE + DISP], Level=0, RFO = 0) FI; ENDFOR VGATHERPF0DPD (EVEX encoded version) (KL, VL) = (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[k+31:k]) * SCALE + DISP], Level=0, RFO = 0) FI; ENDFOR VGATHERPF0QPS (EVEX encoded version) (KL, VL) = (8, 256) FOR j 0 TO KL-1 i j * 64 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[i+63:i]) * SCALE + DISP], Level=0, RFO = 0) FI; ENDFOR VGATHERPF0QPD (EVEX encoded version) (KL, VL) = (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 64 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[k+63:k]) * SCALE + DISP], Level=0, RFO = 0) FI; ENDFOR VGATHERPF0DPS/VGATHERPF0QPS/VGATHERPF0DPD/VGATHERPF0QPD—Sparse Prefetch Packed SP/DP Data Values with Signed Vol. 2C 5-265 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VGATHERPF0DPD void _mm512_mask_prefetch_i32gather_pd(__m256i vdx, __mmask8 m, void * base, int scale, int hint); VGATHERPF0DPS void _mm512_mask_prefetch_i32gather_ps(__m512i vdx, __mmask16 m, void * base, int scale, int hint); VGATHERPF0QPD void _mm512_mask_prefetch_i64gather_pd(__m512i vdx, __mmask8 m, void * base, int scale, int hint); VGATHERPF0QPS void _mm512_mask_prefetch_i64gather_ps(__m512i vdx, __mmask8 m, void * base, int scale, int hint); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E12NP. 5-266 Vol. 2C VGATHERPF0DPS/VGATHERPF0QPS/VGATHERPF0DPD/VGATHERPF0QPD—Sparse Prefetch Packed SP/DP Data Values with Signed INSTRUCTION SET REFERENCE, V-Z VGATHERPF1DPS/VGATHERPF1QPS/VGATHERPF1DPD/VGATHERPF1QPD—Sparse Prefetch Packed SP/DP Data Values with Signed Dword, Signed Qword Indices Using T1 Hint Opcode/ Instruction Op/ En T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512PF EVEX.512.66.0F38.W0 C6 /2 /vsib VGATHERPF1DPS vm32z {k1} EVEX.512.66.0F38.W0 C7 /2 /vsib VGATHERPF1QPS vm64z {k1} T1S V/V AVX512PF EVEX.512.66.0F38.W1 C6 /2 /vsib VGATHERPF1DPD vm32y {k1} T1S V/V AVX512PF EVEX.512.66.0F38.W1 C7 /2 /vsib VGATHERPF1QPD vm64z {k1} T1S V/V AVX512PF Description Using signed dword indices, prefetch sparse byte memory locations containing single-precision data using opmask k1 and T1 hint. Using signed qword indices, prefetch sparse byte memory locations containing single-precision data using opmask k1 and T1 hint. Using signed dword indices, prefetch sparse byte memory locations containing double-precision data using opmask k1 and T1 hint. Using signed qword indices, prefetch sparse byte memory locations containing double-precision data using opmask k1 and T1 hint. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S BaseReg (R): VSIB:base, VectorReg(R): VSIB:index NA NA NA Description The instruction conditionally prefetches up to sixteen 32-bit or eight 64-bit integer byte data elements. The elements are specified via the VSIB (i.e., the index register is an zmm, holding packed indices). Elements will only be prefetched if their corresponding mask bit is one. Lines prefetched are loaded into to a location in the cache hierarchy specified by a locality hint (T1): • T1 (temporal data)—prefetch data into the second level cache. [PS data] For dword indices, the instruction will prefetch sixteen memory locations. For qword indices, the instruction will prefetch eight values. [PD data] For dword and qword indices, the instruction will prefetch eight memory locations. Note that: (1) The prefetches may happen in any order (or not at all). The instruction is a hint. (2) The mask is left unchanged. (3) Not valid with 16-bit effective addresses. Will deliver a #UD fault. (4) No FP nor memory faults may be produced by this instruction. (5) Prefetches do not handle cache line splits (6) A #UD is signaled if the memory operand is encoded without the SIB byte. VGATHERPF1DPS/VGATHERPF1QPS/VGATHERPF1DPD/VGATHERPF1QPD—Sparse Prefetch Packed SP/DP Data Values with Signed Vol. 2C 5-267 INSTRUCTION SET REFERENCE, V-Z Operation BASE_ADDR stands for the memory operand base address (a GPR); may not exist VINDEX stands for the memory operand vector of indices (a vector register) SCALE stands for the memory operand scalar (1, 2, 4 or 8) DISP is the optional 1, 2 or 4 byte displacement PREFETCH(mem, Level, State) Prefetches a byte memory location pointed by ‘mem’ into the cache level specified by ‘Level’; a request for exclusive/ownership is done if ‘State’ is 1. Note that the memory location ignore cache line splits. This operation is considered a hint for the processor and may be skipped depending on implementation. VGATHERPF1DPS (EVEX encoded version) (KL, VL) = (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[i+31:i]) * SCALE + DISP], Level=1, RFO = 0) FI; ENDFOR VGATHERPF1DPD (EVEX encoded version) (KL, VL) = (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[k+31:k]) * SCALE + DISP], Level=1, RFO = 0) FI; ENDFOR VGATHERPF1QPS (EVEX encoded version) (KL, VL) = (8, 256) FOR j 0 TO KL-1 i j * 64 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[i+63:i]) * SCALE + DISP], Level=1, RFO = 0) FI; ENDFOR VGATHERPF1QPD (EVEX encoded version) (KL, VL) = (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 64 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[k+63:k]) * SCALE + DISP], Level=1, RFO = 0) FI; ENDFOR 5-268 Vol. 2C VGATHERPF1DPS/VGATHERPF1QPS/VGATHERPF1DPD/VGATHERPF1QPD—Sparse Prefetch Packed SP/DP Data Values with Signed INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VGATHERPF1DPD void _mm512_mask_prefetch_i32gather_pd(__m256i vdx, __mmask8 m, void * base, int scale, int hint); VGATHERPF1DPS void _mm512_mask_prefetch_i32gather_ps(__m512i vdx, __mmask16 m, void * base, int scale, int hint); VGATHERPF1QPD void _mm512_mask_prefetch_i64gather_pd(__m512i vdx, __mmask8 m, void * base, int scale, int hint); VGATHERPF1QPS void _mm512_mask_prefetch_i64gather_ps(__m512i vdx, __mmask8 m, void * base, int scale, int hint); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E12NP. VGATHERPF1DPS/VGATHERPF1QPS/VGATHERPF1DPD/VGATHERPF1QPD—Sparse Prefetch Packed SP/DP Data Values with Signed Vol. 2C 5-269 INSTRUCTION SET REFERENCE, V-Z VGATHERQPS/VGATHERQPD—Gather Packed Single, Packed Double with Signed Qword Indices Opcode/ Instruction Op/ En T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.66.0F38.W0 93 /vsib VGATHERQPS xmm1 {k1}, vm64x EVEX.256.66.0F38.W0 93 /vsib VGATHERQPS xmm1 {k1}, vm64y T1S V/V AVX512VL AVX512F EVEX.512.66.0F38.W0 93 /vsib VGATHERQPS ymm1 {k1}, vm64z T1S V/V AVX512F EVEX.128.66.0F38.W1 93 /vsib VGATHERQPD xmm1 {k1}, vm64x EVEX.256.66.0F38.W1 93 /vsib VGATHERQPD ymm1 {k1}, vm64y EVEX.512.66.0F38.W1 93 /vsib VGATHERQPD zmm1 {k1}, vm64z T1S V/V T1S V/V T1S V/V AVX512VL AVX512F AVX512VL AVX512F AVX512F Description Using signed qword indices, gather single-precision floating-point values from memory using k1 as completion mask. Using signed qword indices, gather single-precision floating-point values from memory using k1 as completion mask. Using signed qword indices, gather single-precision floating-point values from memory using k1 as completion mask. Using signed qword indices, gather float64 vector into float64 vector xmm1 using k1 as completion mask. Using signed qword indices, gather float64 vector into float64 vector ymm1 using k1 as completion mask. Using signed qword indices, gather float64 vector into float64 vector zmm1 using k1 as completion mask. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) BaseReg (R): VSIB:base, VectorReg(R): VSIB:index NA NA Description A set of 8 single-precision/double-precision faulting-point memory locations pointed by base address BASE_ADDR and index vector V_INDEX with scale SCALE are gathered. The result is written into vector a register. The elements are specified via the VSIB (i.e., the index register is a vector register, holding packed indices). Elements will only be loaded if their corresponding mask bit is one. If an element’s mask bit is not set, the corresponding element of the destination register is left unchanged. The entire mask register will be set to zero by this instruction unless it triggers an exception. This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination register and the mask register (k1) are partially updated; those elements that have been gathered are placed into the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruction breakpoint is not re-triggered when the instruction is continued. If the data element size is less than the index element size, the higher part of the destination register and the mask register do not correspond to any elements being gathered. This instruction sets those higher parts to zero. It may update these unused elements to one or both of those registers even if the instruction triggers an exception, and even if the instruction triggers the exception before gathering any elements. Note that: • The values may be read from memory in any order. Memory ordering with other instructions follows the Intel64 memory-ordering model. • Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all elements closer to the LSB of the destination zmm will be completed (and non-faulting). Individual elements closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered in the conventional order. 5-270 Vol. 2C VGATHERQPS/VGATHERQPD—Gather Packed Single, Packed Double with Signed Qword Indices INSTRUCTION SET REFERENCE, V-Z • Elements may be gathered in any order, but faults must be delivered in a right-to left order; thus, elements to the left of a faulting one may be gathered before the fault is delivered. A given implementation of this instruction is repeatable - given the same input values and architectural state, the same set of elements to the left of the faulting one will be gathered. • • This instruction does not perform AC checks, and so will never deliver an AC fault. Not valid with 16-bit effective addresses. Will deliver a #UD fault. Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if ModRM.rm is different than 100b. This instruction has special disp8*N and alignment rules. N is considered to be the size of a single vector element. The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are ignored. The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX. The instruction will #UD fault if the k0 mask register is specified. Operation BASE_ADDR stands for the memory operand base address (a GPR); may not exist VINDEX stands for the memory operand vector of indices (a ZMM register) SCALE stands for the memory operand scalar (1, 2, 4 or 8) DISP is the optional 1, 2 or 4 byte displacement VGATHERQPS (EVEX encoded version) (KL, VL) = (2, 64), (4, 128), (8, 256) FOR j 0 TO KL-1 i j * 32 k j * 64 IF k1[j] OR *no writemask* THEN DEST[i+31:i] MEM[BASE_ADDR + (VINDEX[k+63:k]) * SCALE + DISP] k1[j] 0 ELSE *DEST[i+31:i] remains unchanged* FI; ENDFOR k1[MAX_KL-1:KL] 0 DEST[MAX_VL-1:VL/2] 0 VGATHERQPD (EVEX encoded version) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] MEM[BASE_ADDR + (VINDEX[i+63:i]) * SCALE + DISP] k1[j] 0 ELSE *DEST[i+63:i] remains unchanged* FI; ENDFOR k1[MAX_KL-1:KL] 0 DEST[MAX_VL-1:VL] 0 VGATHERQPS/VGATHERQPD—Gather Packed Single, Packed Double with Signed Qword Indices Vol. 2C 5-271 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VGATHERQPD __m512d _mm512_i64gather_pd( __m512i vdx, void * base, int scale); VGATHERQPD __m512d _mm512_mask_i64gather_pd(__m512d s, __mmask8 k, __m512i vdx, void * base, int scale); VGATHERQPD __m256d _mm256_mask_i64gather_pd(__m256d s, __mmask8 k, __m256i vdx, void * base, int scale); VGATHERQPD __m128d _mm_mask_i64gather_pd(__m128d s, __mmask8 k, __m128i vdx, void * base, int scale); VGATHERQPS __m256 _mm512_i64gather_ps( __m512i vdx, void * base, int scale); VGATHERQPS __m256 _mm512_mask_i64gather_ps(__m256 s, __mmask16 k, __m512i vdx, void * base, int scale); VGATHERQPS __m128 _mm256_mask_i64gather_ps(__m128 s, __mmask8 k, __m256i vdx, void * base, int scale); VGATHERQPS __m128 _mm_mask_i64gather_ps(__m128 s, __mmask8 k, __m128i vdx, void * base, int scale); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E12. 5-272 Vol. 2C VGATHERQPS/VGATHERQPD—Gather Packed Single, Packed Double with Signed Qword Indices INSTRUCTION SET REFERENCE, V-Z VPGATHERDD/VPGATHERQD — Gather Packed Dword Values Using Signed Dword/Qword Indices Opcode/ Instruction Op/ En RMV 64/32 -bit Mode V/V CPUID Feature Flag AVX2 VEX.DDS.128.66.0F38.W0 90 /r VPGATHERDD xmm1, vm32x, xmm2 VEX.DDS.128.66.0F38.W0 91 /r VPGATHERQD xmm1, vm64x, xmm2 RMV V/V AVX2 VEX.DDS.256.66.0F38.W0 90 /r VPGATHERDD ymm1, vm32y, ymm2 RMV V/V AVX2 VEX.DDS.256.66.0F38.W0 91 /r VPGATHERQD xmm1, vm64y, xmm2 RMV V/V AVX2 Description Using dword indices specified in vm32x, gather dword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1. Using qword indices specified in vm64x, gather dword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1. Using dword indices specified in vm32y, gather dword from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1. Using qword indices specified in vm64y, gather dword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RMV ModRM:reg (r,w) BaseReg (R): VSIB:base, VectorReg(R): VSIB:index VEX.vvvv (r, w) NA Description The instruction conditionally loads up to 4 or 8 dword values from memory addresses specified by the memory operand (the second operand) and using dword indices. The memory operand uses the VSIB form of the SIB byte to specify a general purpose register operand as the common base, a vector register for an array of indices relative to the base and a constant scale factor. The mask operand (the third operand) specifies the conditional load operation from each memory address and the corresponding update of each data element of the destination operand (the first operand). Conditionality is specified by the most significant bit of each data element of the mask register. If an element’s mask bit is not set, the corresponding element of the destination register is left unchanged. The width of data element in the destination register and mask register are identical. The entire mask register will be set to zero by this instruction unless the instruction causes an exception. Using qword indices, the instruction conditionally loads up to 2 or 4 qword values from the VSIB addressing memory operand, and updates the lower half of the destination register. The upper 128 or 256 bits of the destination register are zero’ed with qword indices. This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination register and the mask operand are partially updated; those elements that have been gathered are placed into the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruction breakpoint is not re-triggered when the instruction is continued. If the data size and index size are different, part of the destination register and part of the mask register do not correspond to any elements being gathered. This instruction sets those parts to zero. It may do this to one or both of those registers even if the instruction triggers an exception, and even if the instruction triggers the exception before gathering any elements. VEX.128 version: For dword indices, the instruction will gather four dword values. For qword indices, the instruction will gather two values and zeroes the upper 64 bits of the destination. VPGATHERDD/VPGATHERQD — Gather Packed Dword Values Using Signed Dword/Qword Indices Vol. 2C 5-273 INSTRUCTION SET REFERENCE, V-Z VEX.256 version: For dword indices, the instruction will gather eight dword values. For qword indices, the instruction will gather four values and zeroes the upper 128 bits of the destination. Note that: • • If any pair of the index, mask, or destination registers are the same, this instruction results a UD fault. • Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all elements closer to the LSB of the destination will be completed (and non-faulting). Individual elements closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered in the conventional order. • Elements may be gathered in any order, but faults must be delivered in a right-to-left order; thus, elements to the left of a faulting one may be gathered before the fault is delivered. A given implementation of this instruction is repeatable - given the same input values and architectural state, the same set of elements to the left of the faulting one will be gathered. • • • • This instruction does not perform AC checks, and so will never deliver an AC fault. • The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32bit mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are ignored. The values may be read from memory in any order. Memory ordering with other instructions follows the Intel64 memory-ordering model. This instruction will cause a #UD if the address size attribute is 16-bit. This instruction will cause a #UD if the memory operand is encoded without the SIB byte. This instruction should not be used to access memory mapped I/O as the ordering of the individual loads it does is implementation specific, and some implementations may use loads larger than the data element size or load elements an indeterminate number of times. Operation DEST SRC1; BASE_ADDR: base register encoded in VSIB addressing; VINDEX: the vector index register encoded by VSIB addressing; SCALE: scale factor encoded by SIB:[7:6]; DISP: optional 1, 4 byte displacement; MASK SRC3; VPGATHERDD (VEX.128 version) FOR j 0 to 3 i j * 32; IF MASK[31+i] THEN MASK[i +31:i] FFFFFFFFH; // extend from most significant bit ELSE MASK[i +31:i] 0; FI; ENDFOR MASK[VLMAX-1:128] 0; FOR j 0 to 3 i j * 32; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX[i+31:i])*SCALE + DISP; IF MASK[31+i] THEN DEST[i +31:i] FETCH_32BITS(DATA_ADDR); // a fault exits the instruction FI; MASK[i +31:i] 0; ENDFOR DEST[VLMAX-1:128] 0; (non-masked elements of the mask register have the content of respective element cleared) 5-274 Vol. 2C VPGATHERDD/VPGATHERQD — Gather Packed Dword Values Using Signed Dword/Qword Indices INSTRUCTION SET REFERENCE, V-Z VPGATHERQD (VEX.128 version) FOR j 0 to 3 i j * 32; IF MASK[31+i] THEN MASK[i +31:i] FFFFFFFFH; // extend from most significant bit ELSE MASK[i +31:i] 0; FI; ENDFOR MASK[VLMAX-1:128] 0; FOR j 0 to 1 k j * 64; i j * 32; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX1[k+63:k])*SCALE + DISP; IF MASK[31+i] THEN DEST[i +31:i] FETCH_32BITS(DATA_ADDR); // a fault exits the instruction FI; MASK[i +31:i] 0; ENDFOR MASK[127:64] 0; DEST[VLMAX-1:64] 0; (non-masked elements of the mask register have the content of respective element cleared) VPGATHERDD (VEX.256 version) FOR j 0 to 7 i j * 32; IF MASK[31+i] THEN MASK[i +31:i] FFFFFFFFH; // extend from most significant bit ELSE MASK[i +31:i] 0; FI; ENDFOR FOR j 0 to 7 i j * 32; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX1[i+31:i])*SCALE + DISP; IF MASK[31+i] THEN DEST[i +31:i] FETCH_32BITS(DATA_ADDR); // a fault exits the instruction FI; MASK[i +31:i] 0; ENDFOR (non-masked elements of the mask register have the content of respective element cleared) VPGATHERDD/VPGATHERQD — Gather Packed Dword Values Using Signed Dword/Qword Indices Vol. 2C 5-275 INSTRUCTION SET REFERENCE, V-Z VPGATHERQD (VEX.256 version) FOR j 0 to 7 i j * 32; IF MASK[31+i] THEN MASK[i +31:i] FFFFFFFFH; // extend from most significant bit ELSE MASK[i +31:i] 0; FI; ENDFOR FOR j 0 to 3 k j * 64; i j * 32; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX1[k+63:k])*SCALE + DISP; IF MASK[31+i] THEN DEST[i +31:i] FETCH_32BITS(DATA_ADDR); // a fault exits the instruction FI; MASK[i +31:i] 0; ENDFOR MASK[VLMAX-1:128] 0; DEST[VLMAX-1:128] 0; (non-masked elements of the mask register have the content of respective element cleared) Intel C/C++ Compiler Intrinsic Equivalent VPGATHERDD: __m128i _mm_i32gather_epi32 (int const * base, __m128i index, const int scale); VPGATHERDD: __m128i _mm_mask_i32gather_epi32 (__m128i src, int const * base, __m128i index, __m128i mask, const int scale); VPGATHERDD: __m256i _mm256_i32gather_epi32 ( int const * base, __m256i index, const int scale); VPGATHERDD: __m256i _mm256_mask_i32gather_epi32 (__m256i src, int const * base, __m256i index, __m256i mask, const int scale); VPGATHERQD: __m128i _mm_i64gather_epi32 (int const * base, __m128i index, const int scale); VPGATHERQD: __m128i _mm_mask_i64gather_epi32 (__m128i src, int const * base, __m128i index, __m128i mask, const int scale); VPGATHERQD: __m128i _mm256_i64gather_epi32 (int const * base, __m256i index, const int scale); VPGATHERQD: __m128i _mm256_mask_i64gather_epi32 (__m128i src, int const * base, __m256i index, __m128i mask, const int scale); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type 12. 5-276 Vol. 2C VPGATHERDD/VPGATHERQD — Gather Packed Dword Values Using Signed Dword/Qword Indices INSTRUCTION SET REFERENCE, V-Z VPGATHERDD/VPGATHERDQ—Gather Packed Dword, Packed Qword with Signed Dword Indices Opcode/ Instruction Op/ En EVEX.128.66.0F38.W0 90 /vsib VPGATHERDD xmm1 {k1}, vm32x EVEX.256.66.0F38.W0 90 /vsib VPGATHERDD ymm1 {k1}, vm32y EVEX.512.66.0F38.W0 90 /vsib VPGATHERDD zmm1 {k1}, vm32z EVEX.128.66.0F38.W1 90 /vsib VPGATHERDQ xmm1 {k1}, vm32x EVEX.256.66.0F38.W1 90 /vsib VPGATHERDQ ymm1 {k1}, vm32x EVEX.512.66.0F38.W1 90 /vsib VPGATHERDQ zmm1 {k1}, vm32y T1S 64/32 bit Mode Support V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V CPUID Feature Flag AVX512VL AVX512F AVX512VL AVX512F AVX512F AVX512VL AVX512F AVX512VL AVX512F AVX512F Description Using signed dword indices, gather dword values from memory using writemask k1 for merging-masking. Using signed dword indices, gather dword values from memory using writemask k1 for merging-masking. Using signed dword indices, gather dword values from memory using writemask k1 for merging-masking. Using signed dword indices, gather quadword values from memory using writemask k1 for merging-masking. Using signed dword indices, gather quadword values from memory using writemask k1 for merging-masking. Using signed dword indices, gather quadword values from memory using writemask k1 for merging-masking. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) BaseReg (R): VSIB:base, VectorReg(R): VSIB:index NA NA Description A set of 16 or 8 doubleword/quadword memory locations pointed to by base address BASE_ADDR and index vector VINDEX with scale SCALE are gathered. The result is written into vector zmm1. The elements are specified via the VSIB (i.e., the index register is a zmm, holding packed indices). Elements will only be loaded if their corresponding mask bit is one. If an element’s mask bit is not set, the corresponding element of the destination register (zmm1) is left unchanged. The entire mask register will be set to zero by this instruction unless it triggers an exception. This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination register and the mask register (k1) are partially updated; those elements that have been gathered are placed into the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruction breakpoint is not re-triggered when the instruction is continued. If the data element size is less than the index element size, the higher part of the destination register and the mask register do not correspond to any elements being gathered. This instruction sets those higher parts to zero. It may update these unused elements to one or both of those registers even if the instruction triggers an exception, and even if the instruction triggers the exception before gathering any elements. Note that: • The values may be read from memory in any order. Memory ordering with other instructions follows the Intel64 memory-ordering model. • Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all elements closer to the LSB of the destination zmm will be completed (and non-faulting). Individual elements closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered in the conventional order. • Elements may be gathered in any order, but faults must be delivered in a right-to-left order; thus, elements to the left of a faulting one may be gathered before the fault is delivered. A given implementation of this instruction is repeatable - given the same input values and architectural state, the same set of elements to the left of the faulting one will be gathered. • • • This instruction does not perform AC checks, and so will never deliver an AC fault. Not valid with 16-bit effective addresses. Will deliver a #UD fault. These instructions do not accept zeroing-masking since the 0 values in k1 are used to determine completion. VPGATHERDD/VPGATHERDQ—Gather Packed Dword, Packed Qword with Signed Dword Indices Vol. 2C 5-277 INSTRUCTION SET REFERENCE, V-Z Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if ModRM.rm is different than 100b. This instruction has the same disp8*N and alignment rules as for scalar instructions (Tuple 1). The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX. The instruction will #UD fault if the k0 mask register is specified. The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are ignored. Operation BASE_ADDR stands for the memory operand base address (a GPR); may not exist VINDEX stands for the memory operand vector of indices (a ZMM register) SCALE stands for the memory operand scalar (1, 2, 4 or 8) DISP is the optional 1, 2 or 4 byte displacement VPGATHERDD (EVEX encoded version) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] THEN DEST[i+31:i] MEM[BASE_ADDR + SignExtend(VINDEX[i+31:i]) * SCALE + DISP]), 1) k1[j] 0 ELSE *DEST[i+31:i] remains unchanged* ; Only merging masking is allowed FI; ENDFOR k1[MAX_KL-1:KL] 0 DEST[MAX_VL-1:VL] 0 VPGATHERDQ (EVEX encoded version) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] THEN DEST[i+63:i] MEM[BASE_ADDR + SignExtend(VINDEX[k+31:k]) * SCALE + DISP]) k1[j] 0 ELSE *DEST[i+63:i] remains unchanged* ; Only merging masking is allowed FI; ENDFOR k1[MAX_KL-1:KL] 0 DEST[MAX_VL-1:VL] 0 5-278 Vol. 2C VPGATHERDD/VPGATHERDQ—Gather Packed Dword, Packed Qword with Signed Dword Indices INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPGATHERDD __m512i _mm512_i32gather_epi32( __m512i vdx, void * base, int scale); VPGATHERDD __m512i _mm512_mask_i32gather_epi32(__m512i s, __mmask16 k, __m512i vdx, void * base, int scale); VPGATHERDD __m256i _mm256_mmask_i32gather_epi32(__m256i s, __mmask8 k, __m256i vdx, void * base, int scale); VPGATHERDD __m128i _mm_mmask_i32gather_epi32(__m128i s, __mmask8 k, __m128i vdx, void * base, int scale); VPGATHERDQ __m512i _mm512_i32logather_epi64( __m256i vdx, void * base, int scale); VPGATHERDQ __m512i _mm512_mask_i32logather_epi64(__m512i s, __mmask8 k, __m256i vdx, void * base, int scale); VPGATHERDQ __m256i _mm256_mmask_i32logather_epi64(__m256i s, __mmask8 k, __m128i vdx, void * base, int scale); VPGATHERDQ __m128i _mm_mmask_i32gather_epi64(__m128i s, __mmask8 k, __m128i vdx, void * base, int scale); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E12. VPGATHERDD/VPGATHERDQ—Gather Packed Dword, Packed Qword with Signed Dword Indices Vol. 2C 5-279 INSTRUCTION SET REFERENCE, V-Z VPGATHERDQ/VPGATHERQQ — Gather Packed Qword Values Using Signed Dword/Qword Indices Opcode/ Instruction Op/ En RMV 64/32 -bit Mode V/V CPUID Feature Flag AVX2 VEX.DDS.128.66.0F38.W1 90 /r VPGATHERDQ xmm1, vm32x, xmm2 VEX.DDS.128.66.0F38.W1 91 /r VPGATHERQQ xmm1, vm64x, xmm2 RMV V/V AVX2 VEX.DDS.256.66.0F38.W1 90 /r VPGATHERDQ ymm1, vm32x, ymm2 RMV V/V AVX2 VEX.DDS.256.66.0F38.W1 91 /r VPGATHERQQ ymm1, vm64y, ymm2 RMV V/V AVX2 Description Using dword indices specified in vm32x, gather qword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1. Using qword indices specified in vm64x, gather qword values from memory conditioned on mask specified by xmm2. Conditionally gathered elements are merged into xmm1. Using dword indices specified in vm32x, gather qword values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1. Using qword indices specified in vm64y, gather qword values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 A ModRM:reg (r,w) BaseReg (R): VSIB:base, VectorReg(R): VSIB:index VEX.vvvv (r, w) NA Description The instruction conditionally loads up to 2 or 4 qword values from memory addresses specified by the memory operand (the second operand) and using qword indices. The memory operand uses the VSIB form of the SIB byte to specify a general purpose register operand as the common base, a vector register for an array of indices relative to the base and a constant scale factor. The mask operand (the third operand) specifies the conditional load operation from each memory address and the corresponding update of each data element of the destination operand (the first operand). Conditionality is specified by the most significant bit of each data element of the mask register. If an element’s mask bit is not set, the corresponding element of the destination register is left unchanged. The width of data element in the destination register and mask register are identical. The entire mask register will be set to zero by this instruction unless the instruction causes an exception. Using dword indices in the lower half of the mask register, the instruction conditionally loads up to 2 or 4 qword values from the VSIB addressing memory operand, and updates the destination register. This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination register and the mask operand are partially updated; those elements that have been gathered are placed into the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruction breakpoint is not re-triggered when the instruction is continued. If the data size and index size are different, part of the destination register and part of the mask register do not correspond to any elements being gathered. This instruction sets those parts to zero. It may do this to one or both of those registers even if the instruction triggers an exception, and even if the instruction triggers the exception before gathering any elements. VEX.128 version: The instruction will gather two qword values. For dword indices, only the lower two indices in the vector index register are used. 5-280 Vol. 2C VPGATHERDQ/VPGATHERQQ — Gather Packed Qword Values Using Signed Dword/Qword Indices INSTRUCTION SET REFERENCE, V-Z VEX.256 version: The instruction will gather four qword values. For dword indices, only the lower four indices in the vector index register are used. Note that: • • If any pair of the index, mask, or destination registers are the same, this instruction results a UD fault. • Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all elements closer to the LSB of the destination will be completed (and non-faulting). Individual elements closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered in the conventional order. • Elements may be gathered in any order, but faults must be delivered in a right-to-left order; thus, elements to the left of a faulting one may be gathered before the fault is delivered. A given implementation of this instruction is repeatable - given the same input values and architectural state, the same set of elements to the left of the faulting one will be gathered. • • • • This instruction does not perform AC checks, and so will never deliver an AC fault. • The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32bit mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are ignored. The values may be read from memory in any order. Memory ordering with other instructions follows the Intel64 memory-ordering model. This instruction will cause a #UD if the address size attribute is 16-bit. This instruction will cause a #UD if the memory operand is encoded without the SIB byte. This instruction should not be used to access memory mapped I/O as the ordering of the individual loads it does is implementation specific, and some implementations may use loads larger than the data element size or load elements an indeterminate number of times. VPGATHERDQ/VPGATHERQQ — Gather Packed Qword Values Using Signed Dword/Qword Indices Vol. 2C 5-281 INSTRUCTION SET REFERENCE, V-Z Operation DEST SRC1; BASE_ADDR: base register encoded in VSIB addressing; VINDEX: the vector index register encoded by VSIB addressing; SCALE: scale factor encoded by SIB:[7:6]; DISP: optional 1, 4 byte displacement; MASK SRC3; VPGATHERDQ (VEX.128 version) FOR j 0 to 1 i j * 64; IF MASK[63+i] THEN MASK[i +63:i] FFFFFFFF_FFFFFFFFH; // extend from most significant bit ELSE MASK[i +63:i] 0; FI; ENDFOR FOR j 0 to 1 k j * 32; i j * 64; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX[k+31:k])*SCALE + DISP; IF MASK[63+i] THEN DEST[i +63:i] FETCH_64BITS(DATA_ADDR); // a fault exits the instruction FI; MASK[i +63:i] 0; ENDFOR MASK[VLMAX-1:128] 0; DEST[VLMAX-1:128] 0; (non-masked elements of the mask register have the content of respective element cleared) VPGATHERQQ (VEX.128 version) FOR j 0 to 1 i j * 64; IF MASK[63+i] THEN MASK[i +63:i] FFFFFFFF_FFFFFFFFH; // extend from most significant bit ELSE MASK[i +63:i] 0; FI; ENDFOR FOR j 0 to 1 i j * 64; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX1[i+63:i])*SCALE + DISP; IF MASK[63+i] THEN DEST[i +63:i] FETCH_64BITS(DATA_ADDR); // a fault exits the instruction FI; MASK[i +63:i] 0; ENDFOR MASK[VLMAX-1:128] 0; DEST[VLMAX-1:128] 0; (non-masked elements of the mask register have the content of respective element cleared) 5-282 Vol. 2C VPGATHERDQ/VPGATHERQQ — Gather Packed Qword Values Using Signed Dword/Qword Indices INSTRUCTION SET REFERENCE, V-Z VPGATHERQQ (VEX.256 version) FOR j 0 to 3 i j * 64; IF MASK[63+i] THEN MASK[i +63:i] FFFFFFFF_FFFFFFFFH; // extend from most significant bit ELSE MASK[i +63:i] 0; FI; ENDFOR FOR j 0 to 3 i j * 64; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX1[i+63:i])*SCALE + DISP; IF MASK[63+i] THEN DEST[i +63:i] FETCH_64BITS(DATA_ADDR); // a fault exits the instruction FI; MASK[i +63:i] 0; ENDFOR (non-masked elements of the mask register have the content of respective element cleared) VPGATHERDQ (VEX.256 version) FOR j 0 to 3 i j * 64; IF MASK[63+i] THEN MASK[i +63:i] FFFFFFFF_FFFFFFFFH; // extend from most significant bit ELSE MASK[i +63:i] 0; FI; ENDFOR FOR j 0 to 3 k j * 32; i j * 64; DATA_ADDR BASE_ADDR + (SignExtend(VINDEX1[k+31:k])*SCALE + DISP; IF MASK[63+i] THEN DEST[i +63:i] FETCH_64BITS(DATA_ADDR); // a fault exits the instruction FI; MASK[i +63:i] 0; ENDFOR (non-masked elements of the mask register have the content of respective element cleared) VPGATHERDQ/VPGATHERQQ — Gather Packed Qword Values Using Signed Dword/Qword Indices Vol. 2C 5-283 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPGATHERDQ: __m128i _mm_i32gather_epi64 (__int64 const * base, __m128i index, const int scale); VPGATHERDQ: __m128i _mm_mask_i32gather_epi64 (__m128i src, __int64 const * base, __m128i index, __m128i mask, const int scale); VPGATHERDQ: __m256i _mm256_i32gather_epi64 (__int64 const * base, __m128i index, const int scale); VPGATHERDQ: __m256i _mm256_mask_i32gather_epi64 (__m256i src, __int64 const * base, __m128i index, __m256i mask, const int scale); VPGATHERQQ: __m128i _mm_i64gather_epi64 (__int64 const * base, __m128i index, const int scale); VPGATHERQQ: __m128i _mm_mask_i64gather_epi64 (__m128i src, __int64 const * base, __m128i index, __m128i mask, const int scale); VPGATHERQQ: __m256i _mm256_i64gather_epi64 __(int64 const * base, __m256i index, const int scale); VPGATHERQQ: __m256i _mm256_mask_i64gather_epi64 (__m256i src, __int64 const * base, __m256i index, __m256i mask, const int scale); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type 12. 5-284 Vol. 2C VPGATHERDQ/VPGATHERQQ — Gather Packed Qword Values Using Signed Dword/Qword Indices INSTRUCTION SET REFERENCE, V-Z VPGATHERQD/VPGATHERQQ—Gather Packed Dword, Packed Qword with Signed Qword Indices Opcode/ Instruction Op/ En EVEX.128.66.0F38.W0 91 /vsib VPGATHERQD xmm1 {k1}, vm64x EVEX.256.66.0F38.W0 91 /vsib VPGATHERQD xmm1 {k1}, vm64y EVEX.512.66.0F38.W0 91 /vsib VPGATHERQD ymm1 {k1}, vm64z EVEX.128.66.0F38.W1 91 /vsib VPGATHERQQ xmm1 {k1}, vm64x EVEX.256.66.0F38.W1 91 /vsib VPGATHERQQ ymm1 {k1}, vm64y EVEX.512.66.0F38.W1 91 /vsib VPGATHERQQ zmm1 {k1}, vm64z T1S 64/32 bit Mode Support V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V CPUID Feature Flag AVX512VL AVX512F AVX512VL AVX512F AVX512F AVX512VL AVX512F AVX512VL AVX512F AVX512F Description Using signed qword indices, gather dword values from memory using writemask k1 for merging-masking. Using signed qword indices, gather dword values from memory using writemask k1 for merging-masking. Using signed qword indices, gather dword values from memory using writemask k1 for merging-masking. Using signed qword indices, gather quadword values from memory using writemask k1 for merging-masking. Using signed qword indices, gather quadword values from memory using writemask k1 for merging-masking. Using signed qword indices, gather quadword values from memory using writemask k1 for merging-masking. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) BaseReg (R): VSIB:base, VectorReg(R): VSIB:index NA NA Description A set of 8 doubleword/quadword memory locations pointed to by base address BASE_ADDR and index vector VINDEX with scale SCALE are gathered. The result is written into a vector register. The elements are specified via the VSIB (i.e., the index register is a vector register, holding packed indices). Elements will only be loaded if their corresponding mask bit is one. If an element’s mask bit is not set, the corresponding element of the destination register is left unchanged. The entire mask register will be set to zero by this instruction unless it triggers an exception. This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination register and the mask register (k1) are partially updated; those elements that have been gathered are placed into the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruction breakpoint is not re-triggered when the instruction is continued. If the data element size is less than the index element size, the higher part of the destination register and the mask register do not correspond to any elements being gathered. This instruction sets those higher parts to zero. It may update these unused elements to one or both of those registers even if the instruction triggers an exception, and even if the instruction triggers the exception before gathering any elements. Note that: • The values may be read from memory in any order. Memory ordering with other instructions follows the Intel64 memory-ordering model. • Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all elements closer to the LSB of the destination zmm will be completed (and non-faulting). Individual elements closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered in the conventional order. • Elements may be gathered in any order, but faults must be delivered in a right-to-left order; thus, elements to the left of a faulting one may be gathered before the fault is delivered. A given implementation of this instruction is repeatable - given the same input values and architectural state, the same set of elements to the left of the faulting one will be gathered. VPGATHERQD/VPGATHERQQ—Gather Packed Dword, Packed Qword with Signed Qword Indices Vol. 2C 5-285 INSTRUCTION SET REFERENCE, V-Z • • • This instruction does not perform AC checks, and so will never deliver an AC fault. Not valid with 16-bit effective addresses. Will deliver a #UD fault. These instructions do not accept zeroing-masking since the 0 values in k1 are used to determine completion. Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if ModRM.rm is different than 100b. This instruction has the same disp8*N and alignment rules as for scalar instructions (Tuple 1). The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX. The instruction will #UD fault if the k0 mask register is specified. The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are ignored. Operation BASE_ADDR stands for the memory operand base address (a GPR); may not exist VINDEX stands for the memory operand vector of indices (a ZMM register) SCALE stands for the memory operand scalar (1, 2, 4 or 8) DISP is the optional 1, 2 or 4 byte displacement VPGATHERQD (EVEX encoded version) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 32 k j * 64 IF k1[j] THEN DEST[i+31:i] MEM[BASE_ADDR + (VINDEX[k+63:k]) * SCALE + DISP]), 1) k1[j] 0 ELSE *DEST[i+31:i] remains unchanged* ; Only merging masking is allowed FI; ENDFOR k1[MAX_KL-1:KL] 0 DEST[MAX_VL-1:VL/2] 0 VPGATHERQQ (EVEX encoded version) (KL, VL) = (2, 64), (4, 128), (8, 256) FOR j 0 TO KL-1 i j * 64 IF k1[j] THEN DEST[i+63:i] MEM[BASE_ADDR + (VINDEX[i+63:i]) * SCALE + DISP]) k1[j] 0 ELSE *DEST[i+63:i] remains unchanged* ; Only merging masking is allowed FI; ENDFOR k1[MAX_KL-1:KL] 0 DEST[MAX_VL-1:VL] 0 5-286 Vol. 2C VPGATHERQD/VPGATHERQQ—Gather Packed Dword, Packed Qword with Signed Qword Indices INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPGATHERQD __m256i _mm512_i64gather_epi32(__m512i vdx, void * base, int scale); VPGATHERQD __m256i _mm512_mask_i64gather_epi32lo(__m256i s, __mmask8 k, __m512i vdx, void * base, int scale); VPGATHERQD __m128i _mm256_mask_i64gather_epi32lo(__m128i s, __mmask8 k, __m256i vdx, void * base, int scale); VPGATHERQD __m128i _mm_mask_i64gather_epi32(__m128i s, __mmask8 k, __m128i vdx, void * base, int scale); VPGATHERQQ __m512i _mm512_i64gather_epi64( __m512i vdx, void * base, int scale); VPGATHERQQ __m512i _mm512_mask_i64gather_epi64(__m512i s, __mmask8 k, __m512i vdx, void * base, int scale); VPGATHERQQ __m256i _mm256_mask_i64gather_epi64(__m256i s, __mmask8 k, __m256i vdx, void * base, int scale); VPGATHERQQ __m128i _mm_mask_i64gather_epi64(__m128i s, __mmask8 k, __m128i vdx, void * base, int scale); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E12. VPGATHERQD/VPGATHERQQ—Gather Packed Dword, Packed Qword with Signed Qword Indices Vol. 2C 5-287 INSTRUCTION SET REFERENCE, V-Z VGETEXPPD—Convert Exponents of Packed DP FP Values to DP FP Values Opcode/ Instruction Op/ En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.66.0F38.W1 42 /r VGETEXPPD xmm1 {k1}{z}, xmm2/m128/m64bcst EVEX.256.66.0F38.W1 42 /r VGETEXPPD ymm1 {k1}{z}, ymm2/m256/m64bcst FV V/V AVX512VL AVX512F EVEX.512.66.0F38.W1 42 /r VGETEXPPD zmm1 {k1}{z}, zmm2/m512/m64bcst{sae} FV V/V AVX512F Description Convert the exponent of packed double-precision floating-point values in the source operand to DP FP results representing unbiased integer exponents and stores the results in the destination register. Convert the exponent of packed double-precision floating-point values in the source operand to DP FP results representing unbiased integer exponents and stores the results in the destination register. Convert the exponent of packed double-precision floating-point values in the source operand to DP FP results representing unbiased integer exponents and stores the results in the destination under writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Extracts the biased exponents from the normalized DP FP representation of each qword data element of the source operand (the second operand) as unbiased signed integer value, or convert the denormal representation of input data to unbiased negative integer values. Each integer value of the unbiased exponent is converted to doubleprecision FP value and written to the corresponding qword elements of the destination operand (the first operand) as DP FP numbers. The destination operand is a ZMM/YMM/XMM register and updated under the writemask. The source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 64-bit memory location. EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. Each GETEXP operation converts the exponent value into a FP number (permitting input value in denormal representation). Special cases of input values are listed in Table 5-7. The formula is: GETEXP(x) = floor(log2(|x|)) Notation floor(x) stands for the greatest integer not exceeding real number x. Table 5-7. VGETEXPPD/SD Special Cases 5-288 Vol. 2C Input Operand Result Comments src1 = NaN QNaN(src1) No Exceptions 0 < |src1| < INF floor(log2(|src1|)) | src1| = +INF +INF | src1| = 0 -INF VGETEXPPD—Convert Exponents of Packed DP FP Values to DP FP Values INSTRUCTION SET REFERENCE, V-Z Operation NormalizeExpTinyDPFP(SRC[63:0]) { // Jbit is the hidden integral bit of a FP number. In case of denormal number it has the value of ZERO. Src.Jbit 0; Dst.exp 1; Dst.fraction SRC[51:0]; WHILE(Src.Jbit = 0) { Src.Jbit Dst.fraction[51]; // Get the fraction MSB Dst.fraction Dst.fraction << 1 ; // One bit shift left Dst.exp-- ; // Decrement the exponent } Dst.fraction 0; // zero out fraction bits Dst.sign 1; // Return negative sign TMP[63:0] MXCSR.DAZ? 0 : (Dst.sign << 63) OR (Dst.exp << 52) OR (Dst.fraction) ; Return (TMP[63:0]); } ConvertExpDPFP(SRC[63:0]) { Src.sign 0; // Zero out sign bit Src.exp SRC[62:52]; Src.fraction SRC[51:0]; // Check for NaN IF (SRC = NaN) { IF ( SRC = SNAN ) SET IE; Return QNAN(SRC); } // Check for +INF IF (SRC = +INF) Return (SRC); // check if zero operand IF ((Src.exp = 0) AND ((Src.fraction = 0) OR (MXCSR.DAZ = 1))) Return (-INF); } ELSE // check if denormal operand (notice that MXCSR.DAZ = 0) { IF ((Src.exp = 0) AND (Src.fraction != 0)) { TMP[63:0] NormalizeExpTinyDPFP(SRC[63:0]) ; // Get Normalized Exponent Set #DE } ELSE // exponent value is correct { Dst.fraction 0; // zero out fraction bits TMP[63:0] (Src.sign << 63) OR (Src.exp << 52) OR (Src.fraction) ; } TMP SAR(TMP, 52) ; // Shift Arithmetic Right TMP TMP – 1023; // Subtract Bias Return CvtI2D(TMP) ; // Convert INT to Double-Precision FP number } } VGETEXPPD—Convert Exponents of Packed DP FP Values to DP FP Values Vol. 2C 5-289 INSTRUCTION SET REFERENCE, V-Z VGETEXPPD (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC *is memory*) THEN DEST[i+63:i] ConvertExpDPFP(SRC[63:0]) ELSE DEST[i+63:i] ConvertExpDPFP(SRC[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VGETEXPPD __m512d _mm512_getexp_pd(__m512d a); VGETEXPPD __m512d _mm512_mask_getexp_pd(__m512d s, __mmask8 k, __m512d a); VGETEXPPD __m512d _mm512_maskz_getexp_pd( __mmask8 k, __m512d a); VGETEXPPD __m512d _mm512_getexp_round_pd(__m512d a, int sae); VGETEXPPD __m512d _mm512_mask_getexp_round_pd(__m512d s, __mmask8 k, __m512d a, int sae); VGETEXPPD __m512d _mm512_maskz_getexp_round_pd( __mmask8 k, __m512d a, int sae); VGETEXPPD __m256d _mm256_getexp_pd(__m256d a); VGETEXPPD __m256d _mm256_mask_getexp_pd(__m256d s, __mmask8 k, __m256d a); VGETEXPPD __m256d _mm256_maskz_getexp_pd( __mmask8 k, __m256d a); VGETEXPPD __m128d _mm_getexp_pd(__m128d a); VGETEXPPD __m128d _mm_mask_getexp_pd(__m128d s, __mmask8 k, __m128d a); VGETEXPPD __m128d _mm_maskz_getexp_pd( __mmask8 k, __m128d a); SIMD Floating-Point Exceptions Invalid, Denormal Other Exceptions See Exceptions Type E2. #UD 5-290 Vol. 2C If EVEX.vvvv != 1111B. VGETEXPPD—Convert Exponents of Packed DP FP Values to DP FP Values INSTRUCTION SET REFERENCE, V-Z VGETEXPPS—Convert Exponents of Packed SP FP Values to SP FP Values Opcode/ Instruction Op/ En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.66.0F38.W0 42 /r VGETEXPPS xmm1 {k1}{z}, xmm2/m128/m32bcst EVEX.256.66.0F38.W0 42 /r VGETEXPPS ymm1 {k1}{z}, ymm2/m256/m32bcst FV V/V AVX512VL AVX512F EVEX.512.66.0F38.W0 42 /r VGETEXPPS zmm1 {k1}{z}, zmm2/m512/m32bcst{sae} FV V/V AVX512F Description Convert the exponent of packed single-precision floating-point values in the source operand to SP FP results representing unbiased integer exponents and stores the results in the destination register. Convert the exponent of packed single-precision floating-point values in the source operand to SP FP results representing unbiased integer exponents and stores the results in the destination register. Convert the exponent of packed single-precision floating-point values in the source operand to SP FP results representing unbiased integer exponents and stores the results in the destination register. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Extracts the biased exponents from the normalized SP FP representation of each dword element of the source operand (the second operand) as unbiased signed integer value, or convert the denormal representation of input data to unbiased negative integer values. Each integer value of the unbiased exponent is converted to single-precision FP value and written to the corresponding dword elements of the destination operand (the first operand) as SP FP numbers. The destination operand is a ZMM/YMM/XMM register and updated under the writemask. The source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32-bit memory location. EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. Each GETEXP operation converts the exponent value into a FP number (permitting input value in denormal representation). Special cases of input values are listed in Table 5-8. The formula is: GETEXP(x) = floor(log2(|x|)) Notation floor(x) stands for maximal integer not exceeding real number x. Software usage of VGETEXPxx and VGETMANTxx instructions generally involve a combination of GETEXP operation and GETMANT operation (see VGETMANTPD). Thus VGETEXPxx instruction do not require software to handle SIMD FP exceptions. Table 5-8. VGETEXPPS/SS Special Cases Input Operand Result Comments src1 = NaN QNaN(src1) No Exceptions 0 < |src1| < INF floor(log2(|src1|)) | src1| = +INF +INF | src1| = 0 -INF VGETEXPPS—Convert Exponents of Packed SP FP Values to SP FP Values Vol. 2C 5-291 INSTRUCTION SET REFERENCE, V-Z Figure 5-14 illustrates the VGETEXPPS functionality on input values with normalized representation. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 exp Fraction s 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Src = 2^1 SAR Src, 23 = 080h 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 -Bias 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 Tmp - Bias = 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Cvt_PI2PS(01h) = 2^0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Figure 5-14. VGETEXPPS Functionality On Normal Input values Operation NormalizeExpTinySPFP(SRC[31:0]) { // Jbit is the hidden integral bit of a FP number. In case of denormal number it has the value of ZERO. Src.Jbit 0; Dst.exp 1; Dst.fraction SRC[22:0]; WHILE(Src.Jbit = 0) { Src.Jbit Dst.fraction[22]; // Get the fraction MSB Dst.fraction Dst.fraction << 1 ; // One bit shift left Dst.exp-- ; // Decrement the exponent } Dst.fraction 0; // zero out fraction bits Dst.sign 1; // Return negative sign TMP[31:0] MXCSR.DAZ? 0 : (Dst.sign << 31) OR (Dst.exp << 23) OR (Dst.fraction) ; Return (TMP[31:0]); } ConvertExpSPFP(SRC[31:0]) { Src.sign 0; // Zero out sign bit Src.exp SRC[30:23]; Src.fraction SRC[22:0]; // Check for NaN IF (SRC = NaN) { IF ( SRC = SNAN ) SET IE; Return QNAN(SRC); } // Check for +INF IF (SRC = +INF) Return (SRC); // check if zero operand IF ((Src.exp = 0) AND ((Src.fraction = 0) OR (MXCSR.DAZ = 1))) Return (-INF); } ELSE // check if denormal operand (notice that MXCSR.DAZ = 0) { 5-292 Vol. 2C VGETEXPPS—Convert Exponents of Packed SP FP Values to SP FP Values INSTRUCTION SET REFERENCE, V-Z IF ((Src.exp = 0) AND (Src.fraction != 0)) { TMP[31:0] NormalizeExpTinySPFP(SRC[31:0]) ; // Get Normalized Exponent Set #DE } ELSE // exponent value is correct { Dst.fraction 0; // zero out fraction bits TMP[31:0] (Src.sign << 31) OR (Src.exp << 23) OR (Src.fraction) ; } TMP SAR(TMP, 23) ; // Shift Arithmetic Right TMP TMP – 127; // Subtract Bias Return CvtI2D(TMP) ; // Convert INT to Single-Precision FP number } } VGETEXPPS (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC *is memory*) THEN DEST[i+31:i] ConvertExpSPFP(SRC[31:0]) ELSE DEST[i+31:i] ConvertExpSPFP(SRC[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VGETEXPPS—Convert Exponents of Packed SP FP Values to SP FP Values Vol. 2C 5-293 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VGETEXPPS __m512 _mm512_getexp_ps( __m512 a); VGETEXPPS __m512 _mm512_mask_getexp_ps(__m512 s, __mmask16 k, __m512 a); VGETEXPPS __m512 _mm512_maskz_getexp_ps( __mmask16 k, __m512 a); VGETEXPPS __m512 _mm512_getexp_round_ps( __m512 a, int sae); VGETEXPPS __m512 _mm512_mask_getexp_round_ps(__m512 s, __mmask16 k, __m512 a, int sae); VGETEXPPS __m512 _mm512_maskz_getexp_round_ps( __mmask16 k, __m512 a, int sae); VGETEXPPS __m256 _mm256_getexp_ps(__m256 a); VGETEXPPS __m256 _mm256_mask_getexp_ps(__m256 s, __mmask8 k, __m256 a); VGETEXPPS __m256 _mm256_maskz_getexp_ps( __mmask8 k, __m256 a); VGETEXPPS __m128 _mm_getexp_ps(__m128 a); VGETEXPPS __m128 _mm_mask_getexp_ps(__m128 s, __mmask8 k, __m128 a); VGETEXPPS __m128 _mm_maskz_getexp_ps( __mmask8 k, __m128 a); SIMD Floating-Point Exceptions Invalid, Denormal Other Exceptions See Exceptions Type E2. #UD 5-294 Vol. 2C If EVEX.vvvv != 1111B. VGETEXPPS—Convert Exponents of Packed SP FP Values to SP FP Values INSTRUCTION SET REFERENCE, V-Z VGETEXPSD—Convert Exponents of Scalar DP FP Values to DP FP Value Opcode/ Instruction Op/ En EVEX.NDS.LIG.66.0F38.W1 43 /r VGETEXPSD xmm1 {k1}{z}, xmm2, xmm3/m64{sae} T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F Description Convert the biased exponent (bits 62:52) of the low doubleprecision floating-point value in xmm3/m64 to a DP FP value representing unbiased integer exponent. Stores the result to the low 64-bit of xmm1 under the writemask k1 and merge with the other elements of xmm2. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Extracts the biased exponent from the normalized DP FP representation of the low qword data element of the source operand (the third operand) as unbiased signed integer value, or convert the denormal representation of input data to unbiased negative integer values. The integer value of the unbiased exponent is converted to doubleprecision FP value and written to the destination operand (the first operand) as DP FP numbers. Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. The destination must be a XMM register, the source operand can be a XMM register or a float64 memory location. The low quadword element of the destination operand is conditionally updated with writemask k1. Each GETEXP operation converts the exponent value into a FP number (permitting input value in denormal representation). Special cases of input values are listed in Table 5-7. The formula is: GETEXP(x) = floor(log2(|x|)) Notation floor(x) stands for maximal integer not exceeding real number x. Operation // NormalizeExpTinyDPFP(SRC[63:0]) is defined in the Operation section of VGETEXPPD // ConvertExpDPFP(SRC[63:0]) is defined in the Operation section of VGETEXPPD VGETEXPSD (EVEX encoded version) IF k1[0] OR *no writemask* THEN DEST[63:0] ConvertExpDPFP(SRC2[63:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking DEST[63:0] 0 FI FI; DEST[127:64] SRC1[127:64] DEST[MAX_VL-1:128] 0 VGETEXPSD—Convert Exponents of Scalar DP FP Values to DP FP Value Vol. 2C 5-295 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VGETEXPSD __m128d _mm_getexp_sd( __m128d a, __m128d b); VGETEXPSD __m128d _mm_mask_getexp_sd(__m128d s, __mmask8 k, __m128d a, __m128d b); VGETEXPSD __m128d _mm_maskz_getexp_sd( __mmask8 k, __m128d a, __m128d b); VGETEXPSD __m128d _mm_getexp_round_sd( __m128d a, __m128d b, int sae); VGETEXPSD __m128d _mm_mask_getexp_round_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, int sae); VGETEXPSD __m128d _mm_maskz_getexp_round_sd( __mmask8 k, __m128d a, __m128d b, int sae); SIMD Floating-Point Exceptions Invalid, Denormal Other Exceptions See Exceptions Type E3. 5-296 Vol. 2C VGETEXPSD—Convert Exponents of Scalar DP FP Values to DP FP Value INSTRUCTION SET REFERENCE, V-Z VGETEXPSS—Convert Exponents of Scalar SP FP Values to SP FP Value Opcode/ Instruction Op/ En EVEX.NDS.LIG.66.0F38.W0 43 /r VGETEXPSS xmm1 {k1}{z}, xmm2, xmm3/m32{sae} T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F Description Convert the biased exponent (bits 30:23) of the low singleprecision floating-point value in xmm3/m32 to a SP FP value representing unbiased integer exponent. Stores the result to xmm1 under the writemask k1 and merge with the other elements of xmm2. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Extracts the biased exponent from the normalized SP FP representation of the low doubleword data element of the source operand (the third operand) as unbiased signed integer value, or convert the denormal representation of input data to unbiased negative integer values. The integer value of the unbiased exponent is converted to singleprecision FP value and written to the destination operand (the first operand) as SP FP numbers. Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand. The destination must be a XMM register, the source operand can be a XMM register or a float32 memory location. The the low doubleword element of the destination operand is conditionally updated with writemask k1. Each GETEXP operation converts the exponent value into a FP number (permitting input value in denormal representation). Special cases of input values are listed in Table 5-8. The formula is: GETEXP(x) = floor(log2(|x|)) Notation floor(x) stands for maximal integer not exceeding real number x. Software usage of VGETEXPxx and VGETMANTxx instructions generally involve a combination of GETEXP operation and GETMANT operation (see VGETMANTPD). Thus VGETEXPxx instruction do not require software to handle SIMD FP exceptions. Operation // NormalizeExpTinySPFP(SRC[31:0]) is defined in the Operation section of VGETEXPPS // ConvertExpSPFP(SRC[31:0]) is defined in the Operation section of VGETEXPPS VGETEXPSS (EVEX encoded version) IF k1[0] OR *no writemask* THEN DEST[31:0] ConvertExpDPFP(SRC2[31:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking DEST[31:0] 0 FI FI; ENDFOR DEST[127:32] SRC1[127:32] DEST[MAX_VL-1:128] 0 VGETEXPSS—Convert Exponents of Scalar SP FP Values to SP FP Value Vol. 2C 5-297 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VGETEXPSS __m128 _mm_getexp_ss( __m128 a, __m128 b); VGETEXPSS __m128 _mm_mask_getexp_ss(__m128 s, __mmask8 k, __m128 a, __m128 b); VGETEXPSS __m128 _mm_maskz_getexp_ss( __mmask8 k, __m128 a, __m128 b); VGETEXPSS __m128 _mm_getexp_round_ss( __m128 a, __m128 b, int sae); VGETEXPSS __m128 _mm_mask_getexp_round_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, int sae); VGETEXPSS __m128 _mm_maskz_getexp_round_ss( __mmask8 k, __m128 a, __m128 b, int sae); SIMD Floating-Point Exceptions Invalid, Denormal Other Exceptions See Exceptions Type E3. 5-298 Vol. 2C VGETEXPSS—Convert Exponents of Scalar SP FP Values to SP FP Value INSTRUCTION SET REFERENCE, V-Z VGETMANTPD—Extract Float64 Vector of Normalized Mantissas from Float64 Vector Opcode/ Instruction Op/ En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.66.0F3A.W1 26 /r ib VGETMANTPD xmm1 {k1}{z}, xmm2/m128/m64bcst, imm8 EVEX.256.66.0F3A.W1 26 /r ib VGETMANTPD ymm1 {k1}{z}, ymm2/m256/m64bcst, imm8 FV V/V AVX512VL AVX512F EVEX.512.66.0F3A.W1 26 /r ib VGETMANTPD zmm1 {k1}{z}, zmm2/m512/m64bcst{sae}, imm8 FV V/V AVX512F Description Get Normalized Mantissa from float64 vector xmm2/m128/m64bcst and store the result in xmm1, using imm8 for sign control and mantissa interval normalization, under writemask. Get Normalized Mantissa from float64 vector ymm2/m256/m64bcst and store the result in ymm1, using imm8 for sign control and mantissa interval normalization, under writemask. Get Normalized Mantissa from float64 vector zmm2/m512/m64bcst and store the result in zmm1, using imm8 for sign control and mantissa interval normalization, under writemask. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FVI ModRM:reg (w) ModRM:r/m (r) Imm8 NA Description Convert double-precision floating values in the source operand (the second operand) to DP FP values with the mantissa normalization and sign control specified by the imm8 byte, see Figure 5-15. The converted results are written to the destination operand (the first operand) using writemask k1. The normalized mantissa is specified by interv (imm8[1:0]) and the sign control (sc) is specified by bits 3:2 of the immediate byte. The destination operand is a ZMM/YMM/XMM register updated under the writemask. The source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 64bit memory location. 7 imm8 6 5 4 Must Be Zero 3 2 Sign Control (SC) 1 0 Normaiization Interval Imm8[1:0] = 00b : Interval is [ 1, 2) Imm8[3:2] = 00b : sign(SRC) Imm8[1:0] = 01b : Interval is [1/2, 2) Imm8[3:2] = 01b : 0 Imm8[1:0] = 10b : Interval is [ 1/2, 1) Imm8[3] = 1b : qNan_Indefinite if sign(SRC) != 0, regardless of imm8[2]. Imm8[1:0] = 11b : Interval is [3/4, 3/2) Figure 5-15. Imm8 Controls for VGETMANTPD/SD/PS/SS For each input DP FP value x, The conversion operation is: GetMant(x) = ±2k|x.significand| where: 1 <= |x.significand| < 2 Unbiased exponent k depends on the interval range defined by interv and whether the exponent of the source is even or odd. The sign of the final result is determined by sc and the source sign. VGETMANTPD—Extract Float64 Vector of Normalized Mantissas from Float64 Vector Vol. 2C 5-299 INSTRUCTION SET REFERENCE, V-Z If interv != 0 then k = -1, otherwise K = 0. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15. Each converted DP FP result is encoded according to the sign control, the unbiased exponent k (adding bias) and a mantissa normalized to the range specified by interv. The GetMant() function follows Table 5-9 when dealing with floating-point special numbers. This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1 are computed and stored into the destination. Elements in zmm1 with the corresponding bit clear in k1 retain their previous values. Note: EVEX.vvvv is reserved and must be 1111b; otherwise instructions will #UD. Table 5-9. GetMant() Special Float Values Behavior 5-300 Vol. 2C Input Result Exceptions / Comments NaN QNaN(SRC) Ignore interv If (SRC = SNaN) then #IE +∞ 1.0 Ignore interv +0 1.0 Ignore interv -0 IF (SC[0]) THEN +1.0 ELSE -1.0 Ignore interv -∞ IF (SC[1]) THEN {QNaN_Indefinite} ELSE { IF (SC[0]) THEN +1.0 ELSE -1.0 Ignore interv If (SC[1]) then #IE negative SC[1] ? QNaN_Indefinite : Getmant(SRC) If (SC[1]) then #IE VGETMANTPD—Extract Float64 Vector of Normalized Mantissas from Float64 Vector INSTRUCTION SET REFERENCE, V-Z Operation GetNormalizeMantissaDP(SRC[63:0], SignCtrl[1:0], Interv[1:0]) { // Extracting the SRC sign, exponent and mantissa fields Dst.sign SignCtrl[0] ? 0 : Src[63]; // Get sign bit Dst.exp SRC[62:52]; ; Get original exponent value Dst.fraction SRC[51:0];; Get original fraction value ZeroOperand (Dst.exp = 0) AND (Dst.fraction = 0); DenormOperand (Dst.exp = 0h) AND (Dst.fraction != 0); InfiniteOperand (Dst.exp = 07FFh) AND (Dst.fraction = 0); NaNOperand (Dst.exp = 07FFh) AND (Dst.fraction != 0); // Check for NAN operand IF (NaNOperand) { IF (SRC = SNaN) {Set #IE;} Return QNAN(SRC); } // Check for Zero and Infinite operands IF ((ZeroOperand) OR (InfiniteOperand) { Dst.exp 03FFh; // Override exponent with BIAS Return ((Dst.sign<<63) | (Dst.exp<<52) | (Dst.fraction)); } // Check for negative operand (including -0.0) IF ((Src[63] = 1) AND SignCtrl[1]) { Set #IE; Return QNaN_Indefinite; } // Checking for denormal operands IF (DenormOperand) { IF (MXCSR.DAZ=1) Dst.fraction 0;// Zero out fraction ELSE { // Jbit is the hidden integral bit. Zero in case of denormal operand. Src.Jbit 0; // Zero Src Jbit Dst.exp 03FFh; // Override exponent with BIAS WHILE (Src.Jbit = 0) { // normalize mantissa Src.Jbit Dst.fraction[51]; // Get the fraction MSB Dst.fraction (Dst.fraction << 1); // Start normalizing the mantissa Dst.exp-- ; // Adjust the exponent } SET #DE; // Set DE bit } } // At this point, Dst.fraction is normalized. // Checking for exponent response Unbiased.exp Dst.exp – 03FFh; // subtract the bias from exponent IsOddExp Unbiased.exp[0]; // recognized unbiased ODD exponent SignalingBit Dst.fraction[51]; CASE (interv[1:0]) 00: Dst.exp 03FFh; // This is the bias 01: Dst.exp (IsOddExp) ? 03FEh : 03FFh; // either bias-1, or bias 10: Dst.exp 03FEh; // bias-1 11: Dst.exp (SignalingBit) ? 03FEh : 03FFh; // either bias-1, or bias ESCA // At this point Dst.exp has the correct result. Form the final destination DEST[63:0] (Dst.sign << 63) OR (Dst.exp << 52) OR (Dst.fraction); Return (DEST); VGETMANTPD—Extract Float64 Vector of Normalized Mantissas from Float64 Vector Vol. 2C 5-301 INSTRUCTION SET REFERENCE, V-Z } SignCtrl[1:0] IMM8[3:2]; Interv[1:0] IMM8[1:0]; VGETMANTPD (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC *is memory*) THEN DEST[i+63:i] GetNormalizedMantissaDP(SRC[63:0], sc, interv) ELSE DEST[i+63:i] GetNormalizedMantissaDP(SRC[i+63:i], sc, interv) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VGETMANTPD __m512d _mm512_getmant_pd( __m512d a, enum intv, enum sgn); VGETMANTPD __m512d _mm512_mask_getmant_pd(__m512d s, __mmask8 k, __m512d a, enum intv, enum sgn); VGETMANTPD __m512d _mm512_maskz_getmant_pd( __mmask8 k, __m512d a, enum intv, enum sgn); VGETMANTPD __m512d _mm512_getmant_round_pd( __m512d a, enum intv, enum sgn, int r); VGETMANTPD __m512d _mm512_mask_getmant_round_pd(__m512d s, __mmask8 k, __m512d a, enum intv, enum sgn, int r); VGETMANTPD __m512d _mm512_maskz_getmant_round_pd( __mmask8 k, __m512d a, enum intv, enum sgn, int r); VGETMANTPD __m256d _mm256_getmant_pd( __m256d a, enum intv, enum sgn); VGETMANTPD __m256d _mm256_mask_getmant_pd(__m256d s, __mmask8 k, __m256d a, enum intv, enum sgn); VGETMANTPD __m256d _mm256_maskz_getmant_pd( __mmask8 k, __m256d a, enum intv, enum sgn); VGETMANTPD __m128d _mm_getmant_pd( __m128d a, enum intv, enum sgn); VGETMANTPD __m128d _mm_mask_getmant_pd(__m128d s, __mmask8 k, __m128d a, enum intv, enum sgn); VGETMANTPD __m128d _mm_maskz_getmant_pd( __mmask8 k, __m128d a, enum intv, enum sgn); SIMD Floating-Point Exceptions Denormal, Invalid Other Exceptions See Exceptions Type E2. #UD 5-302 Vol. 2C If EVEX.vvvv != 1111B. VGETMANTPD—Extract Float64 Vector of Normalized Mantissas from Float64 Vector INSTRUCTION SET REFERENCE, V-Z VGETMANTPS—Extract Float32 Vector of Normalized Mantissas from Float32 Vector Opcode/ Instruction Op/ En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.66.0F3A.W0 26 /r ib VGETMANTPS xmm1 {k1}{z}, xmm2/m128/m32bcst, imm8 EVEX.256.66.0F3A.W0 26 /r ib VGETMANTPS ymm1 {k1}{z}, ymm2/m256/m32bcst, imm8 FV V/V AVX512VL AVX512F EVEX.512.66.0F3A.W0 26 /r ib VGETMANTPS zmm1 {k1}{z}, zmm2/m512/m32bcst{sae}, imm8 FV V/V AVX512F Description Get normalized mantissa from float32 vector xmm2/m128/m32bcst and store the result in xmm1, using imm8 for sign control and mantissa interval normalization, under writemask. Get normalized mantissa from float32 vector ymm2/m256/m32bcst and store the result in ymm1, using imm8 for sign control and mantissa interval normalization, under writemask. Get normalized mantissa from float32 vector zmm2/m512/m32bcst and store the result in zmm1, using imm8 for sign control and mantissa interval normalization, under writemask. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FVI ModRM:reg (w) ModRM:r/m (r) Imm8 NA Description Convert single-precision floating values in the source operand (the second operand) to SP FP values with the mantissa normalization and sign control specified by the imm8 byte, see Figure 5-15. The converted results are written to the destination operand (the first operand) using writemask k1. The normalized mantissa is specified by interv (imm8[1:0]) and the sign control (sc) is specified by bits 3:2 of the immediate byte. The destination operand is a ZMM/YMM/XMM register updated under the writemask. The source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32bit memory location. For each input SP FP value x, The conversion operation is: GetMant(x) = ±2k|x.significand| where: 1 <= |x.significand| < 2 Unbiased exponent k depends on the interval range defined by interv and whether the exponent of the source is even or odd. The sign of the final result is determined by sc and the source sign. if interv != 0 then k = -1, otherwise K = 0. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15. Each converted SP FP result is encoded according to the sign control, the unbiased exponent k (adding bias) and a mantissa normalized to the range specified by interv. The GetMant() function follows Table 5-9 when dealing with floating-point special numbers. This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1 are computed and stored into the destination. Elements in zmm1 with the corresponding bit clear in k1 retain their previous values. Note: EVEX.vvvv is reserved and must be 1111b, VEX.L must be 0; otherwise instructions will #UD. VGETMANTPS—Extract Float32 Vector of Normalized Mantissas from Float32 Vector Vol. 2C 5-303 INSTRUCTION SET REFERENCE, V-Z Operation GetNormalizeMantissaSP(SRC[31:0] , SignCtrl[1:0], Interv[1:0]) { // Extracting the SRC sign, exponent and mantissa fields Dst.sign SignCtrl[0] ? 0 : Src[31]; // Get sign bit Dst.exp SRC[30:23]; ; Get original exponent value Dst.fraction SRC[22:0];; Get original fraction value ZeroOperand (Dst.exp = 0) AND (Dst.fraction = 0); DenormOperand (Dst.exp = 0h) AND (Dst.fraction != 0); InfiniteOperand (Dst.exp = 0FFh) AND (Dst.fraction = 0); NaNOperand (Dst.exp = 0FFh) AND (Dst.fraction != 0); // Check for NAN operand IF (NaNOperand) { IF (SRC = SNaN) {Set #IE;} Return QNAN(SRC); } // Check for Zero and Infinite operands IF ((ZeroOperand) OR (InfiniteOperand) { Dst.exp 07Fh; // Override exponent with BIAS Return ((Dst.sign<<31) | (Dst.exp<<23) | (Dst.fraction)); } // Check for negative operand (including -0.0) IF ((Src[31] = 1) AND SignCtrl[1]) { Set #IE; Return QNaN_Indefinite; } // Checking for denormal operands IF (DenormOperand) { IF (MXCSR.DAZ=1) Dst.fraction 0;// Zero out fraction ELSE { // Jbit is the hidden integral bit. Zero in case of denormal operand. Src.Jbit 0; // Zero Src Jbit Dst.exp 07Fh; // Override exponent with BIAS WHILE (Src.Jbit = 0) { // normalize mantissa Src.Jbit Dst.fraction[22]; // Get the fraction MSB Dst.fraction (Dst.fraction << 1); // Start normalizing the mantissa Dst.exp-- ; // Adjust the exponent } SET #DE; // Set DE bit } } // At this point, Dst.fraction is normalized. // Checking for exponent response Unbiased.exp Dst.exp – 07Fh; // subtract the bias from exponent IsOddExp Unbiased.exp[0]; // recognized unbiased ODD exponent SignalingBit Dst.fraction[22]; CASE (interv[1:0]) 00: Dst.exp 07Fh; // This is the bias 01: Dst.exp (IsOddExp) ? 07Eh : 07Fh; // either bias-1, or bias 10: Dst.exp 07Eh; // bias-1 11: Dst.exp (SignalingBit) ? 07Eh : 07Fh; // either bias-1, or bias ESCA // Form the final destination DEST[31:0] (Dst.sign << 31) OR (Dst.exp << 23) OR (Dst.fraction); 5-304 Vol. 2C VGETMANTPS—Extract Float32 Vector of Normalized Mantissas from Float32 Vector INSTRUCTION SET REFERENCE, V-Z Return (DEST); } SignCtrl[1:0] IMM8[3:2]; Interv[1:0] IMM8[1:0]; VGETMANTPS (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC *is memory*) THEN DEST[i+31:i] GetNormalizedMantissaSP(SRC[31:0], sc, interv) ELSE DEST[i+31:i] GetNormalizedMantissaSP(SRC[i+31:i], sc, interv) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VGETMANTPS __m512 _mm512_getmant_ps( __m512 a, enum intv, enum sgn); VGETMANTPS __m512 _mm512_mask_getmant_ps(__m512 s, __mmask16 k, __m512 a, enum intv, enum sgn; VGETMANTPS __m512 _mm512_maskz_getmant_ps(__mmask16 k, __m512 a, enum intv, enum sgn); VGETMANTPS __m512 _mm512_getmant_round_ps( __m512 a, enum intv, enum sgn, int r); VGETMANTPS __m512 _mm512_mask_getmant_round_ps(__m512 s, __mmask16 k, __m512 a, enum intv, enum sgn, int r); VGETMANTPS __m512 _mm512_maskz_getmant_round_ps(__mmask16 k, __m512 a, enum intv, enum sgn, int r); VGETMANTPS __m256 _mm256_getmant_ps( __m256 a, enum intv, enum sgn); VGETMANTPS __m256 _mm256_mask_getmant_ps(__m256 s, __mmask8 k, __m256 a, enum intv, enum sgn); VGETMANTPS __m256 _mm256_maskz_getmant_ps( __mmask8 k, __m256 a, enum intv, enum sgn); VGETMANTPS __m128 _mm_getmant_ps( __m128 a, enum intv, enum sgn); VGETMANTPS __m128 _mm_mask_getmant_ps(__m128 s, __mmask8 k, __m128 a, enum intv, enum sgn); VGETMANTPS __m128 _mm_maskz_getmant_ps( __mmask8 k, __m128 a, enum intv, enum sgn); SIMD Floating-Point Exceptions Denormal, Invalid Other Exceptions See Exceptions Type E2. #UD If EVEX.vvvv != 1111B. VGETMANTPS—Extract Float32 Vector of Normalized Mantissas from Float32 Vector Vol. 2C 5-305 INSTRUCTION SET REFERENCE, V-Z VGETMANTSD—Extract Float64 of Normalized Mantissas from Float64 Scalar Opcode/ Instruction Op/ En EVEX.NDS.LIG.66.0F3A.W1 27 /r ib VGETMANTSD xmm1 {k1}{z}, xmm2, xmm3/m64{sae}, imm8 T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F Description Extract the normalized mantissa of the low float64 element in xmm3/m64 using imm8 for sign control and mantissa interval normalization. Store the mantissa to xmm1 under the writemask k1 and merge with the other elements of xmm2. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Convert the double-precision floating values in the low quadword element of the second source operand (the third operand) to DP FP value with the mantissa normalization and sign control specified by the imm8 byte, see Figure 5-15. The converted result is written to the low quadword element of the destination operand (the first operand) using writemask k1. Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. The normalized mantissa is specified by interv (imm8[1:0]) and the sign control (sc) is specified by bits 3:2 of the immediate byte. The conversion operation is: GetMant(x) = ±2k|x.significand| where: 1 <= |x.significand| < 2 Unbiased exponent k depends on the interval range defined by interv and whether the exponent of the source is even or odd. The sign of the final result is determined by sc and the source sign. If interv != 0 then k = -1, otherwise K = 0. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15. The converted DP FP result is encoded according to the sign control, the unbiased exponent k (adding bias) and a mantissa normalized to the range specified by interv. The GetMant() function follows Table 5-9 when dealing with floating-point special numbers. This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with the corresponding bit clear in k1 retain their previous values. 5-306 Vol. 2C VGETMANTSD—Extract Float64 of Normalized Mantissas from Float64 Scalar INSTRUCTION SET REFERENCE, V-Z Operation // GetNormalizeMantissaDP(SRC[63:0], SignCtrl[1:0], Interv[1:0]) is defined in the operation section of VGETMANTPD SignCtrl[1:0] IMM8[3:2]; Interv[1:0] IMM8[1:0]; VGETMANTSD (EVEX encoded version) IF k1[0] OR *no writemask* THEN DEST[63:0] GetNormalizedMantissaDP(SRC2[63:0], sc, interv) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking DEST[63:0] 0 FI FI; DEST[127:64] SRC1[127:64] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VGETMANTSD __m128d _mm_getmant_sd( __m128d a, __m128 b, enum intv, enum sgn); VGETMANTSD __m128d _mm_mask_getmant_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, enum intv, enum sgn); VGETMANTSD __m128d _mm_maskz_getmant_sd( __mmask8 k, __m128 a, __m128d b, enum intv, enum sgn); VGETMANTSD __m128d _mm_getmant_round_sd( __m128d a, __m128 b, enum intv, enum sgn, int r); VGETMANTSD __m128d _mm_mask_getmant_round_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, enum intv, enum sgn, int r); VGETMANTSD __m128d _mm_maskz_getmant_round_sd( __mmask8 k, __m128d a, __m128d b, enum intv, enum sgn, int r); SIMD Floating-Point Exceptions Denormal, Invalid Other Exceptions See Exceptions Type E3. VGETMANTSD—Extract Float64 of Normalized Mantissas from Float64 Scalar Vol. 2C 5-307 INSTRUCTION SET REFERENCE, V-Z VGETMANTSS—Extract Float32 Vector of Normalized Mantissa from Float32 Vector Opcode/ Instruction Op/ En EVEX.NDS.LIG.66.0F3A.W0 27 /r ib VGETMANTSS xmm1 {k1}{z}, xmm2, xmm3/m32{sae}, imm8 T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F Description Extract the normalized mantissa from the low float32 element of xmm3/m32 using imm8 for sign control and mantissa interval normalization, store the mantissa to xmm1 under the writemask k1 and merge with the other elements of xmm2. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Convert the single-precision floating values in the low doubleword element of the second source operand (the third operand) to SP FP value with the mantissa normalization and sign control specified by the imm8 byte, see Figure 5-15. The converted result is written to the low doubleword element of the destination operand (the first operand) using writemask k1. Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand. The normalized mantissa is specified by interv (imm8[1:0]) and the sign control (sc) is specified by bits 3:2 of the immediate byte. The conversion operation is: GetMant(x) = ±2k|x.significand| where: 1 <= |x.significand| < 2 Unbiased exponent k depends on the interval range defined by interv and whether the exponent of the source is even or odd. The sign of the final result is determined by sc and the source sign. if interv != 0 then k = -1, otherwise K = 0. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15. The converted SP FP result is encoded according to the sign control, the unbiased exponent k (adding bias) and a mantissa normalized to the range specified by interv. The GetMant() function follows Table 5-9 when dealing with floating-point special numbers. This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with the corresponding bit clear in k1 retain their previous values. 5-308 Vol. 2C VGETMANTSS—Extract Float32 Vector of Normalized Mantissa from Float32 Vector INSTRUCTION SET REFERENCE, V-Z Operation // GetNormalizeMantissaSP(SRC[31:0], SignCtrl[1:0], Interv[1:0]) is defined in the operation section of VGETMANTPD SignCtrl[1:0] IMM8[3:2]; Interv[1:0] IMM8[1:0]; VGETMANTSS (EVEX encoded version) IF k1[0] OR *no writemask* THEN DEST[31:0] GetNormalizedMantissaSP(SRC2[31:0], sc, interv) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking DEST[31:0] 0 FI FI; DEST[127:32] SRC1[127:64] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VGETMANTSS __m128 _mm_getmant_ss( __m128 a, __m128 b, enum intv, enum sgn); VGETMANTSS __m128 _mm_mask_getmant_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, enum intv, enum sgn); VGETMANTSS __m128 _mm_maskz_getmant_ss( __mmask8 k, __m128 a, __m128 b, enum intv, enum sgn); VGETMANTSS __m128 _mm_getmant_round_ss( __m128 a, __m128 b, enum intv, enum sgn, int r); VGETMANTSS __m128 _mm_mask_getmant_round_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, enum intv, enum sgn, int r); VGETMANTSS __m128 _mm_maskz_getmant_round_ss( __mmask8 k, __m128 a, __m128 b, enum intv, enum sgn, int r); SIMD Floating-Point Exceptions Denormal, Invalid Other Exceptions See Exceptions Type E3. VGETMANTSS—Extract Float32 Vector of Normalized Mantissa from Float32 Vector Vol. 2C 5-309 INSTRUCTION SET REFERENCE, V-Z VINSERTF128/VINSERTF32x4/VINSERTF64x2/VINSERTF32x8/VINSERTF64x4—Insert Packed Floating-Point Values Opcode/ Instruction Op / En RVMI 64/32 bit Mode Support V/V CPUID Feature Flag AVX VEX.NDS.256.66.0F3A.W0 18 /r ib VINSERTF128 ymm1, ymm2, xmm3/m128, imm8 EVEX.NDS.256.66.0F3A.W0 18 /r ib VINSERTF32X4 ymm1 {k1}{z}, ymm2, xmm3/m128, imm8 EVEX.NDS.512.66.0F3A.W0 18 /r ib VINSERTF32X4 zmm1 {k1}{z}, zmm2, xmm3/m128, imm8 EVEX.NDS.256.66.0F3A.W1 18 /r ib VINSERTF64X2 ymm1 {k1}{z}, ymm2, xmm3/m128, imm8 EVEX.NDS.512.66.0F3A.W1 18 /r ib VINSERTF64X2 zmm1 {k1}{z}, zmm2, xmm3/m128, imm8 EVEX.NDS.512.66.0F3A.W0 1A /r ib VINSERTF32X8 zmm1 {k1}{z}, zmm2, ymm3/m256, imm8 EVEX.NDS.512.66.0F3A.W1 1A /r ib VINSERTF64X4 zmm1 {k1}{z}, zmm2, ymm3/m256, imm8 T4 V/V AVX512VL AVX512F T4 V/V AVX512F T2 V/V AVX512VL AVX512DQ T2 V/V AVX512DQ T8 V/V AVX512DQ T4 V/V AVX512F Description Insert 128 bits of packed floating-point values from xmm3/m128 and the remaining values from ymm2 into ymm1. Insert 128 bits of packed single-precision floatingpoint values from xmm3/m128 and the remaining values from ymm2 into ymm1 under writemask k1. Insert 128 bits of packed single-precision floatingpoint values from xmm3/m128 and the remaining values from zmm2 into zmm1 under writemask k1. Insert 128 bits of packed double-precision floatingpoint values from xmm3/m128 and the remaining values from ymm2 into ymm1 under writemask k1. Insert 128 bits of packed double-precision floatingpoint values from xmm3/m128 and the remaining values from zmm2 into zmm1 under writemask k1. Insert 256 bits of packed single-precision floatingpoint values from ymm3/m256 and the remaining values from zmm2 into zmm1 under writemask k1. Insert 256 bits of packed double-precision floatingpoint values from ymm3/m256 and the remaining values from zmm2 into zmm1 under writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVMI ModRM:reg (w) VEX.vvvv ModRM:r/m (r) Imm8 T2, T4, T8 ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) Imm8 Description VINSERTF128/VINSERTF32x4 and VINSERTF64x2 insert 128-bits of packed floating-point values from the second source operand (the third operand) into the destination operand (the first operand) at an 128-bit granularity offset multiplied by imm8[0] (256-bit) or imm8[1:0]. The remaining portions of the destination operand are copied from the corresponding fields of the first source operand (the second operand). The second source operand can be either an XMM register or a 128-bit memory location. The destination and first source operands are vector registers. VINSERTF32x4: The destination operand is a ZMM/YMM register and updated at 32-bit granularity according to the writemask. The high 6/7 bits of the immediate are ignored. VINSERTF64x2: The destination operand is a ZMM/YMM register and updated at 64-bit granularity according to the writemask. The high 6/7 bits of the immediate are ignored. VINSERTF32x8 and VINSERTF64x4 inserts 256-bits of packed floating-point values from the second source operand (the third operand) into the destination operand (the first operand) at a 256-bit granular offset multiplied by imm8[0]. The remaining portions of the destination are copied from the corresponding fields of the first source operand (the second operand). The second source operand can be either an YMM register or a 256-bit memory location. The high 7 bits of the immediate are ignored. The destination operand is a ZMM register and updated at 32/64-bit granularity according to the writemask. 5-310 Vol. 2C VINSERTF128/VINSERTF32x4/VINSERTF64x2/VINSERTF32x8/VINSERTF64x4—Insert Packed Floating-Point Values INSTRUCTION SET REFERENCE, V-Z Operation VINSERTF32x4 (EVEX encoded versions) (KL, VL) = (8, 256), (16, 512) TEMP_DEST[VL-1:0] SRC1[VL-1:0] IF VL = 256 CASE (imm8[0]) OF 0: TMP_DEST[127:0] SRC2[127:0] 1: TMP_DEST[255:128] SRC2[127:0] ESAC. FI; IF VL = 512 CASE (imm8[1:0]) OF 00: TMP_DEST[127:0] SRC2[127:0] 01: TMP_DEST[255:128] SRC2[127:0] 10: TMP_DEST[383:256] SRC2[127:0] 11: TMP_DEST[511:384] SRC2[127:0] ESAC. FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VINSERTF64x2 (EVEX encoded versions) (KL, VL) = (4, 256), (8, 512) TEMP_DEST[VL-1:0] SRC1[VL-1:0] IF VL = 256 CASE (imm8[0]) OF 0: TMP_DEST[127:0] SRC2[127:0] 1: TMP_DEST[255:128] SRC2[127:0] ESAC. FI; IF VL = 512 CASE (imm8[1:0]) OF 00: TMP_DEST[127:0] SRC2[127:0] 01: TMP_DEST[255:128] SRC2[127:0] 10: TMP_DEST[383:256] SRC2[127:0] 11: TMP_DEST[511:384] SRC2[127:0] ESAC. FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE VINSERTF128/VINSERTF32x4/VINSERTF64x2/VINSERTF32x8/VINSERTF64x4—Insert Packed Floating-Point Values Vol. 2C 5-311 INSTRUCTION SET REFERENCE, V-Z IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VINSERTF32x8 (EVEX.U1.512 encoded version) TEMP_DEST[VL-1:0] SRC1[VL-1:0] CASE (imm8[0]) OF 0: TMP_DEST[255:0] SRC2[255:0] 1: TMP_DEST[511:256] SRC2[255:0] ESAC. FOR j 0 TO 15 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VINSERTF64x4 (EVEX.512 encoded version) VL = 512 TEMP_DEST[VL-1:0] SRC1[VL-1:0] CASE (imm8[0]) OF 0: TMP_DEST[255:0] SRC2[255:0] 1: TMP_DEST[511:256] SRC2[255:0] ESAC. FOR j 0 TO 7 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-312 Vol. 2C VINSERTF128/VINSERTF32x4/VINSERTF64x2/VINSERTF32x8/VINSERTF64x4—Insert Packed Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VINSERTF128 (VEX encoded version) TEMP[255:0] SRC1[255:0] CASE (imm8[0]) OF 0: TEMP[127:0] SRC2[127:0] 1: TEMP[255:128] SRC2[127:0] ESAC DEST TEMP Intel C/C++ Compiler Intrinsic Equivalent VINSERTF32x4 __m512 _mm512_insertf32x4( __m512 a, __m128 b, int imm); VINSERTF32x4 __m512 _mm512_mask_insertf32x4(__m512 s, __mmask16 k, __m512 a, __m128 b, int imm); VINSERTF32x4 __m512 _mm512_maskz_insertf32x4( __mmask16 k, __m512 a, __m128 b, int imm); VINSERTF32x4 __m256 _mm256_insertf32x4( __m256 a, __m128 b, int imm); VINSERTF32x4 __m256 _mm256_mask_insertf32x4(__m256 s, __mmask8 k, __m256 a, __m128 b, int imm); VINSERTF32x4 __m256 _mm256_maskz_insertf32x4( __mmask8 k, __m256 a, __m128 b, int imm); VINSERTF32x8 __m512 _mm512_insertf32x8( __m512 a, __m256 b, int imm); VINSERTF32x8 __m512 _mm512_mask_insertf32x8(__m512 s, __mmask16 k, __m512 a, __m256 b, int imm); VINSERTF32x8 __m512 _mm512_maskz_insertf32x8( __mmask16 k, __m512 a, __m256 b, int imm); VINSERTF64x2 __m512d _mm512_insertf64x2( __m512d a, __m128d b, int imm); VINSERTF64x2 __m512d _mm512_mask_insertf64x2(__m512d s, __mmask8 k, __m512d a, __m128d b, int imm); VINSERTF64x2 __m512d _mm512_maskz_insertf64x2( __mmask8 k, __m512d a, __m128d b, int imm); VINSERTF64x2 __m256d _mm256_insertf64x2( __m256d a, __m128d b, int imm); VINSERTF64x2 __m256d _mm256_mask_insertf64x2(__m256d s, __mmask8 k, __m256d a, __m128d b, int imm); VINSERTF64x2 __m256d _mm256_maskz_insertf64x2( __mmask8 k, __m256d a, __m128d b, int imm); VINSERTF64x4 __m512d _mm512_insertf64x4( __m512d a, __m256d b, int imm); VINSERTF64x4 __m512d _mm512_mask_insertf64x4(__m512d s, __mmask8 k, __m512d a, __m256d b, int imm); VINSERTF64x4 __m512d _mm512_maskz_insertf64x4( __mmask8 k, __m512d a, __m256d b, int imm); VINSERTF128 __m256 _mm256_insertf128_ps (__m256 a, __m128 b, int offset); VINSERTF128 __m256d _mm256_insertf128_pd (__m256d a, __m128d b, int offset); VINSERTF128 __m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int offset); SIMD Floating-Point Exceptions None Other Exceptions VEX-encoded instruction, see Exceptions Type 6; additionally #UD If VEX.L = 0. EVEX-encoded instruction, see Exceptions Type E6NF. VINSERTF128/VINSERTF32x4/VINSERTF64x2/VINSERTF32x8/VINSERTF64x4—Insert Packed Floating-Point Values Vol. 2C 5-313 INSTRUCTION SET REFERENCE, V-Z VINSERTI128/VINSERTI32x4/VINSERTI64x2/VINSERTI32x8/VINSERTI64x4—Insert Packed Integer Values Opcode/ Instruction Op / En CPUID Feature Flag AVX2 Description RVMI 64/32 bit Mode Support V/V VEX.NDS.256.66.0F3A.W0 38 /r ib VINSERTI128 ymm1, ymm2, xmm3/m128, imm8 EVEX.NDS.256.66.0F3A.W0 38 /r ib VINSERTI32X4 ymm1 {k1}{z}, ymm2, xmm3/m128, imm8 EVEX.NDS.512.66.0F3A.W0 38 /r ib VINSERTI32X4 zmm1 {k1}{z}, zmm2, xmm3/m128, imm8 EVEX.NDS.256.66.0F3A.W1 38 /r ib VINSERTI64X2 ymm1 {k1}{z}, ymm2, xmm3/m128, imm8 EVEX.NDS.512.66.0F3A.W1 38 /r ib VINSERTI64X2 zmm1 {k1}{z}, zmm2, xmm3/m128, imm8 EVEX.NDS.512.66.0F3A.W0 3A /r ib VINSERTI32X8 zmm1 {k1}{z}, zmm2, ymm3/m256, imm8 EVEX.NDS.512.66.0F3A.W1 3A /r ib VINSERTI64X4 zmm1 {k1}{z}, zmm2, ymm3/m256, imm8 T4 V/V AVX512VL AVX512F T4 V/V AVX512F T2 V/V AVX512VL AVX512DQ T2 V/V AVX512DQ T8 V/V AVX512DQ T4 V/V AVX512F Insert 128 bits of packed doubleword integer values from xmm3/m128 and the remaining values from ymm2 into ymm1 under writemask k1. Insert 128 bits of packed doubleword integer values from xmm3/m128 and the remaining values from zmm2 into zmm1 under writemask k1. Insert 128 bits of packed quadword integer values from xmm3/m128 and the remaining values from ymm2 into ymm1 under writemask k1. Insert 128 bits of packed quadword integer values from xmm3/m128 and the remaining values from zmm2 into zmm1 under writemask k1. Insert 256 bits of packed doubleword integer values from ymm3/m256 and the remaining values from zmm2 into zmm1 under writemask k1. Insert 256 bits of packed quadword integer values from ymm3/m256 and the remaining values from zmm2 into zmm1 under writemask k1. Insert 128 bits of integer data from xmm3/m128 and the remaining values from ymm2 into ymm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVMI ModRM:reg (w) VEX.vvvv ModRM:r/m (r) Imm8 T2, T4, T8 ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) Imm8 Description VINSERTI32x4 and VINSERTI64x2 inserts 128-bits of packed integer values from the second source operand (the third operand) into the destination operand (the first operand) at an 128-bit granular offset multiplied by imm8[0] (256-bit) or imm8[1:0]. The remaining portions of the destination are copied from the corresponding fields of the first source operand (the second operand). The second source operand can be either an XMM register or a 128-bit memory location. The high 6/7bits of the immediate are ignored. The destination operand is a ZMM/YMM register and updated at 32 and 64-bit granularity according to the writemask. VINSERTI32x8 and VINSERTI64x4 inserts 256-bits of packed integer values from the second source operand (the third operand) into the destination operand (the first operand) at a 256-bit granular offset multiplied by imm8[0]. The remaining portions of the destination are copied from the corresponding fields of the first source operand (the second operand). The second source operand can be either an YMM register or a 256-bit memory location. The upper bits of the immediate are ignored. The destination operand is a ZMM register and updated at 32 and 64-bit granularity according to the writemask. VINSERTI128 inserts 128-bits of packed integer data from the second source operand (the third operand) into the destination operand (the first operand) at a 128-bit granular offset multiplied by imm8[0]. The remaining portions of the destination are copied from the corresponding fields of the first source operand (the second operand). The second source operand can be either an XMM register or a 128-bit memory location. The high 7 bits of the immediate are ignored. VEX.L must be 1, otherwise attempt to execute this instruction with VEX.L=0 will cause #UD. 5-314 Vol. 2C VINSERTI128/VINSERTI32x4/VINSERTI64x2/VINSERTI32x8/VINSERTI64x4—Insert Packed Integer Values INSTRUCTION SET REFERENCE, V-Z Operation VINSERTI32x4 (EVEX encoded versions) (KL, VL) = (8, 256), (16, 512) TEMP_DEST[VL-1:0] SRC1[VL-1:0] IF VL = 256 CASE (imm8[0]) OF 0: TMP_DEST[127:0] SRC2[127:0] 1: TMP_DEST[255:128] SRC2[127:0] ESAC. FI; IF VL = 512 CASE (imm8[1:0]) OF 00: TMP_DEST[127:0] SRC2[127:0] 01: TMP_DEST[255:128] SRC2[127:0] 10: TMP_DEST[383:256] SRC2[127:0] 11: TMP_DEST[511:384] SRC2[127:0] ESAC. FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VINSERTI64x2 (EVEX encoded versions) (KL, VL) = (4, 256), (8, 512) TEMP_DEST[VL-1:0] SRC1[VL-1:0] IF VL = 256 CASE (imm8[0]) OF 0: TMP_DEST[127:0] SRC2[127:0] 1: TMP_DEST[255:128] SRC2[127:0] ESAC. FI; IF VL = 512 CASE (imm8[1:0]) OF 00: TMP_DEST[127:0] SRC2[127:0] 01: TMP_DEST[255:128] SRC2[127:0] 10: TMP_DEST[383:256] SRC2[127:0] 11: TMP_DEST[511:384] SRC2[127:0] ESAC. FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE VINSERTI128/VINSERTI32x4/VINSERTI64x2/VINSERTI32x8/VINSERTI64x4—Insert Packed Integer Values Vol. 2C 5-315 INSTRUCTION SET REFERENCE, V-Z IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VINSERTI32x8 (EVEX.U1.512 encoded version) TEMP_DEST[VL-1:0] SRC1[VL-1:0] CASE (imm8[0]) OF 0: TMP_DEST[255:0] SRC2[255:0] 1: TMP_DEST[511:256] SRC2[255:0] ESAC. FOR j 0 TO 15 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VINSERTI64x4 (EVEX.512 encoded version) VL = 512 TEMP_DEST[VL-1:0] SRC1[VL-1:0] CASE (imm8[0]) OF 0: TMP_DEST[255:0] SRC2[255:0] 1: TMP_DEST[511:256] SRC2[255:0] ESAC. FOR j 0 TO 7 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-316 Vol. 2C VINSERTI128/VINSERTI32x4/VINSERTI64x2/VINSERTI32x8/VINSERTI64x4—Insert Packed Integer Values INSTRUCTION SET REFERENCE, V-Z VINSERTI128 TEMP[255:0] SRC1[255:0] CASE (imm8[0]) OF 0: TEMP[127:0] SRC2[127:0] 1: TEMP[255:128] SRC2[127:0] ESAC DEST TEMP Intel C/C++ Compiler Intrinsic Equivalent VINSERTI32x4 _mm512i _inserti32x4( __m512i a, __m128i b, int imm); VINSERTI32x4 _mm512i _mask_inserti32x4(__m512i s, __mmask16 k, __m512i a, __m128i b, int imm); VINSERTI32x4 _mm512i _maskz_inserti32x4( __mmask16 k, __m512i a, __m128i b, int imm); VINSERTI32x4 __m256i _mm256_inserti32x4( __m256i a, __m128i b, int imm); VINSERTI32x4 __m256i _mm256_mask_inserti32x4(__m256i s, __mmask8 k, __m256i a, __m128i b, int imm); VINSERTI32x4 __m256i _mm256_maskz_inserti32x4( __mmask8 k, __m256i a, __m128i b, int imm); VINSERTI32x8 __m512i _mm512_inserti32x8( __m512i a, __m256i b, int imm); VINSERTI32x8 __m512i _mm512_mask_inserti32x8(__m512i s, __mmask16 k, __m512i a, __m256i b, int imm); VINSERTI32x8 __m512i _mm512_maskz_inserti32x8( __mmask16 k, __m512i a, __m256i b, int imm); VINSERTI64x2 __m512i _mm512_inserti64x2( __m512i a, __m128i b, int imm); VINSERTI64x2 __m512i _mm512_mask_inserti64x2(__m512i s, __mmask8 k, __m512i a, __m128i b, int imm); VINSERTI64x2 __m512i _mm512_maskz_inserti64x2( __mmask8 k, __m512i a, __m128i b, int imm); VINSERTI64x2 __m256i _mm256_inserti64x2( __m256i a, __m128i b, int imm); VINSERTI64x2 __m256i _mm256_mask_inserti64x2(__m256i s, __mmask8 k, __m256i a, __m128i b, int imm); VINSERTI64x2 __m256i _mm256_maskz_inserti64x2( __mmask8 k, __m256i a, __m128i b, int imm); VINSERTI64x4 _mm512_inserti64x4( __m512i a, __m256i b, int imm); VINSERTI64x4 _mm512_mask_inserti64x4(__m512i s, __mmask8 k, __m512i a, __m256i b, int imm); VINSERTI64x4 _mm512_maskz_inserti64x4( __mmask m, __m512i a, __m256i b, int imm); VINSERTI128 __m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int offset); SIMD Floating-Point Exceptions None Other Exceptions VEX-encoded instruction, see Exceptions Type 6; additionally #UD If VEX.L = 0. EVEX-encoded instruction, see Exceptions Type E6NF. VINSERTI128/VINSERTI32x4/VINSERTI64x2/VINSERTI32x8/VINSERTI64x4—Insert Packed Integer Values Vol. 2C 5-317 INSTRUCTION SET REFERENCE, V-Z VMASKMOV—Conditional SIMD Packed Loads and Stores Opcode/ Instruction Op/ En 64/32-bit CPUID Mode Feature Flag VEX.NDS.128.66.0F38.W0 2C /r RVM V/V AVX Conditionally load packed single-precision values from m128 using mask in xmm2 and store in xmm1. RVM V/V AVX Conditionally load packed single-precision values from m256 using mask in ymm2 and store in ymm1. RVM V/V AVX Conditionally load packed double-precision values from m128 using mask in xmm2 and store in xmm1. RVM V/V AVX Conditionally load packed double-precision values from m256 using mask in ymm2 and store in ymm1. MVR V/V AVX Conditionally store packed single-precision values from xmm2 using mask in xmm1. MVR V/V AVX Conditionally store packed single-precision values from ymm2 using mask in ymm1. MVR V/V AVX Conditionally store packed double-precision values from xmm2 using mask in xmm1. MVR V/V AVX Conditionally store packed double-precision values from ymm2 using mask in ymm1. VMASKMOVPS xmm1, xmm2, m128 VEX.NDS.256.66.0F38.W0 2C /r VMASKMOVPS ymm1, ymm2, m256 VEX.NDS.128.66.0F38.W0 2D /r VMASKMOVPD xmm1, xmm2, m128 VEX.NDS.256.66.0F38.W0 2D /r VMASKMOVPD ymm1, ymm2, m256 VEX.NDS.128.66.0F38.W0 2E /r VMASKMOVPS m128, xmm1, xmm2 VEX.NDS.256.66.0F38.W0 2E /r VMASKMOVPS m256, ymm1, ymm2 VEX.NDS.128.66.0F38.W0 2F /r VMASKMOVPD m128, xmm1, xmm2 VEX.NDS.256.66.0F38.W0 2F /r VMASKMOVPD m256, ymm1, ymm2 Description Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA MVR ModRM:r/m (w) VEX.vvvv (r) ModRM:reg (r) NA Description Conditionally moves packed data elements from the second source operand into the corresponding data element of the destination operand, depending on the mask bits associated with each data element. The mask bits are specified in the first source operand. The mask bit for each data element is the most significant bit of that element in the first source operand. If a mask is 1, the corresponding data element is copied from the second source operand to the destination operand. If the mask is 0, the corresponding data element is set to zero in the load form of these instructions, and unmodified in the store form. The second source operand is a memory address for the load form of these instruction. The destination operand is a memory address for the store form of these instructions. The other operands are both XMM registers (for VEX.128 version) or YMM registers (for VEX.256 version). Faults occur only due to mask-bit required memory accesses that caused the faults. Faults will not occur due to referencing any memory location if the corresponding mask bit for that memory location is 0. For example, no faults will be detected if the mask bits are all zero. Unlike previous MASKMOV instructions (MASKMOVQ and MASKMOVDQU), a nontemporal hint is not applied to these instructions. Instruction behavior on alignment check reporting with mask bits of less than all 1s are the same as with mask bits of all 1s. VMASKMOV should not be used to access memory mapped I/O and un-cached memory as the access and the ordering of the individual loads or stores it does is implementation specific. 5-318 Vol. 2C VMASKMOV—Conditional SIMD Packed Loads and Stores INSTRUCTION SET REFERENCE, V-Z In cases where mask bits indicate data should not be loaded or stored paging A and D bits will be set in an implementation dependent way. However, A and D bits are always set for pages where data is actually loaded/stored. Note: for load forms, the first source (the mask) is encoded in VEX.vvvv; the second source is encoded in rm_field, and the destination register is encoded in reg_field. Note: for store forms, the first source (the mask) is encoded in VEX.vvvv; the second source register is encoded in reg_field, and the destination memory location is encoded in rm_field. Operation VMASKMOVPS -128-bit load DEST[31:0] IF (SRC1[31]) Load_32(mem) ELSE 0 DEST[63:32] IF (SRC1[63]) Load_32(mem + 4) ELSE 0 DEST[95:64] IF (SRC1[95]) Load_32(mem + 8) ELSE 0 DEST[127:97] IF (SRC1[127]) Load_32(mem + 12) ELSE 0 DEST[VLMAX-1:128] 0 VMASKMOVPS - 256-bit load DEST[31:0] IF (SRC1[31]) Load_32(mem) ELSE 0 DEST[63:32] IF (SRC1[63]) Load_32(mem + 4) ELSE 0 DEST[95:64] IF (SRC1[95]) Load_32(mem + 8) ELSE 0 DEST[127:96] IF (SRC1[127]) Load_32(mem + 12) ELSE 0 DEST[159:128] IF (SRC1[159]) Load_32(mem + 16) ELSE 0 DEST[191:160] IF (SRC1[191]) Load_32(mem + 20) ELSE 0 DEST[223:192] IF (SRC1[223]) Load_32(mem + 24) ELSE 0 DEST[255:224] IF (SRC1[255]) Load_32(mem + 28) ELSE 0 VMASKMOVPD - 128-bit load DEST[63:0] IF (SRC1[63]) Load_64(mem) ELSE 0 DEST[127:64] IF (SRC1[127]) Load_64(mem + 16) ELSE 0 DEST[VLMAX-1:128] 0 VMASKMOVPD - 256-bit load DEST[63:0] IF (SRC1[63]) Load_64(mem) ELSE 0 DEST[127:64] IF (SRC1[127]) Load_64(mem + 8) ELSE 0 DEST[195:128] IF (SRC1[191]) Load_64(mem + 16) ELSE 0 DEST[255:196] IF (SRC1[255]) Load_64(mem + 24) ELSE 0 VMASKMOVPS - 128-bit store IF (SRC1[31]) DEST[31:0] SRC2[31:0] IF (SRC1[63]) DEST[63:32] SRC2[63:32] IF (SRC1[95]) DEST[95:64] SRC2[95:64] IF (SRC1[127]) DEST[127:96] SRC2[127:96] VMASKMOVPS - 256-bit store IF (SRC1[31]) DEST[31:0] SRC2[31:0] IF (SRC1[63]) DEST[63:32] SRC2[63:32] IF (SRC1[95]) DEST[95:64] SRC2[95:64] IF (SRC1[127]) DEST[127:96] SRC2[127:96] IF (SRC1[159]) DEST[159:128] SRC2[159:128] IF (SRC1[191]) DEST[191:160] SRC2[191:160] IF (SRC1[223]) DEST[223:192] SRC2[223:192] IF (SRC1[255]) DEST[255:224] SRC2[255:224] VMASKMOV—Conditional SIMD Packed Loads and Stores Vol. 2C 5-319 INSTRUCTION SET REFERENCE, V-Z VMASKMOVPD - 128-bit store IF (SRC1[63]) DEST[63:0] SRC2[63:0] IF (SRC1[127]) DEST[127:64] SRC2[127:64] VMASKMOVPD - 256-bit store IF (SRC1[63]) DEST[63:0] SRC2[63:0] IF (SRC1[127]) DEST[127:64] SRC2[127:64] IF (SRC1[191]) DEST[191:128] SRC2[191:128] IF (SRC1[255]) DEST[255:192] SRC2[255:192] Intel C/C++ Compiler Intrinsic Equivalent __m256 _mm256_maskload_ps(float const *a, __m256i mask) void _mm256_maskstore_ps(float *a, __m256i mask, __m256 b) __m256d _mm256_maskload_pd(double *a, __m256i mask); void _mm256_maskstore_pd(double *a, __m256i mask, __m256d b); __m128 _mm128_maskload_ps(float const *a, __m128i mask) void _mm128_maskstore_ps(float *a, __m128i mask, __m128 b) __m128d _mm128_maskload_pd(double *a, __m128i mask); void _mm128_maskstore_pd(double *a, __m128i mask, __m128d b); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type 6 (No AC# reported for any mask bit combinations); additionally #UD 5-320 Vol. 2C If VEX.W = 1. VMASKMOV—Conditional SIMD Packed Loads and Stores INSTRUCTION SET REFERENCE, V-Z VPBLENDD — Blend Packed Dwords Opcode/ Instruction Op/ En RVMI 64/32 -bit Mode V/V CPUID Feature Flag AVX2 VEX.NDS.128.66.0F3A.W0 02 /r ib VPBLENDD xmm1, xmm2, xmm3/m128, imm8 VEX.NDS.256.66.0F3A.W0 02 /r ib VPBLENDD ymm1, ymm2, ymm3/m256, imm8 RVMI V/V AVX2 Description Select dwords from xmm2 and xmm3/m128 from mask specified in imm8 and store the values into xmm1. Select dwords from ymm2 and ymm3/m256 from mask specified in imm8 and store the values into ymm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVMI ModRM:reg (w) VEX.vvvv ModRM:r/m (r) Imm8 Description Dword elements from the source operand (second operand) are conditionally written to the destination operand (first operand) depending on bits in the immediate operand (third operand). The immediate bits (bits 7:0) form a mask that determines whether the corresponding word in the destination is copied from the source. If a bit in the mask, corresponding to a word, is “1", then the word is copied, else the word is unchanged. VEX.128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The first source and destination operands are XMM registers. Bits (VLMAX-1:128) of the corresponding YMM register are zeroed. VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register or a 256-bit memory location. The destination operand is a YMM register. Operation VPBLENDD (VEX.256 encoded version) IF (imm8[0] == 1) THEN DEST[31:0] SRC2[31:0] ELSE DEST[31:0] SRC1[31:0] IF (imm8[1] == 1) THEN DEST[63:32] SRC2[63:32] ELSE DEST[63:32] SRC1[63:32] IF (imm8[2] == 1) THEN DEST[95:64] SRC2[95:64] ELSE DEST[95:64] SRC1[95:64] IF (imm8[3] == 1) THEN DEST[127:96] SRC2[127:96] ELSE DEST[127:96] SRC1[127:96] IF (imm8[4] == 1) THEN DEST[159:128] SRC2[159:128] ELSE DEST[159:128] SRC1[159:128] IF (imm8[5] == 1) THEN DEST[191:160] SRC2[191:160] ELSE DEST[191:160] SRC1[191:160] IF (imm8[6] == 1) THEN DEST[223:192] SRC2[223:192] ELSE DEST[223:192] SRC1[223:192] IF (imm8[7] == 1) THEN DEST[255:224] SRC2[255:224] ELSE DEST[255:224] SRC1[255:224] VPBLENDD — Blend Packed Dwords Vol. 2C 5-321 INSTRUCTION SET REFERENCE, V-Z VPBLENDD (VEX.128 encoded version) IF (imm8[0] == 1) THEN DEST[31:0] SRC2[31:0] ELSE DEST[31:0] SRC1[31:0] IF (imm8[1] == 1) THEN DEST[63:32] SRC2[63:32] ELSE DEST[63:32] SRC1[63:32] IF (imm8[2] == 1) THEN DEST[95:64] SRC2[95:64] ELSE DEST[95:64] SRC1[95:64] IF (imm8[3] == 1) THEN DEST[127:96] SRC2[127:96] ELSE DEST[127:96] SRC1[127:96] DEST[VLMAX-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VPBLENDD: __m128i _mm_blend_epi32 (__m128i v1, __m128i v2, const int mask) VPBLENDD: __m256i _mm256_blend_epi32 (__m256i v1, __m256i v2, const int mask) SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type 4; additionally #UD 5-322 Vol. 2C If VEX.W = 1. VPBLENDD — Blend Packed Dwords INSTRUCTION SET REFERENCE, V-Z VPBLENDMB/VPBLENDMW—Blend Byte/Word Vectors Using an Opmask Control Opcode/ Instruction Op / En FVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512BW EVEX.NDS.128.66.0F38.W0 66 /r VPBLENDMB xmm1 {k1}{z}, xmm2, xmm3/m128 EVEX.NDS.256.66.0F38.W0 66 /r VPBLENDMB ymm1 {k1}{z}, ymm2, ymm3/m256 EVEX.NDS.512.66.0F38.W0 66 /r VPBLENDMB zmm1 {k1}{z}, zmm2, zmm3/m512 EVEX.NDS.128.66.0F38.W1 66 /r VPBLENDMW xmm1 {k1}{z}, xmm2, xmm3/m128 EVEX.NDS.256.66.0F38.W1 66 /r VPBLENDMW ymm1 {k1}{z}, ymm2, ymm3/m256 EVEX.NDS.512.66.0F38.W1 66 /r VPBLENDMW zmm1 {k1}{z}, zmm2, zmm3/m512 FVM V/V AVX512VL AVX512BW FVM V/V AVX512BW FVM V/V AVX512VL AVX512BW FVM V/V AVX512VL AVX512BW FVM V/V AVX512BW Description Blend byte integer vector xmm2 and byte vector xmm3/m128 and store the result in xmm1, under control mask. Blend byte integer vector ymm2 and byte vector ymm3/m256 and store the result in ymm1, under control mask. Blend byte integer vector zmm2 and byte vector zmm3/m512 and store the result in zmm1, under control mask. Blend word integer vector xmm2 and word vector xmm3/m128 and store the result in xmm1, under control mask. Blend word integer vector ymm2 and word vector ymm3/m256 and store the result in ymm1, under control mask. Blend word integer vector zmm2 and word vector zmm3/m512 and store the result in zmm1, under control mask. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FVM ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA Description Performs an element-by-element blending of byte/word elements between the first source operand byte vector register and the second source operand byte vector from memory or register, using the instruction mask as selector. The result is written into the destination byte vector register. The destination and first source operands are ZMM/YMM/XMM registers. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit memory location. The mask is not used as a writemask for this instruction. Instead, the mask is used as an element selector: every element of the destination is conditionally selected between first source or second source using the value of the related mask bit (0 for first source, 1 for second source). VPBLENDMB/VPBLENDMW—Blend Byte/Word Vectors Using an Opmask Control Vol. 2C 5-323 INSTRUCTION SET REFERENCE, V-Z Operation VPBLENDMB (EVEX encoded versions) (KL, VL) = (16, 128), (32, 256), (64, 512) FOR j 0 TO KL-1 ij*8 IF k1[j] OR *no writemask* THEN DEST[i+7:i] SRC2[i+7:i] ELSE IF *merging-masking* ; merging-masking THEN DEST[i+7:i] SRC1[i+7:i] ELSE ; zeroing-masking DEST[i+7:i] 0 FI; FI; ENDFOR DEST[MAX_VL-1:VL] 0; VPBLENDMW (EVEX encoded versions) (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO KL-1 i j * 16 IF k1[j] OR *no writemask* THEN DEST[i+15:i] SRC2[i+15:i] ELSE IF *merging-masking* ; merging-masking THEN DEST[i+15:i] SRC1[i+15:i] ELSE ; zeroing-masking DEST[i+15:i] 0 FI; FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VPBLENDMB __m512i _mm512_mask_blend_epi8(__mmask64 m, __m512i a, __m512i b); VPBLENDMB __m256i _mm256_mask_blend_epi8(__mmask32 m, __m256i a, __m256i b); VPBLENDMB __m128i _mm_mask_blend_epi8(__mmask16 m, __m128i a, __m128i b); VPBLENDMW __m512i _mm512_mask_blend_epi16(__mmask32 m, __m512i a, __m512i b); VPBLENDMW __m256i _mm256_mask_blend_epi16(__mmask16 m, __m256i a, __m256i b); VPBLENDMW __m128i _mm_mask_blend_epi16(__mmask8 m, __m128i a, __m128i b); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E4. 5-324 Vol. 2C VPBLENDMB/VPBLENDMW—Blend Byte/Word Vectors Using an Opmask Control INSTRUCTION SET REFERENCE, V-Z VPBLENDMD/VPBLENDMQ—Blend Int32/Int64 Vectors Using an OpMask Control Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.NDS.128.66.0F38.W0 64 /r VPBLENDMD xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.256.66.0F38.W0 64 /r VPBLENDMD ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.512.66.0F38.W0 64 /r VPBLENDMD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst EVEX.NDS.128.66.0F38.W1 64 /r VPBLENDMQ xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDS.256.66.0F38.W1 64 /r VPBLENDMQ ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.512.66.0F38.W1 64 /r VPBLENDMQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst FV V/V AVX512VL AVX512F FV V/V AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F Description Blend doubleword integer vector xmm2 and doubleword vector xmm3/m128/m32bcst and store the result in xmm1, under control mask. Blend doubleword integer vector ymm2 and doubleword vector ymm3/m256/m32bcst and store the result in ymm1, under control mask. Blend doubleword integer vector zmm2 and doubleword vector zmm3/m512/m32bcst and store the result in zmm1, under control mask. Blend quadword integer vector xmm2 and quadword vector xmm3/m128/m64bcst and store the result in xmm1, under control mask. Blend quadword integer vector ymm2 and quadword vector ymm3/m256/m64bcst and store the result in ymm1, under control mask. Blend quadword integer vector zmm2 and quadword vector zmm3/m512/m64bcst and store the result in zmm1, under control mask. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA Description Performs an element-by-element blending of dword/qword elements between the first source operand (the second operand) and the elements of the second source operand (the third operand) using an opmask register as select control. The blended result is written into the destination. The destination and first source operands are ZMM registers. The second source operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32-bit memory location. The opmask register is not used as a writemask for this instruction. Instead, the mask is used as an element selector: every element of the destination is conditionally selected between first source or second source using the value of the related mask bit (0 for the first source operand, 1 for the second source operand). If EVEX.z is set, the elements with corresponding mask bit value of 0 in the destination operand are zeroed. VPBLENDMD/VPBLENDMQ—Blend Int32/Int64 Vectors Using an OpMask Control Vol. 2C 5-325 INSTRUCTION SET REFERENCE, V-Z Operation VPBLENDMD (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no controlmask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+31:i] SRC2[31:0] ELSE DEST[i+31:i] SRC2[i+31:i] FI; ELSE IF *merging-masking* ; merging-masking THEN DEST[i+31:i] SRC1[i+31:i] ELSE ; zeroing-masking DEST[i+31:i] 0 FI; FI; ENDFOR DEST[MAX_VL-1:VL] 0; VPBLENDMD (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no controlmask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+31:i] SRC2[31:0] ELSE DEST[i+31:i] SRC2[i+31:i] FI; ELSE IF *merging-masking* ; merging-masking THEN DEST[i+31:i] SRC1[i+31:i] ELSE ; zeroing-masking DEST[i+31:i] 0 FI; FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-326 Vol. 2C VPBLENDMD/VPBLENDMQ—Blend Int32/Int64 Vectors Using an OpMask Control INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPBLENDMD __m512i _mm512_mask_blend_epi32(__mmask16 k, __m512i a, __m512i b); VPBLENDMD __m256i _mm256_mask_blend_epi32(__mmask8 m, __m256i a, __m256i b); VPBLENDMD __m128i _mm_mask_blend_epi32(__mmask8 m, __m128i a, __m128i b); VPBLENDMQ __m512i _mm512_mask_blend_epi64(__mmask8 k, __m512i a, __m512i b); VPBLENDMQ __m256i _mm256_mask_blend_epi64(__mmask8 m, __m256i a, __m256i b); VPBLENDMQ __m128i _mm_mask_blend_epi64(__mmask8 m, __m128i a, __m128i b); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E4. VPBLENDMD/VPBLENDMQ—Blend Int32/Int64 Vectors Using an OpMask Control Vol. 2C 5-327 INSTRUCTION SET REFERENCE, V-Z VPBROADCASTB/W/D/Q—Load with Broadcast Integer Data from General Purpose Register Opcode/ Instruction Op / En EVEX.128.66.0F38.W0 7A /r VPBROADCASTB xmm1 {k1}{z}, reg EVEX.256.66.0F38.W0 7A /r VPBROADCASTB ymm1 {k1}{z}, reg EVEX.512.66.0F38.W0 7A /r VPBROADCASTB zmm1 {k1}{z}, reg EVEX.128.66.0F38.W0 7B /r VPBROADCASTW xmm1 {k1}{z}, reg EVEX.256.66.0F38.W0 7B /r VPBROADCASTW ymm1 {k1}{z}, reg EVEX.512.66.0F38.W0 7B /r VPBROADCASTW zmm1 {k1}{z}, reg EVEX.128.66.0F38.W0 7C /r VPBROADCASTD xmm1 {k1}{z}, r32 EVEX.256.66.0F38.W0 7C /r VPBROADCASTD ymm1 {k1}{z}, r32 EVEX.512.66.0F38.W0 7C /r VPBROADCASTD zmm1 {k1}{z}, r32 EVEX.128.66.0F38.W1 7C /r VPBROADCASTQ xmm1 {k1}{z}, r64 EVEX.256.66.0F38.W1 7C /r VPBROADCASTQ ymm1 {k1}{z}, r64 EVEX.512.66.0F38.W1 7C /r VPBROADCASTQ zmm1 {k1}{z}, r64 T1S 64/32 bit Mode Support V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/N.E.1 T1S V/N.E.1 T1S V/N.E.1 CPUID Feature Flag Description AVX512VL AVX512BW AVX512VL AVX512BW AVX512BW Broadcast an 8-bit value from a GPR to all bytes in the 128-bit destination subject to writemask k1. Broadcast an 8-bit value from a GPR to all bytes in the 256-bit destination subject to writemask k1. Broadcast an 8-bit value from a GPR to all bytes in the 512-bit destination subject to writemask k1. Broadcast a 16-bit value from a GPR to all words in the 128-bit destination subject to writemask k1. Broadcast a 16-bit value from a GPR to all words in the 256-bit destination subject to writemask k1. Broadcast a 16-bit value from a GPR to all words in the 512-bit destination subject to writemask k1. Broadcast a 32-bit value from a GPR to all double-words in the 128-bit destination subject to writemask k1. Broadcast a 32-bit value from a GPR to all double-words in the 256-bit destination subject to writemask k1. Broadcast a 32-bit value from a GPR to all double-words in the 512-bit destination subject to writemask k1. Broadcast a 64-bit value from a GPR to all quad-words in the 128-bit destination subject to writemask k1. Broadcast a 64-bit value from a GPR to all quad-words in the 256-bit destination subject to writemask k1. Broadcast a 64-bit value from a GPR to all quad-words in the 512-bit destination subject to writemask k1. AVX512VL AVX512BW AVX512VL AVX512BW AVX512BW AVX512VL AVX512F AVX512VL AVX512F AVX512F AVX512VL AVX512F AVX512VL AVX512F AVX512F NOTES: 1. EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) ModRM:r/m (r) NA NA Description Broadcasts a 8-bit, 16-bit, 32-bit or 64-bit value from a general-purpose register (the second operand) to all the locations in the destination vector register (the first operand) using the writemask k1. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 5-328 Vol. 2C VPBROADCASTB/W/D/Q—Load with Broadcast Integer Data from General Purpose Register INSTRUCTION SET REFERENCE, V-Z Operation VPBROADCASTB (EVEX encoded versions) (KL, VL) = (16, 128), (32, 256), (64, 512) FOR j 0 TO KL-1 i j * 8 IF k1[j] OR *no writemask* THEN DEST[i+7:i] SRC[7:0] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+7:i] remains unchanged* ELSE ; zeroing-masking DEST[i+7:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPBROADCASTW (EVEX encoded versions) (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO KL-1 i j * 16 IF k1[j] OR *no writemask* THEN DEST[i+15:i] SRC[15:0] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged* ELSE ; zeroing-masking DEST[i+15:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPBROADCASTD (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] SRC[31:0] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPBROADCASTB/W/D/Q—Load with Broadcast Integer Data from General Purpose Register Vol. 2C 5-329 INSTRUCTION SET REFERENCE, V-Z VPBROADCASTQ (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] SRC[63:0] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VPBROADCASTB __m512i _mm512_mask_set1_epi8(__m512i s, __mmask64 k, int a); VPBROADCASTB __m512i _mm512_maskz_set1_epi8( __mmask64 k, int a); VPBROADCASTB __m256i _mm256_mask_set1_epi8(__m256i s, __mmask32 k, int a); VPBROADCASTB __m256i _mm256_maskz_set1_epi8( __mmask32 k, int a); VPBROADCASTB __m128i _mm_mask_set1_epi8(__m128i s, __mmask16 k, int a); VPBROADCASTB __m128i _mm_maskz_set1_epi8( __mmask16 k, int a); VPBROADCASTD __m512i _mm512_mask_set1_epi32(__m512i s, __mmask16 k, int a); VPBROADCASTD __m512i _mm512_maskz_set1_epi32( __mmask16 k, int a); VPBROADCASTD __m256i _mm256_mask_set1_epi32(__m256i s, __mmask8 k, int a); VPBROADCASTD __m256i _mm256_maskz_set1_epi32( __mmask8 k, int a); VPBROADCASTD __m128i _mm_mask_set1_epi32(__m128i s, __mmask8 k, int a); VPBROADCASTD __m128i _mm_maskz_set1_epi32( __mmask8 k, int a); VPBROADCASTQ __m512i _mm512_mask_set1_epi64(__m512i s, __mmask8 k, __int64 a); VPBROADCASTQ __m512i _mm512_maskz_set1_epi64( __mmask8 k, __int64 a); VPBROADCASTQ __m256i _mm256_mask_set1_epi64(__m256i s, __mmask8 k, __int64 a); VPBROADCASTQ __m256i _mm256_maskz_set1_epi64( __mmask8 k, __int64 a); VPBROADCASTQ __m128i _mm_mask_set1_epi64(__m128i s, __mmask8 k, __int64 a); VPBROADCASTQ __m128i _mm_maskz_set1_epi64( __mmask8 k, __int64 a); VPBROADCASTW __m512i _mm512_mask_set1_epi16(__m512i s, __mmask32 k, int a); VPBROADCASTW __m512i _mm512_maskz_set1_epi16( __mmask32 k, int a); VPBROADCASTW __m256i _mm256_mask_set1_epi16(__m256i s, __mmask16 k, int a); VPBROADCASTW __m256i _mm256_maskz_set1_epi16( __mmask16 k, int a); VPBROADCASTW __m128i _mm_mask_set1_epi16(__m128i s, __mmask8 k, int a); VPBROADCASTW __m128i _mm_maskz_set1_epi16( __mmask8 k, int a); Exceptions EVEX-encoded instructions, see Exceptions Type E7NM. #UD 5-330 Vol. 2C If EVEX.vvvv != 1111B. VPBROADCASTB/W/D/Q—Load with Broadcast Integer Data from General Purpose Register INSTRUCTION SET REFERENCE, V-Z VPBROADCAST—Load Integer and Broadcast Opcode/ Instruction Op / En RM 64/32 bit Mode Support V/V CPUID Feature Flag AVX2 VEX.128.66.0F38.W0 78 /r VPBROADCASTB xmm1, xmm2/m8 VEX.256.66.0F38.W0 78 /r VPBROADCASTB ymm1, xmm2/m8 EVEX.128.66.0F38.W0 78 /r VPBROADCASTB xmm1{k1}{z}, xmm2/m8 EVEX.256.66.0F38.W0 78 /r VPBROADCASTB ymm1{k1}{z}, xmm2/m8 EVEX.512.66.0F38.W0 78 /r VPBROADCASTB zmm1{k1}{z}, xmm2/m8 RM V/V AVX2 T1S V/V T1S V/V T1S V/V AVX512VL AVX512BW AVX512VL AVX512BW AVX512BW VEX.128.66.0F38.W0 79 /r VPBROADCASTW xmm1, xmm2/m16 VEX.256.66.0F38.W0 79 /r VPBROADCASTW ymm1, xmm2/m16 EVEX.128.66.0F38.W0 79 /r VPBROADCASTW xmm1{k1}{z}, xmm2/m16 RM V/V AVX2 RM V/V AVX2 T1S V/V AVX512VL AVX512BW EVEX.256.66.0F38.W0 79 /r VPBROADCASTW ymm1{k1}{z}, xmm2/m16 T1S V/V AVX512VL AVX512BW EVEX.512.66.0F38.W0 79 /r VPBROADCASTW zmm1{k1}{z}, xmm2/m16 T1S V/V AVX512BW VEX.128.66.0F38.W0 58 /r VPBROADCASTD xmm1, xmm2/m32 VEX.256.66.0F38.W0 58 /r VPBROADCASTD ymm1, xmm2/m32 EVEX.128.66.0F38.W0 58 /r VPBROADCASTD xmm1 {k1}{z}, xmm2/m32 RM V/V AVX2 RM V/V AVX2 T1S V/V AVX512VL AVX512F EVEX.256.66.0F38.W0 58 /r VPBROADCASTD ymm1 {k1}{z}, xmm2/m32 T1S V/V AVX512VL AVX512F EVEX.512.66.0F38.W0 58 /r VPBROADCASTD zmm1 {k1}{z}, xmm2/m32 T1S V/V AVX512F VEX.128.66.0F38.W0 59 /r VPBROADCASTQ xmm1, xmm2/m64 VEX.256.66.0F38.W0 59 /r VPBROADCASTQ ymm1, xmm2/m64 EVEX.128.66.0F38.W1 59 /r VPBROADCASTQ xmm1 {k1}{z}, xmm2/m64 EVEX.256.66.0F38.W1 59 /r VPBROADCASTQ ymm1 {k1}{z}, xmm2/m64 EVEX.512.66.0F38.W1 59 /r VPBROADCASTQ zmm1 {k1}{z}, xmm2/m64 EVEX.128.66.0F38.W0 59 /r VBROADCASTI32x2 xmm1 {k1}{z}, xmm2/m64 RM V/V AVX2 RM V/V AVX2 T1S V/V T1S V/V T1S V/V AVX512VL AVX512F AVX512VL AVX512F AVX512F T2 V/V VPBROADCAST—Load Integer and Broadcast AVX512VL AVX512DQ Description Broadcast a byte integer in the source operand to sixteen locations in xmm1. Broadcast a byte integer in the source operand to thirty-two locations in ymm1. Broadcast a byte integer in the source operand to locations in xmm1 subject to writemask k1. Broadcast a byte integer in the source operand to locations in ymm1 subject to writemask k1. Broadcast a byte integer in the source operand to 64 locations in zmm1 subject to writemask k1. Broadcast a word integer in the source operand to eight locations in xmm1. Broadcast a word integer in the source operand to sixteen locations in ymm1. Broadcast a word integer in the source operand to locations in xmm1 subject to writemask k1. Broadcast a word integer in the source operand to locations in ymm1 subject to writemask k1. Broadcast a word integer in the source operand to 32 locations in zmm1 subject to writemask k1. Broadcast a dword integer in the source operand to four locations in xmm1. Broadcast a dword integer in the source operand to eight locations in ymm1. Broadcast a dword integer in the source operand to locations in xmm1 subject to writemask k1. Broadcast a dword integer in the source operand to locations in ymm1 subject to writemask k1. Broadcast a dword integer in the source operand to locations in zmm1 subject to writemask k1. Broadcast a qword element in source operand to two locations in xmm1. Broadcast a qword element in source operand to four locations in ymm1. Broadcast a qword element in source operand to locations in xmm1 subject to writemask k1. Broadcast a qword element in source operand to locations in ymm1 subject to writemask k1. Broadcast a qword element in source operand to locations in zmm1 subject to writemask k1. Broadcast two dword elements in source operand to locations in xmm1 subject to writemask k1. Vol. 2C 5-331 INSTRUCTION SET REFERENCE, V-Z Opcode/ Instruction Op / En T2 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.256.66.0F38.W0 59 /r VBROADCASTI32x2 ymm1 {k1}{z}, xmm2/m64 EVEX.512.66.0F38.W0 59 /r VBROADCASTI32x2 zmm1 {k1}{z}, xmm2/m64 T2 V/V AVX512DQ VEX.256.66.0F38.W0 5A /r VBROADCASTI128 ymm1, m128 EVEX.256.66.0F38.W0 5A /r VBROADCASTI32X4 ymm1 {k1}{z}, m128 RM V/V AVX2 T4 V/V AVX512VL AVX512F EVEX.512.66.0F38.W0 5A /r VBROADCASTI32X4 zmm1 {k1}{z}, m128 T4 V/V AVX512F EVEX.256.66.0F38.W1 5A /r VBROADCASTI64X2 ymm1 {k1}{z}, m128 T2 V/V AVX512VL AVX512DQ EVEX.512.66.0F38.W1 5A /r VBROADCASTI64X2 zmm1 {k1}{z}, m128 T2 V/V AVX512DQ EVEX.512.66.0F38.W0 5B /r VBROADCASTI32X8 zmm1 {k1}{z}, m256 T8 V/V AVX512DQ EVEX.512.66.0F38.W1 5B /r VBROADCASTI64X4 zmm1 {k1}{z}, m256 T4 V/V AVX512F Description Broadcast two dword elements in source operand to locations in ymm1 subject to writemask k1. Broadcast two dword elements in source operand to locations in zmm1 subject to writemask k1. Broadcast 128 bits of integer data in mem to low and high 128-bits in ymm1. Broadcast 128 bits of 4 doubleword integer data in mem to locations in ymm1 using writemask k1. Broadcast 128 bits of 4 doubleword integer data in mem to locations in zmm1 using writemask k1. Broadcast 128 bits of 2 quadword integer data in mem to locations in ymm1 using writemask k1. Broadcast 128 bits of 2 quadword integer data in mem to locations in zmm1 using writemask k1. Broadcast 256 bits of 8 doubleword integer data in mem to locations in zmm1 using writemask k1. Broadcast 256 bits of 4 quadword integer data in mem to locations in zmm1 using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RM ModRM:reg (w) ModRM:r/m (r) NA NA T1S, T2, T4, T8 ModRM:reg (w) ModRM:r/m (r) NA NA Description Load integer data from the source operand (the second operand) and broadcast to all elements of the destination operand (the first operand). VEX256-encoded VPBROADCASTB/W/D/Q: The source operand is 8-bit, 16-bit, 32-bit, 64-bit memory location or the low 8-bit, 16-bit 32-bit, 64-bit data in an XMM register. The destination operand is a YMM register. VPBROADCASTI128 support the source operand of 128-bit memory location. Register source encodings for VPBROADCASTI128 is reserved and will #UD. Bits (MAX_VL-1:256) of the destination register are zeroed. EVEX-encoded VPBROADCASTD/Q: The source operand is a 32-bit, 64-bit memory location or the low 32-bit, 64bit data in an XMM register. The destination operand is a ZMM/YMM/XMM register and updated according to the writemask k1. VPBROADCASTI32X4 and VPBROADCASTI64X4: The destination operand is a ZMM register and updated according to the writemask k1. The source operand is 128-bit or 256-bit memory location. Register source encodings for VBROADCASTI32X4 and VBROADCASTI64X4 are reserved and will #UD. Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. If VPBROADCASTI128 is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will cause an #UD exception. 5-332 Vol. 2C VPBROADCAST—Load Integer and Broadcast INSTRUCTION SET REFERENCE, V-Z X0 m32 DEST X0 X0 X0 X0 X0 X0 X0 X0 Figure 5-16. VPBROADCASTD Operation (VEX.256 encoded version) X0 m32 DEST 0 0 0 0 X0 X0 X0 X0 Figure 5-17. VPBROADCASTD Operation (128-bit version) m64 DEST X0 X0 X0 X0 X0 Figure 5-18. VPBROADCASTQ Operation (256-bit version) m128 DEST X0 X0 X0 Figure 5-19. VBROADCASTI128 Operation (256-bit version) VPBROADCAST—Load Integer and Broadcast Vol. 2C 5-333 INSTRUCTION SET REFERENCE, V-Z m256 DEST X0 X0 X0 Figure 5-20. VBROADCASTI256 Operation (512-bit version) Operation VPBROADCASTB (EVEX encoded versions) (KL, VL) = (16, 128), (32, 256), (64, 512) FOR j 0 TO KL-1 i j * 8 IF k1[j] OR *no writemask* THEN DEST[i+7:i] SRC[7:0] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+7:i] remains unchanged* ELSE ; zeroing-masking DEST[i+7:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPBROADCASTW (EVEX encoded versions) (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO KL-1 i j * 16 IF k1[j] OR *no writemask* THEN DEST[i+15:i] SRC[15:0] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged* ELSE ; zeroing-masking DEST[i+15:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPBROADCASTD (128 bit version) temp SRC[31:0] DEST[31:0] temp DEST[63:32] temp DEST[95:64] temp DEST[127:96] temp DEST[MAX_VL-1:128] 0 5-334 Vol. 2C VPBROADCAST—Load Integer and Broadcast INSTRUCTION SET REFERENCE, V-Z VPBROADCASTD (VEX.256 encoded version) temp SRC[31:0] DEST[31:0] temp DEST[63:32] temp DEST[95:64] temp DEST[127:96] temp DEST[159:128] temp DEST[191:160] temp DEST[223:192] temp DEST[255:224] temp DEST[MAX_VL-1:256] 0 VPBROADCASTD (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] SRC[31:0] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPBROADCASTQ (VEX.256 encoded version) temp SRC[63:0] DEST[63:0] temp DEST[127:64] temp DEST[191:128] temp DEST[255:192] temp DEST[MAX_VL-1:256] 0 VPBROADCASTQ (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] SRC[63:0] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VBROADCASTI32x2 (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) VPBROADCAST—Load Integer and Broadcast Vol. 2C 5-335 INSTRUCTION SET REFERENCE, V-Z FOR j 0 TO KL-1 i j * 32 n (j mod 2) * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] SRC[n+31:n] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VBROADCASTI128 (VEX.256 encoded version) temp SRC[127:0] DEST[127:0] temp DEST[255:128] temp DEST[MAX_VL-1:256] 0 VBROADCASTI32X4 (EVEX encoded versions) (KL, VL) = (8, 256), (16, 512) FOR j 0 TO KL-1 i j* 32 n (j modulo 4) * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] SRC[n+31:n] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VBROADCASTI64X2 (EVEX encoded versions) (KL, VL) = (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 64 n (j modulo 2) * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] SRC[n+63:n] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] = 0 FI FI; ENDFOR; 5-336 Vol. 2C VPBROADCAST—Load Integer and Broadcast INSTRUCTION SET REFERENCE, V-Z VBROADCASTI32X8 (EVEX.U1.512 encoded version) FOR j 0 TO 15 i j * 32 n (j modulo 8) * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] SRC[n+31:n] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VBROADCASTI64X4 (EVEX.512 encoded version) FOR j 0 TO 7 i j * 64 n (j modulo 4) * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] SRC[n+63:n] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VPBROADCASTB __m512i _mm512_broadcastb_epi8( __m128i a); VPBROADCASTB __m512i _mm512_mask_broadcastb_epi8(__m512i s, __mmask64 k, __m128i a); VPBROADCASTB __m512i _mm512_maskz_broadcastb_epi8( __mmask64 k, __m128i a); VPBROADCASTB __m256i _mm256_broadcastb_epi8(__m128i a); VPBROADCASTB __m256i _mm256_mask_broadcastb_epi8(__m256i s, __mmask32 k, __m128i a); VPBROADCASTB __m256i _mm256_maskz_broadcastb_epi8( __mmask32 k, __m128i a); VPBROADCASTB __m128i _mm_mask_broadcastb_epi8(__m128i s, __mmask16 k, __m128i a); VPBROADCASTB __m128i _mm_maskz_broadcastb_epi8( __mmask16 k, __m128i a); VPBROADCASTB __m128i _mm_broadcastb_epi8(__m128i a); VPBROADCASTD __m512i _mm512_broadcastd_epi32( __m128i a); VPBROADCASTD __m512i _mm512_mask_broadcastd_epi32(__m512i s, __mmask16 k, __m128i a); VPBROADCASTD __m512i _mm512_maskz_broadcastd_epi32( __mmask16 k, __m128i a); VPBROADCASTD __m256i _mm256_broadcastd_epi32( __m128i a); VPBROADCASTD __m256i _mm256_mask_broadcastd_epi32(__m256i s, __mmask8 k, __m128i a); VPBROADCASTD __m256i _mm256_maskz_broadcastd_epi32( __mmask8 k, __m128i a); VPBROADCASTD __m128i _mm_broadcastd_epi32(__m128i a); VPBROADCASTD __m128i _mm_mask_broadcastd_epi32(__m128i s, __mmask8 k, __m128i a); VPBROADCASTD __m128i _mm_maskz_broadcastd_epi32( __mmask8 k, __m128i a); VPBROADCASTQ __m512i _mm512_broadcastq_epi64( __m128i a); VPBROADCASTQ __m512i _mm512_mask_broadcastq_epi64(__m512i s, __mmask8 k, __m128i a); VPBROADCASTQ __m512i _mm512_maskz_broadcastq_epi64( __mmask8 k, __m128i a); VPBROADCAST—Load Integer and Broadcast Vol. 2C 5-337 INSTRUCTION SET REFERENCE, V-Z VPBROADCASTQ __m256i _mm256_broadcastq_epi64(__m128i a); VPBROADCASTQ __m256i _mm256_mask_broadcastq_epi64(__m256i s, __mmask8 k, __m128i a); VPBROADCASTQ __m256i _mm256_maskz_broadcastq_epi64( __mmask8 k, __m128i a); VPBROADCASTQ __m128i _mm_broadcastq_epi64(__m128i a); VPBROADCASTQ __m128i _mm_mask_broadcastq_epi64(__m128i s, __mmask8 k, __m128i a); VPBROADCASTQ __m128i _mm_maskz_broadcastq_epi64( __mmask8 k, __m128i a); VPBROADCASTW __m512i _mm512_broadcastw_epi16(__m128i a); VPBROADCASTW __m512i _mm512_mask_broadcastw_epi16(__m512i s, __mmask32 k, __m128i a); VPBROADCASTW __m512i _mm512_maskz_broadcastw_epi16( __mmask32 k, __m128i a); VPBROADCASTW __m256i _mm256_broadcastw_epi16(__m128i a); VPBROADCASTW __m256i _mm256_mask_broadcastw_epi16(__m256i s, __mmask16 k, __m128i a); VPBROADCASTW __m256i _mm256_maskz_broadcastw_epi16( __mmask16 k, __m128i a); VPBROADCASTW __m128i _mm_broadcastw_epi16(__m128i a); VPBROADCASTW __m128i _mm_mask_broadcastw_epi16(__m128i s, __mmask8 k, __m128i a); VPBROADCASTW __m128i _mm_maskz_broadcastw_epi16( __mmask8 k, __m128i a); VBROADCASTI32x2 __m512i _mm512_broadcast_i32x2( __m128i a); VBROADCASTI32x2 __m512i _mm512_mask_broadcast_i32x2(__m512i s, __mmask16 k, __m128i a); VBROADCASTI32x2 __m512i _mm512_maskz_broadcast_i32x2( __mmask16 k, __m128i a); VBROADCASTI32x2 __m256i _mm256_broadcast_i32x2( __m128i a); VBROADCASTI32x2 __m256i _mm256_mask_broadcast_i32x2(__m256i s, __mmask8 k, __m128i a); VBROADCASTI32x2 __m256i _mm256_maskz_broadcast_i32x2( __mmask8 k, __m128i a); VBROADCASTI32x2 __m128i _mm_broadcastq_i32x2(__m128i a); VBROADCASTI32x2 __m128i _mm_mask_broadcastq_i32x2(__m128i s, __mmask8 k, __m128i a); VBROADCASTI32x2 __m128i _mm_maskz_broadcastq_i32x2( __mmask8 k, __m128i a); VBROADCASTI32x4 __m512i _mm512_broadcast_i32x4( __m128i a); VBROADCASTI32x4 __m512i _mm512_mask_broadcast_i32x4(__m512i s, __mmask16 k, __m128i a); VBROADCASTI32x4 __m512i _mm512_maskz_broadcast_i32x4( __mmask16 k, __m128i a); VBROADCASTI32x4 __m256i _mm256_broadcast_i32x4( __m128i a); VBROADCASTI32x4 __m256i _mm256_mask_broadcast_i32x4(__m256i s, __mmask8 k, __m128i a); VBROADCASTI32x4 __m256i _mm256_maskz_broadcast_i32x4( __mmask8 k, __m128i a); VBROADCASTI32x8 __m512i _mm512_broadcast_i32x8( __m256i a); VBROADCASTI32x8 __m512i _mm512_mask_broadcast_i32x8(__m512i s, __mmask16 k, __m256i a); VBROADCASTI32x8 __m512i _mm512_maskz_broadcast_i32x8( __mmask16 k, __m256i a); VBROADCASTI64x2 __m512i _mm512_broadcast_i64x2( __m128i a); VBROADCASTI64x2 __m512i _mm512_mask_broadcast_i64x2(__m512i s, __mmask8 k, __m128i a); VBROADCASTI64x2 __m512i _mm512_maskz_broadcast_i64x2( __mmask8 k, __m128i a); VBROADCASTI64x2 __m256i _mm256_broadcast_i64x2( __m128i a); VBROADCASTI64x2 __m256i _mm256_mask_broadcast_i64x2(__m256i s, __mmask8 k, __m128i a); VBROADCASTI64x2 __m256i _mm256_maskz_broadcast_i64x2( __mmask8 k, __m128i a); VBROADCASTI64x4 __m512i _mm512_broadcast_i64x4( __m256i a); VBROADCASTI64x4 __m512i _mm512_mask_broadcast_i64x4(__m512i s, __mmask8 k, __m256i a); VBROADCASTI64x4 __m512i _mm512_maskz_broadcast_i64x4( __mmask8 k, __m256i a); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instructions, see Exceptions Type 6; EVEX-encoded instructions, syntax with reg/mem operand, see Exceptions Type E6. #UD If VEX.L = 0 for VPBROADCASTQ, VPBROADCASTI128. If EVEX.L’L = 0 for VBROADCASTI32X4/VBROADCASTI64X2. If EVEX.L’L < 10b for VBROADCASTI32X8/VBROADCASTI64X4. 5-338 Vol. 2C VPBROADCAST—Load Integer and Broadcast INSTRUCTION SET REFERENCE, V-Z VPCMPB/VPCMPUB—Compare Packed Byte Values Into Mask Opcode/ Instruction Op/ En FVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512BW EVEX.NDS.128.66.0F3A.W0 3F /r ib VPCMPB k1 {k2}, xmm2, xmm3/m128, imm8 EVEX.NDS.256.66.0F3A.W0 3F /r ib FVM V/V VPCMPB k1 {k2}, ymm2, ymm3/m256, imm8 EVEX.NDS.512.66.0F3A.W0 3F /r ib VPCMPB k1 {k2}, zmm2, zmm3/m512, imm8 AVX512VL AVX512BW FVM V/V AVX512BW EVEX.NDS.128.66.0F3A.W0 3E /r ib FVM V/V VPCMPUB k1 {k2}, xmm2, xmm3/m128, imm8 EVEX.NDS.256.66.0F3A.W0 3E /r ib AVX512VL AVX512BW FVM V/V AVX512VL AVX512BW FVM V/V AVX512BW VPCMPUB k1 {k2}, ymm2, ymm3/m256, imm8 EVEX.NDS.512.66.0F3A.W0 3E /r ib VPCMPUB k1 {k2}, zmm2, zmm3/m512, imm8 Description Compare packed signed byte values in xmm3/m128 and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed signed byte values in ymm3/m256 and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed signed byte values in zmm3/m512 and zmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed unsigned byte values in xmm3/m128 and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed unsigned byte values in ymm3/m256 and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed unsigned byte values in zmm3/m512 and zmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FVM ModRM:reg (w) vvvv (r) ModRM:r/m (r) NA Description Performs a SIMD compare of the packed byte values in the second source operand and the first source operand and returns the results of the comparison to the mask destination operand. The comparison predicate operand (immediate byte) specifies the type of comparison performed on each pair of packed values in the two source operands. The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false). VPCMPB performs a comparison between pairs of signed byte values. VPCMPUB performs a comparison between pairs of unsigned byte values. The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand (first operand) is a mask register k1. Up to 64/32/16 comparisons are performed with results written to the destination operand under the writemask k2. VPCMPB/VPCMPUB—Compare Packed Byte Values Into Mask Vol. 2C 5-339 INSTRUCTION SET REFERENCE, V-Z The comparison predicate operand is an 8-bit immediate: bits 2:0 define the type of comparison to be performed. Bits 3 through 7 of the immediate are reserved. Compiler can implement the pseudo-op mnemonic listed in Table 5-10. Table 5-10. Pseudo-Op and VPCMP* Implementation : Pseudo-Op PCMPM Implementation VPCMPEQ* reg1, reg2, reg3 VPCMP* reg1, reg2, reg3, 0 VPCMPLT* reg1, reg2, reg3 VPCMP*reg1, reg2, reg3, 1 VPCMPLE* reg1, reg2, reg3 VPCMP* reg1, reg2, reg3, 2 VPCMPNEQ* reg1, reg2, reg3 VPCMP* reg1, reg2, reg3, 4 VPPCMPNLT* reg1, reg2, reg3 VPCMP* reg1, reg2, reg3, 5 VPCMPNLE* reg1, reg2, reg3 VPCMP* reg1, reg2, reg3, 6 Operation CASE (COMPARISON PREDICATE) OF 0: OP EQ; 1: OP LT; 2: OP LE; 3: OP FALSE; 4: OP NEQ ; 5: OP NLT; 6: OP NLE; 7: OP TRUE; ESAC; VPCMPB (EVEX encoded versions) (KL, VL) = (16, 128), (32, 256), (64, 512) FOR j 0 TO KL-1 ij*8 IF k2[j] OR *no writemask* THEN CMP SRC1[i+7:i] OP SRC2[i+7:i]; IF CMP = TRUE THEN DEST[j] 1; ELSE DEST[j] 0; FI; ELSE DEST[j] = 0 ; zeroing-masking onlyFI; FI; ENDFOR DEST[MAX_KL-1:KL] 0 5-340 Vol. 2C VPCMPB/VPCMPUB—Compare Packed Byte Values Into Mask INSTRUCTION SET REFERENCE, V-Z VPCMPUB (EVEX encoded versions) (KL, VL) = (16, 128), (32, 256), (64, 512) FOR j 0 TO KL-1 ij*8 IF k2[j] OR *no writemask* THEN CMP SRC1[i+7:i] OP SRC2[i+7:i]; IF CMP = TRUE THEN DEST[j] 1; ELSE DEST[j] 0; FI; ELSE DEST[j] = 0 ; zeroing-masking onlyFI; FI; ENDFOR DEST[MAX_KL-1:KL] 0 Intel C/C++ Compiler Intrinsic Equivalent VPCMPB __mmask64 _mm512_cmp_epi8_mask( __m512i a, __m512i b, int cmp); VPCMPB __mmask64 _mm512_mask_cmp_epi8_mask( __mmask64 m, __m512i a, __m512i b, int cmp); VPCMPB __mmask32 _mm256_cmp_epi8_mask( __m256i a, __m256i b, int cmp); VPCMPB __mmask32 _mm256_mask_cmp_epi8_mask( __mmask32 m, __m256i a, __m256i b, int cmp); VPCMPB __mmask16 _mm_cmp_epi8_mask( __m128i a, __m128i b, int cmp); VPCMPB __mmask16 _mm_mask_cmp_epi8_mask( __mmask16 m, __m128i a, __m128i b, int cmp); VPCMPB __mmask64 _mm512_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( __m512i a, __m512i b); VPCMPB __mmask64 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( __mmask64 m, __m512i a, __m512i b); VPCMPB __mmask32 _mm256_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( __m256i a, __m256i b); VPCMPB __mmask32 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( __mmask32 m, __m256i a, __m256i b); VPCMPB __mmask16 _mm_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( __m128i a, __m128i b); VPCMPB __mmask16 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epi8_mask( __mmask16 m, __m128i a, __m128i b); VPCMPUB __mmask64 _mm512_cmp_epu8_mask( __m512i a, __m512i b, int cmp); VPCMPUB __mmask64 _mm512_mask_cmp_epu8_mask( __mmask64 m, __m512i a, __m512i b, int cmp); VPCMPUB __mmask32 _mm256_cmp_epu8_mask( __m256i a, __m256i b, int cmp); VPCMPUB __mmask32 _mm256_mask_cmp_epu8_mask( __mmask32 m, __m256i a, __m256i b, int cmp); VPCMPUB __mmask16 _mm_cmp_epu8_mask( __m128i a, __m128i b, int cmp); VPCMPUB __mmask16 _mm_mask_cmp_epu8_mask( __mmask16 m, __m128i a, __m128i b, int cmp); VPCMPUB __mmask64 _mm512_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( __m512i a, __m512i b, int cmp); VPCMPUB __mmask64 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( __mmask64 m, __m512i a, __m512i b, int cmp); VPCMPUB __mmask32 _mm256_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( __m256i a, __m256i b, int cmp); VPCMPUB __mmask32 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( __mmask32 m, __m256i a, __m256i b, int cmp); VPCMPUB __mmask16 _mm_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( __m128i a, __m128i b, int cmp); VPCMPUB __mmask16 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epu8_mask( __mmask16 m, __m128i a, __m128i b, int cmp); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E4.nb. VPCMPB/VPCMPUB—Compare Packed Byte Values Into Mask Vol. 2C 5-341 INSTRUCTION SET REFERENCE, V-Z VPCMPD/VPCMPUD—Compare Packed Integer Values into Mask Opcode/ Instruction Op/ En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.NDS.128.66.0F3A.W0 1F /r ib VPCMPD k1 {k2}, xmm2, xmm3/m128/m32bcst, imm8 EVEX.NDS.256.66.0F3A.W0 1F /r ib VPCMPD k1 {k2}, ymm2, ymm3/m256/m32bcst, imm8 FV V/V AVX512VL AVX512F EVEX.NDS.512.66.0F3A.W0 1F /r ib VPCMPD k1 {k2}, zmm2, zmm3/m512/m32bcst, imm8 FV V/V AVX512F EVEX.NDS.128.66.0F3A.W0 1E /r ib VPCMPUD k1 {k2}, xmm2, xmm3/m128/m32bcst, imm8 FV V/V AVX512VL AVX512F EVEX.NDS.256.66.0F3A.W0 1E /r ib VPCMPUD k1 {k2}, ymm2, ymm3/m256/m32bcst, imm8 FV V/V AVX512VL AVX512F EVEX.NDS.512.66.0F3A.W0 1E /r ib VPCMPUD k1 {k2}, zmm2, zmm3/m512/m32bcst, imm8 FV V/V AVX512F Description Compare packed signed doubleword integer values in xmm3/m128/m32bcst and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed signed doubleword integer values in ymm3/m256/m32bcst and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed signed doubleword integer values in zmm2 and zmm3/m512/m32bcst using bits 2:0 of imm8 as a comparison predicate. The comparison results are written to the destination k1 under writemask k2. Compare packed unsigned doubleword integer values in xmm3/m128/m32bcst and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed unsigned doubleword integer values in ymm3/m256/m32bcst and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed unsigned doubleword integer values in zmm2 and zmm3/m512/m32bcst using bits 2:0 of imm8 as a comparison predicate. The comparison results are written to the destination k1 under writemask k2. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) Imm8 Description Performs a SIMD compare of the packed integer values in the second source operand and the first source operand and returns the results of the comparison to the mask destination operand. The comparison predicate operand (immediate byte) specifies the type of comparison performed on each pair of packed values in the two source operands. The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false). VPCMPD/VPCMPUD performs a comparison between pairs of signed/unsigned doubleword integer values. The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location or a 512-bit vector broadcasted from a 32-bit memory location. The destination operand (first operand) is a mask register k1. Up to 16/8/4 comparisons are performed with results written to the destination operand under the writemask k2. The comparison predicate operand is an 8-bit immediate: bits 2:0 define the type of comparison to be performed. Bits 3 through 7 of the immediate are reserved. Compiler can implement the pseudo-op mnemonic listed in Table 5-10. 5-342 Vol. 2C VPCMPD/VPCMPUD—Compare Packed Integer Values into Mask INSTRUCTION SET REFERENCE, V-Z Operation CASE (COMPARISON PREDICATE) OF 0: OP EQ; 1: OP LT; 2: OP LE; 3: OP FALSE; 4: OP NEQ; 5: OP NLT; 6: OP NLE; 7: OP TRUE; ESAC; VPCMPD (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k2[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN CMP SRC1[i+31:i] OP SRC2[31:0]; ELSE CMP SRC1[i+31:i] OP SRC2[i+31:i]; FI; IF CMP = TRUE THEN DEST[j] 1; ELSE DEST[j] 0; FI; ELSE DEST[j] 0 ; zeroing-masking onlyFI; FI; ENDFOR DEST[MAX_KL-1:KL] 0 VPCMPUD (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k2[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN CMP SRC1[i+31:i] OP SRC2[31:0]; ELSE CMP SRC1[i+31:i] OP SRC2[i+31:i]; FI; IF CMP = TRUE THEN DEST[j] 1; ELSE DEST[j] 0; FI; ELSE DEST[j] 0 ; zeroing-masking onlyFI; FI; ENDFOR DEST[MAX_KL-1:KL] 0 VPCMPD/VPCMPUD—Compare Packed Integer Values into Mask Vol. 2C 5-343 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPCMPD __mmask16 _mm512_cmp_epi32_mask( __m512i a, __m512i b, int imm); VPCMPD __mmask16 _mm512_mask_cmp_epi32_mask(__mmask16 k, __m512i a, __m512i b, int imm); VPCMPD __mmask16 _mm512_cmp[eq|ge|gt|le|lt|neq]_epi32_mask( __m512i a, __m512i b); VPCMPD __mmask16 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epi32_mask(__mmask16 k, __m512i a, __m512i b); VPCMPUD __mmask16 _mm512_cmp_epu32_mask( __m512i a, __m512i b, int imm); VPCMPUD __mmask16 _mm512_mask_cmp_epu32_mask(__mmask16 k, __m512i a, __m512i b, int imm); VPCMPUD __mmask16 _mm512_cmp[eq|ge|gt|le|lt|neq]_epu32_mask( __m512i a, __m512i b); VPCMPUD __mmask16 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epu32_mask(__mmask16 k, __m512i a, __m512i b); VPCMPD __mmask8 _mm256_cmp_epi32_mask( __m256i a, __m256i b, int imm); VPCMPD __mmask8 _mm256_mask_cmp_epi32_mask(__mmask8 k, __m256i a, __m256i b, int imm); VPCMPD __mmask8 _mm256_cmp[eq|ge|gt|le|lt|neq]_epi32_mask( __m256i a, __m256i b); VPCMPD __mmask8 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epi32_mask(__mmask8 k, __m256i a, __m256i b); VPCMPUD __mmask8 _mm256_cmp_epu32_mask( __m256i a, __m256i b, int imm); VPCMPUD __mmask8 _mm256_mask_cmp_epu32_mask(__mmask8 k, __m256i a, __m256i b, int imm); VPCMPUD __mmask8 _mm256_cmp[eq|ge|gt|le|lt|neq]_epu32_mask( __m256i a, __m256i b); VPCMPUD __mmask8 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epu32_mask(__mmask8 k, __m256i a, __m256i b); VPCMPD __mmask8 _mm_cmp_epi32_mask( __m128i a, __m128i b, int imm); VPCMPD __mmask8 _mm_mask_cmp_epi32_mask(__mmask8 k, __m128i a, __m128i b, int imm); VPCMPD __mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epi32_mask( __m128i a, __m128i b); VPCMPD __mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epi32_mask(__mmask8 k, __m128i a, __m128i b); VPCMPUD __mmask8 _mm_cmp_epu32_mask( __m128i a, __m128i b, int imm); VPCMPUD __mmask8 _mm_mask_cmp_epu32_mask(__mmask8 k, __m128i a, __m128i b, int imm); VPCMPUD __mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epu32_mask( __m128i a, __m128i b); VPCMPUD __mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epu32_mask(__mmask8 k, __m128i a, __m128i b); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E4. 5-344 Vol. 2C VPCMPD/VPCMPUD—Compare Packed Integer Values into Mask INSTRUCTION SET REFERENCE, V-Z VPCMPQ/VPCMPUQ—Compare Packed Integer Values into Mask Opcode/ Instruction Op/ En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.NDS.128.66.0F3A.W1 1F /r ib VPCMPQ k1 {k2}, xmm2, xmm3/m128/m64bcst, imm8 EVEX.NDS.256.66.0F3A.W1 1F /r ib VPCMPQ k1 {k2}, ymm2, ymm3/m256/m64bcst, imm8 FV V/V AVX512VL AVX512F EVEX.NDS.512.66.0F3A.W1 1F /r ib VPCMPQ k1 {k2}, zmm2, zmm3/m512/m64bcst, imm8 FV V/V AVX512F EVEX.NDS.128.66.0F3A.W1 1E /r ib VPCMPUQ k1 {k2}, xmm2, xmm3/m128/m64bcst, imm8 FV V/V AVX512VL AVX512F EVEX.NDS.256.66.0F3A.W1 1E /r ib VPCMPUQ k1 {k2}, ymm2, ymm3/m256/m64bcst, imm8 FV V/V AVX512VL AVX512F EVEX.NDS.512.66.0F3A.W1 1E /r ib VPCMPUQ k1 {k2}, zmm2, zmm3/m512/m64bcst, imm8 FV V/V AVX512F Description Compare packed signed quadword integer values in xmm3/m128/m64bcst and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed signed quadword integer values in ymm3/m256/m64bcst and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed signed quadword integer values in zmm3/m512/m64bcst and zmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed unsigned quadword integer values in xmm3/m128/m64bcst and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed unsigned quadword integer values in ymm3/m256/m64bcst and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed unsigned quadword integer values in zmm3/m512/m64bcst and zmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) Imm8 Description Performs a SIMD compare of the packed integer values in the second source operand and the first source operand and returns the results of the comparison to the mask destination operand. The comparison predicate operand (immediate byte) specifies the type of comparison performed on each pair of packed values in the two source operands. The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false). VPCMPQ/VPCMPUQ performs a comparison between pairs of signed/unsigned quadword integer values. The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location or a 512-bit vector broadcasted from a 64-bit memory location. The destination operand (first operand) is a mask register k1. Up to 8/4/2 comparisons are performed with results written to the destination operand under the writemask k2. The comparison predicate operand is an 8-bit immediate: bits 2:0 define the type of comparison to be performed. Bits 3 through 7 of the immediate are reserved. Compiler can implement the pseudo-op mnemonic listed in Table 5-10. VPCMPQ/VPCMPUQ—Compare Packed Integer Values into Mask Vol. 2C 5-345 INSTRUCTION SET REFERENCE, V-Z Operation CASE (COMPARISON PREDICATE) OF 0: OP EQ; 1: OP LT; 2: OP LE; 3: OP FALSE; 4: OP NEQ; 5: OP NLT; 6: OP NLE; 7: OP TRUE; ESAC; VPCMPQ (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k2[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN CMP SRC1[i+63:i] OP SRC2[63:0]; ELSE CMP SRC1[i+63:i] OP SRC2[i+63:i]; FI; IF CMP = TRUE THEN DEST[j] 1; ELSE DEST[j] 0; FI; ELSE DEST[j] 0 ; zeroing-masking only FI; ENDFOR DEST[MAX_KL-1:KL] 0 VPCMPUQ (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k2[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN CMP SRC1[i+63:i] OP SRC2[63:0]; ELSE CMP SRC1[i+63:i] OP SRC2[i+63:i]; FI; IF CMP = TRUE THEN DEST[j] 1; ELSE DEST[j] 0; FI; ELSE DEST[j] 0 ; zeroing-masking only FI; ENDFOR DEST[MAX_KL-1:KL] 0 5-346 Vol. 2C VPCMPQ/VPCMPUQ—Compare Packed Integer Values into Mask INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPCMPQ __mmask8 _mm512_cmp_epi64_mask( __m512i a, __m512i b, int imm); VPCMPQ __mmask8 _mm512_mask_cmp_epi64_mask(__mmask8 k, __m512i a, __m512i b, int imm); VPCMPQ __mmask8 _mm512_cmp[eq|ge|gt|le|lt|neq]_epi64_mask( __m512i a, __m512i b); VPCMPQ __mmask8 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epi64_mask(__mmask8 k, __m512i a, __m512i b); VPCMPUQ __mmask8 _mm512_cmp_epu64_mask( __m512i a, __m512i b, int imm); VPCMPUQ __mmask8 _mm512_mask_cmp_epu64_mask(__mmask8 k, __m512i a, __m512i b, int imm); VPCMPUQ __mmask8 _mm512_cmp[eq|ge|gt|le|lt|neq]_epu64_mask( __m512i a, __m512i b); VPCMPUQ __mmask8 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epu64_mask(__mmask8 k, __m512i a, __m512i b); VPCMPQ __mmask8 _mm256_cmp_epi64_mask( __m256i a, __m256i b, int imm); VPCMPQ __mmask8 _mm256_mask_cmp_epi64_mask(__mmask8 k, __m256i a, __m256i b, int imm); VPCMPQ __mmask8 _mm256_cmp[eq|ge|gt|le|lt|neq]_epi64_mask( __m256i a, __m256i b); VPCMPQ __mmask8 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epi64_mask(__mmask8 k, __m256i a, __m256i b); VPCMPUQ __mmask8 _mm256_cmp_epu64_mask( __m256i a, __m256i b, int imm); VPCMPUQ __mmask8 _mm256_mask_cmp_epu64_mask(__mmask8 k, __m256i a, __m256i b, int imm); VPCMPUQ __mmask8 _mm256_cmp[eq|ge|gt|le|lt|neq]_epu64_mask( __m256i a, __m256i b); VPCMPUQ __mmask8 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epu64_mask(__mmask8 k, __m256i a, __m256i b); VPCMPQ __mmask8 _mm_cmp_epi64_mask( __m128i a, __m128i b, int imm); VPCMPQ __mmask8 _mm_mask_cmp_epi64_mask(__mmask8 k, __m128i a, __m128i b, int imm); VPCMPQ __mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epi64_mask( __m128i a, __m128i b); VPCMPQ __mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epi64_mask(__mmask8 k, __m128i a, __m128i b); VPCMPUQ __mmask8 _mm_cmp_epu64_mask( __m128i a, __m128i b, int imm); VPCMPUQ __mmask8 _mm_mask_cmp_epu64_mask(__mmask8 k, __m128i a, __m128i b, int imm); VPCMPUQ __mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epu64_mask( __m128i a, __m128i b); VPCMPUQ __mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epu64_mask(__mmask8 k, __m128i a, __m128i b); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E4. VPCMPQ/VPCMPUQ—Compare Packed Integer Values into Mask Vol. 2C 5-347 INSTRUCTION SET REFERENCE, V-Z VPCMPW/VPCMPUW—Compare Packed Word Values Into Mask Opcode/ Instruction Op/ En FVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512BW EVEX.NDS.128.66.0F3A.W1 3F /r ib VPCMPW k1 {k2}, xmm2, xmm3/m128, imm8 EVEX.NDS.256.66.0F3A.W1 3F /r ib FVM V/V VPCMPW k1 {k2}, ymm2, ymm3/m256, imm8 EVEX.NDS.512.66.0F3A.W1 3F /r ib VPCMPW k1 {k2}, zmm2, zmm3/m512, imm8 AVX512VL AVX512BW FVM V/V AVX512BW EVEX.NDS.128.66.0F3A.W1 3E /r ib FVM V/V VPCMPUW k1 {k2}, xmm2, xmm3/m128, imm8 EVEX.NDS.256.66.0F3A.W1 3E /r ib AVX512VL AVX512BW FVM V/V AVX512VL AVX512BW FVM V/V AVX512BW VPCMPUW k1 {k2}, ymm2, ymm3/m256, imm8 VPCMPUW k1 {k2}, zmm2, zmm3/m512, imm8 Description Compare packed signed word integers in xmm3/m128 and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed signed word integers in ymm3/m256 and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed signed word integers in zmm3/m512 and zmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed unsigned word integers in xmm3/m128 and xmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed unsigned word integers in ymm3/m256 and ymm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Compare packed unsigned word integers in zmm3/m512 and zmm2 using bits 2:0 of imm8 as a comparison predicate with writemask k2 and leave the result in mask register k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FVM ModRM:reg (w) vvvv (r) ModRM:r/m (r) NA Description Performs a SIMD compare of the packed integer word in the second source operand and the first source operand and returns the results of the comparison to the mask destination operand. The comparison predicate operand (immediate byte) specifies the type of comparison performed on each pair of packed values in the two source operands. The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false). VPCMPW performs a comparison between pairs of signed word values. VPCMPUW performs a comparison between pairs of unsigned word values. The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand (first operand) is a mask register k1. Up to 32/16/8 comparisons are performed with results written to the destination operand under the writemask k2. The comparison predicate operand is an 8-bit immediate: bits 2:0 define the type of comparison to be performed. Bits 3 through 7 of the immediate are reserved. Compiler can implement the pseudo-op mnemonic listed in Table 5-10. 5-348 Vol. 2C VPCMPW/VPCMPUW—Compare Packed Word Values Into Mask INSTRUCTION SET REFERENCE, V-Z Operation CASE (COMPARISON PREDICATE) OF 0: OP EQ; 1: OP LT; 2: OP LE; 3: OP FALSE; 4: OP NEQ ; 5: OP NLT; 6: OP NLE; 7: OP TRUE; ESAC; VPCMPW (EVEX encoded versions) (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO KL-1 i j * 16 IF k2[j] OR *no writemask* THEN ICMP SRC1[i+15:i] OP SRC2[i+15:i]; IF CMP = TRUE THEN DEST[j] 1; ELSE DEST[j] 0; FI; ELSE DEST[j] = 0 ; zeroing-masking only FI; ENDFOR DEST[MAX_KL-1:KL] 0 VPCMPUW (EVEX encoded versions) (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO KL-1 i j * 16 IF k2[j] OR *no writemask* THEN CMP SRC1[i+15:i] OP SRC2[i+15:i]; IF CMP = TRUE THEN DEST[j] 1; ELSE DEST[j] 0; FI; ELSE DEST[j] = 0 ; zeroing-masking only FI; ENDFOR DEST[MAX_KL-1:KL] 0 VPCMPW/VPCMPUW—Compare Packed Word Values Into Mask Vol. 2C 5-349 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPCMPW __mmask32 _mm512_cmp_epi16_mask( __m512i a, __m512i b, int cmp); VPCMPW __mmask32 _mm512_mask_cmp_epi16_mask( __mmask32 m, __m512i a, __m512i b, int cmp); VPCMPW __mmask16 _mm256_cmp_epi16_mask( __m256i a, __m256i b, int cmp); VPCMPW __mmask16 _mm256_mask_cmp_epi16_mask( __mmask16 m, __m256i a, __m256i b, int cmp); VPCMPW __mmask8 _mm_cmp_epi16_mask( __m128i a, __m128i b, int cmp); VPCMPW __mmask8 _mm_mask_cmp_epi16_mask( __mmask8 m, __m128i a, __m128i b, int cmp); VPCMPW __mmask32 _mm512_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( __m512i a, __m512i b); VPCMPW __mmask32 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( __mmask32 m, __m512i a, __m512i b); VPCMPW __mmask16 _mm256_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( __m256i a, __m256i b); VPCMPW __mmask16 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( __mmask16 m, __m256i a, __m256i b); VPCMPW __mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( __m128i a, __m128i b); VPCMPW __mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epi16_mask( __mmask8 m, __m128i a, __m128i b); VPCMPUW __mmask32 _mm512_cmp_epu16_mask( __m512i a, __m512i b, int cmp); VPCMPUW __mmask32 _mm512_mask_cmp_epu16_mask( __mmask32 m, __m512i a, __m512i b, int cmp); VPCMPUW __mmask16 _mm256_cmp_epu16_mask( __m256i a, __m256i b, int cmp); VPCMPUW __mmask16 _mm256_mask_cmp_epu16_mask( __mmask16 m, __m256i a, __m256i b, int cmp); VPCMPUW __mmask8 _mm_cmp_epu16_mask( __m128i a, __m128i b, int cmp); VPCMPUW __mmask8 _mm_mask_cmp_epu16_mask( __mmask8 m, __m128i a, __m128i b, int cmp); VPCMPUW __mmask32 _mm512_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( __m512i a, __m512i b, int cmp); VPCMPUW __mmask32 _mm512_mask_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( __mmask32 m, __m512i a, __m512i b, int cmp); VPCMPUW __mmask16 _mm256_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( __m256i a, __m256i b, int cmp); VPCMPUW __mmask16 _mm256_mask_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( __mmask16 m, __m256i a, __m256i b, int cmp); VPCMPUW __mmask8 _mm_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( __m128i a, __m128i b, int cmp); VPCMPUW __mmask8 _mm_mask_cmp[eq|ge|gt|le|lt|neq]_epu16_mask( __mmask8 m, __m128i a, __m128i b, int cmp); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E4.nb. 5-350 Vol. 2C VPCMPW/VPCMPUW—Compare Packed Word Values Into Mask INSTRUCTION SET REFERENCE, V-Z VPCOMPRESSD—Store Sparse Packed Doubleword Integer Values into Dense Memory/Register Opcode/ Instruction Op/ En EVEX.128.66.0F38.W0 8B /r VPCOMPRESSD xmm1/m128 {k1}{z}, xmm2 EVEX.256.66.0F38.W0 8B /r VPCOMPRESSD ymm1/m256 {k1}{z}, ymm2 EVEX.512.66.0F38.W0 8B /r VPCOMPRESSD zmm1/m512 {k1}{z}, zmm2 T1S 64/32 bit Mode Support V/V T1S V/V T1S V/V CPUID Feature Flag AVX512VL AVX512F AVX512VL AVX512F AVX512F Description Compress packed doubleword integer values from xmm2 to xmm1/m128 using controlmask k1. Compress packed doubleword integer values from ymm2 to ymm1/m256 using controlmask k1. Compress packed doubleword integer values from zmm2 to zmm1/m512 using controlmask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:r/m (w) ModRM:reg (r) NA NA Description Compress (store) up to 16/8/4 doubleword integer values from the source operand (second operand) to the destination operand (first operand). The source operand is a ZMM/YMM/XMM register, the destination operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The opmask register k1 selects the active elements (partial vector or possibly non-contiguous if less than 16 active elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to the destination starting from the low element of the destination operand. Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z must be zero. Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper bits are zeroed. Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector. Operation VPCOMPRESSD (EVEX encoded versions) store form (KL, VL) = (4, 128), (8, 256), (16, 512) SIZE 32 k0 FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no controlmask* THEN DEST[k+SIZE-1:k] SRC[i+31:i] k k + SIZE FI; ENDFOR; VPCOMPRESSD—Store Sparse Packed Doubleword Integer Values into Dense Memory/Register Vol. 2C 5-351 INSTRUCTION SET REFERENCE, V-Z VPCOMPRESSD (EVEX encoded versions) reg-reg form (KL, VL) = (4, 128), (8, 256), (16, 512) SIZE 32 k0 FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no controlmask* THEN DEST[k+SIZE-1:k] SRC[i+31:i] k k + SIZE FI; ENDFOR IF *merging-masking* THEN *DEST[VL-1:k] remains unchanged* ELSE DEST[VL-1:k] ← 0 FI DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VPCOMPRESSD __m512i _mm512_mask_compress_epi32(__m512i s, __mmask16 c, __m512i a); VPCOMPRESSD __m512i _mm512_maskz_compress_epi32( __mmask16 c, __m512i a); VPCOMPRESSD void _mm512_mask_compressstoreu_epi32(void * a, __mmask16 c, __m512i s); VPCOMPRESSD __m256i _mm256_mask_compress_epi32(__m256i s, __mmask8 c, __m256i a); VPCOMPRESSD __m256i _mm256_maskz_compress_epi32( __mmask8 c, __m256i a); VPCOMPRESSD void _mm256_mask_compressstoreu_epi32(void * a, __mmask8 c, __m256i s); VPCOMPRESSD __m128i _mm_mask_compress_epi32(__m128i s, __mmask8 c, __m128i a); VPCOMPRESSD __m128i _mm_maskz_compress_epi32( __mmask8 c, __m128i a); VPCOMPRESSD void _mm_mask_compressstoreu_epi32(void * a, __mmask8 c, __m128i s); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E4.nb. 5-352 Vol. 2C VPCOMPRESSD—Store Sparse Packed Doubleword Integer Values into Dense Memory/Register INSTRUCTION SET REFERENCE, V-Z VPCOMPRESSQ—Store Sparse Packed Quadword Integer Values into Dense Memory/Register Opcode/ Instruction Op/ En EVEX.128.66.0F38.W1 8B /r VPCOMPRESSQ xmm1/m128 {k1}{z}, xmm2 EVEX.256.66.0F38.W1 8B /r VPCOMPRESSQ ymm1/m256 {k1}{z}, ymm2 EVEX.512.66.0F38.W1 8B /r VPCOMPRESSQ zmm1/m512 {k1}{z}, zmm2 T1S 64/32 bit Mode Support V/V T1S V/V T1S V/V CPUID Feature Flag AVX512VL AVX512F AVX512VL AVX512F AVX512F Description Compress packed quadword integer values from xmm2 to xmm1/m128 using controlmask k1. Compress packed quadword integer values from ymm2 to ymm1/m256 using controlmask k1. Compress packed quadword integer values from zmm2 to zmm1/m512 using controlmask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:r/m (w) ModRM:reg (r) NA NA Description Compress (stores) up to 8/4/2 quadword integer values from the source operand (second operand) to the destination operand (first operand). The source operand is a ZMM/YMM/XMM register, the destination operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The opmask register k1 selects the active elements (partial vector or possibly non-contiguous if less than 8 active elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to the destination starting from the low element of the destination operand. Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z must be zero. Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper bits are zeroed. Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector. Operation VPCOMPRESSQ (EVEX encoded versions) store form (KL, VL) = (2, 128), (4, 256), (8, 512) SIZE 64 k0 FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no controlmask* THEN DEST[k+SIZE-1:k] SRC[i+63:i] k k + SIZE FI; ENFOR VPCOMPRESSQ—Store Sparse Packed Quadword Integer Values into Dense Memory/Register Vol. 2C 5-353 INSTRUCTION SET REFERENCE, V-Z VPCOMPRESSQ (EVEX encoded versions) reg-reg form (KL, VL) = (2, 128), (4, 256), (8, 512) SIZE 64 k0 FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no controlmask* THEN DEST[k+SIZE-1:k] SRC[i+63:i] k k + SIZE FI; ENDFOR IF *merging-masking* THEN *DEST[VL-1:k] remains unchanged* ELSE DEST[VL-1:k] ← 0 FI DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VPCOMPRESSQ __m512i _mm512_mask_compress_epi64(__m512i s, __mmask8 c, __m512i a); VPCOMPRESSQ __m512i _mm512_maskz_compress_epi64( __mmask8 c, __m512i a); VPCOMPRESSQ void _mm512_mask_compressstoreu_epi64(void * a, __mmask8 c, __m512i s); VPCOMPRESSQ __m256i _mm256_mask_compress_epi64(__m256i s, __mmask8 c, __m256i a); VPCOMPRESSQ __m256i _mm256_maskz_compress_epi64( __mmask8 c, __m256i a); VPCOMPRESSQ void _mm256_mask_compressstoreu_epi64(void * a, __mmask8 c, __m256i s); VPCOMPRESSQ __m128i _mm_mask_compress_epi64(__m128i s, __mmask8 c, __m128i a); VPCOMPRESSQ __m128i _mm_maskz_compress_epi64( __mmask8 c, __m128i a); VPCOMPRESSQ void _mm_mask_compressstoreu_epi64(void * a, __mmask8 c, __m128i s); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E4.nb. 5-354 Vol. 2C VPCOMPRESSQ—Store Sparse Packed Quadword Integer Values into Dense Memory/Register INSTRUCTION SET REFERENCE, V-Z VPCONFLICTD/Q—Detect Conflicts Within a Vector of Packed Dword/Qword Values into Dense Memory/ Register Opcode/ Instruction Op/ En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512CD EVEX.128.66.0F38.W0 C4 /r VPCONFLICTD xmm1 {k1}{z}, xmm2/m128/m32bcst Description Detect duplicate double-word values in xmm2/m128/m32bcst using writemask k1. EVEX.256.66.0F38.W0 C4 /r VPCONFLICTD ymm1 {k1}{z}, ymm2/m256/m32bcst FV V/V AVX512VL AVX512CD Detect duplicate double-word values in ymm2/m256/m32bcst using writemask k1. EVEX.512.66.0F38.W0 C4 /r VPCONFLICTD zmm1 {k1}{z}, zmm2/m512/m32bcst FV V/V AVX512CD Detect duplicate double-word values in zmm2/m512/m32bcst using writemask k1. EVEX.128.66.0F38.W1 C4 /r VPCONFLICTQ xmm1 {k1}{z}, xmm2/m128/m64bcst FV V/V AVX512VL AVX512CD Detect duplicate quad-word values in xmm2/m128/m64bcst using writemask k1. EVEX.256.66.0F38.W1 C4 /r VPCONFLICTQ ymm1 {k1}{z}, ymm2/m256/m64bcst FV V/V AVX512VL AVX512CD Detect duplicate quad-word values in ymm2/m256/m64bcst using writemask k1. EVEX.512.66.0F38.W1 C4 /r VPCONFLICTQ zmm1 {k1}{z}, zmm2/m512/m64bcst FV V/V AVX512CD Detect duplicate quad-word values in zmm2/m512/m64bcst using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Test each dword/qword element of the source operand (the second operand) for equality with all other elements in the source operand closer to the least significant element. Each element’s comparison results form a bit vector, which is then zero extended and written to the destination according to the writemask. EVEX.512 encoded version: The source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a ZMM register, conditionally updated using writemask k1. EVEX.256 encoded version: The source operand is a YMM register, a 256-bit memory location, or a 256-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a YMM register, conditionally updated using writemask k1. EVEX.128 encoded version: The source operand is a XMM register, a 128-bit memory location, or a 128-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a XMM register, conditionally updated using writemask k1. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. VPCONFLICTD/Q—Detect Conflicts Within a Vector of Packed Dword/Qword Values into Dense Memory/ Register Vol. 2C 5-355 INSTRUCTION SET REFERENCE, V-Z Operation VPCONFLICTD (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j*32 IF MaskBit(j) OR *no writemask*THEN FOR k 0 TO j-1 m k*32 IF ((SRC[i+31:i] = SRC[m+31:m])) THEN DEST[i+k] 1 ELSE DEST[i+k] 0 FI ENDFOR DEST[i+31:i+j] 0 ELSE IF *merging-masking* THEN *DEST[i+31:i] remains unchanged* ELSE DEST[i+31:i] 0 FI FI ENDFOR DEST[MAX_VL-1:VL] ← 0 VPCONFLICTQ (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j*64 IF MaskBit(j) OR *no writemask*THEN FOR k 0 TO j-1 m k*64 IF ((SRC[i+63:i] = SRC[m+63:m])) THEN DEST[i+k] 1 ELSE DEST[i+k] 0 FI ENDFOR DEST[i+63:i+j] 0 ELSE IF *merging-masking* THEN *DEST[i+63:i] remains unchanged* ELSE DEST[i+63:i] 0 FI FI ENDFOR DEST[MAX_VL-1:VL] 0 5-356 Vol. 2C VPCONFLICTD/Q—Detect Conflicts Within a Vector of Packed Dword/Qword Values into Dense Memory/ Register INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPCONFLICTD __m512i _mm512_conflict_epi32( __m512i a); VPCONFLICTD __m512i _mm512_mask_conflict_epi32(__m512i s, __mmask16 m, __m512i a); VPCONFLICTD __m512i _mm512_maskz_conflict_epi32(__mmask16 m, __m512i a); VPCONFLICTQ __m512i _mm512_conflict_epi64( __m512i a); VPCONFLICTQ __m512i _mm512_mask_conflict_epi64(__m512i s, __mmask8 m, __m512i a); VPCONFLICTQ __m512i _mm512_maskz_conflict_epi64(__mmask8 m, __m512i a); VPCONFLICTD __m256i _mm256_conflict_epi32( __m256i a); VPCONFLICTD __m256i _mm256_mask_conflict_epi32(__m256i s, __mmask8 m, __m256i a); VPCONFLICTD __m256i _mm256_maskz_conflict_epi32(__mmask8 m, __m256i a); VPCONFLICTQ __m256i _mm256_conflict_epi64( __m256i a); VPCONFLICTQ __m256i _mm256_mask_conflict_epi64(__m256i s, __mmask8 m, __m256i a); VPCONFLICTQ __m256i _mm256_maskz_conflict_epi64(__mmask8 m, __m256i a); VPCONFLICTD __m128i _mm_conflict_epi32( __m128i a); VPCONFLICTD __m128i _mm_mask_conflict_epi32(__m128i s, __mmask8 m, __m128i a); VPCONFLICTD __m128i _mm_maskz_conflict_epi32(__mmask8 m, __m128i a); VPCONFLICTQ __m128i _mm_conflict_epi64( __m128i a); VPCONFLICTQ __m128i _mm_mask_conflict_epi64(__m128i s, __mmask8 m, __m128i a); VPCONFLICTQ __m128i _mm_maskz_conflict_epi64(__mmask8 m, __m128i a); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E4. VPCONFLICTD/Q—Detect Conflicts Within a Vector of Packed Dword/Qword Values into Dense Memory/ Register Vol. 2C 5-357 INSTRUCTION SET REFERENCE, V-Z VPERM2F128 — Permute Floating-Point Values Opcode/ Instruction Op/ En 64/32 bit Mode Support VEX.NDS.256.66.0F3A.W0 06 /r ib VPERM2F128 ymm1, ymm2, ymm3/m256, imm8 RVMI V/V CPUID Feature Flag Description AVX Permute 128-bit floating-point fields in ymm2 and ymm3/mem using controls from imm8 and store result in ymm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVMI ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) imm8 Description Permute 128 bit floating-point-containing fields from the first source operand (second operand) and second source operand (third operand) using bits in the 8-bit immediate and store results in the destination operand (first operand). The first source operand is a YMM register, the second source operand is a YMM register or a 256-bit memory location, and the destination operand is a YMM register. SRC2 Y1 Y0 SRC1 X1 X0 DEST X0, X1, Y0, or Y1 X0, X1, Y0, or Y1 Figure 5-21. VPERM2F128 Operation Imm8[1:0] select the source for the first destination 128-bit field, imm8[5:4] select the source for the second destination field. If imm8[3] is set, the low 128-bit field is zeroed. If imm8[7] is set, the high 128-bit field is zeroed. VEX.L must be 1, otherwise the instruction will #UD. 5-358 Vol. 2C VPERM2F128 — Permute Floating-Point Values INSTRUCTION SET REFERENCE, V-Z Operation VPERM2F128 CASE IMM8[1:0] of 0: DEST[127:0] SRC1[127:0] 1: DEST[127:0] SRC1[255:128] 2: DEST[127:0] SRC2[127:0] 3: DEST[127:0] SRC2[255:128] ESAC CASE IMM8[5:4] of 0: DEST[255:128] SRC1[127:0] 1: DEST[255:128] SRC1[255:128] 2: DEST[255:128] SRC2[127:0] 3: DEST[255:128] SRC2[255:128] ESAC IF (imm8[3]) DEST[127:0] 0 FI IF (imm8[7]) DEST[VLMAX-1:128] 0 FI Intel C/C++ Compiler Intrinsic Equivalent VPERM2F128: __m256 _mm256_permute2f128_ps (__m256 a, __m256 b, int control) VPERM2F128: __m256d _mm256_permute2f128_pd (__m256d a, __m256d b, int control) VPERM2F128: __m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int control) SIMD Floating-Point Exceptions None. Other Exceptions See Exceptions Type 6; additionally #UD If VEX.L = 0 If VEX.W = 1. VPERM2F128 — Permute Floating-Point Values Vol. 2C 5-359 INSTRUCTION SET REFERENCE, V-Z VPERM2I128 — Permute Integer Values Opcode/ Instruction Op/ En VEX.NDS.256.66.0F3A.W0 46 /r ib VPERM2I128 ymm1, ymm2, ymm3/m256, imm8 RVMI 64/32 -bit Mode V/V CPUID Feature Flag AVX2 Description Permute 128-bit integer data in ymm2 and ymm3/mem using controls from imm8 and store result in ymm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVMI ModRM:reg (w) VEX.vvvv ModRM:r/m (r) Imm8 Description Permute 128 bit integer data from the first source operand (second operand) and second source operand (third operand) using bits in the 8-bit immediate and store results in the destination operand (first operand). The first source operand is a YMM register, the second source operand is a YMM register or a 256-bit memory location, and the destination operand is a YMM register. SRC2 Y1 Y0 SRC1 X1 X0 DEST X0, X1, Y0, or Y1 X0, X1, Y0, or Y1 Figure 5-22. VPERM2I128 Operation Imm8[1:0] select the source for the first destination 128-bit field, imm8[5:4] select the source for the second destination field. If imm8[3] is set, the low 128-bit field is zeroed. If imm8[7] is set, the high 128-bit field is zeroed. VEX.L must be 1, otherwise the instruction will #UD. 5-360 Vol. 2C VPERM2I128 — Permute Integer Values INSTRUCTION SET REFERENCE, V-Z Operation VPERM2I128 CASE IMM8[1:0] of 0: DEST[127:0] SRC1[127:0] 1: DEST[127:0] SRC1[255:128] 2: DEST[127:0] SRC2[127:0] 3: DEST[127:0] SRC2[255:128] ESAC CASE IMM8[5:4] of 0: DEST[255:128] SRC1[127:0] 1: DEST[255:128] SRC1[255:128] 2: DEST[255:128] SRC2[127:0] 3: DEST[255:128] SRC2[255:128] ESAC IF (imm8[3]) DEST[127:0] 0 FI IF (imm8[7]) DEST[255:128] 0 FI Intel C/C++ Compiler Intrinsic Equivalent VPERM2I128: __m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, int control) SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type 6; additionally #UD If VEX.L = 0, If VEX.W = 1. VPERM2I128 — Permute Integer Values Vol. 2C 5-361 INSTRUCTION SET REFERENCE, V-Z VPERMD/VPERMW—Permute Packed Doublewords/Words Elements Opcode/ Instruction Op / En VEX.NDS.256.66.0F38.W0 36 /r VPERMD ymm1, ymm2, ymm3/m256 EVEX.NDS.256.66.0F38.W0 36 /r VPERMD ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.512.66.0F38.W0 36 /r VPERMD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst EVEX.NDS.128.66.0F38.W1 8D /r VPERMW xmm1 {k1}{z}, xmm2, xmm3/m128 EVEX.NDS.256.66.0F38.W1 8D /r VPERMW ymm1 {k1}{z}, ymm2, ymm3/m256 EVEX.NDS.512.66.0F38.W1 8D /r VPERMW zmm1 {k1}{z}, zmm2, zmm3/m512 RVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX2 FV V/V AVX512VL AVX512F FV V/V AVX512F FVM V/V AVX512VL AVX512BW FVM V/V AVX512VL AVX512BW Permute word integers in ymm3/m256 using indexes in ymm2 and store the result in ymm1 using writemask k1. FVM V/V AVX512BW Permute word integers in zmm3/m512 using indexes in zmm2 and store the result in zmm1 using writemask k1. Description Permute doublewords in ymm3/m256 using indices in ymm2 and store the result in ymm1. Permute doublewords in ymm3/m256/m32bcst using indexes in ymm2 and store the result in ymm1 using writemask k1. Permute doublewords in zmm3/m512/m32bcst using indices in zmm2 and store the result in zmm1 using writemask k1. Permute word integers in xmm3/m128 using indexes in xmm2 and store the result in xmm1 using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA FV ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA FVM ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA Description Copies doublewords (or words) from the second source operand (the third operand) to the destination operand (the first operand) according to the indices in the first source operand (the second operand). Note that this instruction permits a doubleword (word) in the source operand to be copied to more than one location in the destination operand. VEX.256 encoded VPERMD: The first and second operands are YMM registers, the third operand can be a YMM register or memory location. Bits (MAX_VL-1:256) of the corresponding destination register are zeroed. EVEX encoded VPERMD: The first and second operands are ZMM/YMM registers, the third operand can be a ZMM/YMM register, a 512/256-bit memory location or a 512/256-bit vector broadcasted from a 32-bit memory location. The elements in the destination are updated using the writemask k1. VPERMW: first and second operands are ZMM/YMM/XMM registers, the third operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit memory location. The destination is updated using the writemask k1. EVEX.128 encoded versions: Bits (MAX_VL-1:128) of the corresponding ZMM register are zeroed. 5-362 Vol. 2C VPERMD/VPERMW—Permute Packed Doublewords/Words Elements INSTRUCTION SET REFERENCE, V-Z Operation VPERMD (EVEX encoded versions) (KL, VL) = (8, 256), (16, 512) IF VL = 256 THEN n 2; FI; IF VL = 512 THEN n 3; FI; FOR j 0 TO KL-1 i j * 32 id 32*SRC1[i+n:i] IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+31:i] SRC2[31:0]; ELSE DEST[i+31:i] SRC2[id+31:id]; FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPERMD (VEX.256 encoded version) DEST[31:0] (SRC2[255:0] >> (SRC1[2:0] * 32))[31:0]; DEST[63:32] (SRC2[255:0] >> (SRC1[34:32] * 32))[31:0]; DEST[95:64] (SRC2[255:0] >> (SRC1[66:64] * 32))[31:0]; DEST[127:96] (SRC2[255:0] >> (SRC1[98:96] * 32))[31:0]; DEST[159:128] (SRC2[255:0] >> (SRC1[130:128] * 32))[31:0]; DEST[191:160] (SRC2[255:0] >> (SRC1[162:160] * 32))[31:0]; DEST[223:192] (SRC2[255:0] >> (SRC1[194:192] * 32))[31:0]; DEST[255:224] (SRC2[255:0] >> (SRC1[226:224] * 32))[31:0]; DEST[MAX_VL-1:256] 0 VPERMW (EVEX encoded versions) (KL, VL) = (8, 128), (16, 256), (32, 512) IF VL = 128 THEN n 2; FI; IF VL = 256 THEN n 3; FI; IF VL = 512 THEN n 4; FI; FOR j 0 TO KL-1 i j * 16 id 16*SRC1[i+n:i] IF k1[j] OR *no writemask* THEN DEST[i+15:i] SRC2[id+15:id] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged* ELSE ; zeroing-masking DEST[i+15:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPERMD/VPERMW—Permute Packed Doublewords/Words Elements Vol. 2C 5-363 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPERMD __m512i _mm512_permutexvar_epi32( __m512i idx, __m512i a); VPERMD __m512i _mm512_mask_permutexvar_epi32(__m512i s, __mmask16 k, __m512i idx, __m512i a); VPERMD __m512i _mm512_maskz_permutexvar_epi32( __mmask16 k, __m512i idx, __m512i a); VPERMD __m256i _mm256_permutexvar_epi32( __m256i idx, __m256i a); VPERMD __m256i _mm256_mask_permutexvar_epi32(__m256i s, __mmask8 k, __m256i idx, __m256i a); VPERMD __m256i _mm256_maskz_permutexvar_epi32( __mmask8 k, __m256i idx, __m256i a); VPERMW __m512i _mm512_permutexvar_epi16( __m512i idx, __m512i a); VPERMW __m512i _mm512_mask_permutexvar_epi16(__m512i s, __mmask32 k, __m512i idx, __m512i a); VPERMW __m512i _mm512_maskz_permutexvar_epi16( __mmask32 k, __m512i idx, __m512i a); VPERMW __m256i _mm256_permutexvar_epi16( __m256i idx, __m256i a); VPERMW __m256i _mm256_mask_permutexvar_epi16(__m256i s, __mmask16 k, __m256i idx, __m256i a); VPERMW __m256i _mm256_maskz_permutexvar_epi16( __mmask16 k, __m256i idx, __m256i a); VPERMW __m128i _mm_permutexvar_epi16( __m128i idx, __m128i a); VPERMW __m128i _mm_mask_permutexvar_epi16(__m128i s, __mmask8 k, __m128i idx, __m128i a); VPERMW __m128i _mm_maskz_permutexvar_epi16( __mmask8 k, __m128i idx, __m128i a); SIMD Floating-Point Exceptions None Other Exceptions Non-EVEX-encoded instruction, see Exceptions Type 4. EVEX-encoded VPERMD, see Exceptions Type E4NF. EVEX-encoded VPERMW, see Exceptions Type E4NF.nb. #UD If VEX.L = 0. If EVEX.L’L = 0 for VPERMD. 5-364 Vol. 2C VPERMD/VPERMW—Permute Packed Doublewords/Words Elements INSTRUCTION SET REFERENCE, V-Z VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Opcode/ Instruction Op / En FVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512BW EVEX.DDS.128.66.0F38.W1 75 /r VPERMI2W xmm1 {k1}{z}, xmm2, xmm3/m128 EVEX.DDS.256.66.0F38.W1 75 /r VPERMI2W ymm1 {k1}{z}, ymm2, ymm3/m256 EVEX.DDS.512.66.0F38.W1 75 /r VPERMI2W zmm1 {k1}{z}, zmm2, zmm3/m512 EVEX.DDS.128.66.0F38.W0 76 /r VPERMI2D xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst FVM V/V AVX512VL AVX512BW FVM V/V AVX512BW FV V/V AVX512VL AVX512F EVEX.DDS.256.66.0F38.W0 76 /r VPERMI2D ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst FV V/V AVX512VL AVX512F EVEX.DDS.512.66.0F38.W0 76 /r VPERMI2D zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst FV V/V AVX512F EVEX.DDS.128.66.0F38.W1 76 /r VPERMI2Q xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst FV V/V AVX512VL AVX512F EVEX.DDS.256.66.0F38.W1 76 /r VPERMI2Q ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst FV V/V AVX512VL AVX512F EVEX.DDS.512.66.0F38.W1 76 /r VPERMI2Q zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst FV V/V AVX512F EVEX.DDS.128.66.0F38.W0 77 /r VPERMI2PS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst FV V/V AVX512VL AVX512F EVEX.DDS.256.66.0F38.W0 77 /r VPERMI2PS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst FV V/V AVX512VL AVX512F EVEX.DDS.512.66.0F38.W0 77 /r VPERMI2PS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst FV V/V AVX512F VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Description Permute word integers from two tables in xmm3/m128 and xmm2 using indexes in xmm1 and store the result in xmm1 using writemask k1. Permute word integers from two tables in ymm3/m256 and ymm2 using indexes in ymm1 and store the result in ymm1 using writemask k1. Permute word integers from two tables in zmm3/m512 and zmm2 using indexes in zmm1 and store the result in zmm1 using writemask k1. Permute double-words from two tables in xmm3/m128/m32bcst and xmm2 using indexes in xmm1 and store the result in xmm1 using writemask k1. Permute double-words from two tables in ymm3/m256/m32bcst and ymm2 using indexes in ymm1 and store the result in ymm1 using writemask k1. Permute double-words from two tables in zmm3/m512/m32bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1. Permute quad-words from two tables in xmm3/m128/m64bcst and xmm2 using indexes in xmm1 and store the result in xmm1 using writemask k1. Permute quad-words from two tables in ymm3/m256/m64bcst and ymm2 using indexes in ymm1 and store the result in ymm1 using writemask k1. Permute quad-words from two tables in zmm3/m512/m64bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1. Permute single-precision FP values from two tables in xmm3/m128/m32bcst and xmm2 using indexes in xmm1 and store the result in xmm1 using writemask k1. Permute single-precision FP values from two tables in ymm3/m256/m32bcst and ymm2 using indexes in ymm1 and store the result in ymm1 using writemask k1. Permute single-precision FP values from two tables in zmm3/m512/m32bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1. Vol. 2C 5-365 INSTRUCTION SET REFERENCE, V-Z Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.DDS.128.66.0F38.W1 77 /r VPERMI2PD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.DDS.256.66.0F38.W1 77 /r VPERMI2PD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst FV V/V AVX512VL AVX512F EVEX.DDS.512.66.0F38.W1 77 /r VPERMI2PD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst FV V/V AVX512F Description Permute double-precision FP values from two tables in xmm3/m128/m64bcst and xmm2 using indexes in xmm1 and store the result in xmm1 using writemask k1. Permute double-precision FP values from two tables in ymm3/m256/m64bcst and ymm2 using indexes in ymm1 and store the result in ymm1 using writemask k1. Permute double-precision FP values from two tables in zmm3/m512/m64bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FVM ModRM:reg (r,w) EVEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Permutes 16-bit/32-bit/64-bit values in the second operand (the first source operand) and the third operand (the second source operand) using indices in the first operand to select elements from the second and third operands. The selected elements are written to the destination operand (the first operand) according to the writemask k1. The first and second operands are ZMM/YMM/XMM registers. The first operand contains input indices to select elements from the two input tables in the 2nd and 3rd operands. The first operand is also the destination of the result. D/Q/PS/PD element versions: The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. Broadcast from the low 32/64-bit memory location is performed if EVEX.b and the id bit for table selection are set (selecting table_2). Dword/PS versions: The id bit for table selection is bit 4/3/2, depending on VL=512, 256, 128. Bits [3:0]/[2:0]/[1:0] of each element in the input index vector select an element within the two source operands, If the id bit is 0, table_1 (the first source) is selected; otherwise the second source operand is selected. Qword/PD versions: The id bit for table selection is bit 3/2/1, and bits [2:0]/[1:0] /bit 0 selects element within each input table. Word element versions: The second source operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit memory location. The id bit for table selection is bit 5/4/3, and bits [4:0]/[3:0]/[2:0] selects element within each input table. Note that these instructions permit a 16-bit/32-bit/64-bit value in the source operands to be copied to more than one location in the destination operand. Note also that in this case, the same table can be reused for example for a second iteration, while the index elements are overwritten. Bits (MAX_VL-1:256/128) of the destination are zeroed for VL=256,128. 5-366 Vol. 2C VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index INSTRUCTION SET REFERENCE, V-Z Operation VPERMI2W (EVEX encoded versions) (KL, VL) = (8, 128), (16, 256), (32, 512) IF VL = 128 id 2 FI; IF VL = 256 id 3 FI; IF VL = 512 id 4 FI; TMP_DEST DEST FOR j 0 TO KL-1 i j * 16 off 16*TMP_DEST[i+id:i] IF k1[j] OR *no writemask* THEN DEST[i+15:i]=TMP_DEST[i+id+1] ? SRC2[off+15:off] : SRC1[off+15:off] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged* ELSE ; zeroing-masking DEST[i+15:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPERMI2D/VPERMI2PS (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) IF VL = 128 id 1 FI; IF VL = 256 id 2 FI; IF VL = 512 id 3 FI; TMP_DEST DEST FOR j 0 TO KL-1 i j * 32 off 32*TMP_DEST[i+id:i] IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+31:i] TMP_DEST[i+id+1] ? SRC2[31:0] : SRC1[off+31:off] ELSE DEST[i+31:i] TMP_DEST[i+id+1] ? SRC2[off+31:off] : SRC1[off+31:off] VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Vol. 2C 5-367 INSTRUCTION SET REFERENCE, V-Z FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPERMI2Q/VPERMI2PD (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8 512) IF VL = 128 id 0 FI; IF VL = 256 id 1 FI; IF VL = 512 id 2 FI; TMP_DEST DEST FOR j 0 TO KL-1 i j * 64 off 64*TMP_DEST[i+id:i] IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+63:i] TMP_DEST[i+id+1] ? SRC2[63:0] : SRC1[off+63:off] ELSE DEST[i+63:i] TMP_DEST[i+id+1] ? SRC2[off+63:off] : SRC1[off+63:off] FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-368 Vol. 2C VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPERMI2D __m512i _mm512_permutex2var_epi32(__m512i a, __m512i idx, __m512i b); VPERMI2D __m512i _mm512_mask_permutex2var_epi32(__m512i a, __mmask16 k, __m512i idx, __m512i b); VPERMI2D __m512i _mm512_mask2_permutex2var_epi32(__m512i a, __m512i idx, __mmask16 k, __m512i b); VPERMI2D __m512i _mm512_maskz_permutex2var_epi32(__mmask16 k, __m512i a, __m512i idx, __m512i b); VPERMI __m256i _mm256_permutex2var_epi32(__m256i a, __m256i idx, __m256i b); VPERMI2D __m256i _mm256_mask_permutex2var_epi32(__m256i a, __mmask8 k, __m256i idx, __m256i b); VPERMI2D __m256i _mm256_mask2_permutex2var_epi32(__m256i a, __m256i idx, __mmask8 k, __m256i b); VPERMI2D __m256i _mm256_maskz_permutex2var_epi32(__mmask8 k, __m256i a, __m256i idx, __m256i b); VPERMI2D __m128i _mm_permutex2var_epi32(__m128i a, __m128i idx, __m128i b); VPERMI2D __m128i _mm_mask_permutex2var_epi32(__m128i a, __mmask8 k, __m128i idx, __m128i b); VPERMI2D __m128i _mm_mask2_permutex2var_epi32(__m128i a, __m128i idx, __mmask8 k, __m128i b); VPERMI2D __m128i _mm_maskz_permutex2var_epi32(__mmask8 k, __m128i a, __m128i idx, __m128i b); VPERMI2PD __m512d _mm512_permutex2var_pd(__m512d a, __m512i idx, __m512d b); VPERMI2PD __m512d _mm512_mask_permutex2var_pd(__m512d a, __mmask8 k, __m512i idx, __m512d b); VPERMI2PD __m512d _mm512_mask2_permutex2var_pd(__m512d a, __m512i idx, __mmask8 k, __m512d b); VPERMI2PD __m512d _mm512_maskz_permutex2var_pd(__mmask8 k, __m512d a, __m512i idx, __m512d b); VPERMI2PD __m256d _mm256_permutex2var_pd(__m256d a, __m256i idx, __m256d b); VPERMI2PD __m256d _mm256_mask_permutex2var_pd(__m256d a, __mmask8 k, __m256i idx, __m256d b); VPERMI2PD __m256d _mm256_mask2_permutex2var_pd(__m256d a, __m256i idx, __mmask8 k, __m256d b); VPERMI2PD __m256d _mm256_maskz_permutex2var_pd(__mmask8 k, __m256d a, __m256i idx, __m256d b); VPERMI2PD __m128d _mm_permutex2var_pd(__m128d a, __m128i idx, __m128d b); VPERMI2PD __m128d _mm_mask_permutex2var_pd(__m128d a, __mmask8 k, __m128i idx, __m128d b); VPERMI2PD __m128d _mm_mask2_permutex2var_pd(__m128d a, __m128i idx, __mmask8 k, __m128d b); VPERMI2PD __m128d _mm_maskz_permutex2var_pd(__mmask8 k, __m128d a, __m128i idx, __m128d b); VPERMI2PS __m512 _mm512_permutex2var_ps(__m512 a, __m512i idx, __m512 b); VPERMI2PS __m512 _mm512_mask_permutex2var_ps(__m512 a, __mmask16 k, __m512i idx, __m512 b); VPERMI2PS __m512 _mm512_mask2_permutex2var_ps(__m512 a, __m512i idx, __mmask16 k, __m512 b); VPERMI2PS __m512 _mm512_maskz_permutex2var_ps(__mmask16 k, __m512 a, __m512i idx, __m512 b); VPERMI2PS __m256 _mm256_permutex2var_ps(__m256 a, __m256i idx, __m256 b); VPERMI2PS __m256 _mm256_mask_permutex2var_ps(__m256 a, __mmask8 k, __m256i idx, __m256 b); VPERMI2PS __m256 _mm256_mask2_permutex2var_ps(__m256 a, __m256i idx, __mmask8 k, __m256 b); VPERMI2PS __m256 _mm256_maskz_permutex2var_ps(__mmask8 k, __m256 a, __m256i idx, __m256 b); VPERMI2PS __m128 _mm_permutex2var_ps(__m128 a, __m128i idx, __m128 b); VPERMI2PS __m128 _mm_mask_permutex2var_ps(__m128 a, __mmask8 k, __m128i idx, __m128 b); VPERMI2PS __m128 _mm_mask2_permutex2var_ps(__m128 a, __m128i idx, __mmask8 k, __m128 b); VPERMI2PS __m128 _mm_maskz_permutex2var_ps(__mmask8 k, __m128 a, __m128i idx, __m128 b); VPERMI2Q __m512i _mm512_permutex2var_epi64(__m512i a, __m512i idx, __m512i b); VPERMI2Q __m512i _mm512_mask_permutex2var_epi64(__m512i a, __mmask8 k, __m512i idx, __m512i b); VPERMI2Q __m512i _mm512_mask2_permutex2var_epi64(__m512i a, __m512i idx, __mmask8 k, __m512i b); VPERMI2Q __m512i _mm512_maskz_permutex2var_epi64(__mmask8 k, __m512i a, __m512i idx, __m512i b); VPERMI2Q __m256i _mm256_permutex2var_epi64(__m256i a, __m256i idx, __m256i b); VPERMI2Q __m256i _mm256_mask_permutex2var_epi64(__m256i a, __mmask8 k, __m256i idx, __m256i b); VPERMI2Q __m256i _mm256_mask2_permutex2var_epi64(__m256i a, __m256i idx, __mmask8 k, __m256i b); VPERMI2Q __m256i _mm256_maskz_permutex2var_epi64(__mmask8 k, __m256i a, __m256i idx, __m256i b); VPERMI2Q __m128i _mm_permutex2var_epi64(__m128i a, __m128i idx, __m128i b); VPERMI2Q __m128i _mm_mask_permutex2var_epi64(__m128i a, __mmask8 k, __m128i idx, __m128i b); VPERMI2Q __m128i _mm_mask2_permutex2var_epi64(__m128i a, __m128i idx, __mmask8 k, __m128i b); VPERMI2Q __m128i _mm_maskz_permutex2var_epi64(__mmask8 k, __m128i a, __m128i idx, __m128i b); VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Vol. 2C 5-369 INSTRUCTION SET REFERENCE, V-Z VPERMI2W __m512i _mm512_permutex2var_epi16(__m512i a, __m512i idx, __m512i b); VPERMI2W __m512i _mm512_mask_permutex2var_epi16(__m512i a, __mmask32 k, __m512i idx, __m512i b); VPERMI2W __m512i _mm512_mask2_permutex2var_epi16(__m512i a, __m512i idx, __mmask32 k, __m512i b); VPERMI2W __m512i _mm512_maskz_permutex2var_epi16(__mmask32 k, __m512i a, __m512i idx, __m512i b); VPERMI2W __m256i _mm256_permutex2var_epi16(__m256i a, __m256i idx, __m256i b); VPERMI2W __m256i _mm256_mask_permutex2var_epi16(__m256i a, __mmask16 k, __m256i idx, __m256i b); VPERMI2W __m256i _mm256_mask2_permutex2var_epi16(__m256i a, __m256i idx, __mmask16 k, __m256i b); VPERMI2W __m256i _mm256_maskz_permutex2var_epi16(__mmask16 k, __m256i a, __m256i idx, __m256i b); VPERMI2W __m128i _mm_permutex2var_epi16(__m128i a, __m128i idx, __m128i b); VPERMI2W __m128i _mm_mask_permutex2var_epi16(__m128i a, __mmask8 k, __m128i idx, __m128i b); VPERMI2W __m128i _mm_mask2_permutex2var_epi16(__m128i a, __m128i idx, __mmask8 k, __m128i b); VPERMI2W __m128i _mm_maskz_permutex2var_epi16(__mmask8 k, __m128i a, __m128i idx, __m128i b); SIMD Floating-Point Exceptions None Other Exceptions VPERMI2D/Q/PS/PD: See Exceptions Type E4NF. VPERMI2W: See Exceptions Type E4NF.nb. 5-370 Vol. 2C VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index INSTRUCTION SET REFERENCE, V-Z VPERMILPD—Permute In-Lane of Pairs of Double-Precision Floating-Point Values Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX VEX.NDS.128.66.0F38.W0 0D /r VPERMILPD xmm1, xmm2, xmm3/m128 VEX.NDS.256.66.0F38.W0 0D /r VPERMILPD ymm1, ymm2, ymm3/m256 RVM V/V AVX EVEX.NDS.128.66.0F38.W1 0D /r VPERMILPD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDS.256.66.0F38.W1 0D /r VPERMILPD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.512.66.0F38.W1 0D /r VPERMILPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst VEX.128.66.0F3A.W0 05 /r ib VPERMILPD xmm1, xmm2/m128, imm8 VEX.256.66.0F3A.W0 05 /r ib VPERMILPD ymm1, ymm2/m256, imm8 EVEX.128.66.0F3A.W1 05 /r ib VPERMILPD xmm1 {k1}{z}, xmm2/m128/m64bcst, imm8 EVEX.256.66.0F3A.W1 05 /r ib VPERMILPD ymm1 {k1}{z}, ymm2/m256/m64bcst, imm8 EVEX.512.66.0F3A.W1 05 /r ib VPERMILPD zmm1 {k1}{z}, zmm2/m512/m64bcst, imm8 FV-RVM V/V AVX512VL AVX512F FV-RVM V/V AVX512VL AVX512F FV-RVM V/V AVX512F RM V/V AVX RM V/V AVX FV-RM V/V AVX512VL AVX512F FV-RM V/V AVX512VL AVX512F FV-RM V/V AVX512F Description Permute double-precision floating-point values in xmm2 using controls from xmm3/m128 and store result in xmm1. Permute double-precision floating-point values in ymm2 using controls from ymm3/m256 and store result in ymm1. Permute double-precision floating-point values in xmm2 using control from xmm3/m128/m64bcst and store the result in xmm1 using writemask k1. Permute double-precision floating-point values in ymm2 using control from ymm3/m256/m64bcst and store the result in ymm1 using writemask k1. Permute double-precision floating-point values in zmm2 using control from zmm3/m512/m64bcst and store the result in zmm1 using writemask k1. Permute double-precision floating-point values in xmm2/m128 using controls from imm8. Permute double-precision floating-point values in ymm2/m256 using controls from imm8. Permute double-precision floating-point values in xmm2/m128/m64bcst using controls from imm8 and store the result in xmm1 using writemask k1. Permute double-precision floating-point values in ymm2/m256/m64bcst using controls from imm8 and store the result in ymm1 using writemask k1. Permute double-precision floating-point values in zmm2/m512/m64bcst using controls from imm8 and store the result in zmm1 using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA RM ModRM:reg (w) ModRM:r/m (r) NA NA FV-RVM ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA FV-RM ModRM:reg (w) ModRM:r/m (r) NA NA VPERMILPD—Permute In-Lane of Pairs of Double-Precision Floating-Point Values Vol. 2C 5-371 INSTRUCTION SET REFERENCE, V-Z Description (variable control version) Permute pairs of double-precision floating-point values in the first source operand (second operand), each using a 1-bit control field residing in the corresponding quadword element of the second source operand (third operand). Permuted results are stored in the destination operand (first operand). The control bits are located at bit 0 of each quadword element (see Figure 5-24). Each control determines which of the source element in an input pair is selected for the destination element. Each pair of source elements must lie in the same 128-bit region as the destination. EVEX version: The second source operand (third operand) is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. Permuted results are written to the destination under the writemask. SRC1 X3 X2 X1 X0 DEST X2..X3 X2..X3 X0..X1 X0..X1 Figure 5-23. VPERMILPD Operation VEX.256 encoded version: Bits (MAX_VL-1:256) of the corresponding ZMM register are zeroed. Bit Control Field 4 ... 66 ignored 65 sel 63 Control Field 2 2 ignored 1 sel ignored sel ignored ignored 127 ignored 194 193 255 Control Field1 Figure 5-24. VPERMILPD Shuffle Control (immediate control version) Permute pairs of double-precision floating-point values in the first source operand (second operand), each pair using a 1-bit control field in the imm8 byte. Each element in the destination operand (first operand) use a separate control bit of the imm8 byte. VEX version: The source operand is a YMM/XMM register or a 256/128-bit memory location and the destination operand is a YMM/XMM register. Imm8 byte provides the lower 4/2 bit as permute control fields. EVEX version: The source operand (second operand) is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. Permuted results are written to the destination under the writemask. Imm8 byte provides the lower 8/4/2 bit as permute control fields. Note: For the imm8 versions, VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD. 5-372 Vol. 2C VPERMILPD—Permute In-Lane of Pairs of Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z Operation VPERMILPD (EVEX immediate versions) (KL, VL) = (8, 512) FOR j 0 TO KL-1 i j * 64 IF (EVEX.b = 1) AND (SRC1 *is memory*) THEN TMP_SRC1[i+63:i] SRC1[63:0]; ELSE TMP_SRC1[i+63:i] SRC1[i+63:i]; FI; ENDFOR; IF (imm8[0] = 0) THEN TMP_DEST[63:0] SRC1[63:0]; FI; IF (imm8[0] = 1) THEN TMP_DEST[63:0] TMP_SRC1[127:64]; FI; IF (imm8[1] = 0) THEN TMP_DEST[127:64] TMP_SRC1[63:0]; FI; IF (imm8[1] = 1) THEN TMP_DEST[127:64] TMP_SRC1[127:64]; FI; IF VL >= 256 IF (imm8[2] = 0) THEN TMP_DEST[191:128] TMP_SRC1[191:128]; FI; IF (imm8[2] = 1) THEN TMP_DEST[191:128] TMP_SRC1[255:192]; FI; IF (imm8[3] = 0) THEN TMP_DEST[255:192] TMP_SRC1[191:128]; FI; IF (imm8[3] = 1) THEN TMP_DEST[255:192] TMP_SRC1[255:192]; FI; FI; IF VL >= 512 IF (imm8[4] = 0) THEN TMP_DEST[319:256] TMP_SRC1[319:256]; FI; IF (imm8[4] = 1) THEN TMP_DEST[319:256] TMP_SRC1[383:320]; FI; IF (imm8[5] = 0) THEN TMP_DEST[383:320] TMP_SRC1[319:256]; FI; IF (imm8[5] = 1) THEN TMP_DEST[383:320] TMP_SRC1[383:320]; FI; IF (imm8[6] = 0) THEN TMP_DEST[447:384] TMP_SRC1[447:384]; FI; IF (imm8[6] = 1) THEN TMP_DEST[447:384] TMP_SRC1[511:448]; FI; IF (imm8[7] = 0) THEN TMP_DEST[511:448] TMP_SRC1[447:384]; FI; IF (imm8[7] = 1) THEN TMP_DEST[511:448] TMP_SRC1[511:448]; FI; FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPERMILPD (256-bit immediate version) IF (imm8[0] = 0) THEN DEST[63:0]SRC1[63:0] IF (imm8[0] = 1) THEN DEST[63:0]SRC1[127:64] IF (imm8[1] = 0) THEN DEST[127:64]SRC1[63:0] IF (imm8[1] = 1) THEN DEST[127:64]SRC1[127:64] IF (imm8[2] = 0) THEN DEST[191:128]SRC1[191:128] IF (imm8[2] = 1) THEN DEST[191:128]SRC1[255:192] IF (imm8[3] = 0) THEN DEST[255:192]SRC1[191:128] IF (imm8[3] = 1) THEN DEST[255:192]SRC1[255:192] DEST[MAX_VL-1:256]0 VPERMILPD—Permute In-Lane of Pairs of Double-Precision Floating-Point Values Vol. 2C 5-373 INSTRUCTION SET REFERENCE, V-Z VPERMILPD (128-bit immediate version) IF (imm8[0] = 0) THEN DEST[63:0]SRC1[63:0] IF (imm8[0] = 1) THEN DEST[63:0]SRC1[127:64] IF (imm8[1] = 0) THEN DEST[127:64]SRC1[63:0] IF (imm8[1] = 1) THEN DEST[127:64]SRC1[127:64] DEST[MAX_VL-1:128]0 VPERMILPD (EVEX variable versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN TMP_SRC2[i+63:i] SRC2[63:0]; ELSE TMP_SRC2[i+63:i] SRC2[i+63:i]; FI; ENDFOR; IF (TMP_SRC2[1] = 0) THEN TMP_DEST[63:0] SRC1[63:0]; FI; IF (TMP_SRC2[1] = 1) THEN TMP_DEST[63:0] SRC1[127:64]; FI; IF (TMP_SRC2[65] = 0) THEN TMP_DEST[127:64] SRC1[63:0]; FI; IF (TMP_SRC2[65] = 1) THEN TMP_DEST[127:64] SRC1[127:64]; FI; IF VL >= 256 IF (TMP_SRC2[129] = 0) THEN TMP_DEST[191:128] SRC1[191:128]; FI; IF (TMP_SRC2[129] = 1) THEN TMP_DEST[191:128] SRC1[255:192]; FI; IF (TMP_SRC2[193] = 0) THEN TMP_DEST[255:192] SRC1[191:128]; FI; IF (TMP_SRC2[193] = 1) THEN TMP_DEST[255:192] SRC1[255:192]; FI; FI; IF VL >= 512 IF (TMP_SRC2[257] = 0) THEN TMP_DEST[319:256] SRC1[319:256]; FI; IF (TMP_SRC2[257] = 1) THEN TMP_DEST[319:256] SRC1[383:320]; FI; IF (TMP_SRC2[321] = 0) THEN TMP_DEST[383:320] SRC1[319:256]; FI; IF (TMP_SRC2[321] = 1) THEN TMP_DEST[383:320] SRC1[383:320]; FI; IF (TMP_SRC2[385] = 0) THEN TMP_DEST[447:384] SRC1[447:384]; FI; IF (TMP_SRC2[385] = 1) THEN TMP_DEST[447:384] SRC1[511:448]; FI; IF (TMP_SRC2[449] = 0) THEN TMP_DEST[511:448] SRC1[447:384]; FI; IF (TMP_SRC2[449] = 1) THEN TMP_DEST[511:448] SRC1[511:448]; FI; FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-374 Vol. 2C VPERMILPD—Permute In-Lane of Pairs of Double-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VPERMILPD (256-bit variable version) IF (SRC2[1] = 0) THEN DEST[63:0]SRC1[63:0] IF (SRC2[1] = 1) THEN DEST[63:0]SRC1[127:64] IF (SRC2[65] = 0) THEN DEST[127:64]SRC1[63:0] IF (SRC2[65] = 1) THEN DEST[127:64]SRC1[127:64] IF (SRC2[129] = 0) THEN DEST[191:128]SRC1[191:128] IF (SRC2[129] = 1) THEN DEST[191:128]SRC1[255:192] IF (SRC2[193] = 0) THEN DEST[255:192]SRC1[191:128] IF (SRC2[193] = 1) THEN DEST[255:192]SRC1[255:192] DEST[MAX_VL-1:256]0 VPERMILPD (128-bit variable version) IF (SRC2[1] = 0) THEN DEST[63:0]SRC1[63:0] IF (SRC2[1] = 1) THEN DEST[63:0]SRC1[127:64] IF (SRC2[65] = 0) THEN DEST[127:64]SRC1[63:0] IF (SRC2[65] = 1) THEN DEST[127:64]SRC1[127:64] DEST[MAX_VL-1:128]0 Intel C/C++ Compiler Intrinsic Equivalent VPERMILPD __m512d _mm512_permute_pd( __m512d a, int imm); VPERMILPD __m512d _mm512_mask_permute_pd(__m512d s, __mmask8 k, __m512d a, int imm); VPERMILPD __m512d _mm512_maskz_permute_pd( __mmask8 k, __m512d a, int imm); VPERMILPD __m256d _mm256_mask_permute_pd(__m256d s, __mmask8 k, __m256d a, int imm); VPERMILPD __m256d _mm256_maskz_permute_pd( __mmask8 k, __m256d a, int imm); VPERMILPD __m128d _mm_mask_permute_pd(__m128d s, __mmask8 k, __m128d a, int imm); VPERMILPD __m128d _mm_maskz_permute_pd( __mmask8 k, __m128d a, int imm); VPERMILPD __m512d _mm512_permutevar_pd( __m512i i, __m512d a); VPERMILPD __m512d _mm512_mask_permutevar_pd(__m512d s, __mmask8 k, __m512i i, __m512d a); VPERMILPD __m512d _mm512_maskz_permutevar_pd( __mmask8 k, __m512i i, __m512d a); VPERMILPD __m256d _mm256_mask_permutevar_pd(__m256d s, __mmask8 k, __m256d i, __m256d a); VPERMILPD __m256d _mm256_maskz_permutevar_pd( __mmask8 k, __m256d i, __m256d a); VPERMILPD __m128d _mm_mask_permutevar_pd(__m128d s, __mmask8 k, __m128d i, __m128d a); VPERMILPD __m128d _mm_maskz_permutevar_pd( __mmask8 k, __m128d i, __m128d a); VPERMILPD __m128d _mm_permute_pd (__m128d a, int control) VPERMILPD __m256d _mm256_permute_pd (__m256d a, int control) VPERMILPD __m128d _mm_permutevar_pd (__m128d a, __m128i control); VPERMILPD __m256d _mm256_permutevar_pd (__m256d a, __m256i control); SIMD Floating-Point Exceptions None Other Exceptions Non-EVEX-encoded instruction, see Exceptions Type 4; additionally #UD If VEX.W = 1. EVEX-encoded instruction, see Exceptions Type E4NF. #UD If either (E)VEX.vvvv != 1111B and with imm8. VPERMILPD—Permute In-Lane of Pairs of Double-Precision Floating-Point Values Vol. 2C 5-375 INSTRUCTION SET REFERENCE, V-Z VPERMILPS—Permute In-Lane of Quadruples of Single-Precision Floating-Point Values Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX VEX.NDS.128.66.0F38.W0 0C /r VPERMILPS xmm1, xmm2, xmm3/m128 VEX.128.66.0F3A.W0 04 /r ib VPERMILPS xmm1, xmm2/m128, imm8 RM V/V AVX VEX.NDS.256.66.0F38.W0 0C /r VPERMILPS ymm1, ymm2, ymm3/m256 RVM V/V AVX VEX.256.66.0F3A.W0 04 /r ib VPERMILPS ymm1, ymm2/m256, imm8 RM V/V AVX EVEX.NDS.128.66.0F38.W0 0C /r VPERMILPS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.256.66.0F38.W0 0C /r VPERMILPS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.512.66.0F38.W0 0C /r VPERMILPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst EVEX.128.66.0F3A.W0 04 /r ib VPERMILPS xmm1 {k1}{z}, xmm2/m128/m32bcst, imm8 EVEX.256.66.0F3A.W0 04 /r ib VPERMILPS ymm1 {k1}{z}, ymm2/m256/m32bcst, imm8 EVEX.512.66.0F3A.W0 04 /r ibVPERMILPS zmm1 {k1}{z}, zmm2/m512/m32bcst, imm8 FV-RVM V/V AVX512VL AVX512F FV-RVM V/V AVX512VL AVX512F FV-RVM V/V AVX512F FV-RM V/V AVX512VL AVX512F FV-RM V/V AVX512VL AVX512F FV-RM V/V AVX512F Description Permute single-precision floating-point values in xmm2 using controls from xmm3/m128 and store result in xmm1. Permute single-precision floating-point values in xmm2/m128 using controls from imm8 and store result in xmm1. Permute single-precision floating-point values in ymm2 using controls from ymm3/m256 and store result in ymm1. Permute single-precision floating-point values in ymm2/m256 using controls from imm8 and store result in ymm1. Permute single-precision floating-point values xmm2 using control from xmm3/m128/m32bcst and store the result in xmm1 using writemask k1. Permute single-precision floating-point values ymm2 using control from ymm3/m256/m32bcst and store the result in ymm1 using writemask k1. Permute single-precision floating-point values zmm2 using control from zmm3/m512/m32bcst and store the result in zmm1 using writemask k1. Permute single-precision floating-point values xmm2/m128/m32bcst using controls from imm8 and store the result in xmm1 using writemask k1. Permute single-precision floating-point values ymm2/m256/m32bcst using controls from imm8 and store the result in ymm1 using writemask k1. Permute single-precision floating-point values zmm2/m512/m32bcst using controls from imm8 and store the result in zmm1 using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA RM ModRM:reg (w) ModRM:r/m (r) NA NA FV-RVM ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA FV-RM ModRM:reg (w) ModRM:r/m (r) NA NA 5-376 Vol. 2C VPERMILPS—Permute In-Lane of Quadruples of Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z Description (variable control version) Permute quadruples of single-precision floating-point values in the first source operand (second operand), each quadruplet using a 2-bit control field in the corresponding dword element of the second source operand. Permuted results are stored in the destination operand (first operand). The 2-bit control fields are located at the low two bits of each dword element (see Figure 5-26). Each control determines which of the source element in an input quadruple is selected for the destination element. Each quadruple of source elements must lie in the same 128-bit region as the destination. EVEX version: The second source operand (third operand) is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. Permuted results are written to the destination under the writemask. SRC1 X7 X6 DEST X7 .. X4 X5 X7 .. X4 X7 .. X4 X4 X3 X7 .. X4 X3 ..X0 X2 X1 X0 X3 ..X0 X3 .. X0 X3 .. X0 Figure 5-25. VPERMILPS Operation Bit 226 225 224 255 ignored sel Control Field 7 63 ... 34 ignored 33 32 31 sel Control Field 2 1 ignored 0 sel Control Field 1 Figure 5-26. VPERMILPS Shuffle Control (immediate control version) Permute quadruples of single-precision floating-point values in the first source operand (second operand), each quadruplet using a 2-bit control field in the imm8 byte. Each 128-bit lane in the destination operand (first operand) use the four control fields of the same imm8 byte. VEX version: The source operand is a YMM/XMM register or a 256/128-bit memory location and the destination operand is a YMM/XMM register. EVEX version: The source operand (second operand) is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. Permuted results are written to the destination under the writemask. Note: For the imm8 version, VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD. VPERMILPS—Permute In-Lane of Quadruples of Single-Precision Floating-Point Values Vol. 2C 5-377 INSTRUCTION SET REFERENCE, V-Z Operation Select4(SRC, control) { CASE (control[1:0]) OF 0: TMP SRC[31:0]; 1: TMP SRC[63:32]; 2: TMP SRC[95:64]; 3: TMP SRC[127:96]; ESAC; RETURN TMP } VPERMILPS (EVEX immediate versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF (EVEX.b = 1) AND (SRC1 *is memory*) THEN TMP_SRC1[i+31:i] SRC1[31:0]; ELSE TMP_SRC1[i+31:i] SRC1[i+31:i]; FI; ENDFOR; TMP_DEST[31:0] Select4(TMP_SRC1[127:0], imm8[1:0]); TMP_DEST[63:32] Select4(TMP_SRC1[127:0], imm8[3:2]); TMP_DEST[95:64] Select4(TMP_SRC1[127:0], imm8[5:4]); TMP_DEST[127:96] Select4(TMP_SRC1[127:0], imm8[7:6]); FI; IF VL >= 256 TMP_DEST[159:128] Select4(TMP_SRC1[255:128], imm8[1:0]); FI; TMP_DEST[191:160] Select4(TMP_SRC1[255:128], imm8[3:2]); FI; TMP_DEST[223:192] Select4(TMP_SRC1[255:128], imm8[5:4]); FI; TMP_DEST[255:224] Select4(TMP_SRC1[255:128], imm8[7:6]); FI; FI; IF VL >= 512 TMP_DEST[287:256] Select4(TMP_SRC1[383:256], imm8[1:0]); FI; TMP_DEST[319:288] Select4(TMP_SRC1[383:256], imm8[3:2]); FI; TMP_DEST[351:320] Select4(TMP_SRC1[383:256], imm8[5:4]); FI; TMP_DEST[383:352] Select4(TMP_SRC1[383:256], imm8[7:6]); FI; TMP_DEST[415:384] Select4(TMP_SRC1[511:384], imm8[1:0]); FI; TMP_DEST[447:416] Select4(TMP_SRC1[511:384], imm8[3:2]); FI; TMP_DEST[479:448] Select4(TMP_SRC1[511:384], imm8[5:4]); FI; TMP_DEST[511:480] Select4(TMP_SRC1[511:384], imm8[7:6]); FI; FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE IF *merging-masking* THEN *DEST[i+31:i] remains unchanged* ELSE DEST[i+31:i] 0 ;zeroing-masking FI; FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-378 Vol. 2C VPERMILPS—Permute In-Lane of Quadruples of Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VPERMILPS (256-bit immediate version) DEST[31:0] Select4(SRC1[127:0], imm8[1:0]); DEST[63:32] Select4(SRC1[127:0], imm8[3:2]); DEST[95:64] Select4(SRC1[127:0], imm8[5:4]); DEST[127:96] Select4(SRC1[127:0], imm8[7:6]); DEST[159:128] Select4(SRC1[255:128], imm8[1:0]); DEST[191:160] Select4(SRC1[255:128], imm8[3:2]); DEST[223:192] Select4(SRC1[255:128], imm8[5:4]); DEST[255:224] Select4(SRC1[255:128], imm8[7:6]); VPERMILPS (128-bit immediate version) DEST[31:0] Select4(SRC1[127:0], imm8[1:0]); DEST[63:32] Select4(SRC1[127:0], imm8[3:2]); DEST[95:64] Select4(SRC1[127:0], imm8[5:4]); DEST[127:96] Select4(SRC1[127:0], imm8[7:6]); DEST[MAX_VL-1:128]0 VPERMILPS (EVEX variable versions) (KL, VL) = (16, 512) FOR j 0 TO KL-1 i j * 32 IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN TMP_SRC2[i+31:i] SRC2[31:0]; ELSE TMP_SRC2[i+31:i] SRC2[i+31:i]; FI; ENDFOR; TMP_DEST[31:0] Select4(SRC1[127:0], TMP_SRC2[1:0]); TMP_DEST[63:32] Select4(SRC1[127:0], TMP_SRC2[33:32]); TMP_DEST[95:64] Select4(SRC1[127:0], TMP_SRC2[65:64]); TMP_DEST[127:96] Select4(SRC1[127:0], TMP_SRC2[97:96]); IF VL >= 256 TMP_DEST[159:128] Select4(SRC1[255:128], TMP_SRC2[129:128]); TMP_DEST[191:160] Select4(SRC1[255:128], TMP_SRC2[161:160]); TMP_DEST[223:192] Select4(SRC1[255:128], TMP_SRC2[193:192]); TMP_DEST[255:224] Select4(SRC1[255:128], TMP_SRC2[225:224]); FI; IF VL >= 512 TMP_DEST[287:256] Select4(SRC1[383:256], TMP_SRC2[257:256]); TMP_DEST[319:288] Select4(SRC1[383:256], TMP_SRC2[289:288]); TMP_DEST[351:320] Select4(SRC1[383:256], TMP_SRC2[321:320]); TMP_DEST[383:352] Select4(SRC1[383:256], TMP_SRC2[353:352]); TMP_DEST[415:384] Select4(SRC1[511:384], TMP_SRC2[385:384]); TMP_DEST[447:416] Select4(SRC1[511:384], TMP_SRC2[417:416]); TMP_DEST[479:448] Select4(SRC1[511:384], TMP_SRC2[449:448]); TMP_DEST[511:480] Select4(SRC1[511:384], TMP_SRC2[481:480]); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE IF *merging-masking* THEN *DEST[i+31:i] remains unchanged* ELSE DEST[i+31:i] 0 ;zeroing-masking VPERMILPS—Permute In-Lane of Quadruples of Single-Precision Floating-Point Values Vol. 2C 5-379 INSTRUCTION SET REFERENCE, V-Z FI; FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPERMILPS (256-bit variable version) DEST[31:0] Select4(SRC1[127:0], SRC2[1:0]); DEST[63:32] Select4(SRC1[127:0], SRC2[33:32]); DEST[95:64] Select4(SRC1[127:0], SRC2[65:64]); DEST[127:96] Select4(SRC1[127:0], SRC2[97:96]); DEST[159:128] Select4(SRC1[255:128], SRC2[129:128]); DEST[191:160] Select4(SRC1[255:128], SRC2[161:160]); DEST[223:192] Select4(SRC1[255:128], SRC2[193:192]); DEST[255:224] Select4(SRC1[255:128], SRC2[225:224]); DEST[MAX_VL-1:256]0 VPERMILPS (128-bit variable version) DEST[31:0] Select4(SRC1[127:0], SRC2[1:0]); DEST[63:32] Select4(SRC1[127:0], SRC2[33:32]); DEST[95:64] Select4(SRC1[127:0], SRC2[65:64]); DEST[127:96] Select4(SRC1[127:0], SRC2[97:96]); DEST[MAX_VL-1:128]0 Intel C/C++ Compiler Intrinsic Equivalent VPERMILPS __m512 _mm512_permute_ps( __m512 a, int imm); VPERMILPS __m512 _mm512_mask_permute_ps(__m512 s, __mmask16 k, __m512 a, int imm); VPERMILPS __m512 _mm512_maskz_permute_ps( __mmask16 k, __m512 a, int imm); VPERMILPS __m256 _mm256_mask_permute_ps(__m256 s, __mmask8 k, __m256 a, int imm); VPERMILPS __m256 _mm256_maskz_permute_ps( __mmask8 k, __m256 a, int imm); VPERMILPS __m128 _mm_mask_permute_ps(__m128 s, __mmask8 k, __m128 a, int imm); VPERMILPS __m128 _mm_maskz_permute_ps( __mmask8 k, __m128 a, int imm); VPERMILPS __m512 _mm512_permutevar_ps( __m512i i, __m512 a); VPERMILPS __m512 _mm512_mask_permutevar_ps(__m512 s, __mmask16 k, __m512i i, __m512 a); VPERMILPS __m512 _mm512_maskz_permutevar_ps( __mmask16 k, __m512i i, __m512 a); VPERMILPS __m256 _mm256_mask_permutevar_ps(__m256 s, __mmask8 k, __m256 i, __m256 a); VPERMILPS __m256 _mm256_maskz_permutevar_ps( __mmask8 k, __m256 i, __m256 a); VPERMILPS __m128 _mm_mask_permutevar_ps(__m128 s, __mmask8 k, __m128 i, __m128 a); VPERMILPS __m128 _mm_maskz_permutevar_ps( __mmask8 k, __m128 i, __m128 a); VPERMILPS __m128 _mm_permute_ps (__m128 a, int control); VPERMILPS __m256 _mm256_permute_ps (__m256 a, int control); VPERMILPS __m128 _mm_permutevar_ps (__m128 a, __m128i control); VPERMILPS __m256 _mm256_permutevar_ps (__m256 a, __m256i control); SIMD Floating-Point Exceptions None Other Exceptions Non-EVEX-encoded instruction, see Exceptions Type 4; #UD If VEX.W = 1. EVEX-encoded instruction, see Exceptions Type E4NF. #UD 5-380 Vol. 2C If either (E)VEX.vvvv != 1111B and with imm8. VPERMILPS—Permute In-Lane of Quadruples of Single-Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VPERMPD—Permute Double-Precision Floating-Point Elements Opcode/ Instruction Op / En RMI 64/32 bit Mode Support V/V CPUID Feature Flag AVX2 VEX.256.66.0F3A.W1 01 /r ib VPERMPD ymm1, ymm2/m256, imm8 EVEX.256.66.0F3A.W1 01 /r ib VPERMPD ymm1 {k1}{z}, ymm2/m256/m64bcst, imm8 EVEX.512.66.0F3A.W1 01 /r ib VPERMPD zmm1 {k1}{z}, zmm2/m512/m64bcst, imm8 EVEX.NDS.256.66.0F38.W1 16 /r VPERMPD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.512.66.0F38.W1 16 /r VPERMPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst FV-RM V/V AVX512VL AVX512F FV-RMI V/V AVX512F FV-RVM V/V AVX512VL AVX512F FV-RVM V/V AVX512F Description Permute double-precision floating-point elements in ymm2/m256 using indices in imm8 and store the result in ymm1. Permute double-precision floating-point elements in ymm2/m256/m64bcst using indexes in imm8 and store the result in ymm1 subject to writemask k1. Permute double-precision floating-point elements in zmm2/m512/m64bcst using indices in imm8 and store the result in zmm1 subject to writemask k1. Permute double-precision floating-point elements in ymm3/m256/m64bcst using indexes in ymm2 and store the result in ymm1 subject to writemask k1. Permute double-precision floating-point elements in zmm3/m512/m64bcst using indices in zmm2 and store the result in zmm1 subject to writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RMI ModRM:reg (w) ModRM:r/m (r) Imm8 NA FV-RMI ModRM:reg (w) ModRM:r/m (r) Imm8 NA FV-RVM ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description The imm8 version: Copies quadword elements of double-precision floating-point values from the source operand (the second operand) to the destination operand (the first operand) according to the indices specified by the immediate operand (the third operand). Each two-bit value in the immediate byte selects a qword element in the source operand. VEX version: The source operand can be a YMM register or a memory location. Bits (MAX_VL-1:256) of the corresponding destination register are zeroed. In EVEX.512 encoded version, The elements in the destination are updated using the writemask k1 and the imm8 bits are reused as control bits for the upper 256-bit half when the control bits are coming from immediate. The source operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit memory location. The imm8 versions: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. The vector control version: Copies quadword elements of double-precision floating-point values from the second source operand (the third operand) to the destination operand (the first operand) according to the indices in the first source operand (the second operand). The first 3 bits of each 64 bit element in the index operand selects which quadword in the second source operand to copy. The first and second operands are ZMM registers, the third operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit memory location. The elements in the destination are updated using the writemask k1. Note that this instruction permits a qword in the source operand to be copied to multiple locations in the destination operand. If VPERMPD is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will cause an #UD exception. VPERMPD—Permute Double-Precision Floating-Point Elements Vol. 2C 5-381 INSTRUCTION SET REFERENCE, V-Z Operation VPERMPD (EVEX - imm8 control forms) (KL, VL) = (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF (EVEX.b = 1) AND (SRC *is memory*) THEN TMP_SRC[i+63:i] SRC[63:0]; ELSE TMP_SRC[i+63:i] SRC[i+63:i]; FI; ENDFOR; TMP_DEST[63:0] (TMP_SRC[256:0] >> (IMM8[1:0] * 64))[63:0]; TMP_DEST[127:64] (TMP_SRC[256:0] >> (IMM8[3:2] * 64))[63:0]; TMP_DEST[191:128] (TMP_SRC[256:0] >> (IMM8[5:4] * 64))[63:0]; TMP_DEST[255:192] (TMP_SRC[256:0] >> (IMM8[7:6] * 64))[63:0]; IF VL >= 512 TMP_DEST[319:256] (TMP_SRC[511:256] >> (IMM8[1:0] * 64))[63:0]; TMP_DEST[383:320] (TMP_SRC[511:256] >> (IMM8[3:2] * 64))[63:0]; TMP_DEST[447:384] (TMP_SRC[511:256] >> (IMM8[5:4] * 64))[63:0]; TMP_DEST[511:448] (TMP_SRC[511:256] >> (IMM8[7:6] * 64))[63:0]; FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 ;zeroing-masking FI; FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPERMPD (EVEX - vector control forms) (KL, VL) = (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN TMP_SRC2[i+63:i] SRC2[63:0]; ELSE TMP_SRC2[i+63:i] SRC2[i+63:i]; FI; ENDFOR; IF VL = 256 TMP_DEST[63:0] (TMP_SRC2[255:0] >> (SRC1[1:0] * 64))[63:0]; TMP_DEST[127:64] (TMP_SRC2[255:0] >> (SRC1[65:64] * 64))[63:0]; TMP_DEST[191:128] (TMP_SRC2[255:0] >> (SRC1[129:128] * 64))[63:0]; TMP_DEST[255:192] (TMP_SRC2[255:0] >> (SRC1[193:192] * 64))[63:0]; FI; IF VL = 512 TMP_DEST[63:0] (TMP_SRC2[511:0] >> (SRC1[2:0] * 64))[63:0]; 5-382 Vol. 2C VPERMPD—Permute Double-Precision Floating-Point Elements INSTRUCTION SET REFERENCE, V-Z TMP_DEST[127:64] (TMP_SRC2[511:0] >> (SRC1[66:64] * 64))[63:0]; TMP_DEST[191:128] (TMP_SRC2[511:0] >> (SRC1[130:128] * 64))[63:0]; TMP_DEST[255:192] (TMP_SRC2[511:0] >> (SRC1[194:192] * 64))[63:0]; TMP_DEST[319:256] (TMP_SRC2[511:0] >> (SRC1[258:256] * 64))[63:0]; TMP_DEST[383:320] (TMP_SRC2[511:0] >> (SRC1[322:320] * 64))[63:0]; TMP_DEST[447:384] (TMP_SRC2[511:0] >> (SRC1[386:384] * 64))[63:0]; TMP_DEST[511:448] (TMP_SRC2[511:0] >> (SRC1[450:448] * 64))[63:0]; FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 ;zeroing-masking FI; FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPERMPD (VEX.256 encoded version) DEST[63:0] (SRC[255:0] >> (IMM8[1:0] * 64))[63:0]; DEST[127:64] (SRC[255:0] >> (IMM8[3:2] * 64))[63:0]; DEST[191:128] (SRC[255:0] >> (IMM8[5:4] * 64))[63:0]; DEST[255:192] (SRC[255:0] >> (IMM8[7:6] * 64))[63:0]; DEST[MAX_VL-1:256] 0 Intel C/C++ Compiler Intrinsic Equivalent VPERMPD __m512d _mm512_permutex_pd( __m512d a, int imm); VPERMPD __m512d _mm512_mask_permutex_pd(__m512d s, __mmask16 k, __m512d a, int imm); VPERMPD __m512d _mm512_maskz_permutex_pd( __mmask16 k, __m512d a, int imm); VPERMPD __m512d _mm512_permutexvar_pd( __m512i i, __m512d a); VPERMPD __m512d _mm512_mask_permutexvar_pd(__m512d s, __mmask16 k, __m512i i, __m512d a); VPERMPD __m512d _mm512_maskz_permutexvar_pd( __mmask16 k, __m512i i, __m512d a); VPERMPD __m256d _mm256_permutex_epi64( __m256d a, int imm); VPERMPD __m256d _mm256_mask_permutex_epi64(__m256i s, __mmask8 k, __m256d a, int imm); VPERMPD __m256d _mm256_maskz_permutex_epi64( __mmask8 k, __m256d a, int imm); VPERMPD __m256d _mm256_permutexvar_epi64( __m256i i, __m256d a); VPERMPD __m256d _mm256_mask_permutexvar_epi64(__m256i s, __mmask8 k, __m256i i, __m256d a); VPERMPD __m256d _mm256_maskz_permutexvar_epi64( __mmask8 k, __m256i i, __m256d a); SIMD Floating-Point Exceptions None Other Exceptions Non-EVEX-encoded instruction, see Exceptions Type 4; additionally #UD If VEX.L = 0. If VEX.vvvv != 1111B. EVEX-encoded instruction, see Exceptions Type E4NF. #UD If encoded with EVEX.128. If EVEX.vvvv != 1111B and with imm8. VPERMPD—Permute Double-Precision Floating-Point Elements Vol. 2C 5-383 INSTRUCTION SET REFERENCE, V-Z VPERMPS—Permute Single-Precision Floating-Point Elements Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX2 VEX.256.66.0F38.W0 16 /r VPERMPS ymm1, ymm2, ymm3/m256 EVEX.NDS.256.66.0F38.W0 16 /r VPERMPS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.512.66.0F38.W0 16 /r VPERMPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst FV V/V AVX512VL AVX512F FV V/V AVX512F Description Permute single-precision floating-point elements in ymm3/m256 using indices in ymm2 and store the result in ymm1. Permute single-precision floating-point elements in ymm3/m256/m32bcst using indexes in ymm2 and store the result in ymm1 subject to write mask k1. Permute single-precision floating-point values in zmm3/m512/m32bcst using indices in zmm2 and store the result in zmm1 subject to write mask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Copies doubleword elements of single-precision floating-point values from the second source operand (the third operand) to the destination operand (the first operand) according to the indices in the first source operand (the second operand). Note that this instruction permits a doubleword in the source operand to be copied to more than one location in the destination operand. VEX.256 versions: The first and second operands are YMM registers, the third operand can be a YMM register or memory location. Bits (MAX_VL-1:256) of the corresponding destination register are zeroed. EVEX encoded version: The first and second operands are ZMM registers, the third operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32-bit memory location. The elements in the destination are updated using the writemask k1. If VPERMPS is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will cause an #UD exception. Operation VPERMPS (EVEX forms) (KL, VL) (8, 256),= (16, 512) FOR j 0 TO KL-1 i j * 64 IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN TMP_SRC2[i+31:i] SRC2[31:0]; ELSE TMP_SRC2[i+31:i] SRC2[i+31:i]; FI; ENDFOR; IF VL = 256 TMP_DEST[31:0] (TMP_SRC2[255:0] >> (SRC1[2:0] * 32))[31:0]; TMP_DEST[63:32] (TMP_SRC2[255:0] >> (SRC1[34:32] * 32))[31:0]; TMP_DEST[95:64] (TMP_SRC2[255:0] >> (SRC1[66:64] * 32))[31:0]; TMP_DEST[127:96] (TMP_SRC2[255:0] >> (SRC1[98:96] * 32))[31:0]; TMP_DEST[159:128] (TMP_SRC2[255:0] >> (SRC1[130:128] * 32))[31:0]; TMP_DEST[191:160] (TMP_SRC2[255:0] >> (SRC1[162:160] * 32))[31:0]; TMP_DEST[223:192] (TMP_SRC2[255:0] >> (SRC1[193:192] * 32))[31:0]; TMP_DEST[255:224] (TMP_SRC2[255:0] >> (SRC1[226:224] * 32))[31:0]; 5-384 Vol. 2C VPERMPS—Permute Single-Precision Floating-Point Elements INSTRUCTION SET REFERENCE, V-Z FI; IF VL = 512 TMP_DEST[31:0] (TMP_SRC2[511:0] >> (SRC1[3:0] * 32))[31:0]; TMP_DEST[63:32] (TMP_SRC2[511:0] >> (SRC1[35:32] * 32))[31:0]; TMP_DEST[95:64] (TMP_SRC2[511:0] >> (SRC1[67:64] * 32))[31:0]; TMP_DEST[127:96] (TMP_SRC2[511:0] >> (SRC1[99:96] * 32))[31:0]; TMP_DEST[159:128] (TMP_SRC2[511:0] >> (SRC1[131:128] * 32))[31:0]; TMP_DEST[191:160] (TMP_SRC2[511:0] >> (SRC1[163:160] * 32))[31:0]; TMP_DEST[223:192] (TMP_SRC2[511:0] >> (SRC1[195:192] * 32))[31:0]; TMP_DEST[255:224] (TMP_SRC2[511:0] >> (SRC1[227:224] * 32))[31:0]; TMP_DEST[287:256] (TMP_SRC2[511:0] >> (SRC1[259:256] * 32))[31:0]; TMP_DEST[319:288] (TMP_SRC2[511:0] >> (SRC1[291:288] * 32))[31:0]; TMP_DEST[351:320] (TMP_SRC2[511:0] >> (SRC1[323:320] * 32))[31:0]; TMP_DEST[383:352] (TMP_SRC2[511:0] >> (SRC1[355:352] * 32))[31:0]; TMP_DEST[415:384] (TMP_SRC2[511:0] >> (SRC1[387:384] * 32))[31:0]; TMP_DEST[447:416] (TMP_SRC2[511:0] >> (SRC1[419:416] * 32))[31:0]; TMP_DEST[479:448] (TMP_SRC2[511:0] >> (SRC1[451:448] * 32))[31:0]; TMP_DEST[511:480] (TMP_SRC2[511:0] >> (SRC1[483:480] * 32))[31:0]; FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 ;zeroing-masking FI; FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPERMPS (VEX.256 encoded version) DEST[31:0] (SRC2[255:0] >> (SRC1[2:0] * 32))[31:0]; DEST[63:32] (SRC2[255:0] >> (SRC1[34:32] * 32))[31:0]; DEST[95:64] (SRC2[255:0] >> (SRC1[66:64] * 32))[31:0]; DEST[127:96] (SRC2[255:0] >> (SRC1[98:96] * 32))[31:0]; DEST[159:128] (SRC2[255:0] >> (SRC1[130:128] * 32))[31:0]; DEST[191:160] (SRC2[255:0] >> (SRC1[162:160] * 32))[31:0]; DEST[223:192] (SRC2[255:0] >> (SRC1[194:192] * 32))[31:0]; DEST[255:224] (SRC2[255:0] >> (SRC1[226:224] * 32))[31:0]; DEST[MAX_VL-1:256] 0 VPERMPS—Permute Single-Precision Floating-Point Elements Vol. 2C 5-385 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPERMPS __m512 _mm512_permutexvar_ps(__m512i i, __m512 a); VPERMPS __m512 _mm512_mask_permutexvar_ps(__m512 s, __mmask16 k, __m512i i, __m512 a); VPERMPS __m512 _mm512_maskz_permutexvar_ps( __mmask16 k, __m512i i, __m512 a); VPERMPS __m256 _mm256_permutexvar_ps(__m256 i, __m256 a); VPERMPS __m256 _mm256_mask_permutexvar_ps(__m256 s, __mmask8 k, __m256 i, __m256 a); VPERMPS __m256 _mm256_maskz_permutexvar_ps( __mmask8 k, __m256 i, __m256 a); SIMD Floating-Point Exceptions None Other Exceptions Non-EVEX-encoded instruction, see Exceptions Type 4; additionally #UD If VEX.L = 0. EVEX-encoded instruction, see Exceptions Type E4NF. 5-386 Vol. 2C VPERMPS—Permute Single-Precision Floating-Point Elements INSTRUCTION SET REFERENCE, V-Z VPERMQ—Qwords Element Permutation Opcode/ Instruction Op / En RMI 64/32 bit Mode Support V/V CPUID Feature Flag AVX2 VEX.256.66.0F3A.W1 00 /r ib VPERMQ ymm1, ymm2/m256, imm8 EVEX.256.66.0F3A.W1 00 /r ib VPERMQ ymm1 {k1}{z}, ymm2/m256/m64bcst, imm8 EVEX.512.66.0F3A.W1 00 /r ib VPERMQ zmm1 {k1}{z}, zmm2/m512/m64bcst, imm8 EVEX.NDS.256.66.0F38.W1 36 /r VPERMQ ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.512.66.0F38.W1 36 /r VPERMQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst Description FV-RM V/V AVX512VL AVX512F FV-RMI V/V AVX512F Permute qwords in zmm2/m512/m64bcst using indices in imm8 and store the result in zmm1. FV-RVM V/V AVX512VL AVX512F Permute qwords in ymm3/m256/m64bcst using indexes in ymm2 and store the result in ymm1. FV-RVM V/V AVX512F Permute qwords in zmm3/m512/m64bcst using indices in zmm2 and store the result in zmm1. Permute qwords in ymm2/m256 using indices in imm8 and store the result in ymm1. Permute qwords in ymm2/m256/m64bcst using indexes in imm8 and store the result in ymm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RMI ModRM:reg (w) ModRM:r/m (r) Imm8 NA FV-RMI ModRM:reg (w) ModRM:r/m (r) Imm8 NA FV-RVM ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description The imm8 version: Copies quadwords from the source operand (the second operand) to the destination operand (the first operand) according to the indices specified by the immediate operand (the third operand). Each two-bit value in the immediate byte selects a qword element in the source operand. VEX version: The source operand can be a YMM register or a memory location. Bits (MAX_VL-1:256) of the corresponding destination register are zeroed. In EVEX.512 encoded version, The elements in the destination are updated using the writemask k1 and the imm8 bits are reused as control bits for the upper 256-bit half when the control bits are coming from immediate. The source operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit memory location. Immediate control versions: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. The vector control version: Copies quadwords from the second source operand (the third operand) to the destination operand (the first operand) according to the indices in the first source operand (the second operand). The first 3 bits of each 64 bit element in the index operand selects which quadword in the second source operand to copy. The first and second operands are ZMM registers, the third operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit memory location. The elements in the destination are updated using the writemask k1. Note that this instruction permits a qword in the source operand to be copied to multiple locations in the destination operand. If VPERMPQ is encoded with VEX.L= 0 or EVEX.128, an attempt to execute the instruction will cause an #UD exception. VPERMQ—Qwords Element Permutation Vol. 2C 5-387 INSTRUCTION SET REFERENCE, V-Z Operation VPERMQ (EVEX - imm8 control forms) (KL, VL) = (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF (EVEX.b = 1) AND (SRC *is memory*) THEN TMP_SRC[i+63:i] SRC[63:0]; ELSE TMP_SRC[i+63:i] SRC[i+63:i]; FI; ENDFOR; TMP_DEST[63:0] (TMP_SRC[255:0] >> (IMM8[1:0] * 64))[63:0]; TMP_DEST[127:64] (TMP_SRC[255:0] >> (IMM8[3:2] * 64))[63:0]; TMP_DEST[191:128] (TMP_SRC[255:0] >> (IMM8[5:4] * 64))[63:0]; TMP_DEST[255:192] (TMP_SRC[255:0] >> (IMM8[7:6] * 64))[63:0]; IF VL >= 512 TMP_DEST[319:256] (TMP_SRC[511:256] >> (IMM8[1:0] * 64))[63:0]; TMP_DEST[383:320] (TMP_SRC[511:256] >> (IMM8[3:2] * 64))[63:0]; TMP_DEST[447:384] (TMP_SRC[511:256] >> (IMM8[5:4] * 64))[63:0]; TMP_DEST[511:448] (TMP_SRC[511:256] >> (IMM8[7:6] * 64))[63:0]; FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 ;zeroing-masking FI; FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPERMQ (EVEX - vector control forms) (KL, VL) = (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN TMP_SRC2[i+63:i] SRC2[63:0]; ELSE TMP_SRC2[i+63:i] SRC2[i+63:i]; FI; ENDFOR; IF VL = 256 TMP_DEST[63:0] (TMP_SRC2[255:0] >> (SRC1[1:0] * 64))[63:0]; TMP_DEST[127:64] (TMP_SRC2[255:0] >> (SRC1[65:64] * 64))[63:0]; TMP_DEST[191:128] (TMP_SRC2[255:0] >> (SRC1[129:128] * 64))[63:0]; TMP_DEST[255:192] (TMP_SRC2[255:0] >> (SRC1[193:192] * 64))[63:0]; FI; IF VL = 512 TMP_DEST[63:0] (TMP_SRC2[511:0] >> (SRC1[2:0] * 64))[63:0]; TMP_DEST[127:64] (TMP_SRC2[511:0] >> (SRC1[66:64] * 64))[63:0]; TMP_DEST[191:128] (TMP_SRC2[511:0] >> (SRC1[130:128] * 64))[63:0]; TMP_DEST[255:192] (TMP_SRC2[511:0] >> (SRC1[194:192] * 64))[63:0]; 5-388 Vol. 2C VPERMQ—Qwords Element Permutation INSTRUCTION SET REFERENCE, V-Z TMP_DEST[319:256] (TMP_SRC2[511:0] >> (SRC1[258:256] * 64))[63:0]; TMP_DEST[383:320] (TMP_SRC2[511:0] >> (SRC1[322:320] * 64))[63:0]; TMP_DEST[447:384] (TMP_SRC2[511:0] >> (SRC1[386:384] * 64))[63:0]; TMP_DEST[511:448] (TMP_SRC2[511:0] >> (SRC1[450:448] * 64))[63:0]; FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 ;zeroing-masking FI; FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPERMQ (VEX.256 encoded version) DEST[63:0] (SRC[255:0] >> (IMM8[1:0] * 64))[63:0]; DEST[127:64] (SRC[255:0] >> (IMM8[3:2] * 64))[63:0]; DEST[191:128] (SRC[255:0] >> (IMM8[5:4] * 64))[63:0]; DEST[255:192] (SRC[255:0] >> (IMM8[7:6] * 64))[63:0]; DEST[MAX_VL-1:256] 0 Intel C/C++ Compiler Intrinsic Equivalent VPERMQ __m512i _mm512_permutex_epi64( __m512i a, int imm); VPERMQ __m512i _mm512_mask_permutex_epi64(__m512i s, __mmask8 k, __m512i a, int imm); VPERMQ __m512i _mm512_maskz_permutex_epi64( __mmask8 k, __m512i a, int imm); VPERMQ __m512i _mm512_permutexvar_epi64( __m512i a, __m512i b); VPERMQ __m512i _mm512_mask_permutexvar_epi64(__m512i s, __mmask8 k, __m512i a, __m512i b); VPERMQ __m512i _mm512_maskz_permutexvar_epi64( __mmask8 k, __m512i a, __m512i b); VPERMQ __m256i _mm256_permutex_epi64( __m256i a, int imm); VPERMQ __m256i _mm256_mask_permutex_epi64(__m256i s, __mmask8 k, __m256i a, int imm); VPERMQ __m256i _mm256_maskz_permutex_epi64( __mmask8 k, __m256i a, int imm); VPERMQ __m256i _mm256_permutexvar_epi64( __m256i a, __m256i b); VPERMQ __m256i _mm256_mask_permutexvar_epi64(__m256i s, __mmask8 k, __m256i a, __m256i b); VPERMQ __m256i _mm256_maskz_permutexvar_epi64( __mmask8 k, __m256i a, __m256i b); SIMD Floating-Point Exceptions None Other Exceptions Non-EVEX-encoded instruction, see Exceptions Type 4; additionally #UD If VEX.L = 0. If VEX.vvvv != 1111B. EVEX-encoded instruction, see Exceptions Type E4NF. #UD If encoded with EVEX.128. If EVEX.vvvv != 1111B and with imm8. VPERMQ—Qwords Element Permutation Vol. 2C 5-389 INSTRUCTION SET REFERENCE, V-Z VPEXPANDD—Load Sparse Packed Doubleword Integer Values from Dense Memory / Register Opcode/ Instruction Op/ En T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.66.0F38.W0 89 /r VPEXPANDD xmm1 {k1}{z}, xmm2/m128 EVEX.256.66.0F38.W0 89 /r VPEXPANDD ymm1 {k1}{z}, ymm2/m256 EVEX.512.66.0F38.W0 89 /r VPEXPANDD zmm1 {k1}{z}, zmm2/m512 Description Expand packed double-word integer values from xmm2/m128 to xmm1 using writemask k1. T1S V/V AVX512VL AVX512F Expand packed double-word integer values from ymm2/m256 to ymm1 using writemask k1. T1S V/V AVX512F Expand packed double-word integer values from zmm2/m512 to zmm1 using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) ModRM:r/m (r) NA NA Description Expand (load) up to 16 contiguous doubleword integer values of the input vector in the source operand (the second operand) to sparse elements in the destination operand (the first operand), selected by the writemask k1. The destination operand is a ZMM register, the source operand can be a ZMM register or memory location. The input vector starts from the lowest element in the source operand. The opmask register k1 selects the destination elements (a partial vector or sparse elements if less than 8 elements) to be replaced by the ascending elements in the input vector. Destination elements not selected by the writemask k1 are either unmodified or zeroed, depending on EVEX.z. Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector. Operation VPEXPANDD (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) k0 FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] SRC[k+31:k]; k k + 32 ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-390 Vol. 2C VPEXPANDD—Load Sparse Packed Doubleword Integer Values from Dense Memory / Register INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPEXPANDD __m512i _mm512_mask_expandloadu_epi32(__m512i s, __mmask16 k, void * a); VPEXPANDD __m512i _mm512_maskz_expandloadu_epi32( __mmask16 k, void * a); VPEXPANDD __m512i _mm512_mask_expand_epi32(__m512i s, __mmask16 k, __m512i a); VPEXPANDD __m512i _mm512_maskz_expand_epi32( __mmask16 k, __m512i a); VPEXPANDD __m256i _mm256_mask_expandloadu_epi32(__m256i s, __mmask8 k, void * a); VPEXPANDD __m256i _mm256_maskz_expandloadu_epi32( __mmask8 k, void * a); VPEXPANDD __m256i _mm256_mask_expand_epi32(__m256i s, __mmask8 k, __m256i a); VPEXPANDD __m256i _mm256_maskz_expand_epi32( __mmask8 k, __m256i a); VPEXPANDD __m128i _mm_mask_expandloadu_epi32(__m128i s, __mmask8 k, void * a); VPEXPANDD __m128i _mm_maskz_expandloadu_epi32( __mmask8 k, void * a); VPEXPANDD __m128i _mm_mask_expand_epi32(__m128i s, __mmask8 k, __m128i a); VPEXPANDD __m128i _mm_maskz_expand_epi32( __mmask8 k, __m128i a); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E4.nb. #UD If EVEX.vvvv != 1111B. VPEXPANDD—Load Sparse Packed Doubleword Integer Values from Dense Memory / Register Vol. 2C 5-391 INSTRUCTION SET REFERENCE, V-Z VPEXPANDQ—Load Sparse Packed Quadword Integer Values from Dense Memory / Register Opcode/ Instruction Op/ En EVEX.128.66.0F38.W1 89 /r VPEXPANDQ xmm1 {k1}{z}, xmm2/m128 EVEX.256.66.0F38.W1 89 /r VPEXPANDQ ymm1 {k1}{z}, ymm2/m256 EVEX.512.66.0F38.W1 89 /r VPEXPANDQ zmm1 {k1}{z}, zmm2/m512 T1S 64/32 bit Mode Support V/V T1S V/V T1S V/V CPUID Feature Flag AVX512VL AVX512F AVX512VL AVX512F AVX512F Description Expand packed quad-word integer values from xmm2/m128 to xmm1 using writemask k1. Expand packed quad-word integer values from ymm2/m256 to ymm1 using writemask k1. Expand packed quad-word integer values from zmm2/m512 to zmm1 using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) ModRM:r/m (r) NA NA Description Expand (load) up to 8 quadword integer values from the source operand (the second operand) to sparse elements in the destination operand (the first operand), selected by the writemask k1. The destination operand is a ZMM register, the source operand can be a ZMM register or memory location. The input vector starts from the lowest element in the source operand. The opmask register k1 selects the destination elements (a partial vector or sparse elements if less than 8 elements) to be replaced by the ascending elements in the input vector. Destination elements not selected by the writemask k1 are either unmodified or zeroed, depending on EVEX.z. Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector. Operation VPEXPANDQ (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) k0 FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] SRC[k+63:k]; k k + 64 ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking THEN DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-392 Vol. 2C VPEXPANDQ—Load Sparse Packed Quadword Integer Values from Dense Memory / Register INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPEXPANDQ __m512i _mm512_mask_expandloadu_epi64(__m512i s, __mmask8 k, void * a); VPEXPANDQ __m512i _mm512_maskz_expandloadu_epi64( __mmask8 k, void * a); VPEXPANDQ __m512i _mm512_mask_expand_epi64(__m512i s, __mmask8 k, __m512i a); VPEXPANDQ __m512i _mm512_maskz_expand_epi64( __mmask8 k, __m512i a); VPEXPANDQ __m256i _mm256_mask_expandloadu_epi64(__m256i s, __mmask8 k, void * a); VPEXPANDQ __m256i _mm256_maskz_expandloadu_epi64( __mmask8 k, void * a); VPEXPANDQ __m256i _mm256_mask_expand_epi64(__m256i s, __mmask8 k, __m256i a); VPEXPANDQ __m256i _mm256_maskz_expand_epi64( __mmask8 k, __m256i a); VPEXPANDQ __m128i _mm_mask_expandloadu_epi64(__m128i s, __mmask8 k, void * a); VPEXPANDQ __m128i _mm_maskz_expandloadu_epi64( __mmask8 k, void * a); VPEXPANDQ __m128i _mm_mask_expand_epi64(__m128i s, __mmask8 k, __m128i a); VPEXPANDQ __m128i _mm_maskz_expand_epi64( __mmask8 k, __m128i a); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E4.nb. #UD If EVEX.vvvv != 1111B. VPEXPANDQ—Load Sparse Packed Quadword Integer Values from Dense Memory / Register Vol. 2C 5-393 INSTRUCTION SET REFERENCE, V-Z VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values Opcode/ Instruction Op/ En CPUID Feature Flag AVX512VL AVX512CD Description FV 64/32 bit Mode Support V/V EVEX.128.66.0F38.W0 44 /r VPLZCNTD xmm1 {k1}{z}, xmm2/m128/m32bcst EVEX.256.66.0F38.W0 44 /r VPLZCNTD ymm1 {k1}{z}, ymm2/m256/m32bcst FV V/V AVX512VL AVX512CD Count the number of leading zero bits in each dword element of ymm2/m256/m32bcst using writemask k1. EVEX.512.66.0F38.W0 44 /r VPLZCNTD zmm1 {k1}{z}, zmm2/m512/m32bcst FV V/V AVX512CD Count the number of leading zero bits in each dword element of zmm2/m512/m32bcst using writemask k1. EVEX.128.66.0F38.W1 44 /r VPLZCNTQ xmm1 {k1}{z}, xmm2/m128/m64bcst FV V/V AVX512VL AVX512CD Count the number of leading zero bits in each qword element of xmm2/m128/m64bcst using writemask k1. EVEX.256.66.0F38.W1 44 /r VPLZCNTQ ymm1 {k1}{z}, ymm2/m256/m64bcst FV V/V AVX512VL AVX512CD Count the number of leading zero bits in each qword element of ymm2/m256/m64bcst using writemask k1. EVEX.512.66.0F38.W1 44 /r VPLZCNTQ zmm1 {k1}{z}, zmm2/m512/m64bcst FV V/V AVX512CD Count the number of leading zero bits in each qword element of zmm2/m512/m64bcst using writemask k1. Count the number of leading zero bits in each dword element of xmm2/m128/m32bcst using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Counts the number of leading most significant zero bits in each dword or qword element of the source operand (the second operand) and stores the results in the destination register (the first operand) according to the writemask. If an element is zero, the result for that element is the operand size of the element. EVEX.512 encoded version: The source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a ZMM register, conditionally updated using writemask k1. EVEX.256 encoded version: The source operand is a YMM register, a 256-bit memory location, or a 256-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a YMM register, conditionally updated using writemask k1. EVEX.128 encoded version: The source operand is a XMM register, a 128-bit memory location, or a 128-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a XMM register, conditionally updated using writemask k1. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 5-394 Vol. 2C VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values INSTRUCTION SET REFERENCE, V-Z Operation VPLZCNTD (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j*32 IF MaskBit(j) OR *no writemask* THEN temp 32 DEST[i+31:i] 0 WHILE (temp > 0) AND (SRC[i+temp-1] = 0) DO temp temp – 1 DEST[i+31:i] DEST[i+31:i] + 1 OD ELSE IF *merging-masking* THEN *DEST[i+31:i] remains unchanged* ELSE DEST[i+31:i] 0 FI FI ENDFOR DEST[MAX_VL-1:VL] 0 VPLZCNTQ (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j*64 IF MaskBit(j) OR *no writemask* THEN temp 64 DEST[i+63:i] 0 WHILE (temp > 0) AND (SRC[i+temp-1] = 0) DO temp temp – 1 DEST[i+63:i] DEST[i+63:i] + 1 OD ELSE IF *merging-masking* THEN *DEST[i+63:i] remains unchanged* ELSE DEST[i+63:i] 0 FI FI ENDFOR DEST[MAX_VL-1:VL] 0 VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values Vol. 2C 5-395 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPLZCNTD __m512i _mm512_lzcnt_epi32(__m512i a); VPLZCNTD __m512i _mm512_mask_lzcnt_epi32(__m512i s, __mmask16 m, __m512i a); VPLZCNTD __m512i _mm512_maskz_lzcnt_epi32( __mmask16 m, __m512i a); VPLZCNTQ __m512i _mm512_lzcnt_epi64(__m512i a); VPLZCNTQ __m512i _mm512_mask_lzcnt_epi64(__m512i s, __mmask8 m, __m512i a); VPLZCNTQ __m512i _mm512_maskz_lzcnt_epi64(__mmask8 m, __m512i a); VPLZCNTD __m256i _mm256_lzcnt_epi32(__m256i a); VPLZCNTD __m256i _mm256_mask_lzcnt_epi32(__m256i s, __mmask8 m, __m256i a); VPLZCNTD __m256i _mm256_maskz_lzcnt_epi32( __mmask8 m, __m256i a); VPLZCNTQ __m256i _mm256_lzcnt_epi64(__m256i a); VPLZCNTQ __m256i _mm256_mask_lzcnt_epi64(__m256i s, __mmask8 m, __m256i a); VPLZCNTQ __m256i _mm256_maskz_lzcnt_epi64(__mmask8 m, __m256i a); VPLZCNTD __m128i _mm_lzcnt_epi32(__m128i a); VPLZCNTD __m128i _mm_mask_lzcnt_epi32(__m128i s, __mmask8 m, __m128i a); VPLZCNTD __m128i _mm_maskz_lzcnt_epi32( __mmask8 m, __m128i a); VPLZCNTQ __m128i _mm_lzcnt_epi64(__m128i a); VPLZCNTQ __m128i _mm_mask_lzcnt_epi64(__m128i s, __mmask8 m, __m128i a); VPLZCNTQ __m128i _mm_maskz_lzcnt_epi64(__mmask8 m, __m128i a); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E4. 5-396 Vol. 2C VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values INSTRUCTION SET REFERENCE, V-Z VPMASKMOV — Conditional SIMD Integer Packed Loads and Stores Opcode/ Instruction Op/ En RVM 64/32 -bit Mode V/V CPUID Feature Flag AVX2 VEX.NDS.128.66.0F38.W0 8C /rVPMASKMOVD xmm1, xmm2, m128 VEX.NDS.256.66.0F38.W0 8C /r VPMASKMOVD ymm1, ymm2, m256 VEX.NDS.128.66.0F38.W1 8C /r VPMASKMOVQ xmm1, xmm2, m128 VEX.NDS.256.66.0F38.W1 8C /r VPMASKMOVQ ymm1, ymm2, m256 VEX.NDS.128.66.0F38.W0 8E /r VPMASKMOVD m128, xmm1, xmm2 VEX.NDS.256.66.0F38.W0 8E /r VPMASKMOVD m256, ymm1, ymm2 VEX.NDS.128.66.0F38.W1 8E /r VPMASKMOVQ m128, xmm1, xmm2 VEX.NDS.256.66.0F38.W1 8E /r VPMASKMOVQ m256, ymm1, ymm2 RVM V/V AVX2 RVM V/V AVX2 RVM V/V AVX2 MVR V/V AVX2 MVR V/V AVX2 MVR V/V AVX2 MVR V/V AVX2 Description Conditionally load dword values from m128 using mask in xmm2 and store in xmm1. Conditionally load dword values from m256 using mask in ymm2 and store in ymm1. Conditionally load qword values from m128 using mask in xmm2 and store in xmm1. Conditionally load qword values from m256 using mask in ymm2 and store in ymm1. Conditionally store dword values from xmm2 using mask in xmm1. Conditionally store dword values from ymm2 using mask in ymm1. Conditionally store qword values from xmm2 using mask in xmm1. Conditionally store qword values from ymm2 using mask in ymm1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA MVR ModRM:r/m (w) VEX.vvvv ModRM:reg (r) NA Description Conditionally moves packed data elements from the second source operand into the corresponding data element of the destination operand, depending on the mask bits associated with each data element. The mask bits are specified in the first source operand. The mask bit for each data element is the most significant bit of that element in the first source operand. If a mask is 1, the corresponding data element is copied from the second source operand to the destination operand. If the mask is 0, the corresponding data element is set to zero in the load form of these instructions, and unmodified in the store form. The second source operand is a memory address for the load form of these instructions. The destination operand is a memory address for the store form of these instructions. The other operands are either XMM registers (for VEX.128 version) or YMM registers (for VEX.256 version). Faults occur only due to mask-bit required memory accesses that caused the faults. Faults will not occur due to referencing any memory location if the corresponding mask bit for that memory location is 0. For example, no faults will be detected if the mask bits are all zero. Unlike previous MASKMOV instructions (MASKMOVQ and MASKMOVDQU), a nontemporal hint is not applied to these instructions. Instruction behavior on alignment check reporting with mask bits of less than all 1s are the same as with mask bits of all 1s. VPMASKMOV — Conditional SIMD Integer Packed Loads and Stores Vol. 2C 5-397 INSTRUCTION SET REFERENCE, V-Z VMASKMOV should not be used to access memory mapped I/O as the ordering of the individual loads or stores it does is implementation specific. In cases where mask bits indicate data should not be loaded or stored paging A and D bits will be set in an implementation dependent way. However, A and D bits are always set for pages where data is actually loaded/stored. Note: for load forms, the first source (the mask) is encoded in VEX.vvvv; the second source is encoded in rm_field, and the destination register is encoded in reg_field. Note: for store forms, the first source (the mask) is encoded in VEX.vvvv; the second source register is encoded in reg_field, and the destination memory location is encoded in rm_field. Operation VPMASKMOVD - 256-bit load DEST[31:0] IF (SRC1[31]) Load_32(mem) ELSE 0 DEST[63:32] IF (SRC1[63]) Load_32(mem + 4) ELSE 0 DEST[95:64] IF (SRC1[95]) Load_32(mem + 8) ELSE 0 DEST[127:96] IF (SRC1[127]) Load_32(mem + 12) ELSE 0 DEST[159:128] IF (SRC1[159]) Load_32(mem + 16) ELSE 0 DEST[191:160] IF (SRC1[191]) Load_32(mem + 20) ELSE 0 DEST[223:192] IF (SRC1[223]) Load_32(mem + 24) ELSE 0 DEST[255:224] IF (SRC1[255]) Load_32(mem + 28) ELSE 0 VPMASKMOVD -128-bit load DEST[31:0] IF (SRC1[31]) Load_32(mem) ELSE 0 DEST[63:32] IF (SRC1[63]) Load_32(mem + 4) ELSE 0 DEST[95:64] IF (SRC1[95]) Load_32(mem + 8) ELSE 0 DEST[127:97] IF (SRC1[127]) Load_32(mem + 12) ELSE 0 DEST[VLMAX-1:128] 0 VPMASKMOVQ - 256-bit load DEST[63:0] IF (SRC1[63]) Load_64(mem) ELSE 0 DEST[127:64] IF (SRC1[127]) Load_64(mem + 8) ELSE 0 DEST[195:128] IF (SRC1[191]) Load_64(mem + 16) ELSE 0 DEST[255:196] IF (SRC1[255]) Load_64(mem + 24) ELSE 0 VPMASKMOVQ - 128-bit load DEST[63:0] IF (SRC1[63]) Load_64(mem) ELSE 0 DEST[127:64] IF (SRC1[127]) Load_64(mem + 16) ELSE 0 DEST[VLMAX-1:128] 0 VPMASKMOVD - 256-bit store IF (SRC1[31]) DEST[31:0] SRC2[31:0] IF (SRC1[63]) DEST[63:32] SRC2[63:32] IF (SRC1[95]) DEST[95:64] SRC2[95:64] IF (SRC1[127]) DEST[127:96] SRC2[127:96] IF (SRC1[159]) DEST[159:128] SRC2[159:128] IF (SRC1[191]) DEST[191:160] SRC2[191:160] IF (SRC1[223]) DEST[223:192] SRC2[223:192] IF (SRC1[255]) DEST[255:224] SRC2[255:224] 5-398 Vol. 2C VPMASKMOV — Conditional SIMD Integer Packed Loads and Stores INSTRUCTION SET REFERENCE, V-Z VPMASKMOVD - 128-bit store IF (SRC1[31]) DEST[31:0] SRC2[31:0] IF (SRC1[63]) DEST[63:32] SRC2[63:32] IF (SRC1[95]) DEST[95:64] SRC2[95:64] IF (SRC1[127]) DEST[127:96] SRC2[127:96] VPMASKMOVQ - 256-bit store IF (SRC1[63]) DEST[63:0] SRC2[63:0] IF (SRC1[127]) DEST[127:64] SRC2[127:64] IF (SRC1[191]) DEST[191:128] SRC2[191:128] IF (SRC1[255]) DEST[255:192] SRC2[255:192] VPMASKMOVQ - 128-bit store IF (SRC1[63]) DEST[63:0] SRC2[63:0] IF (SRC1[127]) DEST[127:64] SRC2[127:64] Intel C/C++ Compiler Intrinsic Equivalent VPMASKMOVD: __m256i _mm256_maskload_epi32(int const *a, __m256i mask) VPMASKMOVD: void _mm256_maskstore_epi32(int *a, __m256i mask, __m256i b) VPMASKMOVQ: __m256i _mm256_maskload_epi64(__int64 const *a, __m256i mask); VPMASKMOVQ: void _mm256_maskstore_epi64(__int64 *a, __m256i mask, __m256d b); VPMASKMOVD: __m128i _mm_maskload_epi32(int const *a, __m128i mask) VPMASKMOVD: void _mm_maskstore_epi32(int *a, __m128i mask, __m128 b) VPMASKMOVQ: __m128i _mm_maskload_epi64(__int cont *a, __m128i mask); VPMASKMOVQ: void _mm_maskstore_epi64(__int64 *a, __m128i mask, __m128i b); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type 6 (No AC# reported for any mask bit combinations). VPMASKMOV — Conditional SIMD Integer Packed Loads and Stores Vol. 2C 5-399 INSTRUCTION SET REFERENCE, V-Z VPMOVM2B/VPMOVM2W/VPMOVM2D/VPMOVM2Q—Convert a Mask Register to a Vector Register Opcode/ Instruction Op/ En EVEX.128.F3.0F38.W0 28 /r VPMOVM2B xmm1, k1 EVEX.256.F3.0F38.W0 28 /r VPMOVM2B ymm1, k1 EVEX.512.F3.0F38.W0 28 /r VPMOVM2B zmm1, k1 EVEX.128.F3.0F38.W1 28 /r VPMOVM2W xmm1, k1 EVEX.256.F3.0F38.W1 28 /r VPMOVM2W ymm1, k1 EVEX.512.F3.0F38.W1 28 /r VPMOVM2W zmm1, k1 EVEX.128.F3.0F38.W0 38 /r VPMOVM2D xmm1, k1 EVEX.256.F3.0F38.W0 38 /r VPMOVM2D ymm1, k1 EVEX.512.F3.0F38.W0 38 /r VPMOVM2D zmm1, k1 EVEX.128.F3.0F38.W1 38 /r VPMOVM2Q xmm1, k1 EVEX.256.F3.0F38.W1 38 /r VPMOVM2Q ymm1, k1 EVEX.512.F3.0F38.W1 38 /r VPMOVM2Q zmm1, k1 RM 64/32 bit Mode Support V/V RM V/V RM V/V RM V/V RM V/V RM V/V RM V/V RM V/V RM V/V RM V/V RM V/V RM V/V CPUID Feature Flag AVX512VL AVX512BW AVX512VL AVX512BW AVX512BW AVX512VL AVX512BW AVX512VL AVX512BW AVX512BW AVX512VL AVX512DQ AVX512VL AVX512DQ AVX512DQ AVX512VL AVX512DQ AVX512VL AVX512DQ AVX512DQ Description Sets each byte in XMM1 to all 1’s or all 0’s based on the value of the corresponding bit in k1. Sets each byte in YMM1 to all 1’s or all 0’s based on the value of the corresponding bit in k1. Sets each byte in ZMM1 to all 1’s or all 0’s based on the value of the corresponding bit in k1. Sets each word in XMM1 to all 1’s or all 0’s based on the value of the corresponding bit in k1. Sets each word in YMM1 to all 1’s or all 0’s based on the value of the corresponding bit in k1. Sets each word in ZMM1 to all 1’s or all 0’s based on the value of the corresponding bit in k1. Sets each doubleword in XMM1 to all 1’s or all 0’s based on the value of the corresponding bit in k1. Sets each doubleword in YMM1 to all 1’s or all 0’s based on the value of the corresponding bit in k1. Sets each doubleword in ZMM1 to all 1’s or all 0’s based on the value of the corresponding bit in k1. Sets each quadword in XMM1 to all 1’s or all 0’s based on the value of the corresponding bit in k1. Sets each quadword in YMM1 to all 1’s or all 0’s based on the value of the corresponding bit in k1. Sets each quadword in ZMM1 to all 1’s or all 0’s based on the value of the corresponding bit in k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RM ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts a mask register to a vector register. Each element in the destination register is set to all 1’s or all 0’s depending on the value of the corresponding bit in the source mask register. The source operand is a mask register. The destination operand is a ZMM/YMM/XMM register. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 5-400 Vol. 2C VPMOVM2B/VPMOVM2W/VPMOVM2D/VPMOVM2Q—Convert a Mask Register to a Vector Register INSTRUCTION SET REFERENCE, V-Z Operation VPMOVM2B (EVEX encoded versions) (KL, VL) = (16, 128), (32, 256), (64, 512) FOR j 0 TO KL-1 ij*8 IF SRC[j] THEN DEST[i+7:i] -1 ELSE DEST[i+7:i] 0 FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPMOVM2W (EVEX encoded versions) (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO KL-1 i j * 16 IF SRC[j] THEN DEST[i+15:i] -1 ELSE DEST[i+15:i] 0 FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPMOVM2D (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF SRC[j] THEN DEST[i+31:i] -1 ELSE DEST[i+31:i] 0 FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPMOVM2Q (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF SRC[j] THEN DEST[i+63:i] -1 ELSE DEST[i+63:i] 0 FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPMOVM2B/VPMOVM2W/VPMOVM2D/VPMOVM2Q—Convert a Mask Register to a Vector Register Vol. 2C 5-401 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalents VPMOVM2B __m512i _mm512_movm_epi8(__mmask64 ); VPMOVM2D __m512i _mm512_movm_epi32(__mmask8 ); VPMOVM2Q __m512i _mm512_movm_epi64(__mmask16 ); VPMOVM2W __m512i _mm512_movm_epi16(__mmask32 ); VPMOVM2B __m256i _mm256_movm_epi8(__mmask32 ); VPMOVM2D __m256i _mm256_movm_epi32(__mmask8 ); VPMOVM2Q __m256i _mm256_movm_epi64(__mmask8 ); VPMOVM2W __m256i _mm256_movm_epi16(__mmask16 ); VPMOVM2B __m128i _mm_movm_epi8(__mmask16 ); VPMOVM2D __m128i _mm_movm_epi32(__mmask8 ); VPMOVM2Q __m128i _mm_movm_epi64(__mmask8 ); VPMOVM2W __m128i _mm_movm_epi16(__mmask8 ); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E7NM #UD 5-402 Vol. 2C If EVEX.vvvv != 1111B. VPMOVM2B/VPMOVM2W/VPMOVM2D/VPMOVM2Q—Convert a Mask Register to a Vector Register INSTRUCTION SET REFERENCE, V-Z VPMOVB2M/VPMOVW2M/VPMOVD2M/VPMOVQ2M—Convert a Vector Register to a Mask Opcode/ Instruction Op/ En EVEX.128.F3.0F38.W0 29 /r VPMOVB2M k1, xmm1 EVEX.256.F3.0F38.W0 29 /r VPMOVB2M k1, ymm1 EVEX.512.F3.0F38.W0 29 /r VPMOVB2M k1, zmm1 EVEX.128.F3.0F38.W1 29 /r VPMOVW2M k1, xmm1 EVEX.256.F3.0F38.W1 29 /r VPMOVW2M k1, ymm1 EVEX.512.F3.0F38.W1 29 /r VPMOVW2M k1, zmm1 EVEX.128.F3.0F38.W0 39 /r VPMOVD2M k1, xmm1 EVEX.256.F3.0F38.W0 39 /r VPMOVD2M k1, ymm1 EVEX.512.F3.0F38.W0 39 /r VPMOVD2M k1, zmm1 EVEX.128.F3.0F38.W1 39 /r VPMOVQ2M k1, xmm1 EVEX.256.F3.0F38.W1 39 /r VPMOVQ2M k1, ymm1 EVEX.512.F3.0F38.W1 39 /r VPMOVQ2M k1, zmm1 RM 64/32 bit Mode Support V/V RM V/V RM V/V RM V/V RM V/V RM V/V RM V/V RM V/V RM V/V RM V/V RM V/V RM V/V CPUID Feature Flag AVX512VL AVX512BW AVX512VL AVX512BW AVX512BW AVX512VL AVX512BW AVX512VL AVX512BW AVX512BW AVX512VL AVX512DQ AVX512VL AVX512DQ AVX512DQ AVX512VL AVX512DQ AVX512VL AVX512DQ AVX512DQ Description Sets each bit in k1 to 1 or 0 based on the value of the most significant bit of the corresponding byte in XMM1. Sets each bit in k1 to 1 or 0 based on the value of the most significant bit of the corresponding byte in YMM1. Sets each bit in k1 to 1 or 0 based on the value of the most significant bit of the corresponding byte in ZMM1. Sets each bit in k1 to 1 or 0 based on the value of the most significant bit of the corresponding word in XMM1. Sets each bit in k1 to 1 or 0 based on the value of the most significant bit of the corresponding word in YMM1. Sets each bit in k1 to 1 or 0 based on the value of the most significant bit of the corresponding word in ZMM1. Sets each bit in k1 to 1 or 0 based on the value of the most significant bit of the corresponding doubleword in XMM1. Sets each bit in k1 to 1 or 0 based on the value of the most significant bit of the corresponding doubleword in YMM1. Sets each bit in k1 to 1 or 0 based on the value of the most significant bit of the corresponding doubleword in ZMM1. Sets each bit in k1 to 1 or 0 based on the value of the most significant bit of the corresponding quadword in XMM1. Sets each bit in k1 to 1 or 0 based on the value of the most significant bit of the corresponding quadword in YMM1. Sets each bit in k1 to 1 or 0 based on the value of the most significant bit of the corresponding quadword in ZMM1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RM ModRM:reg (w) ModRM:r/m (r) NA NA Description Converts a vector register to a mask register. Each element in the destination register is set to 1 or 0 depending on the value of most significant bit of the corresponding element in the source register. The source operand is a ZMM/YMM/XMM register. The destination operand is a mask register. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. VPMOVB2M/VPMOVW2M/VPMOVD2M/VPMOVQ2M—Convert a Vector Register to a Mask Vol. 2C 5-403 INSTRUCTION SET REFERENCE, V-Z Operation VPMOVB2M (EVEX encoded versions) (KL, VL) = (16, 128), (32, 256), (64, 512) FOR j 0 TO KL-1 ij*8 IF SRC[i+7] THEN DEST[j] 1 ELSE DEST[j] 0 FI; ENDFOR DEST[MAX_KL-1:KL] 0 VPMOVW2M (EVEX encoded versions) (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO KL-1 i j * 16 IF SRC[i+15] THEN DEST[j] 1 ELSE DEST[j] 0 FI; ENDFOR DEST[MAX_KL-1:KL] 0 VPMOVD2M (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF SRC[i+31] THEN DEST[j] 1 ELSE DEST[j] 0 FI; ENDFOR DEST[MAX_KL-1:KL] 0 VPMOVQ2M (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF SRC[i+63] THEN DEST[j] 1 ELSE DEST[j] 0 FI; ENDFOR DEST[MAX_KL-1:KL] 0 5-404 Vol. 2C VPMOVB2M/VPMOVW2M/VPMOVD2M/VPMOVQ2M—Convert a Vector Register to a Mask INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalents VPMPOVB2M __mmask64 _mm512_movepi8_mask( __m512i ); VPMPOVD2M __mmask16 _mm512_movepi32_mask( __m512i ); VPMPOVQ2M __mmask8 _mm512_movepi64_mask( __m512i ); VPMPOVW2M __mmask32 _mm512_movepi16_mask( __m512i ); VPMPOVB2M __mmask32 _mm256_movepi8_mask( __m256i ); VPMPOVD2M __mmask8 _mm256_movepi32_mask( __m256i ); VPMPOVQ2M __mmask8 _mm256_movepi64_mask( __m256i ); VPMPOVW2M __mmask16 _mm256_movepi16_mask( __m256i ); VPMPOVB2M __mmask16 _mm_movepi8_mask( __m128i ); VPMPOVD2M __mmask8 _mm_movepi32_mask( __m128i ); VPMPOVQ2M __mmask8 _mm_movepi64_mask( __m128i ); VPMPOVW2M __mmask8 _mm_movepi16_mask( __m128i ); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E7NM #UD If EVEX.vvvv != 1111B. VPMOVB2M/VPMOVW2M/VPMOVD2M/VPMOVQ2M—Convert a Vector Register to a Mask Vol. 2C 5-405 INSTRUCTION SET REFERENCE, V-Z VPMOVQB/VPMOVSQB/VPMOVUSQB—Down Convert QWord to Byte Opcode/ Instruction Op / En OVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.F3.0F38.W0 32 /r VPMOVQB xmm1/m16 {k1}{z}, xmm2 EVEX.128.F3.0F38.W0 22 /r VPMOVSQB xmm1/m16 {k1}{z}, xmm2 OVM V/V AVX512VL AVX512F EVEX.128.F3.0F38.W0 12 /r VPMOVUSQB xmm1/m16 {k1}{z}, xmm2 OVM V/V AVX512VL AVX512F EVEX.256.F3.0F38.W0 32 /r VPMOVQB xmm1/m32 {k1}{z}, ymm2 OVM V/V AVX512VL AVX512F EVEX.256.F3.0F38.W0 22 /r VPMOVSQB xmm1/m32 {k1}{z}, ymm2 OVM V/V AVX512VL AVX512F EVEX.256.F3.0F38.W0 12 /r VPMOVUSQB xmm1/m32 {k1}{z}, ymm2 OVM V/V AVX512VL AVX512F EVEX.512.F3.0F38.W0 32 /r VPMOVQB xmm1/m64 {k1}{z}, zmm2 OVM V/V AVX512F EVEX.512.F3.0F38.W0 22 /r VPMOVSQB xmm1/m64 {k1}{z}, zmm2 OVM V/V AVX512F EVEX.512.F3.0F38.W0 12 /r VPMOVUSQB xmm1/m64 {k1}{z}, zmm2 OVM V/V AVX512F Description Converts 2 packed quad-word integers from xmm2 into 2 packed byte integers in xmm1/m16 with truncation under writemask k1. Converts 2 packed signed quad-word integers from xmm2 into 2 packed signed byte integers in xmm1/m16 using signed saturation under writemask k1. Converts 2 packed unsigned quad-word integers from xmm2 into 2 packed unsigned byte integers in xmm1/m16 using unsigned saturation under writemask k1. Converts 4 packed quad-word integers from ymm2 into 4 packed byte integers in xmm1/m32 with truncation under writemask k1. Converts 4 packed signed quad-word integers from ymm2 into 4 packed signed byte integers in xmm1/m32 using signed saturation under writemask k1. Converts 4 packed unsigned quad-word integers from ymm2 into 4 packed unsigned byte integers in xmm1/m32 using unsigned saturation under writemask k1. Converts 8 packed quad-word integers from zmm2 into 8 packed byte integers in xmm1/m64 with truncation under writemask k1. Converts 8 packed signed quad-word integers from zmm2 into 8 packed signed byte integers in xmm1/m64 using signed saturation under writemask k1. Converts 8 packed unsigned quad-word integers from zmm2 into 8 packed unsigned byte integers in xmm1/m64 using unsigned saturation under writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 OVM ModRM:r/m (w) ModRM:reg (r) NA NA Description VPMOVQB down converts 64-bit integer elements in the source operand (the second operand) into packed byte elements using truncation. VPMOVSQB converts signed 64-bit integers into packed signed bytes using signed saturation. VPMOVUSQB convert unsigned quad-word values into unsigned byte values using unsigned saturation. The source operand is a vector register. The destination operand is an XMM register or a memory location. Down-converted byte elements are written to the destination operand (the first operand) from the least-significant byte. Byte elements of the destination operand are updated according to the writemask. Bits (MAX_VL-1:64) of the destination are zeroed. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 5-406 Vol. 2C VPMOVQB/VPMOVSQB/VPMOVUSQB—Down Convert QWord to Byte INSTRUCTION SET REFERENCE, V-Z Operation VPMOVQB instruction (EVEX encoded versions) when dest is a register (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 ij*8 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+7:i] TruncateQuadWordToByte (SRC[m+63:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+7:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+7:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/8] 0; VPMOVQB instruction (EVEX encoded versions) when dest is memory (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 ij*8 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+7:i] TruncateQuadWordToByte (SRC[m+63:m]) ELSE *DEST[i+7:i] remains unchanged* ; merging-masking FI; ENDFOR VPMOVSQB instruction (EVEX encoded versions) when dest is a register (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 ij*8 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+7:i] SaturateSignedQuadWordToByte (SRC[m+63:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+7:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+7:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/8] 0; VPMOVQB/VPMOVSQB/VPMOVUSQB—Down Convert QWord to Byte Vol. 2C 5-407 INSTRUCTION SET REFERENCE, V-Z VPMOVSQB instruction (EVEX encoded versions) when dest is memory (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 ij*8 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+7:i] SaturateSignedQuadWordToByte (SRC[m+63:m]) ELSE *DEST[i+7:i] remains unchanged* ; merging-masking FI; ENDFOR VPMOVUSQB instruction (EVEX encoded versions) when dest is a register (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 ij*8 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+7:i] SaturateUnsignedQuadWordToByte (SRC[m+63:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+7:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+7:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/8] 0; VPMOVUSQB instruction (EVEX encoded versions) when dest is memory (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 ij*8 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+7:i] SaturateUnsignedQuadWordToByte (SRC[m+63:m]) ELSE *DEST[i+7:i] remains unchanged* ; merging-masking FI; ENDFOR 5-408 Vol. 2C VPMOVQB/VPMOVSQB/VPMOVUSQB—Down Convert QWord to Byte INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalents VPMOVQB __m128i _mm512_cvtepi64_epi8( __m512i a); VPMOVQB __m128i _mm512_mask_cvtepi64_epi8(__m128i s, __mmask8 k, __m512i a); VPMOVQB __m128i _mm512_maskz_cvtepi64_epi8( __mmask8 k, __m512i a); VPMOVQB void _mm512_mask_cvtepi64_storeu_epi8(void * d, __mmask8 k, __m512i a); VPMOVSQB __m128i _mm512_cvtsepi64_epi8( __m512i a); VPMOVSQB __m128i _mm512_mask_cvtsepi64_epi8(__m128i s, __mmask8 k, __m512i a); VPMOVSQB __m128i _mm512_maskz_cvtsepi64_epi8( __mmask8 k, __m512i a); VPMOVSQB void _mm512_mask_cvtsepi64_storeu_epi8(void * d, __mmask8 k, __m512i a); VPMOVUSQB __m128i _mm512_cvtusepi64_epi8( __m512i a); VPMOVUSQB __m128i _mm512_mask_cvtusepi64_epi8(__m128i s, __mmask8 k, __m512i a); VPMOVUSQB __m128i _mm512_maskz_cvtusepi64_epi8( __mmask8 k, __m512i a); VPMOVUSQB void _mm512_mask_cvtusepi64_storeu_epi8(void * d, __mmask8 k, __m512i a); VPMOVUSQB __m128i _mm256_cvtusepi64_epi8(__m256i a); VPMOVUSQB __m128i _mm256_mask_cvtusepi64_epi8(__m128i a, __mmask8 k, __m256i b); VPMOVUSQB __m128i _mm256_maskz_cvtusepi64_epi8( __mmask8 k, __m256i b); VPMOVUSQB void _mm256_mask_cvtusepi64_storeu_epi8(void * , __mmask8 k, __m256i b); VPMOVUSQB __m128i _mm_cvtusepi64_epi8(__m128i a); VPMOVUSQB __m128i _mm_mask_cvtusepi64_epi8(__m128i a, __mmask8 k, __m128i b); VPMOVUSQB __m128i _mm_maskz_cvtusepi64_epi8( __mmask8 k, __m128i b); VPMOVUSQB void _mm_mask_cvtusepi64_storeu_epi8(void * , __mmask8 k, __m128i b); VPMOVSQB __m128i _mm256_cvtsepi64_epi8(__m256i a); VPMOVSQB __m128i _mm256_mask_cvtsepi64_epi8(__m128i a, __mmask8 k, __m256i b); VPMOVSQB __m128i _mm256_maskz_cvtsepi64_epi8( __mmask8 k, __m256i b); VPMOVSQB void _mm256_mask_cvtsepi64_storeu_epi8(void * , __mmask8 k, __m256i b); VPMOVSQB __m128i _mm_cvtsepi64_epi8(__m128i a); VPMOVSQB __m128i _mm_mask_cvtsepi64_epi8(__m128i a, __mmask8 k, __m128i b); VPMOVSQB __m128i _mm_maskz_cvtsepi64_epi8( __mmask8 k, __m128i b); VPMOVSQB void _mm_mask_cvtsepi64_storeu_epi8(void * , __mmask8 k, __m128i b); VPMOVQB __m128i _mm256_cvtepi64_epi8(__m256i a); VPMOVQB __m128i _mm256_mask_cvtepi64_epi8(__m128i a, __mmask8 k, __m256i b); VPMOVQB __m128i _mm256_maskz_cvtepi64_epi8( __mmask8 k, __m256i b); VPMOVQB void _mm256_mask_cvtepi64_storeu_epi8(void * , __mmask8 k, __m256i b); VPMOVQB __m128i _mm_cvtepi64_epi8(__m128i a); VPMOVQB __m128i _mm_mask_cvtepi64_epi8(__m128i a, __mmask8 k, __m128i b); VPMOVQB __m128i _mm_maskz_cvtepi64_epi8( __mmask8 k, __m128i b); VPMOVQB void _mm_mask_cvtepi64_storeu_epi8(void * , __mmask8 k, __m128i b); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E6. #UD If EVEX.vvvv != 1111B. VPMOVQB/VPMOVSQB/VPMOVUSQB—Down Convert QWord to Byte Vol. 2C 5-409 INSTRUCTION SET REFERENCE, V-Z VPMOVQW/VPMOVSQW/VPMOVUSQW—Down Convert QWord to Word Opcode/ Instruction Op / En QVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.F3.0F38.W0 34 /r VPMOVQW xmm1/m32 {k1}{z}, xmm2 EVEX.128.F3.0F38.W0 24 /r VPMOVSQW xmm1/m32 {k1}{z}, xmm2 QVM V/V AVX512VL AVX512F EVEX.128.F3.0F38.W0 14 /r VPMOVUSQW xmm1/m32 {k1}{z}, xmm2 QVM V/V AVX512VL AVX512F EVEX.256.F3.0F38.W0 34 /r VPMOVQW xmm1/m64 {k1}{z}, ymm2 QVM V/V AVX512VL AVX512F EVEX.256.F3.0F38.W0 24 /r VPMOVSQW xmm1/m64 {k1}{z}, ymm2 QVM V/V AVX512VL AVX512F EVEX.256.F3.0F38.W0 14 /r VPMOVUSQW xmm1/m64 {k1}{z}, ymm2 QVM V/V AVX512VL AVX512F EVEX.512.F3.0F38.W0 34 /r VPMOVQW xmm1/m128 {k1}{z}, zmm2 QVM V/V AVX512F EVEX.512.F3.0F38.W0 24 /r VPMOVSQW xmm1/m128 {k1}{z}, zmm2 QVM V/V AVX512F EVEX.512.F3.0F38.W0 14 /r VPMOVUSQW xmm1/m128 {k1}{z}, zmm2 QVM V/V AVX512F Description Converts 2 packed quad-word integers from xmm2 into 2 packed word integers in xmm1/m32 with truncation under writemask k1. Converts 8 packed signed quad-word integers from zmm2 into 8 packed signed word integers in xmm1/m32 using signed saturation under writemask k1. Converts 2 packed unsigned quad-word integers from xmm2 into 2 packed unsigned word integers in xmm1/m32 using unsigned saturation under writemask k1. Converts 4 packed quad-word integers from ymm2 into 4 packed word integers in xmm1/m64 with truncation under writemask k1. Converts 4 packed signed quad-word integers from ymm2 into 4 packed signed word integers in xmm1/m64 using signed saturation under writemask k1. Converts 4 packed unsigned quad-word integers from ymm2 into 4 packed unsigned word integers in xmm1/m64 using unsigned saturation under writemask k1. Converts 8 packed quad-word integers from zmm2 into 8 packed word integers in xmm1/m128 with truncation under writemask k1. Converts 8 packed signed quad-word integers from zmm2 into 8 packed signed word integers in xmm1/m128 using signed saturation under writemask k1. Converts 8 packed unsigned quad-word integers from zmm2 into 8 packed unsigned word integers in xmm1/m128 using unsigned saturation under writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 QVM ModRM:r/m (w) ModRM:reg (r) NA NA Description VPMOVQW down converts 64-bit integer elements in the source operand (the second operand) into packed words using truncation. VPMOVSQW converts signed 64-bit integers into packed signed words using signed saturation. VPMOVUSQW convert unsigned quad-word values into unsigned word values using unsigned saturation. The source operand is a ZMM/YMM/XMM register. The destination operand is a XMM register or a 128/64/32-bit memory location. Down-converted word elements are written to the destination operand (the first operand) from the least-significant word. Word elements of the destination operand are updated according to the writemask. Bits (MAX_VL1:128/64/32) of the register destination are zeroed. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 5-410 Vol. 2C VPMOVQW/VPMOVSQW/VPMOVUSQW—Down Convert QWord to Word INSTRUCTION SET REFERENCE, V-Z Operation VPMOVQW instruction (EVEX encoded versions) when dest is a register (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 16 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+15:i] TruncateQuadWordToWord (SRC[m+63:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+15:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/4] 0; VPMOVQW instruction (EVEX encoded versions) when dest is memory (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 16 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+15:i] TruncateQuadWordToWord (SRC[m+63:m]) ELSE *DEST[i+15:i] remains unchanged* ; merging-masking FI; ENDFOR VPMOVSQW instruction (EVEX encoded versions) when dest is a register (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 16 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+15:i] SaturateSignedQuadWordToWord (SRC[m+63:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+15:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/4] 0; VPMOVQW/VPMOVSQW/VPMOVUSQW—Down Convert QWord to Word Vol. 2C 5-411 INSTRUCTION SET REFERENCE, V-Z VPMOVSQW instruction (EVEX encoded versions) when dest is memory (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 16 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+15:i] SaturateSignedQuadWordToWord (SRC[m+63:m]) ELSE *DEST[i+15:i] remains unchanged* ; merging-masking FI; ENDFOR VPMOVUSQW instruction (EVEX encoded versions) when dest is a register (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 16 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+15:i] SaturateUnsignedQuadWordToWord (SRC[m+63:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+15:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/4] 0; VPMOVUSQW instruction (EVEX encoded versions) when dest is memory (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 16 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+15:i] SaturateUnsignedQuadWordToWord (SRC[m+63:m]) ELSE *DEST[i+15:i] remains unchanged* ; merging-masking FI; ENDFOR 5-412 Vol. 2C VPMOVQW/VPMOVSQW/VPMOVUSQW—Down Convert QWord to Word INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalents VPMOVQW __m128i _mm512_cvtepi64_epi16( __m512i a); VPMOVQW __m128i _mm512_mask_cvtepi64_epi16(__m128i s, __mmask8 k, __m512i a); VPMOVQW __m128i _mm512_maskz_cvtepi64_epi16( __mmask8 k, __m512i a); VPMOVQW void _mm512_mask_cvtepi64_storeu_epi16(void * d, __mmask8 k, __m512i a); VPMOVSQW __m128i _mm512_cvtsepi64_epi16( __m512i a); VPMOVSQW __m128i _mm512_mask_cvtsepi64_epi16(__m128i s, __mmask8 k, __m512i a); VPMOVSQW __m128i _mm512_maskz_cvtsepi64_epi16( __mmask8 k, __m512i a); VPMOVSQW void _mm512_mask_cvtsepi64_storeu_epi16(void * d, __mmask8 k, __m512i a); VPMOVUSQW __m128i _mm512_cvtusepi64_epi16( __m512i a); VPMOVUSQW __m128i _mm512_mask_cvtusepi64_epi16(__m128i s, __mmask8 k, __m512i a); VPMOVUSQW __m128i _mm512_maskz_cvtusepi64_epi16( __mmask8 k, __m512i a); VPMOVUSQW void _mm512_mask_cvtusepi64_storeu_epi16(void * d, __mmask8 k, __m512i a); VPMOVUSQD __m128i _mm256_cvtusepi64_epi32(__m256i a); VPMOVUSQD __m128i _mm256_mask_cvtusepi64_epi32(__m128i a, __mmask8 k, __m256i b); VPMOVUSQD __m128i _mm256_maskz_cvtusepi64_epi32( __mmask8 k, __m256i b); VPMOVUSQD void _mm256_mask_cvtusepi64_storeu_epi32(void * , __mmask8 k, __m256i b); VPMOVUSQD __m128i _mm_cvtusepi64_epi32(__m128i a); VPMOVUSQD __m128i _mm_mask_cvtusepi64_epi32(__m128i a, __mmask8 k, __m128i b); VPMOVUSQD __m128i _mm_maskz_cvtusepi64_epi32( __mmask8 k, __m128i b); VPMOVUSQD void _mm_mask_cvtusepi64_storeu_epi32(void * , __mmask8 k, __m128i b); VPMOVSQD __m128i _mm256_cvtsepi64_epi32(__m256i a); VPMOVSQD __m128i _mm256_mask_cvtsepi64_epi32(__m128i a, __mmask8 k, __m256i b); VPMOVSQD __m128i _mm256_maskz_cvtsepi64_epi32( __mmask8 k, __m256i b); VPMOVSQD void _mm256_mask_cvtsepi64_storeu_epi32(void * , __mmask8 k, __m256i b); VPMOVSQD __m128i _mm_cvtsepi64_epi32(__m128i a); VPMOVSQD __m128i _mm_mask_cvtsepi64_epi32(__m128i a, __mmask8 k, __m128i b); VPMOVSQD __m128i _mm_maskz_cvtsepi64_epi32( __mmask8 k, __m128i b); VPMOVSQD void _mm_mask_cvtsepi64_storeu_epi32(void * , __mmask8 k, __m128i b); VPMOVQD __m128i _mm256_cvtepi64_epi32(__m256i a); VPMOVQD __m128i _mm256_mask_cvtepi64_epi32(__m128i a, __mmask8 k, __m256i b); VPMOVQD __m128i _mm256_maskz_cvtepi64_epi32( __mmask8 k, __m256i b); VPMOVQD void _mm256_mask_cvtepi64_storeu_epi32(void * , __mmask8 k, __m256i b); VPMOVQD __m128i _mm_cvtepi64_epi32(__m128i a); VPMOVQD __m128i _mm_mask_cvtepi64_epi32(__m128i a, __mmask8 k, __m128i b); VPMOVQD __m128i _mm_maskz_cvtepi64_epi32( __mmask8 k, __m128i b); VPMOVQD void _mm_mask_cvtepi64_storeu_epi32(void * , __mmask8 k, __m128i b); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E6. #UD If EVEX.vvvv != 1111B. VPMOVQW/VPMOVSQW/VPMOVUSQW—Down Convert QWord to Word Vol. 2C 5-413 INSTRUCTION SET REFERENCE, V-Z VPMOVQD/VPMOVSQD/VPMOVUSQD—Down Convert QWord to DWord Opcode/ Instruction Op / En A 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.F3.0F38.W0 35 /r VPMOVQD xmm1/m128 {k1}{z}, xmm2 EVEX.128.F3.0F38.W0 25 /r VPMOVSQD xmm1/m64 {k1}{z}, xmm2 A V/V AVX512VL AVX512F EVEX.128.F3.0F38.W0 15 /r VPMOVUSQD xmm1/m64 {k1}{z}, xmm2 A V/V AVX512VL AVX512F EVEX.256.F3.0F38.W0 35 /r VPMOVQD xmm1/m128 {k1}{z}, ymm2 A V/V AVX512VL AVX512F EVEX.256.F3.0F38.W0 25 /r VPMOVSQD xmm1/m128 {k1}{z}, ymm2 A V/V AVX512VL AVX512F EVEX.256.F3.0F38.W0 15 /r VPMOVUSQD xmm1/m128 {k1}{z}, ymm2 A V/V AVX512VL AVX512F EVEX.512.F3.0F38.W0 35 /r VPMOVQD ymm1/m256 {k1}{z}, zmm2 HVM V/V AVX512F EVEX.512.F3.0F38.W0 25 /r VPMOVSQD ymm1/m256 {k1}{z}, zmm2 HVM V/V AVX512F EVEX.512.F3.0F38.W0 15 /r VPMOVUSQD ymm1/m256 {k1}{z}, zmm2 HVM V/V AVX512F Description Converts 2 packed quad-word integers from xmm2 into 2 packed double-word integers in xmm1/m128 with truncation subject to writemask k1. Converts 2 packed signed quad-word integers from xmm2 into 2 packed signed double-word integers in xmm1/m64 using signed saturation subject to writemask k1. Converts 2 packed unsigned quad-word integers from xmm2 into 2 packed unsigned double-word integers in xmm1/m64 using unsigned saturation subject to writemask k1. Converts 4 packed quad-word integers from ymm2 into 4 packed double-word integers in xmm1/m128 with truncation subject to writemask k1. Converts 4 packed signed quad-word integers from ymm2 into 4 packed signed double-word integers in xmm1/m128 using signed saturation subject to writemask k1. Converts 4 packed unsigned quad-word integers from ymm2 into 4 packed unsigned double-word integers in xmm1/m128 using unsigned saturation subject to writemask k1. Converts 8 packed quad-word integers from zmm2 into 8 packed double-word integers in ymm1/m256 with truncation subject to writemask k1. Converts 8 packed signed quad-word integers from zmm2 into 8 packed signed double-word integers in ymm1/m256 using signed saturation subject to writemask k1. Converts 8 packed unsigned quad-word integers from zmm2 into 8 packed unsigned double-word integers in ymm1/m256 using unsigned saturation subject to writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 HVM ModRM:r/m (w) ModRM:reg (r) NA NA Description VPMOVQW down converts 64-bit integer elements in the source operand (the second operand) into packed doublewords using truncation. VPMOVSQW converts signed 64-bit integers into packed signed doublewords using signed saturation. VPMOVUSQW convert unsigned quad-word values into unsigned double-word values using unsigned saturation. The source operand is a ZMM/YMM/XMM register. The destination operand is a YMM/XMM/XMM register or a 256/128/64-bit memory location. Down-converted doubleword elements are written to the destination operand (the first operand) from the leastsignificant doubleword. Doubleword elements of the destination operand are updated according to the writemask. Bits (MAX_VL-1:256/128/64) of the register destination are zeroed. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 5-414 Vol. 2C VPMOVQD/VPMOVSQD/VPMOVUSQD—Down Convert QWord to DWord INSTRUCTION SET REFERENCE, V-Z Operation VPMOVQD instruction (EVEX encoded version) reg-reg form (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 32 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TruncateQuadWordToDWord (SRC[m+63:m]) ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0; VPMOVQD instruction (EVEX encoded version) memory form (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 32 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TruncateQuadWordToDWord (SRC[m+63:m]) ELSE *DEST[i+31:i] remains unchanged* ; merging-masking FI; ENDFOR VPMOVSQD instruction (EVEX encoded version) reg-reg form (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 32 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+31:i] SaturateSignedQuadWordToDWord (SRC[m+63:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0; VPMOVQD/VPMOVSQD/VPMOVUSQD—Down Convert QWord to DWord Vol. 2C 5-415 INSTRUCTION SET REFERENCE, V-Z VPMOVSQD instruction (EVEX encoded version) memory form (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 32 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+31:i] SaturateSignedQuadWordToDWord (SRC[m+63:m]) ELSE *DEST[i+31:i] remains unchanged* ; merging-masking FI; ENDFOR VPMOVUSQD instruction (EVEX encoded version) reg-reg form (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 32 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+31:i] SaturateUnsignedQuadWordToDWord (SRC[m+63:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0; VPMOVUSQD instruction (EVEX encoded version) memory form (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 32 m j * 64 IF k1[j] OR *no writemask* THEN DEST[i+31:i] SaturateUnsignedQuadWordToDWord (SRC[m+63:m]) ELSE *DEST[i+31:i] remains unchanged* ; merging-masking FI; ENDFOR 5-416 Vol. 2C VPMOVQD/VPMOVSQD/VPMOVUSQD—Down Convert QWord to DWord INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalents VPMOVQD __m256i _mm512_cvtepi64_epi32( __m512i a); VPMOVQD __m256i _mm512_mask_cvtepi64_epi32(__m256i s, __mmask8 k, __m512i a); VPMOVQD __m256i _mm512_maskz_cvtepi64_epi32( __mmask8 k, __m512i a); VPMOVQD void _mm512_mask_cvtepi64_storeu_epi32(void * d, __mmask8 k, __m512i a); VPMOVSQD __m256i _mm512_cvtsepi64_epi32( __m512i a); VPMOVSQD __m256i _mm512_mask_cvtsepi64_epi32(__m256i s, __mmask8 k, __m512i a); VPMOVSQD __m256i _mm512_maskz_cvtsepi64_epi32( __mmask8 k, __m512i a); VPMOVSQD void _mm512_mask_cvtsepi64_storeu_epi32(void * d, __mmask8 k, __m512i a); VPMOVUSQD __m256i _mm512_cvtusepi64_epi32( __m512i a); VPMOVUSQD __m256i _mm512_mask_cvtusepi64_epi32(__m256i s, __mmask8 k, __m512i a); VPMOVUSQD __m256i _mm512_maskz_cvtusepi64_epi32( __mmask8 k, __m512i a); VPMOVUSQD void _mm512_mask_cvtusepi64_storeu_epi32(void * d, __mmask8 k, __m512i a); VPMOVUSQD __m128i _mm256_cvtusepi64_epi32(__m256i a); VPMOVUSQD __m128i _mm256_mask_cvtusepi64_epi32(__m128i a, __mmask8 k, __m256i b); VPMOVUSQD __m128i _mm256_maskz_cvtusepi64_epi32( __mmask8 k, __m256i b); VPMOVUSQD void _mm256_mask_cvtusepi64_storeu_epi32(void * , __mmask8 k, __m256i b); VPMOVUSQD __m128i _mm_cvtusepi64_epi32(__m128i a); VPMOVUSQD __m128i _mm_mask_cvtusepi64_epi32(__m128i a, __mmask8 k, __m128i b); VPMOVUSQD __m128i _mm_maskz_cvtusepi64_epi32( __mmask8 k, __m128i b); VPMOVUSQD void _mm_mask_cvtusepi64_storeu_epi32(void * , __mmask8 k, __m128i b); VPMOVSQD __m128i _mm256_cvtsepi64_epi32(__m256i a); VPMOVSQD __m128i _mm256_mask_cvtsepi64_epi32(__m128i a, __mmask8 k, __m256i b); VPMOVSQD __m128i _mm256_maskz_cvtsepi64_epi32( __mmask8 k, __m256i b); VPMOVSQD void _mm256_mask_cvtsepi64_storeu_epi32(void * , __mmask8 k, __m256i b); VPMOVSQD __m128i _mm_cvtsepi64_epi32(__m128i a); VPMOVSQD __m128i _mm_mask_cvtsepi64_epi32(__m128i a, __mmask8 k, __m128i b); VPMOVSQD __m128i _mm_maskz_cvtsepi64_epi32( __mmask8 k, __m128i b); VPMOVSQD void _mm_mask_cvtsepi64_storeu_epi32(void * , __mmask8 k, __m128i b); VPMOVQD __m128i _mm256_cvtepi64_epi32(__m256i a); VPMOVQD __m128i _mm256_mask_cvtepi64_epi32(__m128i a, __mmask8 k, __m256i b); VPMOVQD __m128i _mm256_maskz_cvtepi64_epi32( __mmask8 k, __m256i b); VPMOVQD void _mm256_mask_cvtepi64_storeu_epi32(void * , __mmask8 k, __m256i b); VPMOVQD __m128i _mm_cvtepi64_epi32(__m128i a); VPMOVQD __m128i _mm_mask_cvtepi64_epi32(__m128i a, __mmask8 k, __m128i b); VPMOVQD __m128i _mm_maskz_cvtepi64_epi32( __mmask8 k, __m128i b); VPMOVQD void _mm_mask_cvtepi64_storeu_epi32(void * , __mmask8 k, __m128i b); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E6. #UD If EVEX.vvvv != 1111B. VPMOVQD/VPMOVSQD/VPMOVUSQD—Down Convert QWord to DWord Vol. 2C 5-417 INSTRUCTION SET REFERENCE, V-Z VPMOVDB/VPMOVSDB/VPMOVUSDB—Down Convert DWord to Byte Opcode/ Instruction Op / En QVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.F3.0F38.W0 31 /r VPMOVDB xmm1/m32 {k1}{z}, xmm2 EVEX.128.F3.0F38.W0 21 /r VPMOVSDB xmm1/m32 {k1}{z}, xmm2 QVM V/V AVX512VL AVX512F EVEX.128.F3.0F38.W0 11 /r VPMOVUSDB xmm1/m32 {k1}{z}, xmm2 QVM V/V AVX512VL AVX512F EVEX.256.F3.0F38.W0 31 /r VPMOVDB xmm1/m64 {k1}{z}, ymm2 QVM V/V AVX512VL AVX512F EVEX.256.F3.0F38.W0 21 /r VPMOVSDB xmm1/m64 {k1}{z}, ymm2 QVM V/V AVX512VL AVX512F EVEX.256.F3.0F38.W0 11 /r VPMOVUSDB xmm1/m64 {k1}{z}, ymm2 QVM V/V AVX512VL AVX512F EVEX.512.F3.0F38.W0 31 /r VPMOVDB xmm1/m128 {k1}{z}, zmm2 QVM V/V AVX512F EVEX.512.F3.0F38.W0 21 /r VPMOVSDB xmm1/m128 {k1}{z}, zmm2 QVM V/V AVX512F EVEX.512.F3.0F38.W0 11 /r VPMOVUSDB xmm1/m128 {k1}{z}, zmm2 QVM V/V AVX512F Description Converts 4 packed double-word integers from xmm2 into 4 packed byte integers in xmm1/m32 with truncation under writemask k1. Converts 4 packed signed double-word integers from xmm2 into 4 packed signed byte integers in xmm1/m32 using signed saturation under writemask k1. Converts 4 packed unsigned double-word integers from xmm2 into 4 packed unsigned byte integers in xmm1/m32 using unsigned saturation under writemask k1. Converts 8 packed double-word integers from ymm2 into 8 packed byte integers in xmm1/m64 with truncation under writemask k1. Converts 8 packed signed double-word integers from ymm2 into 8 packed signed byte integers in xmm1/m64 using signed saturation under writemask k1. Converts 8 packed unsigned double-word integers from ymm2 into 8 packed unsigned byte integers in xmm1/m64 using unsigned saturation under writemask k1. Converts 16 packed double-word integers from zmm2 into 16 packed byte integers in xmm1/m128 with truncation under writemask k1. Converts 16 packed signed double-word integers from zmm2 into 16 packed signed byte integers in xmm1/m128 using signed saturation under writemask k1. Converts 16 packed unsigned double-word integers from zmm2 into 16 packed unsigned byte integers in xmm1/m128 using unsigned saturation under writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 QVM ModRM:r/m (w) ModRM:reg (r) NA NA Description VPMOVDB down converts 32-bit integer elements in the source operand (the second operand) into packed bytes using truncation. VPMOVSDB converts signed 32-bit integers into packed signed bytes using signed saturation. VPMOVUSDB convert unsigned double-word values into unsigned byte values using unsigned saturation. The source operand is a ZMM/YMM/XMM register. The destination operand is a XMM register or a 128/64/32-bit memory location. Down-converted byte elements are written to the destination operand (the first operand) from the least-significant byte. Byte elements of the destination operand are updated according to the writemask. Bits (MAX_VL1:128/64/32) of the register destination are zeroed. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 5-418 Vol. 2C VPMOVDB/VPMOVSDB/VPMOVUSDB—Down Convert DWord to Byte INSTRUCTION SET REFERENCE, V-Z Operation VPMOVDB instruction (EVEX encoded versions) when dest is a register (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 ij*8 m j * 32 IF k1[j] OR *no writemask* THEN DEST[i+7:i] TruncateDoubleWordToByte (SRC[m+31:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+7:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+7:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/4] 0; VPMOVDB instruction (EVEX encoded versions) when dest is memory (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 ij*8 m j * 32 IF k1[j] OR *no writemask* THEN DEST[i+7:i] TruncateDoubleWordToByte (SRC[m+31:m]) ELSE *DEST[i+7:i] remains unchanged* ; merging-masking FI; ENDFOR VPMOVSDB instruction (EVEX encoded versions) when dest is a register (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 ij*8 m j * 32 IF k1[j] OR *no writemask* THEN DEST[i+7:i] SaturateSignedDoubleWordToByte (SRC[m+31:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+7:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+7:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/4] 0; VPMOVDB/VPMOVSDB/VPMOVUSDB—Down Convert DWord to Byte Vol. 2C 5-419 INSTRUCTION SET REFERENCE, V-Z VPMOVSDB instruction (EVEX encoded versions) when dest is memory (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 ij*8 m j * 32 IF k1[j] OR *no writemask* THEN DEST[i+7:i] SaturateSignedDoubleWordToByte (SRC[m+31:m]) ELSE *DEST[i+7:i] remains unchanged* ; merging-masking FI; ENDFOR VPMOVUSDB instruction (EVEX encoded versions) when dest is a register (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 ij*8 m j * 32 IF k1[j] OR *no writemask* THEN DEST[i+7:i] SaturateUnsignedDoubleWordToByte (SRC[m+31:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+7:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+7:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/4] 0; VPMOVUSDB instruction (EVEX encoded versions) when dest is memory (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 ij*8 m j * 32 IF k1[j] OR *no writemask* THEN DEST[i+7:i] SaturateUnsignedDoubleWordToByte (SRC[m+31:m]) ELSE *DEST[i+7:i] remains unchanged* ; merging-masking FI; ENDFOR 5-420 Vol. 2C VPMOVDB/VPMOVSDB/VPMOVUSDB—Down Convert DWord to Byte INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalents VPMOVDB __m128i _mm512_cvtepi32_epi8( __m512i a); VPMOVDB __m128i _mm512_mask_cvtepi32_epi8(__m128i s, __mmask16 k, __m512i a); VPMOVDB __m128i _mm512_maskz_cvtepi32_epi8( __mmask16 k, __m512i a); VPMOVDB void _mm512_mask_cvtepi32_storeu_epi8(void * d, __mmask16 k, __m512i a); VPMOVSDB __m128i _mm512_cvtsepi32_epi8( __m512i a); VPMOVSDB __m128i _mm512_mask_cvtsepi32_epi8(__m128i s, __mmask16 k, __m512i a); VPMOVSDB __m128i _mm512_maskz_cvtsepi32_epi8( __mmask16 k, __m512i a); VPMOVSDB void _mm512_mask_cvtsepi32_storeu_epi8(void * d, __mmask16 k, __m512i a); VPMOVUSDB __m128i _mm512_cvtusepi32_epi8( __m512i a); VPMOVUSDB __m128i _mm512_mask_cvtusepi32_epi8(__m128i s, __mmask16 k, __m512i a); VPMOVUSDB __m128i _mm512_maskz_cvtusepi32_epi8( __mmask16 k, __m512i a); VPMOVUSDB void _mm512_mask_cvtusepi32_storeu_epi8(void * d, __mmask16 k, __m512i a); VPMOVUSDB __m128i _mm256_cvtusepi32_epi8(__m256i a); VPMOVUSDB __m128i _mm256_mask_cvtusepi32_epi8(__m128i a, __mmask8 k, __m256i b); VPMOVUSDB __m128i _mm256_maskz_cvtusepi32_epi8( __mmask8 k, __m256i b); VPMOVUSDB void _mm256_mask_cvtusepi32_storeu_epi8(void * , __mmask8 k, __m256i b); VPMOVUSDB __m128i _mm_cvtusepi32_epi8(__m128i a); VPMOVUSDB __m128i _mm_mask_cvtusepi32_epi8(__m128i a, __mmask8 k, __m128i b); VPMOVUSDB __m128i _mm_maskz_cvtusepi32_epi8( __mmask8 k, __m128i b); VPMOVUSDB void _mm_mask_cvtusepi32_storeu_epi8(void * , __mmask8 k, __m128i b); VPMOVSDB __m128i _mm256_cvtsepi32_epi8(__m256i a); VPMOVSDB __m128i _mm256_mask_cvtsepi32_epi8(__m128i a, __mmask8 k, __m256i b); VPMOVSDB __m128i _mm256_maskz_cvtsepi32_epi8( __mmask8 k, __m256i b); VPMOVSDB void _mm256_mask_cvtsepi32_storeu_epi8(void * , __mmask8 k, __m256i b); VPMOVSDB __m128i _mm_cvtsepi32_epi8(__m128i a); VPMOVSDB __m128i _mm_mask_cvtsepi32_epi8(__m128i a, __mmask8 k, __m128i b); VPMOVSDB __m128i _mm_maskz_cvtsepi32_epi8( __mmask8 k, __m128i b); VPMOVSDB void _mm_mask_cvtsepi32_storeu_epi8(void * , __mmask8 k, __m128i b); VPMOVDB __m128i _mm256_cvtepi32_epi8(__m256i a); VPMOVDB __m128i _mm256_mask_cvtepi32_epi8(__m128i a, __mmask8 k, __m256i b); VPMOVDB __m128i _mm256_maskz_cvtepi32_epi8( __mmask8 k, __m256i b); VPMOVDB void _mm256_mask_cvtepi32_storeu_epi8(void * , __mmask8 k, __m256i b); VPMOVDB __m128i _mm_cvtepi32_epi8(__m128i a); VPMOVDB __m128i _mm_mask_cvtepi32_epi8(__m128i a, __mmask8 k, __m128i b); VPMOVDB __m128i _mm_maskz_cvtepi32_epi8( __mmask8 k, __m128i b); VPMOVDB void _mm_mask_cvtepi32_storeu_epi8(void * , __mmask8 k, __m128i b); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E6. #UD If EVEX.vvvv != 1111B. VPMOVDB/VPMOVSDB/VPMOVUSDB—Down Convert DWord to Byte Vol. 2C 5-421 INSTRUCTION SET REFERENCE, V-Z VPMOVDW/VPMOVSDW/VPMOVUSDW—Down Convert DWord to Word Opcode/ Instruction Op / En HVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.F3.0F38.W0 33 /r VPMOVDW xmm1/m64 {k1}{z}, xmm2 EVEX.128.F3.0F38.W0 23 /r VPMOVSDW xmm1/m64 {k1}{z}, xmm2 HVM V/V AVX512VL AVX512F EVEX.128.F3.0F38.W0 13 /r VPMOVUSDW xmm1/m64 {k1}{z}, xmm2 HVM V/V AVX512VL AVX512F EVEX.256.F3.0F38.W0 33 /r VPMOVDW xmm1/m128 {k1}{z}, ymm2 HVM V/V AVX512VL AVX512F EVEX.256.F3.0F38.W0 23 /r VPMOVSDW xmm1/m128 {k1}{z}, ymm2 HVM V/V AVX512VL AVX512F EVEX.256.F3.0F38.W0 13 /r VPMOVUSDW xmm1/m128 {k1}{z}, ymm2 HVM V/V AVX512VL AVX512F EVEX.512.F3.0F38.W0 33 /r VPMOVDW ymm1/m256 {k1}{z}, zmm2 HVM V/V AVX512F EVEX.512.F3.0F38.W0 23 /r VPMOVSDW ymm1/m256 {k1}{z}, zmm2 HVM V/V AVX512F EVEX.512.F3.0F38.W0 13 /r VPMOVUSDW ymm1/m256 {k1}{z}, zmm2 HVM V/V AVX512F Description Converts 4 packed double-word integers from xmm2 into 4 packed word integers in xmm1/m64 with truncation under writemask k1. Converts 4 packed signed double-word integers from xmm2 into 4 packed signed word integers in ymm1/m64 using signed saturation under writemask k1. Converts 4 packed unsigned double-word integers from xmm2 into 4 packed unsigned word integers in xmm1/m64 using unsigned saturation under writemask k1. Converts 8 packed double-word integers from ymm2 into 8 packed word integers in xmm1/m128 with truncation under writemask k1. Converts 8 packed signed double-word integers from ymm2 into 8 packed signed word integers in xmm1/m128 using signed saturation under writemask k1. Converts 8 packed unsigned double-word integers from ymm2 into 8 packed unsigned word integers in xmm1/m128 using unsigned saturation under writemask k1. Converts 16 packed double-word integers from zmm2 into 16 packed word integers in ymm1/m256 with truncation under writemask k1. Converts 16 packed signed double-word integers from zmm2 into 16 packed signed word integers in ymm1/m256 using signed saturation under writemask k1. Converts 16 packed unsigned double-word integers from zmm2 into 16 packed unsigned word integers in ymm1/m256 using unsigned saturation under writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 HVM ModRM:r/m (w) ModRM:reg (r) NA NA Description VPMOVDW down converts 32-bit integer elements in the source operand (the second operand) into packed words using truncation. VPMOVSDW converts signed 32-bit integers into packed signed words using signed saturation. VPMOVUSDW convert unsigned double-word values into unsigned word values using unsigned saturation. The source operand is a ZMM/YMM/XMM register. The destination operand is a YMM/XMM/XMM register or a 256/128/64-bit memory location. Down-converted word elements are written to the destination operand (the first operand) from the least-significant word. Word elements of the destination operand are updated according to the writemask. Bits (MAX_VL1:256/128/64) of the register destination are zeroed. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 5-422 Vol. 2C VPMOVDW/VPMOVSDW/VPMOVUSDW—Down Convert DWord to Word INSTRUCTION SET REFERENCE, V-Z Operation VPMOVDW instruction (EVEX encoded versions) when dest is a register (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 16 m j * 32 IF k1[j] OR *no writemask* THEN DEST[i+15:i] TruncateDoubleWordToWord (SRC[m+31:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+15:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0; VPMOVDW instruction (EVEX encoded versions) when dest is memory (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 16 m j * 32 IF k1[j] OR *no writemask* THEN DEST[i+15:i] TruncateDoubleWordToWord (SRC[m+31:m]) ELSE *DEST[i+15:i] remains unchanged* ; merging-masking FI; ENDFOR VPMOVSDW instruction (EVEX encoded versions) when dest is a register (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 16 m j * 32 IF k1[j] OR *no writemask* THEN DEST[i+15:i] SaturateSignedDoubleWordToWord (SRC[m+31:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+15:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0; VPMOVDW/VPMOVSDW/VPMOVUSDW—Down Convert DWord to Word Vol. 2C 5-423 INSTRUCTION SET REFERENCE, V-Z VPMOVSDW instruction (EVEX encoded versions) when dest is memory (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 16 m j * 32 IF k1[j] OR *no writemask* THEN DEST[i+15:i] SaturateSignedDoubleWordToWord (SRC[m+31:m]) ELSE *DEST[i+15:i] remains unchanged* ; merging-masking FI; ENDFOR VPMOVUSDW instruction (EVEX encoded versions) when dest is a register (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 16 m j * 32 IF k1[j] OR *no writemask* THEN DEST[i+15:i] SaturateUnsignedDoubleWordToWord (SRC[m+31:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+15:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0; VPMOVUSDW instruction (EVEX encoded versions) when dest is memory (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 16 m j * 32 IF k1[j] OR *no writemask* THEN DEST[i+15:i] SaturateUnsignedDoubleWordToWord (SRC[m+31:m]) ELSE *DEST[i+15:i] remains unchanged* ; merging-masking FI; ENDFOR 5-424 Vol. 2C VPMOVDW/VPMOVSDW/VPMOVUSDW—Down Convert DWord to Word INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalents VPMOVDW __m256i _mm512_cvtepi32_epi16( __m512i a); VPMOVDW __m256i _mm512_mask_cvtepi32_epi16(__m256i s, __mmask16 k, __m512i a); VPMOVDW __m256i _mm512_maskz_cvtepi32_epi16( __mmask16 k, __m512i a); VPMOVDW void _mm512_mask_cvtepi32_storeu_epi16(void * d, __mmask16 k, __m512i a); VPMOVSDW __m256i _mm512_cvtsepi32_epi16( __m512i a); VPMOVSDW __m256i _mm512_mask_cvtsepi32_epi16(__m256i s, __mmask16 k, __m512i a); VPMOVSDW __m256i _mm512_maskz_cvtsepi32_epi16( __mmask16 k, __m512i a); VPMOVSDW void _mm512_mask_cvtsepi32_storeu_epi16(void * d, __mmask16 k, __m512i a); VPMOVUSDW __m256i _mm512_cvtusepi32_epi16 __m512i a); VPMOVUSDW __m256i _mm512_mask_cvtusepi32_epi16(__m256i s, __mmask16 k, __m512i a); VPMOVUSDW __m256i _mm512_maskz_cvtusepi32_epi16( __mmask16 k, __m512i a); VPMOVUSDW void _mm512_mask_cvtusepi32_storeu_epi16(void * d, __mmask16 k, __m512i a); VPMOVUSDW __m128i _mm256_cvtusepi32_epi16(__m256i a); VPMOVUSDW __m128i _mm256_mask_cvtusepi32_epi16(__m128i a, __mmask8 k, __m256i b); VPMOVUSDW __m128i _mm256_maskz_cvtusepi32_epi16( __mmask8 k, __m256i b); VPMOVUSDW void _mm256_mask_cvtusepi32_storeu_epi16(void * , __mmask8 k, __m256i b); VPMOVUSDW __m128i _mm_cvtusepi32_epi16(__m128i a); VPMOVUSDW __m128i _mm_mask_cvtusepi32_epi16(__m128i a, __mmask8 k, __m128i b); VPMOVUSDW __m128i _mm_maskz_cvtusepi32_epi16( __mmask8 k, __m128i b); VPMOVUSDW void _mm_mask_cvtusepi32_storeu_epi16(void * , __mmask8 k, __m128i b); VPMOVSDW __m128i _mm256_cvtsepi32_epi16(__m256i a); VPMOVSDW __m128i _mm256_mask_cvtsepi32_epi16(__m128i a, __mmask8 k, __m256i b); VPMOVSDW __m128i _mm256_maskz_cvtsepi32_epi16( __mmask8 k, __m256i b); VPMOVSDW void _mm256_mask_cvtsepi32_storeu_epi16(void * , __mmask8 k, __m256i b); VPMOVSDW __m128i _mm_cvtsepi32_epi16(__m128i a); VPMOVSDW __m128i _mm_mask_cvtsepi32_epi16(__m128i a, __mmask8 k, __m128i b); VPMOVSDW __m128i _mm_maskz_cvtsepi32_epi16( __mmask8 k, __m128i b); VPMOVSDW void _mm_mask_cvtsepi32_storeu_epi16(void * , __mmask8 k, __m128i b); VPMOVDW __m128i _mm256_cvtepi32_epi16(__m256i a); VPMOVDW __m128i _mm256_mask_cvtepi32_epi16(__m128i a, __mmask8 k, __m256i b); VPMOVDW __m128i _mm256_maskz_cvtepi32_epi16( __mmask8 k, __m256i b); VPMOVDW void _mm256_mask_cvtepi32_storeu_epi16(void * , __mmask8 k, __m256i b); VPMOVDW __m128i _mm_cvtepi32_epi16(__m128i a); VPMOVDW __m128i _mm_mask_cvtepi32_epi16(__m128i a, __mmask8 k, __m128i b); VPMOVDW __m128i _mm_maskz_cvtepi32_epi16( __mmask8 k, __m128i b); VPMOVDW void _mm_mask_cvtepi32_storeu_epi16(void * , __mmask8 k, __m128i b); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E6. #UD If EVEX.vvvv != 1111B. VPMOVDW/VPMOVSDW/VPMOVUSDW—Down Convert DWord to Word Vol. 2C 5-425 INSTRUCTION SET REFERENCE, V-Z VPMOVWB/VPMOVSWB/VPMOVUSWB—Down Convert Word to Byte Opcode/ Instruction Op / En HVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512BW EVEX.128.F3.0F38.W0 30 /r VPMOVWB xmm1/m64 {k1}{z}, xmm2 EVEX.128.F3.0F38.W0 20 /r VPMOVSWB xmm1/m64 {k1}{z}, xmm2 EVEX.128.F3.0F38.W0 10 /r VPMOVUSWB xmm1/m64 {k1}{z}, xmm2 EVEX.256.F3.0F38.W0 30 /r VPMOVWB xmm1/m128 {k1}{z}, ymm2 EVEX.256.F3.0F38.W0 20 /r VPMOVSWB xmm1/m128 {k1}{z}, ymm2 EVEX.256.F3.0F38.W0 10 /r VPMOVUSWB xmm1/m128 {k1}{z}, ymm2 EVEX.512.F3.0F38.W0 30 /r VPMOVWB ymm1/m256 {k1}{z}, zmm2 EVEX.512.F3.0F38.W0 20 /r VPMOVSWB ymm1/m256 {k1}{z}, zmm2 EVEX.512.F3.0F38.W0 10 /r VPMOVUSWB ymm1/m256 {k1}{z}, zmm2 HVM V/V AVX512VL AVX512BW HVM V/V AVX512VL AVX512BW HVM V/V AVX512VL AVX512BW HVM V/V AVX512VL AVX512BW HVM V/V AVX512VL AVX512BW HVM V/V AVX512BW HVM V/V AVX512BW HVM V/V AVX512BW Description Converts 8 packed word integers from xmm2 into 8 packed bytes in xmm1/m64 with truncation under writemask k1. Converts 8 packed signed word integers from xmm2 into 8 packed signed bytes in xmm1/m64 using signed saturation under writemask k1. Converts 8 packed unsigned word integers from xmm2 into 8 packed unsigned bytes in 8mm1/m64 using unsigned saturation under writemask k1. Converts 16 packed word integers from ymm2 into 16 packed bytes in xmm1/m128 with truncation under writemask k1. Converts 16 packed signed word integers from ymm2 into 16 packed signed bytes in xmm1/m128 using signed saturation under writemask k1. Converts 16 packed unsigned word integers from ymm2 into 16 packed unsigned bytes in xmm1/m128 using unsigned saturation under writemask k1. Converts 32 packed word integers from zmm2 into 32 packed bytes in ymm1/m256 with truncation under writemask k1. Converts 32 packed signed word integers from zmm2 into 32 packed signed bytes in ymm1/m256 using signed saturation under writemask k1. Converts 32 packed unsigned word integers from zmm2 into 32 packed unsigned bytes in ymm1/m256 using unsigned saturation under writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 HVM ModRM:r/m (w) ModRM:reg (r) NA NA Description VPMOVWB down converts 16-bit integers into packed bytes using truncation. VPMOVSWB converts signed 16-bit integers into packed signed bytes using signed saturation. VPMOVUSWB convert unsigned word values into unsigned byte values using unsigned saturation. The source operand is a ZMM/YMM/XMM register. The destination operand is a YMM/XMM/XMM register or a 256/128/64-bit memory location. Down-converted byte elements are written to the destination operand (the first operand) from the least-significant byte. Byte elements of the destination operand are updated according to the writemask. Bits (MAX_VL1:256/128/64) of the register destination are zeroed. Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 5-426 Vol. 2C VPMOVWB/VPMOVSWB/VPMOVUSWB—Down Convert Word to Byte INSTRUCTION SET REFERENCE, V-Z Operation VPMOVWB instruction (EVEX encoded versions) when dest is a register (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO Kl-1 ij*8 m j * 16 IF k1[j] OR *no writemask* THEN DEST[i+7:i] TruncateWordToByte (SRC[m+15:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+7:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+7:i] = 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0; VPMOVWB instruction (EVEX encoded versions) when dest is memory (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO Kl-1 ij*8 m j * 16 IF k1[j] OR *no writemask* THEN DEST[i+7:i] TruncateWordToByte (SRC[m+15:m]) ELSE *DEST[i+7:i] remains unchanged* ; merging-masking FI; ENDFOR VPMOVSWB instruction (EVEX encoded versions) when dest is a register (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO Kl-1 ij*8 m j * 16 IF k1[j] OR *no writemask* THEN DEST[i+7:i] SaturateSignedWordToByte (SRC[m+15:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+7:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+7:i] = 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0; VPMOVWB/VPMOVSWB/VPMOVUSWB—Down Convert Word to Byte Vol. 2C 5-427 INSTRUCTION SET REFERENCE, V-Z VPMOVSWB instruction (EVEX encoded versions) when dest is memory (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO Kl-1 ij*8 m j * 16 IF k1[j] OR *no writemask* THEN DEST[i+7:i] SaturateSignedWordToByte (SRC[m+15:m]) ELSE *DEST[i+7:i] remains unchanged* ; merging-masking FI; ENDFOR VPMOVUSWB instruction (EVEX encoded versions) when dest is a register (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO Kl-1 ij*8 m j * 16 IF k1[j] OR *no writemask* THEN DEST[i+7:i] SaturateUnsignedWordToByte (SRC[m+15:m]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+7:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+7:i] = 0 FI FI; ENDFOR DEST[MAX_VL-1:VL/2] 0; VPMOVUSWB instruction (EVEX encoded versions) when dest is memory (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO Kl-1 ij*8 m j * 16 IF k1[j] OR *no writemask* THEN DEST[i+7:i] SaturateUnsignedWordToByte (SRC[m+15:m]) ELSE *DEST[i+7:i] remains unchanged* ; merging-masking FI; ENDFOR 5-428 Vol. 2C VPMOVWB/VPMOVSWB/VPMOVUSWB—Down Convert Word to Byte INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalents VPMOVUSWB __m256i _mm512_cvtusepi16_epi8(__m512i a); VPMOVUSWB __m256i _mm512_mask_cvtusepi16_epi8(__m256i a, __mmask32 k, __m512i b); VPMOVUSWB __m256i _mm512_maskz_cvtusepi16_epi8( __mmask32 k, __m512i b); VPMOVUSWB void _mm512_mask_cvtusepi16_storeu_epi8(void * , __mmask32 k, __m512i b); VPMOVSWB __m256i _mm512_cvtsepi16_epi8(__m512i a); VPMOVSWB __m256i _mm512_mask_cvtsepi16_epi8(__m256i a, __mmask32 k, __m512i b); VPMOVSWB __m256i _mm512_maskz_cvtsepi16_epi8( __mmask32 k, __m512i b); VPMOVSWB void _mm512_mask_cvtsepi16_storeu_epi8(void * , __mmask32 k, __m512i b); VPMOVWB __m256i _mm512_cvtepi16_epi8(__m512i a); VPMOVWB __m256i _mm512_mask_cvtepi16_epi8(__m256i a, __mmask32 k, __m512i b); VPMOVWB __m256i _mm512_maskz_cvtepi16_epi8( __mmask32 k, __m512i b); VPMOVWB void _mm512_mask_cvtepi16_storeu_epi8(void * , __mmask32 k, __m512i b); VPMOVUSWB __m128i _mm256_cvtusepi16_epi8(__m256i a); VPMOVUSWB __m128i _mm256_mask_cvtusepi16_epi8(__m128i a, __mmask16 k, __m256i b); VPMOVUSWB __m128i _mm256_maskz_cvtusepi16_epi8( __mmask16 k, __m256i b); VPMOVUSWB void _mm256_mask_cvtusepi16_storeu_epi8(void * , __mmask16 k, __m256i b); VPMOVUSWB __m128i _mm_cvtusepi16_epi8(__m128i a); VPMOVUSWB __m128i _mm_mask_cvtusepi16_epi8(__m128i a, __mmask8 k, __m128i b); VPMOVUSWB __m128i _mm_maskz_cvtusepi16_epi8( __mmask8 k, __m128i b); VPMOVUSWB void _mm_mask_cvtusepi16_storeu_epi8(void * , __mmask8 k, __m128i b); VPMOVSWB __m128i _mm256_cvtsepi16_epi8(__m256i a); VPMOVSWB __m128i _mm256_mask_cvtsepi16_epi8(__m128i a, __mmask16 k, __m256i b); VPMOVSWB __m128i _mm256_maskz_cvtsepi16_epi8( __mmask16 k, __m256i b); VPMOVSWB void _mm256_mask_cvtsepi16_storeu_epi8(void * , __mmask16 k, __m256i b); VPMOVSWB __m128i _mm_cvtsepi16_epi8(__m128i a); VPMOVSWB __m128i _mm_mask_cvtsepi16_epi8(__m128i a, __mmask8 k, __m128i b); VPMOVSWB __m128i _mm_maskz_cvtsepi16_epi8( __mmask8 k, __m128i b); VPMOVSWB void _mm_mask_cvtsepi16_storeu_epi8(void * , __mmask8 k, __m128i b); VPMOVWB __m128i _mm256_cvtepi16_epi8(__m256i a); VPMOVWB __m128i _mm256_mask_cvtepi16_epi8(__m128i a, __mmask16 k, __m256i b); VPMOVWB __m128i _mm256_maskz_cvtepi16_epi8( __mmask16 k, __m256i b); VPMOVWB void _mm256_mask_cvtepi16_storeu_epi8(void * , __mmask16 k, __m256i b); VPMOVWB __m128i _mm_cvtepi16_epi8(__m128i a); VPMOVWB __m128i _mm_mask_cvtepi16_epi8(__m128i a, __mmask8 k, __m128i b); VPMOVWB __m128i _mm_maskz_cvtepi16_epi8( __mmask8 k, __m128i b); VPMOVWB void _mm_mask_cvtepi16_storeu_epi8(void * , __mmask8 k, __m128i b); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E6NF #UD If EVEX.vvvv != 1111B. VPMOVWB/VPMOVSWB/VPMOVUSWB—Down Convert Word to Byte Vol. 2C 5-429 INSTRUCTION SET REFERENCE, V-Z PROLD/PROLVD/PROLQ/PROLVQ—Bit Rotate Left Opcode/ Instruction Op / En FV-RVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.NDS.128.66.0F38.W0 15 /r VPROLVD xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDD.128.66.0F.W0 72 /1 ib VPROLD xmm1 {k1}{z}, xmm2/m128/m32bcst, imm8 EVEX.NDS.128.66.0F38.W1 15 /r VPROLVQ xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDD.128.66.0F.W1 72 /1 ib VPROLQ xmm1 {k1}{z}, xmm2/m128/m64bcst, imm8 EVEX.NDS.256.66.0F38.W0 15 /r VPROLVD ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDD.256.66.0F.W0 72 /1 ib VPROLD ymm1 {k1}{z}, ymm2/m256/m32bcst, imm8 EVEX.NDS.256.66.0F38.W1 15 /r VPROLVQ ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDD.256.66.0F.W1 72 /1 ib VPROLQ ymm1 {k1}{z}, ymm2/m256/m64bcst, imm8 EVEX.NDS.512.66.0F38.W0 15 /r VPROLVD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst FV-VMI V/V AVX512VL AVX512F FV-RVM V/V AVX512VL AVX512F FV-VMI V/V AVX512VL AVX512F FV-RVM V/V AVX512VL AVX512F FV-VMI V/V AVX512VL AVX512F FV-RVM V/V AVX512VL AVX512F FV-VMI V/V AVX512VL AVX512F FV-RVM V/V AVX512F EVEX.NDD.512.66.0F.W0 72 /1 ib VPROLD zmm1 {k1}{z}, zmm2/m512/m32bcst, imm8 EVEX.NDS.512.66.0F38.W1 15 /r VPROLVQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst EVEX.NDD.512.66.0F.W1 72 /1 ib VPROLQ zmm1 {k1}{z}, zmm2/m512/m64bcst, imm8 FV-VMI V/V AVX512F FV-RVM V/V AVX512F FV-VMI V/V AVX512F Description Rotate doublewords in xmm2 left by count in the corresponding element of xmm3/m128/m32bcst. Result written to xmm1 under writemask k1. Rotate doublewords in xmm2/m128/m32bcst left by imm8. Result written to xmm1 using writemask k1. Rotate quadwords in xmm2 left by count in the corresponding element of xmm3/m128/m64bcst. Result written to xmm1 under writemask k1. Rotate quadwords in xmm2/m128/m64bcst left by imm8. Result written to xmm1 using writemask k1. Rotate doublewords in ymm2 left by count in the corresponding element of ymm3/m256/m32bcst. Result written to ymm1 under writemask k1. Rotate doublewords in ymm2/m256/m32bcst left by imm8. Result written to ymm1 using writemask k1. Rotate quadwords in ymm2 left by count in the corresponding element of ymm3/m256/m64bcst. Result written to ymm1 under writemask k1. Rotate quadwords in ymm2/m256/m64bcst left by imm8. Result written to ymm1 using writemask k1. Rotate left of doublewords in zmm2 by count in the corresponding element of zmm3/m512/m32bcst. Result written to zmm1 using writemask k1. Rotate left of doublewords in zmm3/m512/m32bcst by imm8. Result written to zmm1 using writemask k1. Rotate quadwords in zmm2 left by count in the corresponding element of zmm3/m512/m64bcst. Result written to zmm1under writemask k1. Rotate quadwords in zmm2/m512/m64bcst left by imm8. Result written to zmm1 using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV-VMI VEX.vvvv (w) ModRM:r/m (R) Imm8 NA FV-RVM ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA 5-430 Vol. 2C PROLD/PROLVD/PROLQ/PROLVQ—Bit Rotate Left INSTRUCTION SET REFERENCE, V-Z Description Rotates the bits in the individual data elements (doublewords, or quadword) in the first source operand to the left by the number of bits specified in the count operand. If the value specified by the count operand is greater than 31 (for doublewords), or 63 (for a quadword), then the count operand modulo the data size (32 or 64) is used. EVEX.128 encoded version: The destination operand is a XMM register. The source operand is a XMM register or a memory location (for immediate form). The count operand can come either from an XMM register or a memory location or an 8-bit immediate. Bits (MAX_VL-1:128) of the corresponding ZMM register are zeroed. EVEX.256 encoded version: The destination operand is a YMM register. The source operand is a YMM register or a memory location (for immediate form). The count operand can come either from an XMM register or a memory location or an 8-bit immediate. Bits (MAX_VL-1:256) of the corresponding ZMM register are zeroed. EVEX.512 encoded version: The destination operand is a ZMM register updated according to the writemask. For the count operand in immediate form, the source operand can be a ZMM register, a 512-bit memory location or a 512bit vector broadcasted from a 32/64-bit memory location, the count operand is an 8-bit immediate. For the count operand in variable form, the first source operand (the second operand) is a ZMM register and the counter operand (the third operand) is a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32/64-bit memory location. Operation LEFT_ROTATE_DWORDS(SRC, COUNT_SRC) COUNT COUNT_SRC modulo 32; DEST[31:0] (SRC << COUNT) | (SRC >> (32 - COUNT)); LEFT_ROTATE_QWORDS(SRC, COUNT_SRC) COUNT COUNT_SRC modulo 64; DEST[63:0] (SRC << COUNT) | (SRC >> (64 - COUNT)); VPROLD (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC1 *is memory*) THEN DEST[i+31:i] LEFT_ROTATE_DWORDS(SRC1[31:0], imm8) ELSE DEST[i+31:i] LEFT_ROTATE_DWORDS(SRC1[i+31:i], imm8) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 PROLD/PROLVD/PROLQ/PROLVQ—Bit Rotate Left Vol. 2C 5-431 INSTRUCTION SET REFERENCE, V-Z VPROLVD (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+31:i] LEFT_ROTATE_DWORDS(SRC1[i+31:i], SRC2[31:0]) ELSE DEST[i+31:i] LEFT_ROTATE_DWORDS(SRC1[i+31:i], SRC2[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPROLQ (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC1 *is memory*) THEN DEST[i+63:i] LEFT_ROTATE_QWORDS(SRC1[63:0], imm8) ELSE DEST[i+63:i] LEFT_ROTATE_QWORDS(SRC1[i+63:i], imm8) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-432 Vol. 2C PROLD/PROLVD/PROLQ/PROLVQ—Bit Rotate Left INSTRUCTION SET REFERENCE, V-Z VPROLVQ (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+63:i] LEFT_ROTATE_QWORDS(SRC1[i+63:i], SRC2[63:0]) ELSE DEST[i+63:i] LEFT_ROTATE_QWORDS(SRC1[i+63:i], SRC2[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 PROLD/PROLVD/PROLQ/PROLVQ—Bit Rotate Left Vol. 2C 5-433 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPROLD __m512i _mm512_rol_epi32(__m512i a, int imm); VPROLD __m512i _mm512_mask_rol_epi32(__m512i a, __mmask16 k, __m512i b, int imm); VPROLD __m512i _mm512_maskz_rol_epi32( __mmask16 k, __m512i a, int imm); VPROLD __m256i _mm256_rol_epi32(__m256i a, int imm); VPROLD __m256i _mm256_mask_rol_epi32(__m256i a, __mmask8 k, __m256i b, int imm); VPROLD __m256i _mm256_maskz_rol_epi32( __mmask8 k, __m256i a, int imm); VPROLD __m128i _mm_rol_epi32(__m128i a, int imm); VPROLD __m128i _mm_mask_rol_epi32(__m128i a, __mmask8 k, __m128i b, int imm); VPROLD __m128i _mm_maskz_rol_epi32( __mmask8 k, __m128i a, int imm); VPROLQ __m512i _mm512_rol_epi64(__m512i a, int imm); VPROLQ __m512i _mm512_mask_rol_epi64(__m512i a, __mmask8 k, __m512i b, int imm); VPROLQ __m512i _mm512_maskz_rol_epi64(__mmask8 k, __m512i a, int imm); VPROLQ __m256i _mm256_rol_epi64(__m256i a, int imm); VPROLQ __m256i _mm256_mask_rol_epi64(__m256i a, __mmask8 k, __m256i b, int imm); VPROLQ __m256i _mm256_maskz_rol_epi64( __mmask8 k, __m256i a, int imm); VPROLQ __m128i _mm_rol_epi64(__m128i a, int imm); VPROLQ __m128i _mm_mask_rol_epi64(__m128i a, __mmask8 k, __m128i b, int imm); VPROLQ __m128i _mm_maskz_rol_epi64( __mmask8 k, __m128i a, int imm); VPROLVD __m512i _mm512_rolv_epi32(__m512i a, __m512i cnt); VPROLVD __m512i _mm512_mask_rolv_epi32(__m512i a, __mmask16 k, __m512i b, __m512i cnt); VPROLVD __m512i _mm512_maskz_rolv_epi32(__mmask16 k, __m512i a, __m512i cnt); VPROLVD __m256i _mm256_rolv_epi32(__m256i a, __m256i cnt); VPROLVD __m256i _mm256_mask_rolv_epi32(__m256i a, __mmask8 k, __m256i b, __m256i cnt); VPROLVD __m256i _mm256_maskz_rolv_epi32(__mmask8 k, __m256i a, __m256i cnt); VPROLVD __m128i _mm_rolv_epi32(__m128i a, __m128i cnt); VPROLVD __m128i _mm_mask_rolv_epi32(__m128i a, __mmask8 k, __m128i b, __m128i cnt); VPROLVD __m128i _mm_maskz_rolv_epi32(__mmask8 k, __m128i a, __m128i cnt); VPROLVQ __m512i _mm512_rolv_epi64(__m512i a, __m512i cnt); VPROLVQ __m512i _mm512_mask_rolv_epi64(__m512i a, __mmask8 k, __m512i b, __m512i cnt); VPROLVQ __m512i _mm512_maskz_rolv_epi64( __mmask8 k, __m512i a, __m512i cnt); VPROLVQ __m256i _mm256_rolv_epi64(__m256i a, __m256i cnt); VPROLVQ __m256i _mm256_mask_rolv_epi64(__m256i a, __mmask8 k, __m256i b, __m256i cnt); VPROLVQ __m256i _mm256_maskz_rolv_epi64(__mmask8 k, __m256i a, __m256i cnt); VPROLVQ __m128i _mm_rolv_epi64(__m128i a, __m128i cnt); VPROLVQ __m128i _mm_mask_rolv_epi64(__m128i a, __mmask8 k, __m128i b, __m128i cnt); VPROLVQ __m128i _mm_maskz_rolv_epi64(__mmask8 k, __m128i a, __m128i cnt); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E4. 5-434 Vol. 2C PROLD/PROLVD/PROLQ/PROLVQ—Bit Rotate Left INSTRUCTION SET REFERENCE, V-Z PRORD/PRORVD/PRORQ/PRORVQ—Bit Rotate Right Opcode/ Instruction Op / En FV-RVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.NDS.128.66.0F38.W0 14 /r VPRORVD xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDD.128.66.0F.W0 72 /0 ib VPRORD xmm1 {k1}{z}, xmm2/m128/m32bcst, imm8 EVEX.NDS.128.66.0F38.W1 14 /r VPRORVQ xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDD.128.66.0F.W1 72 /0 ib VPRORQ xmm1 {k1}{z}, xmm2/m128/m64bcst, imm8 EVEX.NDS.256.66.0F38.W0 14 /r VPRORVD ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst FV-VMI V/V AVX512VL AVX512F FV-RVM V/V AVX512VL AVX512F FV-VMI V/V AVX512VL AVX512F FV-RVM V/V AVX512VL AVX512F EVEX.NDD.256.66.0F.W0 72 /0 ib VPRORD ymm1 {k1}{z}, ymm2/m256/m32bcst, imm8 EVEX.NDS.256.66.0F38.W1 14 /r VPRORVQ ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDD.256.66.0F.W1 72 /0 ib VPRORQ ymm1 {k1}{z}, ymm2/m256/m64bcst, imm8 EVEX.NDS.512.66.0F38.W0 14 /r VPRORVD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst FV-VMI V/V AVX512VL AVX512F FV-RVM V/V AVX512VL AVX512F FV-VMI V/V AVX512VL AVX512F FV-RVM V/V AVX512F EVEX.NDD.512.66.0F.W0 72 /0 ib VPRORD zmm1 {k1}{z}, zmm2/m512/m32bcst, imm8 EVEX.NDS.512.66.0F38.W1 14 /r VPRORVQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst EVEX.NDD.512.66.0F.W1 72 /0 ib VPRORQ zmm1 {k1}{z}, zmm2/m512/m64bcst, imm8 FV-VMI V/V AVX512F FV-RVM V/V AVX512F FV-VMI V/V AVX512F Description Rotate doublewords in xmm2 right by count in the corresponding element of xmm3/m128/m32bcst, store result using writemask k1. Rotate doublewords in xmm2/m128/m32bcst right by imm8, store result using writemask k1. Rotate quadwords in xmm2 right by count in the corresponding element of xmm3/m128/m64bcst, store result using writemask k1. Rotate quadwords in xmm2/m128/m64bcst right by imm8, store result using writemask k1. Rotate doublewords in ymm2 right by count in the corresponding element of ymm3/m256/m32bcst, store using result writemask k1. Rotate doublewords in ymm2/m256/m32bcst right by imm8, store result using writemask k1. Rotate quadwords in ymm2 right by count in the corresponding element of ymm3/m256/m64bcst, store result using writemask k1. Rotate quadwords in ymm2/m256/m64bcst right by imm8, store result using writemask k1. Rotate doublewords in zmm2 right by count in the corresponding element of zmm3/m512/m32bcst, store result using writemask k1. Rotate doublewords in zmm2/m512/m32bcst right by imm8, store result using writemask k1. Rotate quadwords in zmm2 right by count in the corresponding element of zmm3/m512/m64bcst, store result using writemask k1. Rotate quadwords in zmm2/m512/m64bcst right by imm8, store result using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV-VMI VEX.vvvv (w) ModRM:r/m (R) Imm8 NA FV-RVM ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA PRORD/PRORVD/PRORQ/PRORVQ—Bit Rotate Right Vol. 2C 5-435 INSTRUCTION SET REFERENCE, V-Z Description Rotates the bits in the individual data elements (doublewords, or quadword) in the first source operand to the right by the number of bits specified in the count operand. If the value specified by the count operand is greater than 31 (for doublewords), or 63 (for a quadword), then the count operand modulo the data size (32 or 64) is used. EVEX.128 encoded version: The destination operand is a XMM register. The source operand is a XMM register or a memory location (for immediate form). The count operand can come either from an XMM register or a memory location or an 8-bit immediate. Bits (MAX_VL-1:128) of the corresponding ZMM register are zeroed. EVEX.256 encoded version: The destination operand is a YMM register. The source operand is a YMM register or a memory location (for immediate form). The count operand can come either from an XMM register or a memory location or an 8-bit immediate. Bits (MAX_VL-1:256) of the corresponding ZMM register are zeroed. EVEX.512 encoded version: The destination operand is a ZMM register updated according to the writemask. For the count operand in immediate form, the source operand can be a ZMM register, a 512-bit memory location or a 512bit vector broadcasted from a 32/64-bit memory location, the count operand is an 8-bit immediate. For the count operand in variable form, the first source operand (the second operand) is a ZMM register and the counter operand (the third operand) is a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32/64-bit memory location. Operation RIGHT_ROTATE_DWORDS(SRC, COUNT_SRC) COUNT COUNT_SRC modulo 32; DEST[31:0] (SRC >> COUNT) | (SRC << (32 - COUNT)); RIGHT_ROTATE_QWORDS(SRC, COUNT_SRC) COUNT COUNT_SRC modulo 64; DEST[63:0] (SRC >> COUNT) | (SRC << (64 - COUNT)); VPRORD (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC1 *is memory*) THEN DEST[i+31:i] RIGHT_ROTATE_DWORDS( SRC1[31:0], imm8) ELSE DEST[i+31:i] RIGHT_ROTATE_DWORDS(SRC1[i+31:i], imm8) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-436 Vol. 2C PRORD/PRORVD/PRORQ/PRORVQ—Bit Rotate Right INSTRUCTION SET REFERENCE, V-Z VPRORVD (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+31:i] RIGHT_ROTATE_DWORDS(SRC1[i+31:i], SRC2[31:0]) ELSE DEST[i+31:i] RIGHT_ROTATE_DWORDS(SRC1[i+31:i], SRC2[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VPRORQ (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC1 *is memory*) THEN DEST[i+63:i] RIGHT_ROTATE_QWORDS(SRC1[63:0], imm8) ELSE DEST[i+63:i] RIGHT_ROTATE_QWORDS(SRC1[i+63:i], imm8]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 PRORD/PRORVD/PRORQ/PRORVQ—Bit Rotate Right Vol. 2C 5-437 INSTRUCTION SET REFERENCE, V-Z VPRORVQ (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+63:i] RIGHT_ROTATE_QWORDS(SRC1[i+63:i], SRC2[63:0]) ELSE DEST[i+63:i] RIGHT_ROTATE_QWORDS(SRC1[i+63:i], SRC2[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-438 Vol. 2C PRORD/PRORVD/PRORQ/PRORVQ—Bit Rotate Right INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPRORD __m512i _mm512_ror_epi32(__m512i a, int imm); VPRORD __m512i _mm512_mask_ror_epi32(__m512i a, __mmask16 k, __m512i b, int imm); VPRORD __m512i _mm512_maskz_ror_epi32( __mmask16 k, __m512i a, int imm); VPRORD __m256i _mm256_ror_epi32(__m256i a, int imm); VPRORD __m256i _mm256_mask_ror_epi32(__m256i a, __mmask8 k, __m256i b, int imm); VPRORD __m256i _mm256_maskz_ror_epi32( __mmask8 k, __m256i a, int imm); VPRORD __m128i _mm_ror_epi32(__m128i a, int imm); VPRORD __m128i _mm_mask_ror_epi32(__m128i a, __mmask8 k, __m128i b, int imm); VPRORD __m128i _mm_maskz_ror_epi32( __mmask8 k, __m128i a, int imm); VPRORQ __m512i _mm512_ror_epi64(__m512i a, int imm); VPRORQ __m512i _mm512_mask_ror_epi64(__m512i a, __mmask8 k, __m512i b, int imm); VPRORQ __m512i _mm512_maskz_ror_epi64(__mmask8 k, __m512i a, int imm); VPRORQ __m256i _mm256_ror_epi64(__m256i a, int imm); VPRORQ __m256i _mm256_mask_ror_epi64(__m256i a, __mmask8 k, __m256i b, int imm); VPRORQ __m256i _mm256_maskz_ror_epi64( __mmask8 k, __m256i a, int imm); VPRORQ __m128i _mm_ror_epi64(__m128i a, int imm); VPRORQ __m128i _mm_mask_ror_epi64(__m128i a, __mmask8 k, __m128i b, int imm); VPRORQ __m128i _mm_maskz_ror_epi64( __mmask8 k, __m128i a, int imm); VPRORVD __m512i _mm512_rorv_epi32(__m512i a, __m512i cnt); VPRORVD __m512i _mm512_mask_rorv_epi32(__m512i a, __mmask16 k, __m512i b, __m512i cnt); VPRORVD __m512i _mm512_maskz_rorv_epi32(__mmask16 k, __m512i a, __m512i cnt); VPRORVD __m256i _mm256_rorv_epi32(__m256i a, __m256i cnt); VPRORVD __m256i _mm256_mask_rorv_epi32(__m256i a, __mmask8 k, __m256i b, __m256i cnt); VPRORVD __m256i _mm256_maskz_rorv_epi32(__mmask8 k, __m256i a, __m256i cnt); VPRORVD __m128i _mm_rorv_epi32(__m128i a, __m128i cnt); VPRORVD __m128i _mm_mask_rorv_epi32(__m128i a, __mmask8 k, __m128i b, __m128i cnt); VPRORVD __m128i _mm_maskz_rorv_epi32(__mmask8 k, __m128i a, __m128i cnt); VPRORVQ __m512i _mm512_rorv_epi64(__m512i a, __m512i cnt); VPRORVQ __m512i _mm512_mask_rorv_epi64(__m512i a, __mmask8 k, __m512i b, __m512i cnt); VPRORVQ __m512i _mm512_maskz_rorv_epi64( __mmask8 k, __m512i a, __m512i cnt); VPRORVQ __m256i _mm256_rorv_epi64(__m256i a, __m256i cnt); VPRORVQ __m256i _mm256_mask_rorv_epi64(__m256i a, __mmask8 k, __m256i b, __m256i cnt); VPRORVQ __m256i _mm256_maskz_rorv_epi64(__mmask8 k, __m256i a, __m256i cnt); VPRORVQ __m128i _mm_rorv_epi64(__m128i a, __m128i cnt); VPRORVQ __m128i _mm_mask_rorv_epi64(__m128i a, __mmask8 k, __m128i b, __m128i cnt); VPRORVQ __m128i _mm_maskz_rorv_epi64(__mmask8 k, __m128i a, __m128i cnt); SIMD Floating-Point Exceptions None Other Exceptions EVEX-encoded instruction, see Exceptions Type E4. PRORD/PRORVD/PRORQ/PRORVQ—Bit Rotate Right Vol. 2C 5-439 INSTRUCTION SET REFERENCE, V-Z VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed Qword Indices Opcode/ Instruction Op/ En EVEX.128.66.0F38.W0 A0 /vsib VPSCATTERDD vm32x {k1}, xmm1 EVEX.256.66.0F38.W0 A0 /vsib VPSCATTERDD vm32y {k1}, ymm1 EVEX.512.66.0F38.W0 A0 /vsib VPSCATTERDD vm32z {k1}, zmm1 EVEX.128.66.0F38.W1 A0 /vsib VPSCATTERDQ vm32x {k1}, xmm1 EVEX.256.66.0F38.W1 A0 /vsib VPSCATTERDQ vm32x {k1}, ymm1 EVEX.512.66.0F38.W1 A0 /vsib VPSCATTERDQ vm32y {k1}, zmm1 EVEX.128.66.0F38.W0 A1 /vsib VPSCATTERQD vm64x {k1}, xmm1 EVEX.256.66.0F38.W0 A1 /vsib VPSCATTERQD vm64y {k1}, xmm1 EVEX.512.66.0F38.W0 A1 /vsib VPSCATTERQD vm64z {k1}, ymm1 EVEX.128.66.0F38.W1 A1 /vsib VPSCATTERQQ vm64x {k1}, xmm1 EVEX.256.66.0F38.W1 A1 /vsib VPSCATTERQQ vm64y {k1}, ymm1 EVEX.512.66.0F38.W1 A1 /vsib VPSCATTERQQ vm64z {k1}, zmm1 T1S 64/32 bit Mode Support V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V CPUID Feature Flag AVX512VL AVX512F AVX512VL AVX512F AVX512F AVX512VL AVX512F AVX512VL AVX512F AVX512F AVX512VL AVX512F AVX512VL AVX512F AVX512F AVX512VL AVX512F AVX512VL AVX512F AVX512F Description Using signed dword indices, scatter dword values to memory using writemask k1. Using signed dword indices, scatter dword values to memory using writemask k1. Using signed dword indices, scatter dword values to memory using writemask k1. Using signed dword indices, scatter qword values to memory using writemask k1. Using signed dword indices, scatter qword values to memory using writemask k1. Using signed dword indices, scatter qword values to memory using writemask k1. Using signed qword indices, scatter dword values to memory using writemask k1. Using signed qword indices, scatter dword values to memory using writemask k1. Using signed qword indices, scatter dword values to memory using writemask k1. Using signed qword indices, scatter qword values to memory using writemask k1. Using signed qword indices, scatter qword values to memory using writemask k1. Using signed qword indices, scatter qword values to memory using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S BaseReg (R): VSIB:base, VectorReg(R): VSIB:index ModRM:reg (r) NA NA Description Stores up to 16 elements (8 elements for qword indices) in doubleword vector or 8 elements in quadword vector to the memory locations pointed by base address BASE_ADDR and index vector VINDEX, with scale SCALE. The elements are specified via the VSIB (i.e., the index register is a vector register, holding packed indices). Elements will only be stored if their corresponding mask bit is one. The entire mask register will be set to zero by this instruction unless it triggers an exception. This instruction can be suspended by an exception if at least one element is already scattered (i.e., if the exception is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination register and the mask register are partially updated. If any traps or interrupts are pending from already scattered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruction breakpoint is not re-triggered when the instruction is continued. Note that: • Only writes to overlapping vector indices are guaranteed to be ordered with respect to each other (from LSB to MSB of the source registers). Note that this also include partially overlapping vector indices. Writes that are not overlapped may happen in any order. Memory ordering with other instructions follows the Intel-64 memory ordering model. Note that this does not account for non-overlapping indices that map into the same physical address locations. 5-440 Vol. 2C VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed INSTRUCTION SET REFERENCE, V-Z • • If two or more destination indices completely overlap, the “earlier” write(s) may be skipped. • Elements may be scattered in any order, but faults must be delivered in a right-to left order; thus, elements to the left of a faulting one may be gathered before the fault is delivered. A given implementation of this instruction is repeatable - given the same input values and architectural state, the same set of elements to the left of the faulting one will be gathered. • • • This instruction does not perform AC checks, and so will never deliver an AC fault. Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all elements closer to the LSB of the destination ZMM will be completed (and non-faulting). Individual elements closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered in the conventional order. Not valid with 16-bit effective addresses. Will deliver a #UD fault. If this instruction overwrites itself and then takes a fault, only a subset of elements may be completed before the fault is delivered (as described above). If the fault handler completes and attempts to re-execute this instruction, the new instruction will be executed, and the scatter will not complete. Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if ModRM.rm is different than 100b. This instruction has special disp8*N and alignment rules. N is considered to be the size of a single vector element. The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are ignored. The instruction will #UD fault if the k0 mask register is specified. The instruction will #UD fault if EVEX.Z = 1. VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed Vol. 2C 5-441 INSTRUCTION SET REFERENCE, V-Z Operation BASE_ADDR stands for the memory operand base address (a GPR); may not exist VINDEX stands for the memory operand vector of indices (a ZMM register) SCALE stands for the memory operand scalar (1, 2, 4 or 8) DISP is the optional 1, 2 or 4 byte displacement 5-442 Vol. 2C VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed INSTRUCTION SET REFERENCE, V-Z VPSCATTERDD (EVEX encoded versions) (KL, VL)= (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN MEM[BASE_ADDR +SignExtend(VINDEX[i+31:i]) * SCALE + DISP] SRC[i+31:i] k1[j] 0 FI; ENDFOR k1[MAX_KL-1:KL] 0 VPSCATTERDQ (EVEX encoded versions) (KL, VL)= (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] OR *no writemask* THEN MEM[BASE_ADDR +SignExtend(VINDEX[k+31:k]) * SCALE + DISP] SRC[i+63:i] k1[j] 0 FI; ENDFOR k1[MAX_KL-1:KL] 0 VPSCATTERQD (EVEX encoded versions) (KL, VL)= (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 32 k j * 64 IF k1[j] OR *no writemask* THEN MEM[BASE_ADDR + (VINDEX[k+63:k]) * SCALE + DISP] SRC[i+31:i] k1[j] 0 FI; ENDFOR k1[MAX_KL-1:KL] 0 VPSCATTERQQ (EVEX encoded versions) (KL, VL)= (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN MEM[BASE_ADDR + (VINDEX[j+63:j]) * SCALE + DISP] SRC[i+63:i] FI; ENDFOR k1[MAX_KL-1:KL] 0 VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed Vol. 2C 5-443 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPSCATTERDD void _mm512_i32scatter_epi32(void * base, __m512i vdx, __m512i a, int scale); VPSCATTERDD void _mm256_i32scatter_epi32(void * base, __m256i vdx, __m256i a, int scale); VPSCATTERDD void _mm_i32scatter_epi32(void * base, __m128i vdx, __m128i a, int scale); VPSCATTERDD void _mm512_mask_i32scatter_epi32(void * base, __mmask16 k, __m512i vdx, __m512i a, int scale); VPSCATTERDD void _mm256_mask_i32scatter_epi32(void * base, __mmask8 k, __m256i vdx, __m256i a, int scale); VPSCATTERDD void _mm_mask_i32scatter_epi32(void * base, __mmask8 k, __m128i vdx, __m128i a, int scale); VPSCATTERDQ void _mm512_i32scatter_epi64(void * base, __m256i vdx, __m512i a, int scale); VPSCATTERDQ void _mm256_i32scatter_epi64(void * base, __m128i vdx, __m256i a, int scale); VPSCATTERDQ void _mm_i32scatter_epi64(void * base, __m128i vdx, __m128i a, int scale); VPSCATTERDQ void _mm512_mask_i32scatter_epi64(void * base, __mmask8 k, __m256i vdx, __m512i a, int scale); VPSCATTERDQ void _mm256_mask_i32scatter_epi64(void * base, __mmask8 k, __m128i vdx, __m256i a, int scale); VPSCATTERDQ void _mm_mask_i32scatter_epi64(void * base, __mmask8 k, __m128i vdx, __m128i a, int scale); VPSCATTERQD void _mm512_i64scatter_epi32(void * base, __m512i vdx, __m256i a, int scale); VPSCATTERQD void _mm256_i64scatter_epi32(void * base, __m256i vdx, __m128i a, int scale); VPSCATTERQD void _mm_i64scatter_epi32(void * base, __m128i vdx, __m128i a, int scale); VPSCATTERQD void _mm512_mask_i64scatter_epi32(void * base, __mmask8 k, __m512i vdx, __m256i a, int scale); VPSCATTERQD void _mm256_mask_i64scatter_epi32(void * base, __mmask8 k, __m256i vdx, __m128i a, int scale); VPSCATTERQD void _mm_mask_i64scatter_epi32(void * base, __mmask8 k, __m128i vdx, __m128i a, int scale); VPSCATTERQQ void _mm512_i64scatter_epi64(void * base, __m512i vdx, __m512i a, int scale); VPSCATTERQQ void _mm256_i64scatter_epi64(void * base, __m256i vdx, __m256i a, int scale); VPSCATTERQQ void _mm_i64scatter_epi64(void * base, __m128i vdx, __m128i a, int scale); VPSCATTERQQ void _mm512_mask_i64scatter_epi64(void * base, __mmask8 k, __m512i vdx, __m512i a, int scale); VPSCATTERQQ void _mm256_mask_i64scatter_epi64(void * base, __mmask8 k, __m256i vdx, __m256i a, int scale); VPSCATTERQQ void _mm_mask_i64scatter_epi64(void * base, __mmask8 k, __m128i vdx, __m128i a, int scale); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E12. 5-444 Vol. 2C VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed INSTRUCTION SET REFERENCE, V-Z VPSLLVW/VPSLLVD/VPSLLVQ—Variable Bit Shift Left Logical Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX2 VEX.NDS.128.66.0F38.W0 47 /r VPSLLVD xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W1 47 /r VPSLLVQ xmm1, xmm2, xmm3/m128 RVM V/V AVX2 VEX.NDS.256.66.0F38.W0 47 /r VPSLLVD ymm1, ymm2, ymm3/m256 RVM V/V AVX2 VEX.NDS.256.66.0F38.W1 47 /r VPSLLVQ ymm1, ymm2, ymm3/m256 RVM V/V AVX2 EVEX.NDS.128.66.0F38.W1 12 /r VPSLLVW xmm1 {k1}{z}, xmm2, xmm3/m128 EVEX.NDS.256.66.0F38.W1 12 /r VPSLLVW ymm1 {k1}{z}, ymm2, ymm3/m256 EVEX.NDS.512.66.0F38.W1 12 /r VPSLLVW zmm1 {k1}{z}, zmm2, zmm3/m512 EVEX.NDS.128.66.0F38.W0 47 /r VPSLLVD xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.256.66.0F38.W0 47 /r VPSLLVD ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.512.66.0F38.W0 47 /r VPSLLVD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst EVEX.NDS.128.66.0F38.W1 47 /r VPSLLVQ xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDS.256.66.0F38.W1 47 /r VPSLLVQ ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.512.66.0F38.W1 47 /r VPSLLVQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst FVM V/V AVX512VL AVX512BW FVM V/V AVX512VL AVX512BW FVM V/V AVX512BW FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F Description Shift doublewords in xmm2 left by amount specified in the corresponding element of xmm3/m128 while shifting in 0s. Shift quadwords in xmm2 left by amount specified in the corresponding element of xmm3/m128 while shifting in 0s. Shift doublewords in ymm2 left by amount specified in the corresponding element of ymm3/m256 while shifting in 0s. Shift quadwords in ymm2 left by amount specified in the corresponding element of ymm3/m256 while shifting in 0s. Shift words in xmm2 left by amount specified in the corresponding element of xmm3/m128 while shifting in 0s using writemask k1. Shift words in ymm2 left by amount specified in the corresponding element of ymm3/m256 while shifting in 0s using writemask k1. Shift words in zmm2 left by amount specified in the corresponding element of zmm3/m512 while shifting in 0s using writemask k1. Shift doublewords in xmm2 left by amount specified in the corresponding element of xmm3/m128/m32bcst while shifting in 0s using writemask k1. Shift doublewords in ymm2 left by amount specified in the corresponding element of ymm3/m256/m32bcst while shifting in 0s using writemask k1. Shift doublewords in zmm2 left by amount specified in the corresponding element of zmm3/m512/m32bcst while shifting in 0s using writemask k1. Shift quadwords in xmm2 left by amount specified in the corresponding element of xmm3/m128/m64bcst while shifting in 0s using writemask k1. Shift quadwords in ymm2 left by amount specified in the corresponding element of ymm3/m256/m64bcst while shifting in 0s using writemask k1. Shift quadwords in zmm2 left by amount specified in the corresponding element of zmm3/m512/m64bcst while shifting in 0s using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA FVM ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA VPSLLVW/VPSLLVD/VPSLLVQ—Variable Bit Shift Left Logical Vol. 2C 5-445 INSTRUCTION SET REFERENCE, V-Z Description Shifts the bits in the individual data elements (words, doublewords or quadword) in the first source operand to the left by the count value of respective data elements in the second source operand. As the bits in the data elements are shifted left, the empty low-order bits are cleared (set to 0). The count values are specified individually in each data element of the second source operand. If the unsigned integer value specified in the respective data element of the second source operand is greater than 15 (for word), 31 (for doublewords), or 63 (for a quadword), then the destination data element are written with 0. VEX.128 encoded version: The destination and first source operands are XMM registers. The count operand can be either an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding destination register are zeroed. VEX.256 encoded version: The destination and first source operands are YMM registers. The count operand can be either an YMM register or a 256-bit memory. Bits (MAX_VL-1:256) of the corresponding ZMM register are zeroed. EVEX encoded VPSLLVD/Q: The destination and first source operands are ZMM/YMM/XMM registers. The count operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512-bit vector broadcasted from a 32/64-bit memory location. The destination is conditionally updated with writemask k1. EVEX encoded VPSLLVW: The destination and first source operands are ZMM/YMM/XMM registers. The count operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination is conditionally updated with writemask k1. Operation VPSLLVW (EVEX encoded version) (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO KL-1 i j * 16 IF k1[j] OR *no writemask* THEN DEST[i+15:i] ZeroExtend(SRC1[i+15:i] << SRC2[i+15:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged* ELSE ; zeroing-masking DEST[i+15:i] 0 FI FI; ENDFOR; DEST[MAX_VL-1:VL] 0; 5-446 Vol. 2C VPSLLVW/VPSLLVD/VPSLLVQ—Variable Bit Shift Left Logical INSTRUCTION SET REFERENCE, V-Z VPSLLVD (VEX.128 version) COUNT_0 SRC2[31 : 0] (* Repeat Each COUNT_i for the 2nd through 4th dwords of SRC2*) COUNT_3 SRC2[100 : 96]; IF COUNT_0 < 32 THEN DEST[31:0] ZeroExtend(SRC1[31:0] << COUNT_0); ELSE DEST[31:0] 0; (* Repeat shift operation for 2nd through 4th dwords *) IF COUNT_3 < 32 THEN DEST[127:96] ZeroExtend(SRC1[127:96] << COUNT_3); ELSE DEST[127:96] 0; DEST[MAX_VL-1:128] 0; VPSLLVD (VEX.256 version) COUNT_0 SRC2[31 : 0]; (* Repeat Each COUNT_i for the 2nd through 7th dwords of SRC2*) COUNT_7 SRC2[228 : 224]; IF COUNT_0 < 32 THEN DEST[31:0] ZeroExtend(SRC1[31:0] << COUNT_0); ELSE DEST[31:0] 0; (* Repeat shift operation for 2nd through 7th dwords *) IF COUNT_7 < 32 THEN DEST[255:224] ZeroExtend(SRC1[255:224] << COUNT_7); ELSE DEST[255:224] 0; DEST[MAX_VL-1:256] 0; VPSLLVD (EVEX encoded version) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+31:i] ZeroExtend(SRC1[i+31:i] << SRC2[31:0]) ELSE DEST[i+31:i] ZeroExtend(SRC1[i+31:i] << SRC2[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR; DEST[MAX_VL-1:VL] 0; VPSLLVW/VPSLLVD/VPSLLVQ—Variable Bit Shift Left Logical Vol. 2C 5-447 INSTRUCTION SET REFERENCE, V-Z VPSLLVQ (VEX.128 version) COUNT_0 SRC2[63 : 0]; COUNT_1 SRC2[127 : 64]; IF COUNT_0 < 64THEN DEST[63:0] ZeroExtend(SRC1[63:0] << COUNT_0); ELSE DEST[63:0] 0; IF COUNT_1 < 64 THEN DEST[127:64] ZeroExtend(SRC1[127:64] << COUNT_1); ELSE DEST[127:96] 0; DEST[MAX_VL-1:128] 0; VPSLLVQ (VEX.256 version) COUNT_0 SRC2[63 : 0]; (* Repeat Each COUNT_i for the 2nd through 4th dwords of SRC2*) COUNT_3 SRC2[197 : 192]; IF COUNT_0 < 64THEN DEST[63:0] ZeroExtend(SRC1[63:0] << COUNT_0); ELSE DEST[63:0] 0; (* Repeat shift operation for 2nd through 4th dwords *) IF COUNT_3 < 64 THEN DEST[255:192] ZeroExtend(SRC1[255:192] << COUNT_3); ELSE DEST[255:192] 0; DEST[MAX_VL-1:256] 0; VPSLLVQ (EVEX encoded version) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+63:i] ZeroExtend(SRC1[i+63:i] << SRC2[63:0]) ELSE DEST[i+63:i] ZeroExtend(SRC1[i+63:i] << SRC2[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR; DEST[MAX_VL-1:VL] 0; 5-448 Vol. 2C VPSLLVW/VPSLLVD/VPSLLVQ—Variable Bit Shift Left Logical INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPSLLVW __m512i _mm512_sllv_epi16(__m512i a, __m512i cnt); VPSLLVW __m512i _mm512_mask_sllv_epi16(__m512i s, __mmask32 k, __m512i a, __m512i cnt); VPSLLVW __m512i _mm512_maskz_sllv_epi16( __mmask32 k, __m512i a, __m512i cnt); VPSLLVW __m256i _mm256_mask_sllv_epi16(__m256i s, __mmask16 k, __m256i a, __m256i cnt); VPSLLVW __m256i _mm256_maskz_sllv_epi16( __mmask16 k, __m256i a, __m256i cnt); VPSLLVW __m128i _mm_mask_sllv_epi16(__m128i s, __mmask8 k, __m128i a, __m128i cnt); VPSLLVW __m128i _mm_maskz_sllv_epi16( __mmask8 k, __m128i a, __m128i cnt); VPSLLVD __m512i _mm512_sllv_epi32(__m512i a, __m512i cnt); VPSLLVD __m512i _mm512_mask_sllv_epi32(__m512i s, __mmask16 k, __m512i a, __m512i cnt); VPSLLVD __m512i _mm512_maskz_sllv_epi32( __mmask16 k, __m512i a, __m512i cnt); VPSLLVD __m256i _mm256_mask_sllv_epi32(__m256i s, __mmask8 k, __m256i a, __m256i cnt); VPSLLVD __m256i _mm256_maskz_sllv_epi32( __mmask8 k, __m256i a, __m256i cnt); VPSLLVD __m128i _mm_mask_sllv_epi32(__m128i s, __mmask8 k, __m128i a, __m128i cnt); VPSLLVD __m128i _mm_maskz_sllv_epi32( __mmask8 k, __m128i a, __m128i cnt); VPSLLVQ __m512i _mm512_sllv_epi64(__m512i a, __m512i cnt); VPSLLVQ __m512i _mm512_mask_sllv_epi64(__m512i s, __mmask8 k, __m512i a, __m512i cnt); VPSLLVQ __m512i _mm512_maskz_sllv_epi64( __mmask8 k, __m512i a, __m512i cnt); VPSLLVD __m256i _mm256_mask_sllv_epi64(__m256i s, __mmask8 k, __m256i a, __m256i cnt); VPSLLVD __m256i _mm256_maskz_sllv_epi64( __mmask8 k, __m256i a, __m256i cnt); VPSLLVD __m128i _mm_mask_sllv_epi64(__m128i s, __mmask8 k, __m128i a, __m128i cnt); VPSLLVD __m128i _mm_maskz_sllv_epi64( __mmask8 k, __m128i a, __m128i cnt); VPSLLVD __m256i _mm256_sllv_epi32 (__m256i m, __m256i count) VPSLLVQ __m256i _mm256_sllv_epi64 (__m256i m, __m256i count) SIMD Floating-Point Exceptions None Other Exceptions VEX-encoded instructions, see Exceptions Type 4. EVEX-encoded VPSLLVD/VPSLLVQ, see Exceptions Type E4. EVEX-encoded VPSLLVW, see Exceptions Type E4.nb. VPSLLVW/VPSLLVD/VPSLLVQ—Variable Bit Shift Left Logical Vol. 2C 5-449 INSTRUCTION SET REFERENCE, V-Z VPSRAVW/VPSRAVD/VPSRAVQ—Variable Bit Shift Right Arithmetic Opcode/ Instruction Op/ En RVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX2 VEX.NDS.128.66.0F38.W0 46 /r VPSRAVD xmm1, xmm2, xmm3/m128 Description VEX.NDS.256.66.0F38.W0 46 /r VPSRAVD ymm1, ymm2, ymm3/m256 RVM V/V AVX2 EVEX.NDS.128.66.0F38.W1 11 /r VPSRAVW xmm1 {k1}{z}, xmm2, xmm3/m128 EVEX.NDS.256.66.0F38.W1 11 /r VPSRAVW ymm1 {k1}{z}, ymm2, ymm3/m256 EVEX.NDS.512.66.0F38.W1 11 /r VPSRAVW zmm1 {k1}{z}, zmm2, zmm3/m512 EVEX.NDS.128.66.0F38.W0 46 /r VPSRAVD xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst FVM V/V AVX512VL AVX512BW FVM V/V AVX512VL AVX512BW FVM V/V AVX512BW FV V/V AVX512VL AVX512F EVEX.NDS.256.66.0F38.W0 46 /r VPSRAVD ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst FV V/V AVX512VL AVX512F EVEX.NDS.512.66.0F38.W0 46 /r VPSRAVD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst FV V/V AVX512F EVEX.NDS.128.66.0F38.W1 46 /r VPSRAVQ xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst FV V/V AVX512VL AVX512F EVEX.NDS.256.66.0F38.W1 46 /r VPSRAVQ ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst FV V/V AVX512VL AVX512F EVEX.NDS.512.66.0F38.W1 46 /r VPSRAVQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst FV V/V AVX512F Shift doublewords in xmm2 right by amount specified in the corresponding element of xmm3/m128 while shifting in sign bits. Shift doublewords in ymm2 right by amount specified in the corresponding element of ymm3/m256 while shifting in sign bits. Shift words in xmm2 right by amount specified in the corresponding element of xmm3/m128 while shifting in sign bits using writemask k1. Shift words in ymm2 right by amount specified in the corresponding element of ymm3/m256 while shifting in sign bits using writemask k1. Shift words in zmm2 right by amount specified in the corresponding element of zmm3/m512 while shifting in sign bits using writemask k1. Shift doublewords in xmm2 right by amount specified in the corresponding element of xmm3/m128/m32bcst while shifting in sign bits using writemask k1. Shift doublewords in ymm2 right by amount specified in the corresponding element of ymm3/m256/m32bcst while shifting in sign bits using writemask k1. Shift doublewords in zmm2 right by amount specified in the corresponding element of zmm3/m512/m32bcst while shifting in sign bits using writemask k1. Shift quadwords in xmm2 right by amount specified in the corresponding element of xmm3/m128/m64bcst while shifting in sign bits using writemask k1. Shift quadwords in ymm2 right by amount specified in the corresponding element of ymm3/m256/m64bcst while shifting in sign bits using writemask k1. Shift quadwords in zmm2 right by amount specified in the corresponding element of zmm3/m512/m64bcst while shifting in sign bits using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA FVM ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA 5-450 Vol. 2C VPSRAVW/VPSRAVD/VPSRAVQ—Variable Bit Shift Right Arithmetic INSTRUCTION SET REFERENCE, V-Z Description Shifts the bits in the individual data elements (word/doublewords/quadword) in the first source operand (the second operand) to the right by the number of bits specified in the count value of respective data elements in the second source operand (the third operand). As the bits in the data elements are shifted right, the empty high-order bits are set to the MSB (sign extension). The count values are specified individually in each data element of the second source operand. If the unsigned integer value specified in the respective data element of the second source operand is greater than 15 (for words), 31 (for doublewords), or 63 (for a quadword), then the destination data element are filled with the corresponding sign bit of the source element. The count values are specified individually in each data element of the second source operand. If the unsigned integer value specified in the respective data element of the second source operand is greater than 16 (for word), 31 (for doublewords), or 63 (for a quadword), then the destination data element are written with 0. VEX.128 encoded version: The destination and first source operands are XMM registers. The count operand can be either an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding destination register are zeroed. VEX.256 encoded version: The destination and first source operands are YMM registers. The count operand can be either an YMM register or a 256-bit memory. Bits (MAX_VL-1:256) of the corresponding destination register are zeroed. EVEX.512/256/128 encoded VPSRAVD/W: The destination and first source operands are ZMM/YMM/XMM registers. The count operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The destination is conditionally updated with writemask k1. EVEX.512/256/128 encoded VPSRAVQ: The destination and first source operands are ZMM/YMM/XMM registers. The count operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination is conditionally updated with writemask k1. Operation VPSRAVW (EVEX encoded version) (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO KL-1 i j * 16 IF k1[j] OR *no writemask* THEN COUNT SRC2[i+3:i] IF COUNT < 16 THEN DEST[i+15:i] SignExtend(SRC1[i+15:i] >> COUNT) ELSE FOR k 0 TO 15 DEST[i+k] SRC1[i+15] ENDFOR; FI ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged* ELSE ; zeroing-masking DEST[i+15:i] 0 FI FI; ENDFOR; DEST[MAX_VL-1:VL] 0; VPSRAVW/VPSRAVD/VPSRAVQ—Variable Bit Shift Right Arithmetic Vol. 2C 5-451 INSTRUCTION SET REFERENCE, V-Z VPSRAVD (VEX.128 version) COUNT_0 SRC2[31 : 0] (* Repeat Each COUNT_i for the 2nd through 4th dwords of SRC2*) COUNT_3 SRC2[100 : 96]; DEST[31:0] SignExtend(SRC1[31:0] >> COUNT_0); (* Repeat shift operation for 2nd through 4th dwords *) DEST[127:96] SignExtend(SRC1[127:96] >> COUNT_3); DEST[MAX_VL-1:128] 0; VPSRAVD (VEX.256 version) COUNT_0 SRC2[31 : 0]; (* Repeat Each COUNT_i for the 2nd through 8th dwords of SRC2*) COUNT_7 SRC2[228 : 224]; DEST[31:0] SignExtend(SRC1[31:0] >> COUNT_0); (* Repeat shift operation for 2nd through 7th dwords *) DEST[255:224] SignExtend(SRC1[255:224] >> COUNT_7); DEST[MAX_VL-1:256] 0; VPSRAVD (EVEX encoded version) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN COUNT SRC2[4:0] IF COUNT < 32 THEN DEST[i+31:i] SignExtend(SRC1[i+31:i] >> COUNT) ELSE FOR k 0 TO 31 DEST[i+k] SRC1[i+31] ENDFOR; FI ELSE COUNT SRC2[i+4:i] IF COUNT < 32 THEN DEST[i+31:i] SignExtend(SRC1[i+31:i] >> COUNT) ELSE FOR k 0 TO 31 DEST[i+k] SRC1[i+31] ENDFOR; FI FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking DEST[31:0] 0 FI FI; ENDFOR; DEST[MAX_VL-1:VL] 0; 5-452 Vol. 2C VPSRAVW/VPSRAVD/VPSRAVQ—Variable Bit Shift Right Arithmetic INSTRUCTION SET REFERENCE, V-Z VPSRAVQ (EVEX encoded version) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN COUNT SRC2[5:0] IF COUNT < 64 THEN DEST[i+63:i] SignExtend(SRC1[i+63:i] >> COUNT) ELSE FOR k 0 TO 63 DEST[i+k] SRC1[i+63] ENDFOR; FI ELSE COUNT SRC2[i+5:i] IF COUNT < 64 THEN DEST[i+63:i] SignExtend(SRC1[i+63:i] >> COUNT) ELSE FOR k 0 TO 63 DEST[i+k] SRC1[i+63] ENDFOR; FI FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking DEST[63:0] 0 FI FI; ENDFOR; DEST[MAX_VL-1:VL] 0; VPSRAVW/VPSRAVD/VPSRAVQ—Variable Bit Shift Right Arithmetic Vol. 2C 5-453 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPSRAVD __m512i _mm512_srav_epi32(__m512i a, __m512i cnt); VPSRAVD __m512i _mm512_mask_srav_epi32(__m512i s, __mmask16 m, __m512i a, __m512i cnt); VPSRAVD __m512i _mm512_maskz_srav_epi32(__mmask16 m, __m512i a, __m512i cnt); VPSRAVD __m256i _mm256_srav_epi32(__m256i a, __m256i cnt); VPSRAVD __m256i _mm256_mask_srav_epi32(__m256i s, __mmask8 m, __m256i a, __m256i cnt); VPSRAVD __m256i _mm256_maskz_srav_epi32(__mmask8 m, __m256i a, __m256i cnt); VPSRAVD __m128i _mm_srav_epi32(__m128i a, __m128i cnt); VPSRAVD __m128i _mm_mask_srav_epi32(__m128i s, __mmask8 m, __m128i a, __m128i cnt); VPSRAVD __m128i _mm_maskz_srav_epi32(__mmask8 m, __m128i a, __m128i cnt); VPSRAVQ __m512i _mm512_srav_epi64(__m512i a, __m512i cnt); VPSRAVQ __m512i _mm512_mask_srav_epi64(__m512i s, __mmask8 m, __m512i a, __m512i cnt); VPSRAVQ __m512i _mm512_maskz_srav_epi64( __mmask8 m, __m512i a, __m512i cnt); VPSRAVQ __m256i _mm256_srav_epi64(__m256i a, __m256i cnt); VPSRAVQ __m256i _mm256_mask_srav_epi64(__m256i s, __mmask8 m, __m256i a, __m256i cnt); VPSRAVQ __m256i _mm256_maskz_srav_epi64( __mmask8 m, __m256i a, __m256i cnt); VPSRAVQ __m128i _mm_srav_epi64(__m128i a, __m128i cnt); VPSRAVQ __m128i _mm_mask_srav_epi64(__m128i s, __mmask8 m, __m128i a, __m128i cnt); VPSRAVQ __m128i _mm_maskz_srav_epi64( __mmask8 m, __m128i a, __m128i cnt); VPSRAVW __m512i _mm512_srav_epi16(__m512i a, __m512i cnt); VPSRAVW __m512i _mm512_mask_srav_epi16(__m512i s, __mmask32 m, __m512i a, __m512i cnt); VPSRAVW __m512i _mm512_maskz_srav_epi16(__mmask32 m, __m512i a, __m512i cnt); VPSRAVW __m256i _mm256_srav_epi16(__m256i a, __m256i cnt); VPSRAVW __m256i _mm256_mask_srav_epi16(__m256i s, __mmask16 m, __m256i a, __m256i cnt); VPSRAVW __m256i _mm256_maskz_srav_epi16(__mmask16 m, __m256i a, __m256i cnt); VPSRAVW __m128i _mm_srav_epi16(__m128i a, __m128i cnt); VPSRAVW __m128i _mm_mask_srav_epi16(__m128i s, __mmask8 m, __m128i a, __m128i cnt); VPSRAVW __m128i _mm_maskz_srav_epi32(__mmask8 m, __m128i a, __m128i cnt); VPSRAVD __m256i _mm256_srav_epi32 (__m256i m, __m256i count) SIMD Floating-Point Exceptions None Other Exceptions Non-EVEX-encoded instruction, see Exceptions Type 4. EVEX-encoded instruction, see Exceptions Type E4. 5-454 Vol. 2C VPSRAVW/VPSRAVD/VPSRAVQ—Variable Bit Shift Right Arithmetic INSTRUCTION SET REFERENCE, V-Z VPSRLVW/VPSRLVD/VPSRLVQ—Variable Bit Shift Right Logical Opcode/ Instruction Op / En RVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX2 VEX.NDS.128.66.0F38.W0 45 /r VPSRLVD xmm1, xmm2, xmm3/m128 VEX.NDS.128.66.0F38.W1 45 /r VPSRLVQ xmm1, xmm2, xmm3/m128 RVM V/V AVX2 VEX.NDS.256.66.0F38.W0 45 /r VPSRLVD ymm1, ymm2, ymm3/m256 RVM V/V AVX2 VEX.NDS.256.66.0F38.W1 45 /r VPSRLVQ ymm1, ymm2, ymm3/m256 RVM V/V AVX2 EVEX.NDS.128.66.0F38.W1 10 /r VPSRLVW xmm1 {k1}{z}, xmm2, xmm3/m128 EVEX.NDS.256.66.0F38.W1 10 /r VPSRLVW ymm1 {k1}{z}, ymm2, ymm3/m256 EVEX.NDS.512.66.0F38.W1 10 /r VPSRLVW zmm1 {k1}{z}, zmm2, zmm3/m512 EVEX.NDS.128.66.0F38.W0 45 /r VPSRLVD xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.256.66.0F38.W0 45 /r VPSRLVD ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.512.66.0F38.W0 45 /r VPSRLVD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst EVEX.NDS.128.66.0F38.W1 45 /r VPSRLVQ xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDS.256.66.0F38.W1 45 /r VPSRLVQ ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.512.66.0F38.W1 45 /r VPSRLVQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst FVM V/V AVX512VL AVX512BW FVM V/V AVX512VL AVX512BW FVM V/V AVX512BW FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F Description Shift doublewords in xmm2 right by amount specified in the corresponding element of xmm3/m128 while shifting in 0s. Shift quadwords in xmm2 right by amount specified in the corresponding element of xmm3/m128 while shifting in 0s. Shift doublewords in ymm2 right by amount specified in the corresponding element of ymm3/m256 while shifting in 0s. Shift quadwords in ymm2 right by amount specified in the corresponding element of ymm3/m256 while shifting in 0s. Shift words in xmm2 right by amount specified in the corresponding element of xmm3/m128 while shifting in 0s using writemask k1. Shift words in ymm2 right by amount specified in the corresponding element of ymm3/m256 while shifting in 0s using writemask k1. Shift words in zmm2 right by amount specified in the corresponding element of zmm3/m512 while shifting in 0s using writemask k1. Shift doublewords in xmm2 right by amount specified in the corresponding element of xmm3/m128/m32bcst while shifting in 0s using writemask k1. Shift doublewords in ymm2 right by amount specified in the corresponding element of ymm3/m256/m32bcst while shifting in 0s using writemask k1. Shift doublewords in zmm2 right by amount specified in the corresponding element of zmm3/m512/m32bcst while shifting in 0s using writemask k1. Shift quadwords in xmm2 right by amount specified in the corresponding element of xmm3/m128/m64bcst while shifting in 0s using writemask k1. Shift quadwords in ymm2 right by amount specified in the corresponding element of ymm3/m256/m64bcst while shifting in 0s using writemask k1. Shift quadwords in zmm2 right by amount specified in the corresponding element of zmm3/m512/m64bcst while shifting in 0s using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RVM ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA FVM ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA VPSRLVW/VPSRLVD/VPSRLVQ—Variable Bit Shift Right Logical Vol. 2C 5-455 INSTRUCTION SET REFERENCE, V-Z Description Shifts the bits in the individual data elements (words, doublewords or quadword) in the first source operand to the right by the count value of respective data elements in the second source operand. As the bits in the data elements are shifted right, the empty high-order bits are cleared (set to 0). The count values are specified individually in each data element of the second source operand. If the unsigned integer value specified in the respective data element of the second source operand is greater than 15 (for word), 31 (for doublewords), or 63 (for a quadword), then the destination data element are written with 0. VEX.128 encoded version: The destination and first source operands are XMM registers. The count operand can be either an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding destination register are zeroed. VEX.256 encoded version: The destination and first source operands are YMM registers. The count operand can be either an YMM register or a 256-bit memory. Bits (MAX_VL-1:256) of the corresponding ZMM register are zeroed. EVEX encoded VPSRLVD/Q: The destination and first source operands are ZMM/YMM/XMM registers. The count operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512-bit vector broadcasted from a 32/64-bit memory location. The destination is conditionally updated with writemask k1. EVEX encoded VPSRLVW: The destination and first source operands are ZMM/YMM/XMM registers. The count operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination is conditionally updated with writemask k1. Operation VPSRLVW (EVEX encoded version) (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO KL-1 i j * 16 IF k1[j] OR *no writemask* THEN DEST[i+15:i] ZeroExtend(SRC1[i+15:i] >> SRC2[i+15:i]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged* ELSE ; zeroing-masking DEST[i+15:i] 0 FI FI; ENDFOR; DEST[MAX_VL-1:VL] 0; 5-456 Vol. 2C VPSRLVW/VPSRLVD/VPSRLVQ—Variable Bit Shift Right Logical INSTRUCTION SET REFERENCE, V-Z VPSRLVD (VEX.128 version) COUNT_0 SRC2[31 : 0] (* Repeat Each COUNT_i for the 2nd through 4th dwords of SRC2*) COUNT_3 SRC2[127 : 96]; IF COUNT_0 < 32 THEN DEST[31:0] ZeroExtend(SRC1[31:0] >> COUNT_0); ELSE DEST[31:0] 0; (* Repeat shift operation for 2nd through 4th dwords *) IF COUNT_3 < 32 THEN DEST[127:96] ZeroExtend(SRC1[127:96] >> COUNT_3); ELSE DEST[127:96] 0; DEST[MAX_VL-1:128] 0; VPSRLVD (VEX.256 version) COUNT_0 SRC2[31 : 0]; (* Repeat Each COUNT_i for the 2nd through 7th dwords of SRC2*) COUNT_7 SRC2[255 : 224]; IF COUNT_0 < 32 THEN DEST[31:0] ZeroExtend(SRC1[31:0] >> COUNT_0); ELSE DEST[31:0] 0; (* Repeat shift operation for 2nd through 7th dwords *) IF COUNT_7 < 32 THEN DEST[255:224] ZeroExtend(SRC1[255:224] >> COUNT_7); ELSE DEST[255:224] 0; DEST[MAX_VL-1:256] 0; VPSRLVD (EVEX encoded version) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+31:i] ZeroExtend(SRC1[i+31:i] >> SRC2[31:0]) ELSE DEST[i+31:i] ZeroExtend(SRC1[i+31:i] >> SRC2[i+31:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR; DEST[MAX_VL-1:VL] 0; VPSRLVW/VPSRLVD/VPSRLVQ—Variable Bit Shift Right Logical Vol. 2C 5-457 INSTRUCTION SET REFERENCE, V-Z VPSRLVQ (VEX.128 version) COUNT_0 SRC2[63 : 0]; COUNT_1 SRC2[127 : 64]; IF COUNT_0 < 64 THEN DEST[63:0] ZeroExtend(SRC1[63:0] >> COUNT_0); ELSE DEST[63:0] 0; IF COUNT_1 < 64 THEN DEST[127:64] ZeroExtend(SRC1[127:64] >> COUNT_1); ELSE DEST[127:64] 0; DEST[MAX_VL-1:128] 0; VPSRLVQ (VEX.256 version) COUNT_0 SRC2[63 : 0]; (* Repeat Each COUNT_i for the 2nd through 4th dwords of SRC2*) COUNT_3 SRC2[255 : 192]; IF COUNT_0 < 64 THEN DEST[63:0] ZeroExtend(SRC1[63:0] >> COUNT_0); ELSE DEST[63:0] 0; (* Repeat shift operation for 2nd through 4th dwords *) IF COUNT_3 < 64 THEN DEST[255:192] ZeroExtend(SRC1[255:192] >> COUNT_3); ELSE DEST[255:192] 0; DEST[MAX_VL-1:256] 0; VPSRLVQ (EVEX encoded version) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+63:i] ZeroExtend(SRC1[i+63:i] >> SRC2[63:0]) ELSE DEST[i+63:i] ZeroExtend(SRC1[i+63:i] >> SRC2[i+63:i]) FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR; DEST[MAX_VL-1:VL] 0; 5-458 Vol. 2C VPSRLVW/VPSRLVD/VPSRLVQ—Variable Bit Shift Right Logical INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPSRLVW __m512i _mm512_srlv_epi16(__m512i a, __m512i cnt); VPSRLVW __m512i _mm512_mask_srlv_epi16(__m512i s, __mmask32 k, __m512i a, __m512i cnt); VPSRLVW __m512i _mm512_maskz_srlv_epi16( __mmask32 k, __m512i a, __m512i cnt); VPSRLVW __m256i _mm256_mask_srlv_epi16(__m256i s, __mmask16 k, __m256i a, __m256i cnt); VPSRLVW __m256i _mm256_maskz_srlv_epi16( __mmask16 k, __m256i a, __m256i cnt); VPSRLVW __m128i _mm_mask_srlv_epi16(__m128i s, __mmask8 k, __m128i a, __m128i cnt); VPSRLVW __m128i _mm_maskz_srlv_epi16( __mmask8 k, __m128i a, __m128i cnt); VPSRLVW __m256i _mm256_srlv_epi32 (__m256i m, __m256i count) VPSRLVD __m512i _mm512_srlv_epi32(__m512i a, __m512i cnt); VPSRLVD __m512i _mm512_mask_srlv_epi32(__m512i s, __mmask16 k, __m512i a, __m512i cnt); VPSRLVD __m512i _mm512_maskz_srlv_epi32( __mmask16 k, __m512i a, __m512i cnt); VPSRLVD __m256i _mm256_mask_srlv_epi32(__m256i s, __mmask8 k, __m256i a, __m256i cnt); VPSRLVD __m256i _mm256_maskz_srlv_epi32( __mmask8 k, __m256i a, __m256i cnt); VPSRLVD __m128i _mm_mask_srlv_epi32(__m128i s, __mmask8 k, __m128i a, __m128i cnt); VPSRLVD __m128i _mm_maskz_srlv_epi32( __mmask8 k, __m128i a, __m128i cnt); VPSRLVQ __m512i _mm512_srlv_epi64(__m512i a, __m512i cnt); VPSRLVQ __m512i _mm512_mask_srlv_epi64(__m512i s, __mmask8 k, __m512i a, __m512i cnt); VPSRLVQ __m512i _mm512_maskz_srlv_epi64( __mmask8 k, __m512i a, __m512i cnt); VPSRLVQ __m256i _mm256_mask_srlv_epi64(__m256i s, __mmask8 k, __m256i a, __m256i cnt); VPSRLVQ __m256i _mm256_maskz_srlv_epi64( __mmask8 k, __m256i a, __m256i cnt); VPSRLVQ __m128i _mm_mask_srlv_epi64(__m128i s, __mmask8 k, __m128i a, __m128i cnt); VPSRLVQ __m128i _mm_maskz_srlv_epi64( __mmask8 k, __m128i a, __m128i cnt); VPSRLVQ __m256i _mm256_srlv_epi64 (__m256i m, __m256i count) VPSRLVD __m128i _mm_srlv_epi32( __m128i a, __m128i cnt); VPSRLVQ __m128i _mm_srlv_epi64( __m128i a, __m128i cnt); SIMD Floating-Point Exceptions None Other Exceptions VEX-encoded instructions, see Exceptions Type 4. EVEX-encoded VPSRLVD/Q, see Exceptions Type E4. EVEX-encoded VPSRLVW, see Exceptions Type E4.nb. VPSRLVW/VPSRLVD/VPSRLVQ—Variable Bit Shift Right Logical Vol. 2C 5-459 INSTRUCTION SET REFERENCE, V-Z VPTERNLOGD/VPTERNLOGQ—Bitwise Ternary Logic Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.DDS.128.66.0F3A.W0 25 /r ib VPTERNLOGD xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst, imm8 EVEX.DDS.256.66.0F3A.W0 25 /r ib VPTERNLOGD ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst, imm8 FV V/V AVX512VL AVX512F EVEX.DDS.512.66.0F3A.W0 25 /r ib VPTERNLOGD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst, imm8 FV V/V AVX512F EVEX.DDS.128.66.0F3A.W1 25 /r ib VPTERNLOGQ xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst, imm8 FV V/V AVX512VL AVX512F EVEX.DDS.256.66.0F3A.W1 25 /r ib VPTERNLOGQ ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst, imm8 FV V/V AVX512VL AVX512F EVEX.DDS.512.66.0F3A.W1 25 /r ib VPTERNLOGQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst, imm8 FV V/V AVX512F Description Bitwise ternary logic taking xmm1, xmm2 and xmm3/m128/m32bcst as source operands and writing the result to xmm1 under writemask k1 with dword granularity. The immediate value determines the specific binary function being implemented. Bitwise ternary logic taking ymm1, ymm2 and ymm3/m256/m32bcst as source operands and writing the result to ymm1 under writemask k1 with dword granularity. The immediate value determines the specific binary function being implemented. Bitwise ternary logic taking zmm1, zmm2 and zmm3/m512/m32bcst as source operands and writing the result to zmm1 under writemask k1 with dword granularity. The immediate value determines the specific binary function being implemented. Bitwise ternary logic taking xmm1, xmm2 and xmm3/m128/m64bcst as source operands and writing the result to xmm1 under writemask k1 with qword granularity. The immediate value determines the specific binary function being implemented. Bitwise ternary logic taking ymm1, ymm2 and ymm3/m256/m64bcst as source operands and writing the result to ymm1 under writemask k1 with qword granularity. The immediate value determines the specific binary function being implemented. Bitwise ternary logic taking zmm1, zmm2 and zmm3/m512/m64bcst as source operands and writing the result to zmm1 under writemask k1 with qword granularity. The immediate value determines the specific binary function being implemented. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (r, w) EVEX.vvvv (r) ModRM:r/m (r) Imm8 Description VPTERNLOGD/Q takes three bit vectors of 512-bit length (in the first, second and third operand) as input data to form a set of 512 indices, each index is comprised of one bit from each input vector. The imm8 byte specifies a boolean logic table producing a binary value for each 3-bit index value. The final 512-bit boolean result is written to the destination operand (the first operand) using the writemask k1 with the granularity of doubleword element or quadword element into the destination. The destination operand is a ZMM (EVEX.512)/YMM (EVEX.256)/XMM (EVEX.128) register. The first source operand is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location The destination operand is a ZMM register conditionally updated with writemask k1. 5-460 Vol. 2C VPTERNLOGD/VPTERNLOGQ—Bitwise Ternary Logic INSTRUCTION SET REFERENCE, V-Z Table 5-11 shows two examples of Boolean functions specified by immediate values 0xE2 and 0xE4, with the look up result listed in the fourth column following the three columns containing all possible values of the 3-bit index. Table 5-11. Examples of VPTERNLOGD/Q Imm8 Boolean Function and Input Index Values VPTERNLOGD reg1, reg2, src3, 0xE2 Bit Result with Imm8=0xE2 VPTERNLOGD reg1, reg2, src3, 0xE4 Bit(reg1) Bit(reg2) Bit(src3) Bit Result with Imm8=0xE4 Bit(reg1) Bit(reg2) Bit(src3) 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 Specifying different values in imm8 will allow any arbitrary three-input Boolean functions to be implemented in software using VPTERNLOGD/Q. Table 5-1 and Table 5-2 provide a mapping of all 256 possible imm8 values to various Boolean expressions. Operation VPTERNLOGD (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN FOR k 0 TO 31 IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[j][k] imm[(DEST[i+k] << 2) + (SRC1[ i+k ] << 1) + SRC2[ k ]] ELSE DEST[j][k] imm[(DEST[i+k] << 2) + (SRC1[ i+k ] << 1) + SRC2[ i+k ]] FI; ; table lookup of immediate bellow; ELSE IF *merging-masking* ; merging-masking THEN *DEST[31+i:i] remains unchanged* ELSE ; zeroing-masking DEST[31+i:i] 0 FI; FI; ENDFOR; DEST[MAX_VL-1:VL] 0 VPTERNLOGD/VPTERNLOGQ—Bitwise Ternary Logic Vol. 2C 5-461 INSTRUCTION SET REFERENCE, V-Z VPTERNLOGQ (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN FOR k 0 TO 63 IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[j][k] imm[(DEST[i+k] << 2) + (SRC1[ i+k ] << 1) + SRC2[ k ]] ELSE DEST[j][k] imm[(DEST[i+k] << 2) + (SRC1[ i+k ] << 1) + SRC2[ i+k ]] FI; ; table lookup of immediate bellow; ELSE IF *merging-masking* ; merging-masking THEN *DEST[63+i:i] remains unchanged* ELSE ; zeroing-masking DEST[63+i:i] 0 FI; FI; ENDFOR; DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalents VPTERNLOGD __m512i _mm512_ternarylogic_epi32(__m512i a, __m512i b, int imm); VPTERNLOGD __m512i _mm512_mask_ternarylogic_epi32(__m512i s, __mmask16 m, __m512i a, __m512i b, int imm); VPTERNLOGD __m512i _mm512_maskz_ternarylogic_epi32(__mmask m, __m512i a, __m512i b, int imm); VPTERNLOGD __m256i _mm256_ternarylogic_epi32(__m256i a, __m256i b, int imm); VPTERNLOGD __m256i _mm256_mask_ternarylogic_epi32(__m256i s, __mmask8 m, __m256i a, __m256i b, int imm); VPTERNLOGD __m256i _mm256_maskz_ternarylogic_epi32( __mmask8 m, __m256i a, __m256i b, int imm); VPTERNLOGD __m128i _mm_ternarylogic_epi32(__m128i a, __m128i b, int imm); VPTERNLOGD __m128i _mm_mask_ternarylogic_epi32(__m128i s, __mmask8 m, __m128i a, __m128i b, int imm); VPTERNLOGD __m128i _mm_maskz_ternarylogic_epi32( __mmask8 m, __m128i a, __m128i b, int imm); VPTERNLOGQ __m512i _mm512_ternarylogic_epi64(__m512i a, __m512i b, int imm); VPTERNLOGQ __m512i _mm512_mask_ternarylogic_epi64(__m512i s, __mmask8 m, __m512i a, __m512i b, int imm); VPTERNLOGQ __m512i _mm512_maskz_ternarylogic_epi64( __mmask8 m, __m512i a, __m512i b, int imm); VPTERNLOGQ __m256i _mm256_ternarylogic_epi64(__m256i a, __m256i b, int imm); VPTERNLOGQ __m256i _mm256_mask_ternarylogic_epi64(__m256i s, __mmask8 m, __m256i a, __m256i b, int imm); VPTERNLOGQ __m256i _mm256_maskz_ternarylogic_epi64( __mmask8 m, __m256i a, __m256i b, int imm); VPTERNLOGQ __m128i _mm_ternarylogic_epi64(__m128i a, __m128i b, int imm); VPTERNLOGQ __m128i _mm_mask_ternarylogic_epi64(__m128i s, __mmask8 m, __m128i a, __m128i b, int imm); VPTERNLOGQ __m128i _mm_maskz_ternarylogic_epi64( __mmask8 m, __m128i a, __m128i b, int imm); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E4. 5-462 Vol. 2C VPTERNLOGD/VPTERNLOGQ—Bitwise Ternary Logic INSTRUCTION SET REFERENCE, V-Z VPTESTMB/VPTESTMW/VPTESTMD/VPTESTMQ—Logical AND and Set Mask Opcode/ Instruction Op/ En FVM 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512BW EVEX.NDS.128.66.0F38.W0 26 /r VPTESTMB k2 {k1}, xmm2, xmm3/m128 EVEX.NDS.256.66.0F38.W0 26 /r VPTESTMB k2 {k1}, ymm2, ymm3/m256 EVEX.NDS.512.66.0F38.W0 26 /r VPTESTMB k2 {k1}, zmm2, zmm3/m512 EVEX.NDS.128.66.0F38.W1 26 /r VPTESTMW k2 {k1}, xmm2, xmm3/m128 EVEX.NDS.256.66.0F38.W1 26 /r VPTESTMW k2 {k1}, ymm2, ymm3/m256 EVEX.NDS.512.66.0F38.W1 26 /r VPTESTMW k2 {k1}, zmm2, zmm3/m512 EVEX.NDS.128.66.0F38.W0 27 /r VPTESTMD k2 {k1}, xmm2, xmm3/m128/m32bcst FVM V/V AVX512VL AVX512BW FVM V/V AVX512BW FVM V/V AVX512VL AVX512BW FVM V/V AVX512VL AVX512BW FVM V/V AVX512BW FV V/V AVX512VL AVX512F EVEX.NDS.256.66.0F38.W0 27 /r VPTESTMD k2 {k1}, ymm2, ymm3/m256/m32bcst FV V/V AVX512VL AVX512F EVEX.NDS.512.66.0F38.W0 27 /r VPTESTMD k2 {k1}, zmm2, zmm3/m512/m32bcst FV V/V AVX512F EVEX.NDS.128.66.0F38.W1 27 /r VPTESTMQ k2 {k1}, xmm2, xmm3/m128/m64bcst FV V/V AVX512VL AVX512F EVEX.NDS.256.66.0F38.W1 27 /r VPTESTMQ k2 {k1}, ymm2, ymm3/m256/m64bcst FV V/V AVX512VL AVX512F EVEX.NDS.512.66.0F38.W1 27 /r VPTESTMQ k2 {k1}, zmm2, zmm3/m512/m64bcst FV V/V AVX512F Description Bitwise AND of packed byte integers in xmm2 and xmm3/m128 and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise AND of packed byte integers in ymm2 and ymm3/m256 and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise AND of packed byte integers in zmm2 and zmm3/m512 and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise AND of packed word integers in xmm2 and xmm3/m128 and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise AND of packed word integers in ymm2 and ymm3/m256 and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise AND of packed word integers in zmm2 and zmm3/m512 and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise AND of packed doubleword integers in xmm2 and xmm3/m128/m32bcst and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise AND of packed doubleword integers in ymm2 and ymm3/m256/m32bcst and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise AND of packed doubleword integers in zmm2 and zmm3/m512/m32bcst and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise AND of packed quadword integers in xmm2 and xmm3/m128/m64bcst and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise AND of packed quadword integers in ymm2 and ymm3/m256/m64bcst and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise AND of packed quadword integers in zmm2 and zmm3/m512/m64bcst and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FVM ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA VPTESTMB/VPTESTMW/VPTESTMD/VPTESTMQ—Logical AND and Set Mask Vol. 2C 5-463 INSTRUCTION SET REFERENCE, V-Z Description Performs a bitwise logical AND operation on the first source operand (the second operand) and second source operand (the third operand) and stores the result in the destination operand (the first operand) under the writemask. Each bit of the result is set to 1 if the bitwise AND of the corresponding elements of the first and second src operands is non-zero; otherwise it is set to 0. VPTESTMD/VPTESTMQ: The first source operand is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a mask register updated under the writemask. VPTESTMB/VPTESTMW: The first source operand is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is a mask register updated under the writemask. Operation VPTESTMB (EVEX encoded versions) (KL, VL) = (16, 128), (32, 256), (64, 512) FOR j 0 TO KL-1 ij*8 IF k1[j] OR *no writemask* THEN DEST[j] (SRC1[i+7:i] BITWISE AND SRC2[i+7:i] != 0)? 1 : 0; ELSE DEST[j] = 0 ; zeroing-masking only FI; ENDFOR DEST[MAX_KL-1:KL] 0 VPTESTMW (EVEX encoded versions) (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j 0 TO KL-1 i j * 16 IF k1[j] OR *no writemask* THEN DEST[j] (SRC1[i+15:i] BITWISE AND SRC2[i+15:i] != 0)? 1 : 0; ELSE DEST[j] = 0 ; zeroing-masking only FI; ENDFOR DEST[MAX_KL-1:KL] 0 VPTESTMD (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[j] (SRC1[i+31:i] BITWISE AND SRC2[31:0] != 0)? 1 : 0; ELSE DEST[j] (SRC1[i+31:i] BITWISE AND SRC2[i+31:i] != 0)? 1 : 0; FI; ELSE DEST[j] 0 ; zeroing-masking only FI; ENDFOR DEST[MAX_KL-1:KL] 0 5-464 Vol. 2C VPTESTMB/VPTESTMW/VPTESTMD/VPTESTMQ—Logical AND and Set Mask INSTRUCTION SET REFERENCE, V-Z VPTESTMQ (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[j] (SRC1[i+63:i] BITWISE AND SRC2[63:0] != 0)? 1 : 0; ELSE DEST[j] (SRC1[i+63:i] BITWISE AND SRC2[i+63:i] != 0)? 1 : 0; FI; ELSE DEST[j] 0 ; zeroing-masking only FI; ENDFOR DEST[MAX_KL-1:KL] 0 Intel C/C++ Compiler Intrinsic Equivalents VPTESTMB __mmask64 _mm512_test_epi8_mask( __m512i a, __m512i b); VPTESTMB __mmask64 _mm512_mask_test_epi8_mask(__mmask64, __m512i a, __m512i b); VPTESTMW __mmask32 _mm512_test_epi16_mask( __m512i a, __m512i b); VPTESTMW __mmask32 _mm512_mask_test_epi16_mask(__mmask32, __m512i a, __m512i b); VPTESTMD __mmask16 _mm512_test_epi32_mask( __m512i a, __m512i b); VPTESTMD __mmask16 _mm512_mask_test_epi32_mask(__mmask16, __m512i a, __m512i b); VPTESTMQ __mmask8 _mm512_test_epi64_mask(__m512i a, __m512i b); VPTESTMQ __mmask8 _mm512_mask_test_epi64_mask(__mmask8, __m512i a, __m512i b); SIMD Floating-Point Exceptions None Other Exceptions VPTESTMD/Q: See Exceptions Type E4. VPTESTMB/W: See Exceptions Type E4.nb. VPTESTMB/VPTESTMW/VPTESTMD/VPTESTMQ—Logical AND and Set Mask Vol. 2C 5-465 INSTRUCTION SET REFERENCE, V-Z VPTESTNMB/W/D/Q—Logical NAND and Set Opcode/ Instruction Op/ En EVEX.NDS.128.F3.0F38.W0 26 /r VPTESTNMB k2 {k1}, xmm2, xmm3/m128 FVM 64/32 bit Mode Support V/V CPUID Description AVX512VL AVX512BW Bitwise NAND of packed byte integers in xmm2 and xmm3/m128 and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. EVEX.NDS.256.F3.0F38.W0 26 /r VPTESTNMB k2 {k1}, ymm2, ymm3/m256 FVM V/V AVX512VL AVX512BW Bitwise NAND of packed byte integers in ymm2 and ymm3/m256 and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. EVEX.NDS.512.F3.0F38.W0 26 /r VPTESTNMB k2 {k1}, zmm2, zmm3/m512 FVM V/V AVX512F AVX512BW Bitwise NAND of packed byte integers in zmm2 and zmm3/m512 and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. EVEX.NDS.128.F3.0F38.W1 26 /r VPTESTNMW k2 {k1}, xmm2, xmm3/m128 FVM V/V AVX512VL AVX512BW Bitwise NAND of packed word integers in xmm2 and xmm3/m128 and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. EVEX.NDS.256.F3.0F38.W1 26 /r VPTESTNMW k2 {k1}, ymm2, ymm3/m256 FVM V/V AVX512VL AVX512BW Bitwise NAND of packed word integers in ymm2 and ymm3/m256 and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. EVEX.NDS.512.F3.0F38.W1 26 /r VPTESTNMW k2 {k1}, zmm2, zmm3/m512 FVM V/V AVX512F AVX512BW Bitwise NAND of packed word integers in zmm2 and zmm3/m512 and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. EVEX.NDS.128.F3.0F38.W0 27 /r VPTESTNMD k2 {k1}, xmm2, xmm3/m128/m32bcst FV V/V AVX512VL AVX512F EVEX.NDS.256.F3.0F38.W0 27 /r VPTESTNMD k2 {k1}, ymm2, ymm3/m256/m32bcst FV V/V AVX512VL AVX512F EVEX.NDS.512.F3.0F38.W0 27 /r VPTESTNMD k2 {k1}, zmm2, zmm3/m512/m32bcst FV V/V AVX512F EVEX.NDS.128.F3.0F38.W1 27 /r VPTESTNMQ k2 {k1}, xmm2, xmm3/m128/m64bcst FV V/V AVX512VL AVX512F EVEX.NDS.256.F3.0F38.W1 27 /r VPTESTNMQ k2 {k1}, ymm2, ymm3/m256/m64bcst FV V/V AVX512VL AVX512F EVEX.NDS.512.F3.0F38.W1 27 /r VPTESTNMQ k2 {k1}, zmm2, zmm3/m512/m64bcst FV V/V AVX512F Bitwise NAND of packed doubleword integers in xmm2 and xmm3/m128/m32bcst and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise NAND of packed doubleword integers in ymm2 and ymm3/m256/m32bcst and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise NAND of packed doubleword integers in zmm2 and zmm3/m512/m32bcst and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise NAND of packed quadword integers in xmm2 and xmm3/m128/m64bcst and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise NAND of packed quadword integers in ymm2 and ymm3/m256/m64bcst and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. Bitwise NAND of packed quadword integers in zmm2 and zmm3/m512/m64bcst and set mask k2 to reflect the zero/non-zero status of each element of the result, under writemask k1. 5-466 Vol. 2C VPTESTNMB/W/D/Q—Logical NAND and Set INSTRUCTION SET REFERENCE, V-Z Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FVM ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Performs a bitwise logical NAND operation on the byte/word/doubleword/quadword element of the first source operand (the second operand) with the corresponding element of the second source operand (the third operand) and stores the logical comparison result into each bit of the destination operand (the first operand) according to the writemask k1. Each bit of the result is set to 1 if the bitwise AND of the corresponding elements of the first and second src operands is zero; otherwise it is set to 0. EVEX encoded VPTESTNMD/Q: The first source operand is a ZMM/YMM/XMM registers. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The destination is updated according to the writemask. EVEX encoded VPTESTNMB/W: The first source operand is a ZMM/YMM/XMM registers. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination is updated according to the writemask. Operation VPTESTNMB (KL, VL) = (16, 128), (32, 256), (64, 512) FOR j ← 0 TO KL-1 i ← j*8 IF MaskBit(j) OR *no writemask* THEN DEST[j] ← (SRC1[i+7:i] BITWISE AND SRC2[i+7:i] == 0)? 1 : 0 ELSE DEST[j] ← 0; zeroing masking only FI ENDFOR DEST[MAX_KL-1:KL] ← 0 VPTESTNMW (KL, VL) = (8, 128), (16, 256), (32, 512) FOR j ← 0 TO KL-1 i ← j*16 IF MaskBit(j) OR *no writemask* THEN DEST[j] ← (SRC1[i+15:i] BITWISE AND SRC2[i+15:i] == 0)? 1 : 0 ELSE DEST[j] ← 0; zeroing masking only FI ENDFOR DEST[MAX_KL-1:KL] ← 0 VPTESTNMB/W/D/Q—Logical NAND and Set Vol. 2C 5-467 INSTRUCTION SET REFERENCE, V-Z VPTESTNMD (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j*32 IF MaskBit(j) OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+31:i] (SRC1[i+31:i] BITWISE AND SRC2[31:0] == 0)? 1 : 0 ELSE DEST[j] (SRC1[i+31:i] BITWISE AND SRC2[i+31:i] == 0)? 1 : 0 FI ELSE DEST[j] 0; zeroing masking only FI ENDFOR DEST[MAX_KL-1:KL] 0 VPTESTNMQ (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j*64 IF MaskBit(j) OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[j] (SRC1[i+63:i] BITWISE AND SRC2[63:0] != 0)? 1 : 0; ELSE DEST[j] (SRC1[i+63:i] BITWISE AND SRC2[i+63:i] != 0)? 1 : 0; FI; ELSE DEST[j] 0; zeroing masking only FI ENDFOR DEST[MAX_KL-1:KL] 0 5-468 Vol. 2C VPTESTNMB/W/D/Q—Logical NAND and Set INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VPTESTNMB __mmask64 _mm512_testn_epi8_mask( __m512i a, __m512i b); VPTESTNMB __mmask64 _mm512_mask_testn_epi8_mask(__mmask64, __m512i a, __m512i b); VPTESTNMB __mmask32 _mm256_testn_epi8_mask(__m256i a, __m256i b); VPTESTNMB __mmask32 _mm256_mask_testn_epi8_mask(__mmask32, __m256i a, __m256i b); VPTESTNMB __mmask16 _mm_testn_epi8_mask(__m128i a, __m128i b); VPTESTNMB __mmask16 _mm_mask_testn_epi8_mask(__mmask16, __m128i a, __m128i b); VPTESTNMW __mmask32 _mm512_testn_epi16_mask( __m512i a, __m512i b); VPTESTNMW __mmask32 _mm512_mask_testn_epi16_mask(__mmask32, __m512i a, __m512i b); VPTESTNMW __mmask16 _mm256_testn_epi16_mask(__m256i a, __m256i b); VPTESTNMW __mmask16 _mm256_mask_testn_epi16_mask(__mmask16, __m256i a, __m256i b); VPTESTNMW __mmask8 _mm_testn_epi16_mask(__m128i a, __m128i b); VPTESTNMW __mmask8 _mm_mask_testn_epi16_mask(__mmask8, __m128i a, __m128i b); VPTESTNMD __mmask16 _mm512_testn_epi32_mask( __m512i a, __m512i b); VPTESTNMD __mmask16 _mm512_mask_testn_epi32_mask(__mmask16, __m512i a, __m512i b); VPTESTNMD __mmask8 _mm256_testn_epi32_mask(__m256i a, __m256i b); VPTESTNMD __mmask8 _mm256_mask_testn_epi32_mask(__mmask8, __m256i a, __m256i b); VPTESTNMD __mmask8 _mm_testn_epi32_mask(__m128i a, __m128i b); VPTESTNMD __mmask8 _mm_mask_testn_epi32_mask(__mmask8, __m128i a, __m128i b); VPTESTNMQ __mmask8 _mm512_testn_epi64_mask(__m512i a, __m512i b); VPTESTNMQ __mmask8 _mm512_mask_testn_epi64_mask(__mmask8, __m512i a, __m512i b); VPTESTNMQ __mmask8 _mm256_testn_epi64_mask(__m256i a, __m256i b); VPTESTNMQ __mmask8 _mm256_mask_testn_epi64_mask(__mmask8, __m256i a, __m256i b); VPTESTNMQ __mmask8 _mm_testn_epi64_mask(__m128i a, __m128i b); VPTESTNMQ __mmask8 _mm_mask_testn_epi64_mask(__mmask8, __m128i a, __m128i b); SIMD Floating-Point Exceptions None Other Exceptions VPTESTNMD/VPTESTNMQ: See Exceptions Type E4. VPTESTNMB/VPTESTNMW: See Exceptions Type E4.nb. VPTESTNMB/W/D/Q—Logical NAND and Set Vol. 2C 5-469 INSTRUCTION SET REFERENCE, V-Z VRANGEPD—Range Restriction Calculation For Packed Pairs of Float64 Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.NDS.128.66.0F3A.W1 50 /r ib VRANGEPD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst, imm8 EVEX.NDS.256.66.0F3A.W1 50 /r ib VRANGEPD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst, imm8 FV V/V AVX512VL AVX512DQ EVEX.NDS.512.66.0F3A.W1 50 /r ib VRANGEPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{sae}, imm8 FV V/V AVX512DQ Description Calculate two RANGE operation output value from 2 pairs of double-precision floating-point values in xmm2 and xmm3/m128/m32bcst, store the results to xmm1 under the writemask k1. Imm8 specifies the comparison and sign of the range operation. Calculate four RANGE operation output value from 4pairs of double-precision floating-point values in ymm2 and ymm3/m256/m32bcst, store the results to ymm1 under the writemask k1. Imm8 specifies the comparison and sign of the range operation. Calculate eight RANGE operation output value from 8 pairs of double-precision floating-point values in zmm2 and zmm3/m512/m32bcst, store the results to zmm1 under the writemask k1. Imm8 specifies the comparison and sign of the range operation. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) Imm8 Description This instruction calculates 2/4/8 range operation outputs from two sets of packed input double-precision FP values in the first source operand (the second operand) and the second source operand (the third operand). The range outputs are written to the destination operand (the first operand) under the writemask k1. Bits7:4 of imm8 byte must be zero. The range operation output is performed in two parts, each configured by a two-bit control field within imm8[3:0]: • Imm8[1:0] specifies the initial comparison operation to be one of max, min, max absolute value or min absolute value of the input value pair. Each comparison of two input values produces an intermediate result that combines with the sign selection control (Imm8[3:2]) to determine the final range operation output. • Imm8[3:2] specifies the sign of the range operation output to be one of the following: from the first input value, from the comparison result, set or clear. The encodings of Imm8[1:0] and Imm8[3:2] are shown in Figure 5-27. 7 imm8 6 5 4 3 Must Be Zero 2 Sign Control (SC) 1 0 Compare Operation Select Imm8[1:0] = 00b : Select Min value Imm8[3:2] = 00b : Select sign(SRC1) Imm8[1:0] = 01b : Select Max value Imm8[3:2] = 01b : Select sign(Compare_Result) Imm8[1:0] = 10b : Select Min-Abs value Imm8[3:2] = 10b : Set sign to 0 Imm8[3:2] = 11b : Set sign to 1 Imm8[1:0] = 11b : Select Max-Abs value Figure 5-27. Imm8 Controls for VRANGEPD/SD/PS/SS 5-470 Vol. 2C VRANGEPD—Range Restriction Calculation For Packed Pairs of Float64 Values INSTRUCTION SET REFERENCE, V-Z When one or more of the input value is a NAN, the comparison operation may signal invalid exception (IE). Details with one of more input value is NAN is listed in Table 5-12. If the comparison raises an IE, the sign select control (Imm8[3:2] has no effect to the range operation output, this is indicated also in Table 5-12. When both input values are zeros of opposite signs, the comparison operation of MIN/MAX in the range compare operation is slightly different from the conceptually similar FP MIN/MAX operation that are found in the instructions VMAXPD/VMINPD. The details of MIN/MAX/MIN_ABS/MAX_ABS operation for VRANGEPD/PS/SD/SS for magnitude-0, opposite-signed input cases are listed in Table 5-13. Additionally, non-zero, equal-magnitude with opposite-sign input values perform MIN_ABS or MAX_ABS comparison operation with result listed in Table 5-14. Table 5-12. Signaling of Comparison Operation of One or More NaN Input Values and Effect of Imm8[3:2] Src1 Src2 Result IE Signaling Due to Comparison Imm8[3:2] Effect to Range Output sNaN1 sNaN2 Quiet(sNaN1) Yes Ignored sNaN1 qNaN2 Quiet(sNaN1) Yes Ignored sNaN1 Norm2 Quiet(sNaN1) Yes Ignored qNaN1 sNaN2 Quiet(sNaN2) Yes Ignored qNaN1 qNaN2 qNaN1 No Applicable qNaN1 Norm2 Norm2 No Applicable Norm1 sNaN2 Quiet(sNaN2) Yes Ignored Norm1 qNaN2 Norm1 No Applicable Table 5-13. Comparison Result for Opposite-Signed Zero Cases for MIN, MIN_ABS and MAX, MAX_ABS MIN and MIN_ABS MAX and MAX_ABS Src1 Src2 Result Src1 Src2 Result +0 -0 -0 +0 -0 +0 -0 +0 -0 -0 +0 +0 Table 5-14. Comparison Result of Equal-Magnitude Input Cases for MIN_ABS and MAX_ABS, (|a| = |b|, a>0, b<0) MIN_ABS (|a| = |b|, a>0, b<0) MAX_ABS (|a| = |b|, a>0, b<0) Src1 Src2 Result Src1 Src2 Result a b b a b a b a b b a a VRANGEPD—Range Restriction Calculation For Packed Pairs of Float64 Values Vol. 2C 5-471 INSTRUCTION SET REFERENCE, V-Z Operation RangeDP(SRC1[63:0], SRC2[63:0], CmpOpCtl[1:0], SignSelCtl[1:0]) { // Check if SNAN and report IE, see also Table 5-12 IF (SRC1 = SNAN) THEN RETURN (QNAN(SRC1), set IE); IF (SRC2 = SNAN) THEN RETURN (QNAN(SRC2), set IE); Src1.exp SRC1[62:52]; Src1.fraction SRC1[51:0]; IF ((Src1.exp = 0 ) and (Src1.fraction != 0)) THEN// Src1 is a denormal number IF DAZ THEN Src1.fraction 0; ELSE IF (SRC2 <> QNAN) Set DE; FI; FI; Src2.exp SRC2[62:52]; Src2.fraction SRC2[51:0]; IF ((Src2.exp = 0) and (Src2.fraction !=0 )) THEN// Src2 is a denormal number IF DAZ THEN Src2.fraction 0; ELSE IF (SRC1 <> QNAN) Set DE; FI; FI; IF (SRC2 = QNAN) THEN{TMP[63:0] SRC1[63:0]} ELSE IF(SRC1 = QNAN) THEN{TMP[63:0] SRC2[63:0]} ELSE IF (Both SRC1, SRC2 are magnitude-0 and opposite-signed) TMP[63:0] from Table 5-13 ELSE IF (Both SRC1, SRC2 are magnitude-equal and opposite-signed and CmpOpCtl[1:0] > 01) TMP[63:0] from Table 5-14 ELSE Case(CmpOpCtl[1:0]) 00: TMP[63:0] (SRC1[63:0] ≤ SRC2[63:0]) ? SRC1[63:0] : SRC2[63:0]; 01: TMP[63:0] (SRC1[63:0] ≤ SRC2[63:0]) ? SRC2[63:0] : SRC1[63:0]; 10: TMP[63:0] (ABS(SRC1[63:0]) ≤ ABS(SRC2[63:0])) ? SRC1[63:0] : SRC2[63:0]; 11: TMP[63:0] (ABS(SRC1[63:0]) ≤ ABS(SRC2[63:0])) ? SRC2[63:0] : SRC1[63:0]; ESAC; FI; Case(SignSelCtl[1:0]) 00: dest (SRC1[63] << 63) OR (TMP[62:0]);// Preserve Src1 sign bit 01: dest TMP[63:0];// Preserve sign of compare result 10: dest (0 << 63) OR (TMP[62:0]);// Zero out sign bit 11: dest (1 << 63) OR (TMP[62:0]);// Set the sign bit ESAC; RETURN dest[63:0]; } CmpOpCtl[1:0]= imm8[1:0]; SignSelCtl[1:0]=imm8[3:2]; 5-472 Vol. 2C VRANGEPD—Range Restriction Calculation For Packed Pairs of Float64 Values INSTRUCTION SET REFERENCE, V-Z VRANGEPD (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) AND (SRC2 *is memory*) THEN DEST[i+63:i] RangeDP (SRC1[i+63:i], SRC2[63:0], CmpOpCtl[1:0], SignSelCtl[1:0]); ELSE DEST[i+63:i] RangeDP (SRC1[i+63:i], SRC2[i+63:i], DAZ, CmpOpCtl[1:0], SignSelCtl[1:0]); FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] = 0 FI; FI; ENDFOR; DEST[MAX_VL-1:VL] 0 The following example describes a common usage of this instruction for checking that the input operand is bounded between ±1023. VRANGEPD zmm_dst, zmm_src, zmm_1023, 02h; Where: zmm_dst is the destination operand. zmm_src is the input operand to compare against ±1023 (this is SRC1). zmm_1023 is the reference operand, contains the value of 1023 (and this is SRC2). IMM=02(imm8[1:0]='10) selects the Min Absolute value operation with selection of SRC1.sign. In case |zmm_src| < 1023 (i.e. SRC1 is smaller than 1023 in magnitude), then its value will be written into zmm_dst. Otherwise, the value stored in zmm_dst will get the value of 1023 (received on zmm_1023, which is SRC2). However, the sign control (imm8[3:2]='00) instructs to select the sign of SRC1 received from zmm_src. So, even in the case of |zmm_src| ≥ 1023, the selected sign of SRC1 is kept. Thus, if zmm_src < -1023, the result of VRANGEPD will be the minimal value of -1023 while if zmm_src > +1023, the result of VRANGE will be the maximal value of +1023. VRANGEPD—Range Restriction Calculation For Packed Pairs of Float64 Values Vol. 2C 5-473 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VRANGEPD __m512d _mm512_range_pd ( __m512d a, __m512d b, int imm); VRANGEPD __m512d _mm512_range_round_pd ( __m512d a, __m512d b, int imm, int sae); VRANGEPD __m512d _mm512_mask_range_pd (__m512 ds, __mmask8 k, __m512d a, __m512d b, int imm); VRANGEPD __m512d _mm512_mask_range_round_pd (__m512d s, __mmask8 k, __m512d a, __m512d b, int imm, int sae); VRANGEPD __m512d _mm512_maskz_range_pd ( __mmask8 k, __m512d a, __m512d b, int imm); VRANGEPD __m512d _mm512_maskz_range_round_pd ( __mmask8 k, __m512d a, __m512d b, int imm, int sae); VRANGEPD __m256d _mm256_range_pd ( __m256d a, __m256d b, int imm); VRANGEPD __m256d _mm256_mask_range_pd (__m256d s, __mmask8 k, __m256d a, __m256d b, int imm); VRANGEPD __m256d _mm256_maskz_range_pd ( __mmask8 k, __m256d a, __m256d b, int imm); VRANGEPD __m128d _mm_range_pd ( __m128 a, __m128d b, int imm); VRANGEPD __m128d _mm_mask_range_pd (__m128 s, __mmask8 k, __m128d a, __m128d b, int imm); VRANGEPD __m128d _mm_maskz_range_pd ( __mmask8 k, __m128d a, __m128d b, int imm); SIMD Floating-Point Exceptions Invalid, Denormal Other Exceptions See Exceptions Type E2. 5-474 Vol. 2C VRANGEPD—Range Restriction Calculation For Packed Pairs of Float64 Values INSTRUCTION SET REFERENCE, V-Z VRANGEPS—Range Restriction Calculation For Packed Pairs of Float32 Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.NDS.128.66.0F3A.W0 50 /r ib VRANGEPS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst, imm8 EVEX.NDS.256.66.0F3A.W0 50 /r ib VRANGEPS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst, imm8 FV V/V AVX512VL AVX512DQ EVEX.NDS.512.66.0F3A.W0 50 /r ib VRANGEPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{sae}, imm8 FV V/V AVX512DQ Description Calculate four RANGE operation output value from 4 pairs of single-precision floating-point values in xmm2 and xmm3/m128/m32bcst, store the results to xmm1 under the writemask k1. Imm8 specifies the comparison and sign of the range operation. Calculate eight RANGE operation output value from 8 pairs of single-precision floating-point values in ymm2 and ymm3/m256/m32bcst, store the results to ymm1 under the writemask k1. Imm8 specifies the comparison and sign of the range operation. Calculate 16 RANGE operation output value from 16 pairs of single-precision floating-point values in zmm2 and zmm3/m512/m32bcst, store the results to zmm1 under the writemask k1. Imm8 specifies the comparison and sign of the range operation. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) Imm8 Description This instruction calculates 4/8/16 range operation outputs from two sets of packed input single-precision FP values in the first source operand (the second operand) and the second source operand (the third operand). The range outputs are written to the destination operand (the first operand) under the writemask k1. Bits7:4 of imm8 byte must be zero. The range operation output is performed in two parts, each configured by a two-bit control field within imm8[3:0]: • Imm8[1:0] specifies the initial comparison operation to be one of max, min, max absolute value or min absolute value of the input value pair. Each comparison of two input values produces an intermediate result that combines with the sign selection control (Imm8[3:2]) to determine the final range operation output. • Imm8[3:2] specifies the sign of the range operation output to be one of the following: from the first input value, from the comparison result, set or clear. The encodings of Imm8[1:0] and Imm8[3:2] are shown in Figure 5-27. When one or more of the input value is a NAN, the comparison operation may signal invalid exception (IE). Details with one of more input value is NAN is listed in Table 5-12. If the comparison raises an IE, the sign select control (Imm8[3:2]) has no effect to the range operation output, this is indicated also in Table 5-12. When both input values are zeros of opposite signs, the comparison operation of MIN/MAX in the range compare operation is slightly different from the conceptually similar FP MIN/MAX operation that are found in the instructions VMAXPD/VMINPD. The details of MIN/MAX/MIN_ABS/MAX_ABS operation for VRANGEPD/PS/SD/SS for magnitude-0, opposite-signed input cases are listed in Table 5-13. Additionally, non-zero, equal-magnitude with opposite-sign input values perform MIN_ABS or MAX_ABS comparison operation with result listed in Table 5-14. VRANGEPS—Range Restriction Calculation For Packed Pairs of Float32 Values Vol. 2C 5-475 INSTRUCTION SET REFERENCE, V-Z Operation RangeSP(SRC1[31:0], SRC2[31:0], CmpOpCtl[1:0], SignSelCtl[1:0]) { // Check if SNAN and report IE, see also Table 5-12 IF (SRC1=SNAN) THEN RETURN (QNAN(SRC1), set IE); IF (SRC2=SNAN) THEN RETURN (QNAN(SRC2), set IE); Src1.exp SRC1[30:23]; Src1.fraction SRC1[22:0]; IF ((Src1.exp = 0 ) and (Src1.fraction != 0 )) THEN// Src1 is a denormal number IF DAZ THEN Src1.fraction 0; ELSE IF (SRC2 <> QNAN) Set DE; FI; FI; Src2.exp SRC2[30:23]; Src2.fraction SRC2[22:0]; IF ((Src2.exp = 0 ) and (Src2.fraction != 0 )) THEN// Src2 is a denormal number IF DAZ THEN Src2.fraction 0; ELSE IF (SRC1 <> QNAN) Set DE; FI; FI; IF (SRC2 = QNAN) THEN{TMP[31:0] SRC1[31:0]} ELSE IF(SRC1 = QNAN) THEN{TMP[31:0] SRC2[31:0]} ELSE IF (Both SRC1, SRC2 are magnitude-0 and opposite-signed) TMP[31:0] from Table 5-13 ELSE IF (Both SRC1, SRC2 are magnitude-equal and opposite-signed and CmpOpCtl[1:0] > 01) TMP[31:0] from Table 5-14 ELSE Case(CmpOpCtl[1:0]) 00: TMP[31:0] (SRC1[31:0] ≤ SRC2[31:0]) ? SRC1[31:0] : SRC2[31:0]; 01: TMP[31:0] (SRC1[31:0] ≤ SRC2[31:0]) ? SRC2[31:0] : SRC1[31:0]; 10: TMP[31:0] (ABS(SRC1[31:0]) ≤ ABS(SRC2[31:0])) ? SRC1[31:0] : SRC2[31:0]; 11: TMP[31:0] (ABS(SRC1[31:0]) ≤ ABS(SRC2[31:0])) ? SRC2[31:0] : SRC1[31:0]; ESAC; FI; Case(SignSelCtl[1:0]) 00: dest (SRC1[31] << 31) OR (TMP[30:0]);// Preserve Src1 sign bit 01: dest TMP[31:0];// Preserve sign of compare result 10: dest (0 << 31) OR (TMP[30:0]);// Zero out sign bit 11: dest (1 << 31) OR (TMP[30:0]);// Set the sign bit ESAC; RETURN dest[31:0]; } CmpOpCtl[1:0]= imm8[1:0]; SignSelCtl[1:0]=imm8[3:2]; 5-476 Vol. 2C VRANGEPS—Range Restriction Calculation For Packed Pairs of Float32 Values INSTRUCTION SET REFERENCE, V-Z VRANGEPS (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) AND (SRC2 *is memory*) THEN DEST[i+31:i] RangeSP (SRC1[i+31:i], SRC2[31:0], CmpOpCtl[1:0], SignSelCtl[1:0]); ELSE DEST[i+31:i] RangeSP (SRC1[i+31:i], SRC2[i+31:i], DAZ, CmpOpCtl[1:0], SignSelCtl[1:0]); FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] = 0 FI; FI; ENDFOR; DEST[MAX_VL-1:VL] 0 The following example describes a common usage of this instruction for checking that the input operand is bounded between ±150. VRANGEPS zmm_dst, zmm_src, zmm_150, 02h; Where: zmm_dst is the destination operand. zmm_src is the input operand to compare against ±150. zmm_150 is the reference operand, contains the value of 150. IMM=02(imm8[1:0]=’10) selects the Min Absolute value operation with selection of src1.sign. In case |zmm_src| < 150, then its value will be written into zmm_dst. Otherwise, the value stored in zmm_dst will get the value of 150 (received on zmm_150). However, the sign control (imm8[3:2]=’00) instructs to select the sign of SRC1 received from zmm_src. So, even in the case of |zmm_src| ≥ 150, the selected sign of SRC1 is kept. Thus, if zmm_src < -150, the result of VRANGEPS will be the minimal value of -150 while if zmm_src > +150, the result of VRANGE will be the maximal value of +150. VRANGEPS—Range Restriction Calculation For Packed Pairs of Float32 Values Vol. 2C 5-477 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VRANGEPS __m512 _mm512_range_ps ( __m512 a, __m512 b, int imm); VRANGEPS __m512 _mm512_range_round_ps ( __m512 a, __m512 b, int imm, int sae); VRANGEPS __m512 _mm512_mask_range_ps (__m512 s, __mmask16 k, __m512 a, __m512 b, int imm); VRANGEPS __m512 _mm512_mask_range_round_ps (__m512 s, __mmask16 k, __m512 a, __m512 b, int imm, int sae); VRANGEPS __m512 _mm512_maskz_range_ps ( __mmask16 k, __m512 a, __m512 b, int imm); VRANGEPS __m512 _mm512_maskz_range_round_ps ( __mmask16 k, __m512 a, __m512 b, int imm, int sae); VRANGEPS __m256 _mm256_range_ps ( __m256 a, __m256 b, int imm); VRANGEPS __m256 _mm256_mask_range_ps (__m256 s, __mmask8 k, __m256 a, __m256 b, int imm); VRANGEPS __m256 _mm256_maskz_range_ps ( __mmask8 k, __m256 a, __m256 b, int imm); VRANGEPS __m128 _mm_range_ps ( __m128 a, __m128 b, int imm); VRANGEPS __m128 _mm_mask_range_ps (__m128 s, __mmask8 k, __m128 a, __m128 b, int imm); VRANGEPS __m128 _mm_maskz_range_ps ( __mmask8 k, __m128 a, __m128 b, int imm); SIMD Floating-Point Exceptions Invalid, Denormal Other Exceptions See Exceptions Type E2. 5-478 Vol. 2C VRANGEPS—Range Restriction Calculation For Packed Pairs of Float32 Values INSTRUCTION SET REFERENCE, V-Z VRANGESD—Range Restriction Calculation From a pair of Scalar Float64 Values Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F3A.W1 51 /r VRANGESD xmm1 {k1}{z}, xmm2, xmm3/m64{sae}, imm8 T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512DQ Description Calculate a RANGE operation output value from 2 doubleprecision floating-point values in xmm2 and xmm3/m64, store the output to xmm1 under writemask. Imm8 specifies the comparison and sign of the range operation. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) Imm8 Description This instruction calculates a range operation output from two input double-precision FP values in the low qword element of the first source operand (the second operand) and second source operand (the third operand). The range output is written to the low qword element of the destination operand (the first operand) under the writemask k1. Bits7:4 of imm8 byte must be zero. The range operation output is performed in two parts, each configured by a two-bit control field within imm8[3:0]: • Imm8[1:0] specifies the initial comparison operation to be one of max, min, max absolute value or min absolute value of the input value pair. Each comparison of two input values produces an intermediate result that combines with the sign selection control (Imm8[3:2]) to determine the final range operation output. • Imm8[3:2] specifies the sign of the range operation output to be one of the following: from the first input value, from the comparison result, set or clear. The encodings of Imm8[1:0] and Imm8[3:2] are shown in Figure 5-27. Bits 128:63 of the destination operand are copied from the respective element of the first source operand. When one or more of the input value is a NAN, the comparison operation may signal invalid exception (IE). Details with one of more input value is NAN is listed in Table 5-12. If the comparison raises an IE, the sign select control (Imm8[3:2] has no effect to the range operation output, this is indicated also in Table 5-12. When both input values are zeros of opposite signs, the comparison operation of MIN/MAX in the range compare operation is slightly different from the conceptually similar FP MIN/MAX operation that are found in the instructions VMAXPD/VMINPD. The details of MIN/MAX/MIN_ABS/MAX_ABS operation for VRANGEPD/PS/SD/SS for magnitude-0, opposite-signed input cases are listed in Table 5-13. Additionally, non-zero, equal-magnitude with opposite-sign input values perform MIN_ABS or MAX_ABS comparison operation with result listed in Table 5-14. VRANGESD—Range Restriction Calculation From a pair of Scalar Float64 Values Vol. 2C 5-479 INSTRUCTION SET REFERENCE, V-Z Operation RangeDP(SRC1[63:0], SRC2[63:0], CmpOpCtl[1:0], SignSelCtl[1:0]) { // Check if SNAN and report IE, see also Table 5-12 IF (SRC1 = SNAN) THEN RETURN (QNAN(SRC1), set IE); IF (SRC2 = SNAN) THEN RETURN (QNAN(SRC2), set IE); Src1.exp SRC1[62:52]; Src1.fraction SRC1[51:0]; IF ((Src1.exp = 0 ) and (Src1.fraction != 0)) THEN// Src1 is a denormal number IF DAZ THEN Src1.fraction 0; ELSE IF (SRC2 <> QNAN) Set DE; FI; FI; Src2.exp SRC2[62:52]; Src2.fraction SRC2[51:0]; IF ((Src2.exp = 0) and (Src2.fraction !=0 )) THEN// Src2 is a denormal number IF DAZ THEN Src2.fraction 0; ELSE IF (SRC1 <> QNAN) Set DE; FI; FI; IF (SRC2 = QNAN) THEN{TMP[63:0] SRC1[63:0]} ELSE IF(SRC1 = QNAN) THEN{TMP[63:0] SRC2[63:0]} ELSE IF (Both SRC1, SRC2 are magnitude-0 and opposite-signed) TMP[63:0] from Table 5-13 ELSE IF (Both SRC1, SRC2 are magnitude-equal and opposite-signed and CmpOpCtl[1:0] > 01) TMP[63:0] from Table 5-14 ELSE Case(CmpOpCtl[1:0]) 00: TMP[63:0] (SRC1[63:0] ≤ SRC2[63:0]) ? SRC1[63:0] : SRC2[63:0]; 01: TMP[63:0] (SRC1[63:0] ≤ SRC2[63:0]) ? SRC2[63:0] : SRC1[63:0]; 10: TMP[63:0] (ABS(SRC1[63:0]) ≤ ABS(SRC2[63:0])) ? SRC1[63:0] : SRC2[63:0]; 11: TMP[63:0] (ABS(SRC1[63:0]) ≤ ABS(SRC2[63:0])) ? SRC2[63:0] : SRC1[63:0]; ESAC; FI; Case(SignSelCtl[1:0]) 00: dest (SRC1[63] << 63) OR (TMP[62:0]);// Preserve Src1 sign bit 01: dest TMP[63:0];// Preserve sign of compare result 10: dest (0 << 63) OR (TMP[62:0]);// Zero out sign bit 11: dest (1 << 63) OR (TMP[62:0]);// Set the sign bit ESAC; RETURN dest[63:0]; } CmpOpCtl[1:0]= imm8[1:0]; SignSelCtl[1:0]=imm8[3:2]; 5-480 Vol. 2C VRANGESD—Range Restriction Calculation From a pair of Scalar Float64 Values INSTRUCTION SET REFERENCE, V-Z VRANGESD IF k1[0] OR *no writemask* THEN DEST[63:0] RangeDP (SRC1[63:0], SRC2[63:0], CmpOpCtl[1:0], SignSelCtl[1:0]); ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking DEST[63:0] = 0 FI; FI; DEST[127:64] SRC1[127:64] DEST[MAX_VL-1:128] 0 The following example describes a common usage of this instruction for checking that the input operand is bounded between ±1023. VRANGESD xmm_dst, xmm_src, xmm_1023, 02h; Where: xmm_dst is the destination operand. xmm_src is the input operand to compare against ±1023. xmm_1023 is the reference operand, contains the value of 1023. IMM=02(imm8[1:0]=’10) selects the Min Absolute value operation with selection of src1.sign. In case |xmm_src| < 1023, then its value will be written into xmm_dst. Otherwise, the value stored in xmm_dst will get the value of 1023 (received on xmm_1023). However, the sign control (imm8[3:2]=’00) instructs to select the sign of SRC1 received from xmm_src. So, even in the case of |xmm_src| ≥ 1023, the selected sign of SRC1 is kept. Thus, if xmm_src < -1023, the result of VRANGEPD will be the minimal value of -1023while if xmm_src > +1023, the result of VRANGE will be the maximal value of +1023. Intel C/C++ Compiler Intrinsic Equivalent VRANGESD __m128d _mm_range_sd ( __m128d a, __m128d b, int imm); VRANGESD __m128d _mm_range_round_sd ( __m128d a, __m128d b, int imm, int sae); VRANGESD __m128d _mm_mask_range_sd (__m128d s, __mmask8 k, __m128d a, __m128d b, int imm); VRANGESD __m128d _mm_mask_range_round_sd (__m128d s, __mmask8 k, __m128d a, __m128d b, int imm, int sae); VRANGESD __m128d _mm_maskz_range_sd ( __mmask8 k, __m128d a, __m128d b, int imm); VRANGESD __m128d _mm_maskz_range_round_sd ( __mmask8 k, __m128d a, __m128d b, int imm, int sae); SIMD Floating-Point Exceptions Invalid, Denormal Other Exceptions See Exceptions Type E3. VRANGESD—Range Restriction Calculation From a pair of Scalar Float64 Values Vol. 2C 5-481 INSTRUCTION SET REFERENCE, V-Z VRANGESS—Range Restriction Calculation From a Pair of Scalar Float32 Values Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F3A.W0 51 /r VRANGESS xmm1 {k1}{z}, xmm2, xmm3/m32{sae}, imm8 T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512DQ Description Calculate a RANGE operation output value from 2 singleprecision floating-point values in xmm2 and xmm3/m32, store the output to xmm1 under writemask. Imm8 specifies the comparison and sign of the range operation. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description This instruction calculates a range operation output from two input single-precision FP values in the low dword element of the first source operand (the second operand) and second source operand (the third operand). The range output is written to the low dword element of the destination operand (the first operand) under the writemask k1. Bits7:4 of imm8 byte must be zero. The range operation output is performed in two parts, each configured by a two-bit control field within imm8[3:0]: • Imm8[1:0] specifies the initial comparison operation to be one of max, min, max absolute value or min absolute value of the input value pair. Each comparison of two input values produces an intermediate result that combines with the sign selection control (Imm8[3:2]) to determine the final range operation output. • Imm8[3:2] specifies the sign of the range operation output to be one of the following: from the first input value, from the comparison result, set or clear. The encodings of Imm8[1:0] and Imm8[3:2] are shown in Figure 5-27. Bits 128:31 of the destination operand are copied from the respective elements of the first source operand. When one or more of the input value is a NAN, the comparison operation may signal invalid exception (IE). Details with one of more input value is NAN is listed in Table 5-12. If the comparison raises an IE, the sign select control (Imm8[3:2]) has no effect to the range operation output, this is indicated also in Table 5-12. When both input values are zeros of opposite signs, the comparison operation of MIN/MAX in the range compare operation is slightly different from the conceptually similar FP MIN/MAX operation that are found in the instructions VMAXPD/VMINPD. The details of MIN/MAX/MIN_ABS/MAX_ABS operation for VRANGEPD/PS/SD/SS for magnitude-0, opposite-signed input cases are listed in Table 5-13. Additionally, non-zero, equal-magnitude with opposite-sign input values perform MIN_ABS or MAX_ABS comparison operation with result listed in Table 5-14. 5-482 Vol. 2C VRANGESS—Range Restriction Calculation From a Pair of Scalar Float32 Values INSTRUCTION SET REFERENCE, V-Z Operation RangeSP(SRC1[31:0], SRC2[31:0], CmpOpCtl[1:0], SignSelCtl[1:0]) { // Check if SNAN and report IE, see also Table 5-12 IF (SRC1=SNAN) THEN RETURN (QNAN(SRC1), set IE); IF (SRC2=SNAN) THEN RETURN (QNAN(SRC2), set IE); Src1.exp SRC1[30:23]; Src1.fraction SRC1[22:0]; IF ((Src1.exp = 0 ) and (Src1.fraction != 0 )) THEN// Src1 is a denormal number IF DAZ THEN Src1.fraction 0; ELSE IF (SRC2 <> QNAN) Set DE; FI; FI; Src2.exp SRC2[30:23]; Src2.fraction SRC2[22:0]; IF ((Src2.exp = 0 ) and (Src2.fraction != 0 )) THEN// Src2 is a denormal number IF DAZ THEN Src2.fraction 0; ELSE IF (SRC1 <> QNAN) Set DE; FI; FI; IF (SRC2 = QNAN) THEN{TMP[31:0] SRC1[31:0]} ELSE IF(SRC1 = QNAN) THEN{TMP[31:0] SRC2[31:0]} ELSE IF (Both SRC1, SRC2 are magnitude-0 and opposite-signed) TMP[31:0] from Table 5-13 ELSE IF (Both SRC1, SRC2 are magnitude-equal and opposite-signed and CmpOpCtl[1:0] > 01) TMP[31:0] from Table 5-14 ELSE Case(CmpOpCtl[1:0]) 00: TMP[31:0] (SRC1[31:0] ≤ SRC2[31:0]) ? SRC1[31:0] : SRC2[31:0]; 01: TMP[31:0] (SRC1[31:0] ≤ SRC2[31:0]) ? SRC2[31:0] : SRC1[31:0]; 10: TMP[31:0] (ABS(SRC1[31:0]) ≤ ABS(SRC2[31:0])) ? SRC1[31:0] : SRC2[31:0]; 11: TMP[31:0] (ABS(SRC1[31:0]) ≤ ABS(SRC2[31:0])) ? SRC2[31:0] : SRC1[31:0]; ESAC; FI; Case(SignSelCtl[1:0]) 00: dest (SRC1[31] << 31) OR (TMP[30:0]);// Preserve Src1 sign bit 01: dest TMP[31:0];// Preserve sign of compare result 10: dest (0 << 31) OR (TMP[30:0]);// Zero out sign bit 11: dest (1 << 31) OR (TMP[30:0]);// Set the sign bit ESAC; RETURN dest[31:0]; } CmpOpCtl[1:0]= imm8[1:0]; SignSelCtl[1:0]=imm8[3:2]; VRANGESS—Range Restriction Calculation From a Pair of Scalar Float32 Values Vol. 2C 5-483 INSTRUCTION SET REFERENCE, V-Z VRANGESS IF k1[0] OR *no writemask* THEN DEST[31:0] RangeSP (SRC1[31:0], SRC2[31:0], CmpOpCtl[1:0], SignSelCtl[1:0]); ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking DEST[31:0] = 0 FI; FI; DEST[127:32] SRC1[127:32] DEST[MAX_VL-1:128] 0 The following example describes a common usage of this instruction for checking that the input operand is bounded between ±150. VRANGESS zmm_dst, zmm_src, zmm_150, 02h; Where: xmm_dst is the destination operand. xmm_src is the input operand to compare against ±150. xmm_150 is the reference operand, contains the value of 150. IMM=02(imm8[1:0]=’10) selects the Min Absolute value operation with selection of src1.sign. In case |xmm_src| < 150, then its value will be written into zmm_dst. Otherwise, the value stored in xmm_dst will get the value of 150 (received on zmm_150). However, the sign control (imm8[3:2]=’00) instructs to select the sign of SRC1 received from xmm_src. So, even in the case of |xmm_src| ≥ 150, the selected sign of SRC1 is kept. Thus, if xmm_src < -150, the result of VRANGESS will be the minimal value of -150 while if xmm_src > +150, the result of VRANGE will be the maximal value of +150. Intel C/C++ Compiler Intrinsic Equivalent VRANGESS __m128 _mm_range_ss ( __m128 a, __m128 b, int imm); VRANGESS __m128 _mm_range_round_ss ( __m128 a, __m128 b, int imm, int sae); VRANGESS __m128 _mm_mask_range_ss (__m128 s, __mmask8 k, __m128 a, __m128 b, int imm); VRANGESS __m128 _mm_mask_range_round_ss (__m128 s, __mmask8 k, __m128 a, __m128 b, int imm, int sae); VRANGESS __m128 _mm_maskz_range_ss ( __mmask8 k, __m128 a, __m128 b, int imm); VRANGESS __m128 _mm_maskz_range_round_ss ( __mmask8 k, __m128 a, __m128 b, int imm, int sae); SIMD Floating-Point Exceptions Invalid, Denormal Other Exceptions See Exceptions Type E3. 5-484 Vol. 2C VRANGESS—Range Restriction Calculation From a Pair of Scalar Float32 Values INSTRUCTION SET REFERENCE, V-Z VRCP14PD—Compute Approximate Reciprocals of Packed Float64 Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.66.0F38.W1 4C /r VRCP14PD xmm1 {k1}{z}, xmm2/m128/m64bcst EVEX.256.66.0F38.W1 4C /r VRCP14PD ymm1 {k1}{z}, ymm2/m256/m64bcst EVEX.512.66.0F38.W1 4C /r VRCP14PD zmm1 {k1}{z}, zmm2/m512/m64bcst FV V/V AVX512VL AVX512F FV V/V AVX512F Description Computes the approximate reciprocals of the packed doubleprecision floating-point values in xmm2/m128/m64bcst and stores the results in xmm1. Under writemask. Computes the approximate reciprocals of the packed doubleprecision floating-point values in ymm2/m256/m64bcst and stores the results in ymm1. Under writemask. Computes the approximate reciprocals of the packed doubleprecision floating-point values in zmm2/m512/m64bcst and stores the results in zmm1. Under writemask. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description This instruction performs a SIMD computation of the approximate reciprocals of eight/four/two packed doubleprecision floating-point values in the source operand (the second operand) and stores the packed double-precision floating-point results in the destination operand. The maximum relative error for this approximation is less than 214 . The source operand can be a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 64bit memory location. The destination operand is a ZMM register conditionally updated according to the writemask. The VRCP14PD instruction is not affected by the rounding control bits in the MXCSR register. When a source value is a 0.0, an ∞ with the sign of the source value is returned. A denormal source value will be treated as zero only in case of DAZ bit set in MXCSR. Otherwise it is treated correctly (i.e. not as a 0.0). Underflow results are flushed to zero only in case of FTZ bit set in MXCSR. Otherwise it will be treated correctly (i.e. correct underflow result is written) with the sign of the operand. When a source value is a SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported. Table 5-15. VRCP14PD/VRCP14SD Special Cases Input value Result value Comments 0 ≤ X ≤ 2-1024 INF Very small denormal -2-1024 ≤ -INF Very small denormal Underflow Up to 18 bits of fractions are returned* -Underflow Up to 18 bits of fractions are returned* X> X ≤ -0 21022 X < -21022 X= 2-n X = -2-n 2n -2n * in this case the mantissa is shifted right by one or two bits A numerically exact implementation of VRCP14xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. VRCP14PD—Compute Approximate Reciprocals of Packed Float64 Values Vol. 2C 5-485 INSTRUCTION SET REFERENCE, V-Z Operation VRCP14PD ((EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC *is memory*) THEN DEST[i+63:i] APPROXIMATE(1.0/SRC[63:0]); ELSE DEST[i+63:i] APPROXIMATE(1.0/SRC[i+63:i]); FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI; FI; ENDFOR; DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VRCP14PD __m512d _mm512_rcp14_pd( __m512d a); VRCP14PD __m512d _mm512_mask_rcp14_pd(__m512d s, __mmask8 k, __m512d a); VRCP14PD __m512d _mm512_maskz_rcp14_pd( __mmask8 k, __m512d a); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E4. 5-486 Vol. 2C VRCP14PD—Compute Approximate Reciprocals of Packed Float64 Values INSTRUCTION SET REFERENCE, V-Z VRCP14SD—Compute Approximate Reciprocal of Scalar Float64 Value Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F38.W1 4D /r VRCP14SD xmm1 {k1}{z}, xmm2, xmm3/m64 T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F Description Computes the approximate reciprocal of the scalar doubleprecision floating-point value in xmm3/m64 and stores the result in xmm1 using writemask k1. Also, upper double-precision floating-point value (bits[127:64]) from xmm2 is copied to xmm1[127:64]. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description This instruction performs a SIMD computation of the approximate reciprocal of the low double-precision floatingpoint value in the second source operand (the third operand) stores the result in the low quadword element of the destination operand (the first operand) according to the writemask k1. Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand (the second operand). The maximum relative error for this approximation is less than 2-14. The source operand can be an XMM register or a 64-bit memory location. The destination operand is an XMM register. The VRCP14SD instruction is not affected by the rounding control bits in the MXCSR register. When a source value is a 0.0, an ∞ with the sign of the source value is returned. A denormal source value will be treated as zero only in case of DAZ bit set in MXCSR. Otherwise it is treated correctly (i.e. not as a 0.0). Underflow results are flushed to zero only in case of FTZ bit set in MXCSR. Otherwise it will be treated correctly (i.e. correct underflow result is written) with the sign of the operand. When a source value is a SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned. See Table 5-15 for special-case input values. MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported. A numerically exact implementation of VRCP14xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. Operation VRCP14SD (EVEX version) IF k1[0] OR *no writemask* THEN DEST[63:0] APPROXIMATE(1.0/SRC2[63:0]); ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking DEST[63:0] 0 FI; FI; DEST[127:64] SRC1[127:64] DEST[MAX_VL-1:128] 0 VRCP14SD—Compute Approximate Reciprocal of Scalar Float64 Value Vol. 2C 5-487 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VRCP14SD __m128d _mm_rcp14_sd( __m128d a, __m128d b); VRCP14SD __m128d _mm_mask_rcp14_sd(__m128d s, __mmask8 k, __m128d a, __m128d b); VRCP14SD __m128d _mm_maskz_rcp14_sd( __mmask8 k, __m128d a, __m128d b); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E5. 5-488 Vol. 2C VRCP14SD—Compute Approximate Reciprocal of Scalar Float64 Value INSTRUCTION SET REFERENCE, V-Z VRCP14PS—Compute Approximate Reciprocals of Packed Float32 Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.66.0F38.W0 4C /r VRCP14PS xmm1 {k1}{z}, xmm2/m128/m32bcst EVEX.256.66.0F38.W0 4C /r VRCP14PS ymm1 {k1}{z}, ymm2/m256/m32bcst EVEX.512.66.0F38.W0 4C /r VRCP14PS zmm1 {k1}{z}, zmm2/m512/m32bcst FV V/V AVX512VL AVX512F FV V/V AVX512F Description Computes the approximate reciprocals of the packed singleprecision floating-point values in xmm2/m128/m32bcst and stores the results in xmm1. Under writemask. Computes the approximate reciprocals of the packed singleprecision floating-point values in ymm2/m256/m32bcst and stores the results in ymm1. Under writemask. Computes the approximate reciprocals of the packed singleprecision floating-point values in zmm2/m512/m32bcst and stores the results in zmm1. Under writemask. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description This instruction performs a SIMD computation of the approximate reciprocals of the packed single-precision floating-point values in the source operand (the second operand) and stores the packed single-precision floatingpoint results in the destination operand (the first operand). The maximum relative error for this approximation is less than 2-14. The source operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32bit memory location. The destination operand is a ZMM register conditionally updated according to the writemask. The VRCP14PS instruction is not affected by the rounding control bits in the MXCSR register. When a source value is a 0.0, an ∞ with the sign of the source value is returned. A denormal source value will be treated as zero only in case of DAZ bit set in MXCSR. Otherwise it is treated correctly (i.e. not as a 0.0). Underflow results are flushed to zero only in case of FTZ bit set in MXCSR. Otherwise it will be treated correctly (i.e. correct underflow result is written) with the sign of the operand. When a source value is a SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported. Table 5-16. VRCP14PS/VRCP14SS Special Cases Input value Result value Comments 0 ≤ X ≤ 2-128 INF Very small denormal -2-128 ≤ -INF Very small denormal Underflow Up to 18 bits of fractions are returned* -Underflow Up to 18 bits of fractions are returned* X> X ≤ -0 2126 X < -2126 X= 2-n X = -2-n 2n -2n * in this case the mantissa is shifted right by one or two bits A numerically exact implementation of VRCP14xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. VRCP14PS—Compute Approximate Reciprocals of Packed Float32 Values Vol. 2C 5-489 INSTRUCTION SET REFERENCE, V-Z Operation VRCP14PS (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC *is memory*) THEN DEST[i+31:i] APPROXIMATE(1.0/SRC[31:0]); ELSE DEST[i+31:i] APPROXIMATE(1.0/SRC[i+31:i]); FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI; FI; ENDFOR; DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VRCP14PS __m512 _mm512_rcp14_ps( __m512 a); VRCP14PS __m512 _mm512_mask_rcp14_ps(__m512 s, __mmask16 k, __m512 a); VRCP14PS __m512 _mm512_maskz_rcp14_ps( __mmask16 k, __m512 a); VRCP14PS __m256 _mm256_rcp14_ps( __m256 a); VRCP14PS __m256 _mm512_mask_rcp14_ps(__m256 s, __mmask8 k, __m256 a); VRCP14PS __m256 _mm512_maskz_rcp14_ps( __mmask8 k, __m256 a); VRCP14PS __m128 _mm_rcp14_ps( __m128 a); VRCP14PS __m128 _mm_mask_rcp14_ps(__m128 s, __mmask8 k, __m128 a); VRCP14PS __m128 _mm_maskz_rcp14_ps( __mmask8 k, __m128 a); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E4. 5-490 Vol. 2C VRCP14PS—Compute Approximate Reciprocals of Packed Float32 Values INSTRUCTION SET REFERENCE, V-Z VRCP14SS—Compute Approximate Reciprocal of Scalar Float32 Value Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F38.W0 4D /r VRCP14SS xmm1 {k1}{z}, xmm2, xmm3/m32 T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F Description Computes the approximate reciprocal of the scalar singleprecision floating-point value in xmm3/m32 and stores the results in xmm1 using writemask k1. Also, upper doubleprecision floating-point value (bits[127:32]) from xmm2 is copied to xmm1[127:32]. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description This instruction performs a SIMD computation of the approximate reciprocal of the low single-precision floatingpoint value in the second source operand (the third operand) and stores the result in the low quadword element of the destination operand (the first operand) according to the writemask k1. Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand (the second operand). The maximum relative error for this approximation is less than 2-14. The source operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register. The VRCP14SS instruction is not affected by the rounding control bits in the MXCSR register. When a source value is a 0.0, an ∞ with the sign of the source value is returned. A denormal source value will be treated as zero only in case of DAZ bit set in MXCSR. Otherwise it is treated correctly (i.e. not as a 0.0). Underflow results are flushed to zero only in case of FTZ bit set in MXCSR. Otherwise it will be treated correctly (i.e. correct underflow result is written) with the sign of the operand. When a source value is a SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned. See Table 5-16 for special-case input values. MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported. A numerically exact implementation of VRCP14xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. Operation VRCP14SS (EVEX version) IF k1[0] OR *no writemask* THEN DEST[31:0] APPROXIMATE(1.0/SRC2[31:0]); ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking DEST[31:0] 0 FI; FI; DEST[127:32] SRC1[127:32] DEST[MAX_VL-1:128] 0 VRCP14SS—Compute Approximate Reciprocal of Scalar Float32 Value Vol. 2C 5-491 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VRCP14SS __m128 _mm_rcp14_ss( __m128 a, __m128 b); VRCP14SS __m128 _mm_mask_rcp14_ss(__m128 s, __mmask8 k, __m128 a, __m128 b); VRCP14SS __m128 _mm_maskz_rcp14_ss( __mmask8 k, __m128 a, __m128 b); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E5. 5-492 Vol. 2C VRCP14SS—Compute Approximate Reciprocal of Scalar Float32 Value INSTRUCTION SET REFERENCE, V-Z VRCP28PD—Approximation to the Reciprocal of Packed Double-Precision Floating-Point Values with Less Than 2^-28 Relative Error Opcode/ Instruction Op / En EVEX.512.66.0F38.W1 CA /r VRCP28PD zmm1 {k1}{z}, zmm2/m512/m64bcst {sae} FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512ER Description Computes the approximate reciprocals ( < 2^-28 relative error) of the packed double-precision floating-point values in zmm2/m512/m64bcst and stores the results in zmm1. Under writemask. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Computes the reciprocal approximation of the float64 values in the source operand (the second operand) and store the results to the destination operand (the first operand). The approximate reciprocal is evaluated with less than 2^-28 of maximum relative error. Denormal input values are treated as zeros and do not signal #DE, irrespective of MXCSR.DAZ. Denormal results are flushed to zeros and do not signal #UE, irrespective of MXCSR.FZ. If any source element is NaN, the quietized NaN source value is returned for that element. If any source element is ±∞, ±0.0 is returned for that element. Also, if any source element is ±0.0, ±∞ is returned for that element. The source operand is a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit memory location. The destination operand is a ZMM register, conditionally updated using writemask k1. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. A numerically exact implementation of VRCP28xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. Operation VRCP28PD (EVEX encoded versions) (KL, VL) = (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC *is memory*) THEN DEST[i+63:i] RCP_28_DP(1.0/SRC[63:0]); ELSE DEST[i+63:i] RCP_28_DP(1.0/SRC[i+63:i]); FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI; FI; ENDFOR; VRCP28PD—Approximation to the Reciprocal of Packed Double-Precision Floating-Point Values with Less Than 2^-28 Relative Error Vol. 2C 5-493 INSTRUCTION SET REFERENCE, V-Z Table 5-17. VRCP28PD Special Cases Input value Result value Comments NAN QNAN(input) If (SRC = SNaN) then #I 2-1022 INF Positive input denormal or zero; #Z < X ≤ -0 -INF Negative input denormal or zero; #Z 0≤X< -2 -1022 X> 21022 X < -2 1022 +0.0f -0.0f X = +∞ +0.0f X = -∞ -0.0f X= 2-n -n X = -2 2n -2 n Exact result (unless input/output is a denormal) Exact result (unless input/output is a denormal) Intel C/C++ Compiler Intrinsic Equivalent VRCP28PD __m512d _mm512_rcp28_round_pd ( __m512d a, int sae); VRCP28PD __m512d _mm512_mask_rcp28_round_pd(__m512d a, __mmask8 m, __m512d b, int sae); VRCP28PD __m512d _mm512_maskz_rcp28_round_pd( __mmask8 m, __m512d b, int sae); SIMD Floating-Point Exceptions Invalid (if SNaN input), Divide-by-zero Other Exceptions See Exceptions Type E2. 5-494 Vol. 2C VRCP28PD—Approximation to the Reciprocal of Packed Double-Precision Floating-Point Values with Less Than 2^-28 Relative Error INSTRUCTION SET REFERENCE, V-Z VRCP28SD—Approximation to the Reciprocal of Scalar Double-Precision Floating-Point Value with Less Than 2^-28 Relative Error Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F38.W1 CB /r VRCP28SD xmm1 {k1}{z}, xmm2, xmm3/m64 {sae} T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512ER Description Computes the approximate reciprocal ( < 2^-28 relative error) of the scalar double-precision floating-point value in xmm3/m64 and stores the results in xmm1. Under writemask. Also, upper double-precision floating-point value (bits[127:64]) from xmm2 is copied to xmm1[127:64]. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA Description Computes the reciprocal approximation of the low float64 value in the second source operand (the third operand) and store the result to the destination operand (the first operand). The approximate reciprocal is evaluated with less than 2^-28 of maximum relative error. The result is written into the low float64 element of the destination operand according to the writemask k1. Bits 127:64 of the destination is copied from the corresponding bits of the first source operand (the second operand). A denormal input value is treated as zero and does not signal #DE, irrespective of MXCSR.DAZ. A denormal result is flushed to zero and does not signal #UE, irrespective of MXCSR.FZ. If any source element is NaN, the quietized NaN source value is returned for that element. If any source element is ±∞, ±0.0 is returned for that element. Also, if any source element is ±0.0, ±∞ is returned for that element. The first source operand is an XMM register. The second source operand is an XMM register or a 64-bit memory location. The destination operand is a XMM register, conditionally updated using writemask k1. A numerically exact implementation of VRCP28xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. Operation VRCP28SD ((EVEX encoded versions) IF k1[0] OR *no writemask* THEN DEST[63: 0] RCP_28_DP(1.0/SRC2[63: 0]); ELSE IF *merging-masking* ; merging-masking THEN *DEST[63: 0] remains unchanged* ELSE ; zeroing-masking DEST[63: 0] 0 FI; FI; ENDFOR; DEST[127:64] SRC1[127: 64] DEST[MAX_VL-1:128] 0 VRCP28SD—Approximation to the Reciprocal of Scalar Double-Precision Floating-Point Value with Less Than 2^-28 Relative Error Vol. 2C 5-495 INSTRUCTION SET REFERENCE, V-Z Table 5-18. VRCP28SD Special Cases Input value Result value Comments NAN QNAN(input) If (SRC = SNaN) then #I 2-1022 INF Positive input denormal or zero; #Z < X ≤ -0 -INF Negative input denormal or zero; #Z 0≤X< -2 -1022 X> 21022 X < -2 1022 +0.0f -0.0f X = +∞ +0.0f X = -∞ -0.0f X= 2-n 2n -n X = -2 -2 n Exact result (unless input/output is a denormal) Exact result (unless input/output is a denormal) Intel C/C++ Compiler Intrinsic Equivalent VRCP28SD __m128d _mm_rcp28_round_sd ( __m128d a, __m128d b, int sae); VRCP28SD __m128d _mm_mask_rcp28_round_sd(__m128d s, __mmask8 m, __m128d a, __m128d b, int sae); VRCP28SD __m128d _mm_maskz_rcp28_round_sd(__mmask8 m, __m128d a, __m128d b, int sae); SIMD Floating-Point Exceptions Invalid (if SNaN input), Divide-by-zero Other Exceptions See Exceptions Type E3. 5-496 Vol. 2C VRCP28SD—Approximation to the Reciprocal of Scalar Double-Precision Floating-Point Value with Less Than 2^-28 Relative Error INSTRUCTION SET REFERENCE, V-Z VRCP28PS—Approximation to the Reciprocal of Packed Single-Precision Floating-Point Values with Less Than 2^-28 Relative Error Opcode/ Instruction Op / En EVEX.512.66.0F38.W0 CA /r VRCP28PS zmm1 {k1}{z}, zmm2/m512/m32bcst {sae} FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512ER Description Computes the approximate reciprocals ( < 2^-28 relative error) of the packed single-precision floating-point values in zmm2/m512/m32bcst and stores the results in zmm1. Under writemask. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Computes the reciprocal approximation of the float32 values in the source operand (the second operand) and store the results to the destination operand (the first operand) using the writemask k1. The approximate reciprocal is evaluated with less than 2^-28 of maximum relative error prior to final rounding. The final results are rounded to < 2^-23 relative error before written to the destination. Denormal input values are treated as zeros and do not signal #DE, irrespective of MXCSR.DAZ. Denormal results are flushed to zeros and do not signal #UE, irrespective of MXCSR.FZ. If any source element is NaN, the quietized NaN source value is returned for that element. If any source element is ±∞, ±0.0 is returned for that element. Also, if any source element is ±0.0, ±∞ is returned for that element. The source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM register, conditionally updated using writemask k1. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. A numerically exact implementation of VRCP28xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. Operation VRCP28PS (EVEX encoded versions) (KL, VL) = (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC *is memory*) THEN DEST[i+31:i] RCP_28_SP(1.0/SRC[31:0]); ELSE DEST[i+31:i] RCP_28_SP(1.0/SRC[i+31:i]); FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI; FI; ENDFOR; VRCP28PS—Approximation to the Reciprocal of Packed Single-Precision Floating-Point Values with Less Than 2^-28 Relative Error Vol. 2C 5-497 INSTRUCTION SET REFERENCE, V-Z Table 5-19. VRCP28PS Special Cases Input value Result value Comments NAN QNAN(input) If (SRC = SNaN) then #I 2-126 INF Positive input denormal or zero; #Z < X ≤ -0 -INF Negative input denormal or zero; #Z 0≤X< -2 -126 X> 2126 X < -2 126 +0.0f -0.0f X = +∞ +0.0f X = -∞ -0.0f X= 2-n 2n -n X = -2 -2 n Exact result (unless input/output is a denormal) Exact result (unless input/output is a denormal) Intel C/C++ Compiler Intrinsic Equivalent VRCP28PS _mm512_rcp28_round_ps ( __m512 a, int sae); VRCP28PS __m512 _mm512_mask_rcp28_round_ps(__m512 s, __mmask16 m, __m512 a, int sae); VRCP28PS __m512 _mm512_maskz_rcp28_round_ps( __mmask16 m, __m512 a, int sae); SIMD Floating-Point Exceptions Invalid (if SNaN input), Divide-by-zero Other Exceptions See Exceptions Type E2. 5-498 Vol. 2C VRCP28PS—Approximation to the Reciprocal of Packed Single-Precision Floating-Point Values with Less Than 2^-28 Relative Error INSTRUCTION SET REFERENCE, V-Z VRCP28SS—Approximation to the Reciprocal of Scalar Single-Precision Floating-Point Value with Less Than 2^-28 Relative Error Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F38.W0 CB /r VRCP28SS xmm1 {k1}{z}, xmm2, xmm3/m32 {sae} T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512ER Description Computes the approximate reciprocal ( < 2^-28 relative error) of the scalar single-precision floating-point value in xmm3/m32 and stores the results in xmm1. Under writemask. Also, upper 3 single-precision floating-point values (bits[127:32]) from xmm2 is copied to xmm1[127:32]. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA Description Computes the reciprocal approximation of the low float32 value in the second source operand (the third operand) and store the result to the destination operand (the first operand). The approximate reciprocal is evaluated with less than 2^-28 of maximum relative error prior to final rounding. The final result is rounded to < 2^-23 relative error before written into the low float32 element of the destination according to writemask k1. Bits 127:32 of the destination is copied from the corresponding bits of the first source operand (the second operand). A denormal input value is treated as zero and does not signal #DE, irrespective of MXCSR.DAZ. A denormal result is flushed to zero and does not signal #UE, irrespective of MXCSR.FZ. If any source element is NaN, the quietized NaN source value is returned for that element. If any source element is ±∞, ±0.0 is returned for that element. Also, if any source element is ±0.0, ±∞ is returned for that element. The first source operand is an XMM register. The second source operand is an XMM register or a 32-bit memory location. The destination operand is a XMM register, conditionally updated using writemask k1. A numerically exact implementation of VRCP28xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. Operation VRCP28SS ((EVEX encoded versions) IF k1[0] OR *no writemask* THEN DEST[31: 0] RCP_28_SP(1.0/SRC2[31: 0]); ELSE IF *merging-masking* ; merging-masking THEN *DEST[31: 0] remains unchanged* ELSE ; zeroing-masking DEST[31: 0] 0 FI; FI; ENDFOR; DEST[127:32] SRC1[127: 32] DEST[MAX_VL-1:128] 0 VRCP28SS—Approximation to the Reciprocal of Scalar Single-Precision Floating-Point Value with Less Than 2^-28 Relative Error Vol. 2C 5-499 INSTRUCTION SET REFERENCE, V-Z Table 5-20. VRCP28SS Special Cases Input value Result value Comments NAN QNAN(input) If (SRC = SNaN) then #I 2-126 INF Positive input denormal or zero; #Z < X ≤ -0 -INF Negative input denormal or zero; #Z 0≤X< -2 -126 X> 2126 X < -2 +0.0f 126 -0.0f X = +∞ +0.0f X = -∞ -0.0f X= 2-n 2n -n X = -2 -2 n Exact result (unless input/output is a denormal) Exact result (unless input/output is a denormal) Intel C/C++ Compiler Intrinsic Equivalent VRCP28SS __m128 _mm_rcp28_round_ss ( __m128 a, __m128 b, int sae); VRCP28SS __m128 _mm_mask_rcp28_round_ss(__m128 s, __mmask8 m, __m128 a, __m128 b, int sae); VRCP28SS __m128 _mm_maskz_rcp28_round_ss(__mmask8 m, __m128 a, __m128 b, int sae); SIMD Floating-Point Exceptions Invalid (if SNaN input), Divide-by-zero Other Exceptions See Exceptions Type E3. 5-500 Vol. 2C VRCP28SS—Approximation to the Reciprocal of Scalar Single-Precision Floating-Point Value with Less Than 2^-28 Relative Error INSTRUCTION SET REFERENCE, V-Z VREDUCEPD—Perform Reduction Transformation on Packed Float64 Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.128.66.0F3A.W1 56 /r ib VREDUCEPD xmm1 {k1}{z}, xmm2/m128/m64bcst, imm8 EVEX.256.66.0F3A.W1 56 /r ib VREDUCEPD ymm1 {k1}{z}, ymm2/m256/m64bcst, imm8 FV V/V AVX512VL AVX512DQ EVEX.512.66.0F3A.W1 56 /r ib VREDUCEPD zmm1 {k1}{z}, zmm2/m512/m64bcst{sae}, imm8 FV V/V AVX512DQ Description Perform reduction transformation on packed double-precision floating point values in xmm2/m128/m32bcst by subtracting a number of fraction bits specified by the imm8 field. Stores the result in xmm1 register under writemask k1. Perform reduction transformation on packed double-precision floating point values in ymm2/m256/m32bcst by subtracting a number of fraction bits specified by the imm8 field. Stores the result in ymm1 register under writemask k1. Perform reduction transformation on double-precision floating point values in zmm2/m512/m32bcst by subtracting a number of fraction bits specified by the imm8 field. Stores the result in zmm1 register under writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) Imm8 NA Description Perform reduction transformation of the packed binary encoded double-precision FP values in the source operand (the second operand) and store the reduced results in binary FP format to the destination operand (the first operand) under the writemask k1. The reduction transformation subtracts the integer part and the leading M fractional bits from the binary FP source value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduction transformation can be expressed as: dest = src – (ROUND(2M*src))*2-M; where “Round()” treats “src”, “2M”, and their product as binary FP numbers with normalized significand and biased exponents. The magnitude of the reduced result can be expressed by considering src= 2p*man2, where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1 Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M This instruction might end up with a precision exception set. However, in case of SPE set (i.e. Suppress Precision Exception, which is imm8[3]=1), no precision exception is reported. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 7 imm8 6 5 4 Fixed point length 3 2 SPE RS 1 Round Control Override Suppress Precision Exception: Imm8[3] Imm8[7:4] : Number of fixed points to subtract 0 Imm8[3] = 0b : Use MXCSR exception mask Round Select: Imm8[2] Imm8[3] = 1b : Suppress Imm8[2] = 0b : Use Imm8[1:0] Imm8[2] = 1b : Use MXCSR Imm8[1:0] = 00b : Round nearest even Imm8[1:0] = 01b : Round down Imm8[1:0] = 10b : Round up Imm8[1:0] = 11b : Truncate Figure 5-28. Imm8 Controls for VREDUCEPD/SD/PS/SS VREDUCEPD—Perform Reduction Transformation on Packed Float64 Values Vol. 2C 5-501 INSTRUCTION SET REFERENCE, V-Z Handling of special case of input values are listed in Table 5-21. Table 5-21. VREDUCEPD/SD/PS/SS Special Cases Round Mode Returned value RNE Src1 RPI, Src1 > 0 Round (Src1-2-M) * RPI, Src1 ≤ 0 Src1 RNI, Src1 ≥ 0 Src1 RNI, Src1 < 0 Round (Src1+2-M) * NOT RNI +0.0 RNI -0.0 Src1 = ±INF any +0.0 Src1= ±NAN n/a QNaN(Src1) |Src1| < |Src1| < 2-M-1 2-M Src1 = ±0, or Dest = ±0 (Src1!=INF) * Round control = (imm8.MS1)? MXCSR.RC: imm8.RC Operation ReduceArgumentDP(SRC[63:0], imm8[7:0]) { // Check for NaN IF (SRC [63:0] = NAN) THEN RETURN (Convert SRC[63:0] to QNaN); FI; M imm8[7:4]; // Number of fraction bits of the normalized significand to be subtracted RC imm8[1:0];// Round Control for ROUND() operation RC source imm[2]; SPE 0;// Suppress Precision Exception TMP[63:0] 2-M *{ROUND(2M*SRC[63:0], SPE, RC_source, RC)}; // ROUND() treats SRC and 2M as standard binary FP values TMP[63:0] SRC[63:0] – TMP[63:0]; // subtraction under the same RC,SPE controls RETURN TMP[63:0]; // binary encoded FP with biased exponent and normalized significand } VREDUCEPD (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) AND (SRC *is memory*) THEN DEST[i+63:i] ReduceArgumentDP(SRC[63:0], imm8[7:0]); ELSE DEST[i+63:i] ReduceArgumentDP(SRC[i+63:i], imm8[7:0]); FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] = 0 FI; FI; ENDFOR; DEST[MAX_VL-1:VL] 0 5-502 Vol. 2C VREDUCEPD—Perform Reduction Transformation on Packed Float64 Values INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VREDUCEPD __m512d _mm512_mask_reduce_pd( __m512d a, int imm, int sae) VREDUCEPD __m512d _mm512_mask_reduce_pd(__m512d s, __mmask8 k, __m512d a, int imm, int sae) VREDUCEPD __m512d _mm512_maskz_reduce_pd(__mmask8 k, __m512d a, int imm, int sae) VREDUCEPD __m256d _mm256_mask_reduce_pd( __m256d a, int imm) VREDUCEPD __m256d _mm256_mask_reduce_pd(__m256d s, __mmask8 k, __m256d a, int imm) VREDUCEPD __m256d _mm256_maskz_reduce_pd(__mmask8 k, __m256d a, int imm) VREDUCEPD __m128d _mm_mask_reduce_pd( __m128d a, int imm) VREDUCEPD __m128d _mm_mask_reduce_pd(__m128d s, __mmask8 k, __m128d a, int imm) VREDUCEPD __m128d _mm_maskz_reduce_pd(__mmask8 k, __m128d a, int imm) SIMD Floating-Point Exceptions Invalid, Precision If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask). Other Exceptions See Exceptions Type E2, additionally #UD If EVEX.vvvv != 1111B. VREDUCEPD—Perform Reduction Transformation on Packed Float64 Values Vol. 2C 5-503 INSTRUCTION SET REFERENCE, V-Z VREDUCESD—Perform a Reduction Transformation on a Scalar Float64 Value Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F3A.W1 57 VREDUCESD xmm1 {k1}{z}, xmm2, xmm3/m64{sae}, imm8/r T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512D Q Description Perform a reduction transformation on a scalar double-precision floating point value in xmm3/m64 by subtracting a number of fraction bits specified by the imm8 field. Also, upper double precision floating-point value (bits[127:64]) from xmm2 are copied to xmm1[127:64]. Stores the result in xmm1 register. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Perform a reduction transformation of the binary encoded double-precision FP value in the low qword element of the second source operand (the third operand) and store the reduced result in binary FP format to the low qword element of the destination operand (the first operand) under the writemask k1. Bits 127:64 of the destination operand are copied from respective qword elements of the first source operand (the second operand). The reduction transformation subtracts the integer part and the leading M fractional bits from the binary FP source value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduction transformation can be expressed as: dest = src – (ROUND(2M*src))*2-M; where “Round()” treats “src”, “2M”, and their product as binary FP numbers with normalized significand and biased exponents. The magnitude of the reduced result can be expressed by considering src= 2p*man2, where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1 Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M This instruction might end up with a precision exception set. However, in case of SPE set (i.e. Suppress Precision Exception, which is imm8[3]=1), no precision exception is reported. The operation is write masked. Handling of special case of input values are listed in Table 5-21. Operation ReduceArgumentDP(SRC[63:0], imm8[7:0]) { // Check for NaN IF (SRC [63:0] = NAN) THEN RETURN (Convert SRC[63:0] to QNaN); FI; M imm8[7:4]; // Number of fraction bits of the normalized significand to be subtracted RC imm8[1:0];// Round Control for ROUND() operation RC source imm[2]; SPE 0;// Suppress Precision Exception TMP[63:0] 2-M *{ROUND(2M*SRC[63:0], SPE, RC_source, RC)}; // ROUND() treats SRC and 2M as standard binary FP values TMP[63:0] SRC[63:0] – TMP[63:0]; // subtraction under the same RC,SPE controls RETURN TMP[63:0]; // binary encoded FP with biased exponent and normalized significand } 5-504 Vol. 2C VREDUCESD—Perform a Reduction Transformation on a Scalar Float64 Value INSTRUCTION SET REFERENCE, V-Z VREDUCESD IF k1[0] or *no writemask* THEN DEST[63:0] ReduceArgumentDP(SRC2[63:0], imm8[7:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[63:0] = 0 FI; FI; DEST[127:64] SRC1[127:64] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VREDUCESD __m128d _mm_mask_reduce_sd( __m128d a, __m128d b, int imm, int sae) VREDUCESD __m128d _mm_mask_reduce_sd(__m128d s, __mmask16 k, __m128d a, __m128d b, int imm, int sae) VREDUCESD __m128d _mm_maskz_reduce_sd(__mmask16 k, __m128d a, __m128d b, int imm, int sae) SIMD Floating-Point Exceptions Invalid, Precision If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask). Other Exceptions See Exceptions Type E3. VREDUCESD—Perform a Reduction Transformation on a Scalar Float64 Value Vol. 2C 5-505 INSTRUCTION SET REFERENCE, V-Z VREDUCEPS—Perform Reduction Transformation on Packed Float32 Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512DQ EVEX.128.66.0F3A.W0 56 /r ib VREDUCEPS xmm1 {k1}{z}, xmm2/m128/m32bcst, imm8 EVEX.256.66.0F3A.W0 56 /r ib VREDUCEPS ymm1 {k1}{z}, ymm2/m256/m32bcst, imm8 FV V/V AVX512VL AVX512DQ EVEX.512.66.0F3A.W0 56 /r ib VREDUCEPS zmm1 {k1}{z}, zmm2/m512/m32bcst{sae}, imm8 FV V/V AVX512DQ Description Perform reduction transformation on packed single-precision floating point values in xmm2/m128/m32bcst by subtracting a number of fraction bits specified by the imm8 field. Stores the result in xmm1 register under writemask k1. Perform reduction transformation on packed single-precision floating point values in ymm2/m256/m32bcst by subtracting a number of fraction bits specified by the imm8 field. Stores the result in ymm1 register under writemask k1. Perform reduction transformation on packed single-precision floating point values in zmm2/m512/m32bcst by subtracting a number of fraction bits specified by the imm8 field. Stores the result in zmm1 register under writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) Imm8 NA Description Perform reduction transformation of the packed binary encoded single-precision FP values in the source operand (the second operand) and store the reduced results in binary FP format to the destination operand (the first operand) under the writemask k1. The reduction transformation subtracts the integer part and the leading M fractional bits from the binary FP source value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduction transformation can be expressed as: dest = src – (ROUND(2M*src))*2-M; where “Round()” treats “src”, “2M”, and their product as binary FP numbers with normalized significand and biased exponents. The magnitude of the reduced result can be expressed by considering src= 2p*man2, where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1 Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M This instruction might end up with a precision exception set. However, in case of SPE set (i.e. Suppress Precision Exception, which is imm8[3]=1), no precision exception is reported. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. Handling of special case of input values are listed in Table 5-21. 5-506 Vol. 2C VREDUCEPS—Perform Reduction Transformation on Packed Float32 Values INSTRUCTION SET REFERENCE, V-Z Operation ReduceArgumentSP(SRC[31:0], imm8[7:0]) { // Check for NaN IF (SRC [31:0] = NAN) THEN RETURN (Convert SRC[31:0] to QNaN); FI M imm8[7:4]; // Number of fraction bits of the normalized significand to be subtracted RC imm8[1:0];// Round Control for ROUND() operation RC source imm[2]; SPE 0;// Suppress Precision Exception TMP[31:0] 2-M *{ROUND(2M*SRC[31:0], SPE, RC_source, RC)}; // ROUND() treats SRC and 2M as standard binary FP values TMP[31:0] SRC[31:0] – TMP[31:0]; // subtraction under the same RC,SPE controls RETURN TMP[31:0]; // binary encoded FP with biased exponent and normalized significand } VREDUCEPS (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) AND (SRC *is memory*) THEN DEST[i+31:i] ReduceArgumentSP(SRC[31:0], imm8[7:0]); ELSE DEST[i+31:i] ReduceArgumentSP(SRC[i+31:i], imm8[7:0]); FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] = 0 FI; FI; ENDFOR; DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VREDUCEPS __m512 _mm512_mask_reduce_ps( __m512 a, int imm, int sae) VREDUCEPS __m512 _mm512_mask_reduce_ps(__m512 s, __mmask16 k, __m512 a, int imm, int sae) VREDUCEPS __m512 _mm512_maskz_reduce_ps(__mmask16 k, __m512 a, int imm, int sae) VREDUCEPS __m256 _mm256_mask_reduce_ps( __m256 a, int imm) VREDUCEPS __m256 _mm256_mask_reduce_ps(__m256 s, __mmask8 k, __m256 a, int imm) VREDUCEPS __m256 _mm256_maskz_reduce_ps(__mmask8 k, __m256 a, int imm) VREDUCEPS __m128 _mm_mask_reduce_ps( __m128 a, int imm) VREDUCEPS __m128 _mm_mask_reduce_ps(__m128 s, __mmask8 k, __m128 a, int imm) VREDUCEPS __m128 _mm_maskz_reduce_ps(__mmask8 k, __m128 a, int imm) SIMD Floating-Point Exceptions Invalid, Precision If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask). Other Exceptions See Exceptions Type E2, additionally #UD If EVEX.vvvv != 1111B. VREDUCEPS—Perform Reduction Transformation on Packed Float32 Values Vol. 2C 5-507 INSTRUCTION SET REFERENCE, V-Z VREDUCESS—Perform a Reduction Transformation on a Scalar Float32 Value Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F3A.W0 57 /r /ib VREDUCESS xmm1 {k1}{z}, xmm2, xmm3/m32{sae}, imm8 T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512DQ Description Perform a reduction transformation on a scalar single-precision floating point value in xmm3/m32 by subtracting a number of fraction bits specified by the imm8 field. Also, upper single precision floating-point values (bits[127:32]) from xmm2 are copied to xmm1[127:32]. Stores the result in xmm1 register. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Perform a reduction transformation of the binary encoded single-precision FP value in the low dword element of the second source operand (the third operand) and store the reduced result in binary FP format to the low dword element of the destination operand (the first operand) under the writemask k1. Bits 127:32 of the destination operand are copied from respective dword elements of the first source operand (the second operand). The reduction transformation subtracts the integer part and the leading M fractional bits from the binary FP source value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduction transformation can be expressed as: dest = src – (ROUND(2M*src))*2-M; where “Round()” treats “src”, “2M”, and their product as binary FP numbers with normalized significand and biased exponents. The magnitude of the reduced result can be expressed by considering src= 2p*man2, where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1 Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M This instruction might end up with a precision exception set. However, in case of SPE set (i.e. Suppress Precision Exception, which is imm8[3]=1), no precision exception is reported. Handling of special case of input values are listed in Table 5-21. Operation ReduceArgumentSP(SRC[31:0], imm8[7:0]) { // Check for NaN IF (SRC [31:0] = NAN) THEN RETURN (Convert SRC[31:0] to QNaN); FI M imm8[7:4]; // Number of fraction bits of the normalized significand to be subtracted RC imm8[1:0];// Round Control for ROUND() operation RC source imm[2]; SPE 0;// Suppress Precision Exception TMP[31:0] 2-M *{ROUND(2M*SRC[31:0], SPE, RC_source, RC)}; // ROUND() treats SRC and 2M as standard binary FP values TMP[31:0] SRC[31:0] – TMP[31:0]; // subtraction under the same RC,SPE controls RETURN TMP[31:0]; // binary encoded FP with biased exponent and normalized significand } 5-508 Vol. 2C VREDUCESS—Perform a Reduction Transformation on a Scalar Float32 Value INSTRUCTION SET REFERENCE, V-Z VREDUCESS IF k1[0] or *no writemask* THEN DEST[31:0] ReduceArgumentSP(SRC2[31:0], imm8[7:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[31:0] = 0 FI; FI; DEST[127:32] SRC1[127:32] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VREDUCESS __m128 _mm_mask_reduce_ss( __m128 a, __m128 b, int imm, int sae) VREDUCESS __m128 _mm_mask_reduce_ss(__m128 s, __mmask16 k, __m128 a, __m128 b, int imm, int sae) VREDUCESS __m128 _mm_maskz_reduce_ss(__mmask16 k, __m128 a, __m128 b, int imm, int sae) SIMD Floating-Point Exceptions Invalid, Precision If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask). Other Exceptions See Exceptions Type E3. VREDUCESS—Perform a Reduction Transformation on a Scalar Float32 Value Vol. 2C 5-509 INSTRUCTION SET REFERENCE, V-Z VRNDSCALEPD—Round Packed Float64 Values To Include A Given Number Of Fraction Bits Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.66.0F3A.W1 09 /r ib VRNDSCALEPD xmm1 {k1}{z}, xmm2/m128/m64bcst, imm8 EVEX.256.66.0F3A.W1 09 /r ib VRNDSCALEPD ymm1 {k1}{z}, ymm2/m256/m64bcst, imm8 FV V/V AVX512VL AVX512F EVEX.512.66.0F3A.W1 09 /r ib VRNDSCALEPD zmm1 {k1}{z}, zmm2/m512/m64bcst{sae}, imm8 FV V/V AVX512F Description Rounds packed double-precision floating point values in xmm2/m128/m64bcst to a number of fraction bits specified by the imm8 field. Stores the result in xmm1 register. Under writemask. Rounds packed double-precision floating point values in ymm2/m256/m64bcst to a number of fraction bits specified by the imm8 field. Stores the result in ymm1 register. Under writemask. Rounds packed double-precision floating-point values in zmm2/m512/m64bcst to a number of fraction bits specified by the imm8 field. Stores the result in zmm1 register using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) Imm8 NA Description Round the double-precision floating-point values in the source operand by the rounding mode specified in the immediate operand (see Figure 5-29) and places the result in the destination operand. The destination operand (the first operand) is a ZMM/YMM/XMM register conditionally updated according to the writemask. The source operand (the second operand) can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by imm8[7:4] (to be included in the result) and returns the result as a double-precision floating-point value. It should be noticed that no overflow is induced while executing this instruction (although the source is scaled by the imm8[7:4] value). The immediate operand also specifies control fields for the rounding operation, three bit fields are defined and shown in the “Immediate Control Description” figure below. Bit 3 of the immediate byte controls the processor behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky rounding-mode value (Immediate control table below lists the encoded values for rounding-mode field). The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an SNaN then it will be converted to a QNaN. If DAZ is set to ‘1 then denormals will be converted to zero before rounding. The sign of the result of this instruction is preserved, including the sign of zero. The formula of the operation on each data element for VRNDSCALEPD is ROUND(x) = 2-M*Round_to_INT(x*2M, round_ctrl), round_ctrl = imm[3:0]; M=imm[7:4]; The operation of x*2M is computed as if the exponent range is unlimited (i.e. no overflow ever occurs). 5-510 Vol. 2C VRNDSCALEPD—Round Packed Float64 Values To Include A Given Number Of Fraction Bits INSTRUCTION SET REFERENCE, V-Z VRNDSCALEPD is a more general form of the VEX-encoded VROUNDPD instruction. In VROUNDPD, the formula of the operation on each element is ROUND(x) = Round_to_INT(x, round_ctrl), round_ctrl = imm[3:0]; Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. 7 imm8 6 5 4 Fixed point length 3 2 SPE RS 1 Round Control Override Suppress Precision Exception: Imm8[3] Imm8[7:4] : Number of fixed points to preserve 0 Imm8[3] = 0b : Use MXCSR exception mask Round Select: Imm8[2] Imm8[3] = 1b : Suppress Imm8[2] = 0b : Use Imm8[1:0] Imm8[2] = 1b : Use MXCSR Imm8[1:0] = 00b : Round nearest even Imm8[1:0] = 01b : Round down Imm8[1:0] = 10b : Round up Imm8[1:0] = 11b : Truncate Figure 5-29. Imm8 Controls for VRNDSCALEPD/SD/PS/SS Handling of special case of input values are listed in Table 5-22. Table 5-22. VRNDSCALEPD/SD/PS/SS Special Cases Returned value Src1=±inf Src1 Src1=±NAN Src1 converted to QNAN Src1=±0 Src1 VRNDSCALEPD—Round Packed Float64 Values To Include A Given Number Of Fraction Bits Vol. 2C 5-511 INSTRUCTION SET REFERENCE, V-Z Operation RoundToIntegerDP(SRC[63:0], imm8[7:0]) { if (imm8[2] = 1) rounding_direction MXCSR:RC ; get round control from MXCSR else rounding_direction imm8[1:0] ; get round control from imm8[1:0] FI M imm8[7:4] ; get the scaling factor case (rounding_direction) 00: TMP[63:0] round_to_nearest_even_integer(2M*SRC[63:0]) 01: TMP[63:0] round_to_equal_or_smaller_integer(2M*SRC[63:0]) 10: TMP[63:0] round_to_equal_or_larger_integer(2M*SRC[63:0]) 11: TMP[63:0] round_to_nearest_smallest_magnitude_integer(2M*SRC[63:0]) ESAC Dest[63:0] 2-M* TMP[63:0] ; scale down back to 2-M if (imm8[3] = 0) Then ; check SPE if (SRC[63:0] != Dest[63:0]) Then ; check precision lost set_precision() ; set #PE FI; FI; return(Dest[63:0]) } 5-512 Vol. 2C VRNDSCALEPD—Round Packed Float64 Values To Include A Given Number Of Fraction Bits INSTRUCTION SET REFERENCE, V-Z VRNDSCALEPD (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) IF *src is a memory operand* THEN TMP_SRC BROADCAST64(SRC, VL, k1) ELSE TMP_SRC SRC FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] RoundToIntegerDP((TMP_SRC[i+63:i], imm8[7:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI; FI; ENDFOR; DEST[MAX_VL-1:VL] 0 Intel C/C++ Compiler Intrinsic Equivalent VRNDSCALEPD __m512d _mm512_roundscale_pd( __m512d a, int imm); VRNDSCALEPD __m512d _mm512_roundscale_round_pd( __m512d a, int imm, int sae); VRNDSCALEPD __m512d _mm512_mask_roundscale_pd(__m512d s, __mmask8 k, __m512d a, int imm); VRNDSCALEPD __m512d _mm512_mask_roundscale_round_pd(__m512d s, __mmask8 k, __m512d a, int imm, int sae); VRNDSCALEPD __m512d _mm512_maskz_roundscale_pd( __mmask8 k, __m512d a, int imm); VRNDSCALEPD __m512d _mm512_maskz_roundscale_round_pd( __mmask8 k, __m512d a, int imm, int sae); VRNDSCALEPD __m256d _mm256_roundscale_pd( __m256d a, int imm); VRNDSCALEPD __m256d _mm256_mask_roundscale_pd(__m256d s, __mmask8 k, __m256d a, int imm); VRNDSCALEPD __m256d _mm256_maskz_roundscale_pd( __mmask8 k, __m256d a, int imm); VRNDSCALEPD __m128d _mm_roundscale_pd( __m128d a, int imm); VRNDSCALEPD __m128d _mm_mask_roundscale_pd(__m128d s, __mmask8 k, __m128d a, int imm); VRNDSCALEPD __m128d _mm_maskz_roundscale_pd( __mmask8 k, __m128d a, int imm); SIMD Floating-Point Exceptions Invalid, Precision If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask). Other Exceptions See Exceptions Type E2. VRNDSCALEPD—Round Packed Float64 Values To Include A Given Number Of Fraction Bits Vol. 2C 5-513 INSTRUCTION SET REFERENCE, V-Z VRNDSCALESD—Round Scalar Float64 Value To Include A Given Number Of Fraction Bits Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F3A.W1 0B /r ib VRNDSCALESD xmm1 {k1}{z}, xmm2, xmm3/m64{sae}, imm8 T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F Description Rounds scalar double-precision floating-point value in xmm3/m64 to a number of fraction bits specified by the imm8 field. Stores the result in xmm1 register. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) Imm8 Description Rounds a double-precision floating-point value in the low quadword (see Figure 5-29) element the second source operand (the third operand) by the rounding mode specified in the immediate operand and places the result in the corresponding element of the destination operand (the third operand) according to the writemask. The quadword element at bits 127:64 of the destination is copied from the first source operand (the second operand). The destination and first source operands are XMM registers, the 2nd source operand can be an XMM register or memory location. Bits MAX_VL-1:128 of the destination register are cleared. The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by imm8[7:4] (to be included in the result) and returns the result as a double-precision floating-point value. It should be noticed that no overflow is induced while executing this instruction (although the source is scaled by the imm8[7:4] value). The immediate operand also specifies control fields for the rounding operation, three bit fields are defined and shown in the “Immediate Control Description” figure below. Bit 3 of the immediate byte controls the processor behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky rounding-mode value (Immediate control table below lists the encoded values for rounding-mode field). The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an SNaN then it will be converted to a QNaN. If DAZ is set to ‘1 then denormals will be converted to zero before rounding. The sign of the result of this instruction is preserved, including the sign of zero. The formula of the operation for VRNDSCALESD is ROUND(x) = 2-M*Round_to_INT(x*2M, round_ctrl), round_ctrl = imm[3:0]; M=imm[7:4]; The operation of x*2M is computed as if the exponent range is unlimited (i.e. no overflow ever occurs). VRNDSCALESD is a more general form of the VEX-encoded VROUNDSD instruction. In VROUNDSD, the formula of the operation is ROUND(x) = Round_to_INT(x, round_ctrl), round_ctrl = imm[3:0]; EVEX encoded version: The source operand is a XMM register or a 64-bit memory location. The destination operand is a XMM register. Handling of special case of input values are listed in Table 5-22. 5-514 Vol. 2C VRNDSCALESD—Round Scalar Float64 Value To Include A Given Number Of Fraction Bits INSTRUCTION SET REFERENCE, V-Z Operation RoundToIntegerDP(SRC[63:0], imm8[7:0]) { if (imm8[2] = 1) rounding_direction MXCSR:RC ; get round control from MXCSR else rounding_direction imm8[1:0] ; get round control from imm8[1:0] FI M imm8[7:4] ; get the scaling factor case (rounding_direction) 00: TMP[63:0] round_to_nearest_even_integer(2M*SRC[63:0]) 01: TMP[63:0] round_to_equal_or_smaller_integer(2M*SRC[63:0]) 10: TMP[63:0] round_to_equal_or_larger_integer(2M*SRC[63:0]) 11: TMP[63:0] round_to_nearest_smallest_magnitude_integer(2M*SRC[63:0]) ESAC Dest[63:0] 2-M* TMP[63:0] ; scale down back to 2-M if (imm8[3] = 0) Then ; check SPE if (SRC[63:0] != Dest[63:0]) Then ; check precision lost set_precision() ; set #PE FI; FI; return(Dest[63:0]) } VRNDSCALESD (EVEX encoded version) IF k1[0] or *no writemask* THEN DEST[63:0] RoundToIntegerDP(SRC2[63:0], Zero_upper_imm[7:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[63:0] 0 FI; FI; DEST[127:64] SRC1[127:64] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VRNDSCALESD __m128d _mm_roundscale_sd ( __m128d a, __m128d b, int imm); VRNDSCALESD __m128d _mm_roundscale_round_sd ( __m128d a, __m128d b, int imm, int sae); VRNDSCALESD __m128d _mm_mask_roundscale_sd (__m128d s, __mmask8 k, __m128d a, __m128d b, int imm); VRNDSCALESD __m128d _mm_mask_roundscale_round_sd (__m128d s, __mmask8 k, __m128d a, __m128d b, int imm, int sae); VRNDSCALESD __m128d _mm_maskz_roundscale_sd ( __mmask8 k, __m128d a, __m128d b, int imm); VRNDSCALESD __m128d _mm_maskz_roundscale_round_sd ( __mmask8 k, __m128d a, __m128d b, int imm, int sae); SIMD Floating-Point Exceptions Invalid, Precision If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask). Other Exceptions See Exceptions Type E3. VRNDSCALESD—Round Scalar Float64 Value To Include A Given Number Of Fraction Bits Vol. 2C 5-515 INSTRUCTION SET REFERENCE, V-Z VRNDSCALEPS—Round Packed Float32 Values To Include A Given Number Of Fraction Bits Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.66.0F3A.W0 08 /r ib VRNDSCALEPS xmm1 {k1}{z}, xmm2/m128/m32bcst, imm8 EVEX.256.66.0F3A.W0 08 /r ib VRNDSCALEPS ymm1 {k1}{z}, ymm2/m256/m32bcst, imm8 FV V/V AVX512VL AVX512F EVEX.512.66.0F3A.W0 08 /r ib VRNDSCALEPS zmm1 {k1}{z}, zmm2/m512/m32bcst{sae}, imm8 FV V/V AVX512F Description Rounds packed single-precision floating point values in xmm2/m128/m32bcst to a number of fraction bits specified by the imm8 field. Stores the result in xmm1 register. Under writemask. Rounds packed single-precision floating point values in ymm2/m256/m32bcst to a number of fraction bits specified by the imm8 field. Stores the result in ymm1 register. Under writemask. Rounds packed single-precision floating-point values in zmm2/m512/m32bcst to a number of fraction bits specified by the imm8 field. Stores the result in zmm1 register using writemask. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) Imm8 NA Description Round the single-precision floating-point values in the source operand by the rounding mode specified in the immediate operand (see Figure 5-29) and places the result in the destination operand. The destination operand (the first operand) is a ZMM register conditionally updated according to the writemask. The source operand (the second operand) can be a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32-bit memory location. The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by imm8[7:4] (to be included in the result) and returns the result as a single-precision floating-point value. It should be noticed that no overflow is induced while executing this instruction (although the source is scaled by the imm8[7:4] value). The immediate operand also specifies control fields for the rounding operation, three bit fields are defined and shown in the “Immediate Control Description” figure below. Bit 3 of the immediate byte controls the processor behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky rounding-mode value (Immediate control table below lists the encoded values for rounding-mode field). The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an SNaN then it will be converted to a QNaN. If DAZ is set to ‘1 then denormals will be converted to zero before rounding. The sign of the result of this instruction is preserved, including the sign of zero. The formula of the operation on each data element for VRNDSCALEPS is ROUND(x) = 2-M*Round_to_INT(x*2M, round_ctrl), round_ctrl = imm[3:0]; M=imm[7:4]; The operation of x*2M is computed as if the exponent range is unlimited (i.e. no overflow ever occurs). VRNDSCALEPS is a more general form of the VEX-encoded VROUNDPS instruction. In VROUNDPS, the formula of the operation on each element is ROUND(x) = Round_to_INT(x, round_ctrl), round_ctrl = imm[3:0]; 5-516 Vol. 2C VRNDSCALEPS—Round Packed Float32 Values To Include A Given Number Of Fraction Bits INSTRUCTION SET REFERENCE, V-Z Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. Handling of special case of input values are listed in Table 5-22. Operation RoundToIntegerSP(SRC[31:0], imm8[7:0]) { if (imm8[2] = 1) rounding_direction MXCSR:RC ; get round control from MXCSR else rounding_direction imm8[1:0] ; get round control from imm8[1:0] FI M imm8[7:4] ; get the scaling factor case (rounding_direction) 00: TMP[31:0] round_to_nearest_even_integer(2M*SRC[31:0]) 01: TMP[31:0] round_to_equal_or_smaller_integer(2M*SRC[31:0]) 10: TMP[31:0] round_to_equal_or_larger_integer(2M*SRC[31:0]) 11: TMP[31:0] round_to_nearest_smallest_magnitude_integer(2M*SRC[31:0]) ESAC; Dest[31:0] 2-M* TMP[31:0] ; scale down back to 2-M if (imm8[3] = 0) Then ; check SPE if (SRC[31:0] != Dest[31:0]) Then ; check precision lost set_precision() ; set #PE FI; FI; return(Dest[31:0]) } VRNDSCALEPS (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) IF *src is a memory operand* THEN TMP_SRC BROADCAST32(SRC, VL, k1) ELSE TMP_SRC SRC FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] RoundToIntegerSP(TMP_SRC[i+31:i]), imm8[7:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI; FI; ENDFOR; DEST[MAX_VL-1:VL] 0 VRNDSCALEPS—Round Packed Float32 Values To Include A Given Number Of Fraction Bits Vol. 2C 5-517 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VRNDSCALEPS __m512 _mm512_roundscale_ps( __m512 a, int imm); VRNDSCALEPS __m512 _mm512_roundscale_round_ps( __m512 a, int imm, int sae); VRNDSCALEPS __m512 _mm512_mask_roundscale_ps(__m512 s, __mmask16 k, __m512 a, int imm); VRNDSCALEPS __m512 _mm512_mask_roundscale_round_ps(__m512 s, __mmask16 k, __m512 a, int imm, int sae); VRNDSCALEPS __m512 _mm512_maskz_roundscale_ps( __mmask16 k, __m512 a, int imm); VRNDSCALEPS __m512 _mm512_maskz_roundscale_round_ps( __mmask16 k, __m512 a, int imm, int sae); VRNDSCALEPS __m256 _mm256_roundscale_ps( __m256 a, int imm); VRNDSCALEPS __m256 _mm256_mask_roundscale_ps(__m256 s, __mmask8 k, __m256 a, int imm); VRNDSCALEPS __m256 _mm256_maskz_roundscale_ps( __mmask8 k, __m256 a, int imm); VRNDSCALEPS __m128 _mm_roundscale_ps( __m256 a, int imm); VRNDSCALEPS __m128 _mm_mask_roundscale_ps(__m128 s, __mmask8 k, __m128 a, int imm); VRNDSCALEPS __m128 _mm_maskz_roundscale_ps( __mmask8 k, __m128 a, int imm); SIMD Floating-Point Exceptions Invalid, Precision If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask). Other Exceptions See Exceptions Type E2. 5-518 Vol. 2C VRNDSCALEPS—Round Packed Float32 Values To Include A Given Number Of Fraction Bits INSTRUCTION SET REFERENCE, V-Z VRNDSCALESS—Round Scalar Float32 Value To Include A Given Number Of Fraction Bits Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F3A.W0 0A /r ib VRNDSCALESS xmm1 {k1}{z}, xmm2, xmm3/m32{sae}, imm8 T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F Description Rounds scalar single-precision floating-point value in xmm3/m32 to a number of fraction bits specified by the imm8 field. Stores the result in xmm1 register under writemask. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Rounds the single-precision floating-point value in the low doubleword element of the second source operand (the third operand) by the rounding mode specified in the immediate operand (see Figure 5-29) and places the result in the corresponding element of the destination operand (the first operand) according to the writemask. The doubleword elements at bits 127:32 of the destination are copied from the first source operand (the second operand). The destination and first source operands are XMM registers, the 2nd source operand can be an XMM register or memory location. Bits MAX_VL-1:128 of the destination register are cleared. The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by imm8[7:4] (to be included in the result) and returns the result as a single-precision floating-point value. It should be noticed that no overflow is induced while executing this instruction (although the source is scaled by the imm8[7:4] value). The immediate operand also specifies control fields for the rounding operation, three bit fields are defined and shown in the “Immediate Control Description” figure below. Bit 3 of the immediate byte controls the processor behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky rounding-mode value (Immediate control tables below lists the encoded values for rounding-mode field). The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an SNaN then it will be converted to a QNaN. If DAZ is set to ‘1 then denormals will be converted to zero before rounding. The sign of the result of this instruction is preserved, including the sign of zero. The formula of the operation for VRNDSCALESS is ROUND(x) = 2-M*Round_to_INT(x*2M, round_ctrl), round_ctrl = imm[3:0]; M=imm[7:4]; The operation of x*2M is computed as if the exponent range is unlimited (i.e. no overflow ever occurs). VRNDSCALESS is a more general form of the VEX-encoded VROUNDSS instruction. In VROUNDSS, the formula of the operation on each element is ROUND(x) = Round_to_INT(x, round_ctrl), round_ctrl = imm[3:0]; EVEX encoded version: The source operand is a XMM register or a 32-bit memory location. The destination operand is a XMM register. Handling of special case of input values are listed in Table 5-22. VRNDSCALESS—Round Scalar Float32 Value To Include A Given Number Of Fraction Bits Vol. 2C 5-519 INSTRUCTION SET REFERENCE, V-Z Operation RoundToIntegerSP(SRC[31:0], imm8[7:0]) { if (imm8[2] = 1) rounding_direction MXCSR:RC ; get round control from MXCSR else rounding_direction imm8[1:0] ; get round control from imm8[1:0] FI M imm8[7:4] ; get the scaling factor case (rounding_direction) 00: TMP[31:0] round_to_nearest_even_integer(2M*SRC[31:0]) 01: TMP[31:0] round_to_equal_or_smaller_integer(2M*SRC[31:0]) 10: TMP[31:0] round_to_equal_or_larger_integer(2M*SRC[31:0]) 11: TMP[31:0] round_to_nearest_smallest_magnitude_integer(2M*SRC[31:0]) ESAC; Dest[31:0] 2-M* TMP[31:0] ; scale down back to 2-M if (imm8[3] = 0) Then ; check SPE if (SRC[31:0] != Dest[31:0]) Then ; check precision lost set_precision() ; set #PE FI; FI; return(Dest[31:0]) } VRNDSCALESS (EVEX encoded version) IF k1[0] or *no writemask* THEN DEST[31:0] RoundToIntegerSP(SRC2[31:0], Zero_upper_imm[7:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[31:0] 0 FI; FI; DEST[127:32] SRC1[127:32] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VRNDSCALESS __m128 _mm_roundscale_ss ( __m128 a, __m128 b, int imm); VRNDSCALESS __m128 _mm_roundscale_round_ss ( __m128 a, __m128 b, int imm, int sae); VRNDSCALESS __m128 _mm_mask_roundscale_ss (__m128 s, __mmask8 k, __m128 a, __m128 b, int imm); VRNDSCALESS __m128 _mm_mask_roundscale_round_ss (__m128 s, __mmask8 k, __m128 a, __m128 b, int imm, int sae); VRNDSCALESS __m128 _mm_maskz_roundscale_ss ( __mmask8 k, __m128 a, __m128 b, int imm); VRNDSCALESS __m128 _mm_maskz_roundscale_round_ss ( __mmask8 k, __m128 a, __m128 b, int imm, int sae); SIMD Floating-Point Exceptions Invalid, Precision If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask). Other Exceptions See Exceptions Type E3. 5-520 Vol. 2C VRNDSCALESS—Round Scalar Float32 Value To Include A Given Number Of Fraction Bits INSTRUCTION SET REFERENCE, V-Z VRSQRT14PD—Compute Approximate Reciprocals of Square Roots of Packed Float64 Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.66.0F38.W1 4E /r VRSQRT14PD xmm1 {k1}{z}, xmm2/m128/m64bcst EVEX.256.66.0F38.W1 4E /r VRSQRT14PD ymm1 {k1}{z}, ymm2/m256/m64bcst FV V/V AVX512VL AVX512F EVEX.512.66.0F38.W1 4E /r VRSQRT14PD zmm1 {k1}{z}, zmm2/m512/m64bcst FV V/V AVX512F Description Computes the approximate reciprocal square roots of the packed double-precision floating-point values in xmm2/m128/m64bcst and stores the results in xmm1. Under writemask. Computes the approximate reciprocal square roots of the packed double-precision floating-point values in ymm2/m256/m64bcst and stores the results in ymm1. Under writemask. Computes the approximate reciprocal square roots of the packed double-precision floating-point values in zmm2/m512/m64bcst and stores the results in zmm1 under writemask. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description This instruction performs a SIMD computation of the approximate reciprocals of the square roots of the eight packed double-precision floating-point values in the source operand (the second operand) and stores the packed double-precision floating-point results in the destination operand (the first operand) according to the writemask. The maximum relative error for this approximation is less than 2-14. EVEX.512 encoded version: The source operand can be a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 64-bit memory location. The destination operand is a ZMM register, conditionally updated using writemask k1. EVEX.256 encoded version: The source operand is a YMM register, a 256-bit memory location, or a 256-bit vector broadcasted from a 64-bit memory location. The destination operand is a YMM register, conditionally updated using writemask k1. EVEX.128 encoded version: The source operand is a XMM register, a 128-bit memory location, or a 128-bit vector broadcasted from a 64-bit memory location. The destination operand is a XMM register, conditionally updated using writemask k1. The VRSQRT14PD instruction is not affected by the rounding control bits in the MXCSR register. When a source value is a 0.0, an ∞ with the sign of the source value is returned. When the source operand is an +∞ then +ZERO value is returned. A denormal source value is treated as zero only if DAZ bit is set in MXCSR. Otherwise it is treated correctly and performs the approximation with the specified masked response. When a source value is a negative value (other than 0.0) a floating-point QNaN_indefinite is returned. When a source value is an SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned. MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported. Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. A numerically exact implementation of VRSQRT14xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. VRSQRT14PD—Compute Approximate Reciprocals of Square Roots of Packed Float64 Values Vol. 2C 5-521 INSTRUCTION SET REFERENCE, V-Z Operation VRSQRT14PD (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC *is memory*) THEN DEST[i+63:i] APPROXIMATE(1.0/ SQRT(SRC[63:0])); ELSE DEST[i+63:i] APPROXIMATE(1.0/ SQRT(SRC[i+63:i])); FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI; FI; ENDFOR; DEST[MAX_VL-1:VL] 0 Table 5-23. VRSQRT14PD Special Cases Input value Result value Comments Any denormal Normal Cannot generate overflow X= 2-2n 2n X<0 QNaN_Indefinite X = -0 -INF X = +0 +INF X = +INF +0 Including -INF Intel C/C++ Compiler Intrinsic Equivalent VRSQRT14PD __m512d _mm512_rsqrt14_pd( __m512d a); VRSQRT14PD __m512d _mm512_mask_rsqrt14_pd(__m512d s, __mmask8 k, __m512d a); VRSQRT14PD __m512d _mm512_maskz_rsqrt14_pd( __mmask8 k, __m512d a); VRSQRT14PD __m256d _mm256_rsqrt14_pd( __m256d a); VRSQRT14PD __m256d _mm512_mask_rsqrt14_pd(__m256d s, __mmask8 k, __m256d a); VRSQRT14PD __m256d _mm512_maskz_rsqrt14_pd( __mmask8 k, __m256d a); VRSQRT14PD __m128d _mm_rsqrt14_pd( __m128d a); VRSQRT14PD __m128d _mm_mask_rsqrt14_pd(__m128d s, __mmask8 k, __m128d a); VRSQRT14PD __m128d _mm_maskz_rsqrt14_pd( __mmask8 k, __m128d a); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E4. 5-522 Vol. 2C VRSQRT14PD—Compute Approximate Reciprocals of Square Roots of Packed Float64 Values INSTRUCTION SET REFERENCE, V-Z VRSQRT14SD—Compute Approximate Reciprocal of Square Root of Scalar Float64 Value Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F38.W1 4F /r VRSQRT14SD xmm1 {k1}{z}, xmm2, xmm3/m64 T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F Description Computes the approximate reciprocal square root of the scalar double-precision floating-point value in xmm3/m64 and stores the result in the low quadword element of xmm1 using writemask k1. Bits[127:64] of xmm2 is copied to xmm1[127:64]. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Computes the approximate reciprocal of the square roots of the scalar double-precision floating-point value in the low quadword element of the source operand (the second operand) and stores the result in the low quadword element of the destination operand (the first operand) according to the writemask. The maximum relative error for this approximation is less than 2-14. The source operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register. Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (MAX_VL-1:128) of the destination register are zeroed. The VRSQRT14SD instruction is not affected by the rounding control bits in the MXCSR register. When a source value is a 0.0, an ∞ with the sign of the source value is returned. When the source operand is an +∞ then +ZERO value is returned. A denormal source value is treated as zero only if DAZ bit is set in MXCSR. Otherwise it is treated correctly and performs the approximation with the specified masked response. When a source value is a negative value (other than 0.0) a floating-point QNaN_indefinite is returned. When a source value is an SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned. MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported. A numerically exact implementation of VRSQRT14xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. Operation VRSQRT14SD (EVEX version) IF k1[0] or *no writemask* THEN DEST[63:0] APPROXIMATE(1.0/ SQRT(SRC2[63:0])) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[63:0] 0 FI; FI; DEST[127:64] SRC1[127:64] DEST[MAX_VL-1:128] 0 VRSQRT14SD—Compute Approximate Reciprocal of Square Root of Scalar Float64 Value Vol. 2C 5-523 INSTRUCTION SET REFERENCE, V-Z Table 5-24. VRSQRT14SD Special Cases Input value Result value Comments Any denormal Normal Cannot generate overflow X= 2-2n 2 n X<0 QNaN_Indefinite X = -0 -INF X = +0 +INF X = +INF +0 Including -INF Intel C/C++ Compiler Intrinsic Equivalent VRSQRT14SD __m128d _mm_rsqrt14_sd( __m128d a, __m128d b); VRSQRT14SD __m128d _mm_mask_rsqrt14_sd(__m128d s, __mmask8 k, __m128d a, __m128d b); VRSQRT14SD __m128d _mm_maskz_rsqrt14_sd( __mmask8d m, __m128d a, __m128d b); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E5. 5-524 Vol. 2C VRSQRT14SD—Compute Approximate Reciprocal of Square Root of Scalar Float64 Value INSTRUCTION SET REFERENCE, V-Z VRSQRT14PS—Compute Approximate Reciprocals of Square Roots of Packed Float32 Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.128.66.0F38.W0 4E /r VRSQRT14PS xmm1 {k1}{z}, xmm2/m128/m32bcst EVEX.256.66.0F38.W0 4E /r VRSQRT14PS ymm1 {k1}{z}, ymm2/m256/m32bcst FV V/V AVX512VL AVX512F EVEX.512.66.0F38.W0 4E /r VRSQRT14PS zmm1 {k1}{z}, zmm2/m512/m32bcst FV V/V AVX512F Description Computes the approximate reciprocal square roots of the packed single-precision floating-point values in xmm2/m128/m32bcst and stores the results in xmm1. Under writemask. Computes the approximate reciprocal square roots of the packed single-precision floating-point values in ymm2/m256/m32bcst and stores the results in ymm1. Under writemask. Computes the approximate reciprocal square roots of the packed single-precision floating-point values in zmm2/m512/m32bcst and stores the results in zmm1. Under writemask. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description This instruction performs a SIMD computation of the approximate reciprocals of the square roots of 16 packed single-precision floating-point values in the source operand (the second operand) and stores the packed singleprecision floating-point results in the destination operand (the first operand) according to the writemask. The maximum relative error for this approximation is less than 2-14. EVEX.512 encoded version: The source operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM register, conditionally updated using writemask k1. EVEX.256 encoded version: The source operand is a YMM register, a 256-bit memory location, or a 256-bit vector broadcasted from a 32-bit memory location. The destination operand is a YMM register, conditionally updated using writemask k1. EVEX.128 encoded version: The source operand is a XMM register, a 128-bit memory location, or a 128-bit vector broadcasted from a 32-bit memory location. The destination operand is a XMM register, conditionally updated using writemask k1. The VRSQRT14PS instruction is not affected by the rounding control bits in the MXCSR register. When a source value is a 0.0, an ∞ with the sign of the source value is returned. When the source operand is an +∞ then +ZERO value is returned. A denormal source value is treated as zero only if DAZ bit is set in MXCSR. Otherwise it is treated correctly and performs the approximation with the specified masked response. When a source value is a negative value (other than 0.0) a floating-point QNaN_indefinite is returned. When a source value is an SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned. MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported. Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. A numerically exact implementation of VRSQRT14xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. VRSQRT14PS—Compute Approximate Reciprocals of Square Roots of Packed Float32 Values Vol. 2C 5-525 INSTRUCTION SET REFERENCE, V-Z Operation VRSQRT14PS (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC *is memory*) THEN DEST[i+31:i] APPROXIMATE(1.0/ SQRT(SRC[31:0])); ELSE DEST[i+31:i] APPROXIMATE(1.0/ SQRT(SRC[i+31:i])); FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI; FI; ENDFOR; DEST[MAX_VL-1:VL] 0 Table 5-25. VRSQRT14PS Special Cases Input value Result value Comments Any denormal Normal Cannot generate overflow X= 2-2n 2n X<0 QNaN_Indefinite X = -0 -INF X = +0 +INF X = +INF +0 Including -INF Intel C/C++ Compiler Intrinsic Equivalent VRSQRT14PS __m512 _mm512_rsqrt14_ps( __m512 a); VRSQRT14PS __m512 _mm512_mask_rsqrt14_ps(__m512 s, __mmask16 k, __m512 a); VRSQRT14PS __m512 _mm512_maskz_rsqrt14_ps( __mmask16 k, __m512 a); VRSQRT14PS __m256 _mm256_rsqrt14_ps( __m256 a); VRSQRT14PS __m256 _mm256_mask_rsqrt14_ps(__m256 s, __mmask8 k, __m256 a); VRSQRT14PS __m256 _mm256_maskz_rsqrt14_ps( __mmask8 k, __m256 a); VRSQRT14PS __m128 _mm_rsqrt14_ps( __m128 a); VRSQRT14PS __m128 _mm_mask_rsqrt14_ps(__m128 s, __mmask8 k, __m128 a); VRSQRT14PS __m128 _mm_maskz_rsqrt14_ps( __mmask8 k, __m128 a); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type 4. 5-526 Vol. 2C VRSQRT14PS—Compute Approximate Reciprocals of Square Roots of Packed Float32 Values INSTRUCTION SET REFERENCE, V-Z VRSQRT14SS—Compute Approximate Reciprocal of Square Root of Scalar Float32 Value Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F38.W0 4F /r VRSQRT14SS xmm1 {k1}{z}, xmm2, xmm3/m32 T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F Description Computes the approximate reciprocal square root of the scalar single-precision floating-point value in xmm3/m32 and stores the result in the low doubleword element of xmm1 using writemask k1. Bits[127:32] of xmm2 is copied to xmm1[127:32]. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA Description Computes of the approximate reciprocal of the square root of the scalar single-precision floating-point value in the low doubleword element of the source operand (the second operand) and stores the result in the low doubleword element of the destination operand (the first operand) according to the writemask. The maximum relative error for this approximation is less than 2-14. The source operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register. Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (MAX_VL-1:128) of the destination register are zeroed. The VRSQRT14SS instruction is not affected by the rounding control bits in the MXCSR register. When a source value is a 0.0, an ∞ with the sign of the source value is returned. When the source operand is an ∞, zero with the sign of the source value is returned. A denormal source value is treated as zero only if DAZ bit is set in MXCSR. Otherwise it is treated correctly and performs the approximation with the specified masked response. When a source value is a negative value (other than 0.0) a floating-point indefinite is returned. When a source value is an SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned. MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported. A numerically exact implementation of VRSQRT14xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. Operation VRSQRT14SS (EVEX version) IF k1[0] or *no writemask* THEN DEST[31:0] APPROXIMATE(1.0/ SQRT(SRC2[31:0])) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking THEN DEST[31:0] 0 FI; FI; DEST[127:32] SRC1[127:32] DEST[MAX_VL-1:128] 0 VRSQRT14SS—Compute Approximate Reciprocal of Square Root of Scalar Float32 Value Vol. 2C 5-527 INSTRUCTION SET REFERENCE, V-Z Table 5-26. VRSQRT14SS Special Cases Input value Result value Comments Any denormal Normal Cannot generate overflow X= 2-2n 2 n X<0 QNaN_Indefinite X = -0 -INF X = +0 +INF X = +INF +0 Including -INF Intel C/C++ Compiler Intrinsic Equivalent VRSQRT14SS __m128 _mm_rsqrt14_ss( __m128 a, __m128 b); VRSQRT14SS __m128 _mm_mask_rsqrt14_ss(__m128 s, __mmask8 k, __m128 a, __m128 b); VRSQRT14SS __m128 _mm_maskz_rsqrt14_ss( __mmask8 k, __m128 a, __m128 b); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E5. 5-528 Vol. 2C VRSQRT14SS—Compute Approximate Reciprocal of Square Root of Scalar Float32 Value INSTRUCTION SET REFERENCE, V-Z VRSQRT28PD—Approximation to the Reciprocal Square Root of Packed Double-Precision Floating-Point Values with Less Than 2^-28 Relative Error Opcode/ Instruction Op / En EVEX.512.66.0F38.W1 CC /r VRSQRT28PD zmm1 {k1}{z}, zmm2/m512/m64bcst {sae} FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512ER Description Computes approximations to the Reciprocal square root (<2^28 relative error) of the packed double-precision floating-point values from zmm2/m512/m64bcst and stores result in zmm1with writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Computes the reciprocal square root of the float64 values in the source operand (the second operand) and store the results to the destination operand (the first operand). The approximate reciprocal is evaluated with less than 2^-28 of maximum relative error. If any source element is NaN, the quietized NaN source value is returned for that element. Negative (non-zero) source numbers, as well as -∞, return the canonical NaN and set the Invalid Flag (#I). A value of -0 must return -∞ and set the DivByZero flags (#Z). Negative numbers should return NaN and set the Invalid flag (#I). Note however that the instruction flush input denormals to zero of the same sign, so negative denormals return -∞ and set the DivByZero flag. The source operand is a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit memory location. The destination operand is a ZMM register, conditionally updated using writemask k1. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. A numerically exact implementation of VRSQRT28xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. Operation VRSQRT28PD (EVEX encoded versions) (KL, VL) = (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC *is memory*) THEN DEST[i+63:i] (1.0/ SQRT(SRC[63:0])); ELSE DEST[i+63:i] (1.0/ SQRT(SRC[i+63:i])); FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI; FI; ENDFOR; VRSQRT28PD—Approximation to the Reciprocal Square Root of Packed Double-Precision Floating-Point Values with Less Than 2^-28 Vol. 2C 5-529 INSTRUCTION SET REFERENCE, V-Z Table 5-27. VRSQRT28PD Special Cases Input value Result value Comments NAN QNAN(input) If (SRC = SNaN) then #I X= 2-2n n 2 X<0 QNaN_Indefinite Including -INF X = -0 or negative denormal -INF #Z X = +0 or positive denormal +INF #Z X = +INF +0 Intel C/C++ Compiler Intrinsic Equivalent VRSQRT28PD __m512d _mm512_rsqrt28_round_pd(__m512d a, int sae); VRSQRT28PD __m512d _mm512_mask_rsqrt28_round_pd(__m512d s, __mmask8 m,__m512d a, int sae); VRSQRT28PD __m512d _mm512_maskz_rsqrt28_round_pd(__mmask8 m,__m512d a, int sae); SIMD Floating-Point Exceptions Invalid (if SNaN input), Divide-by-zero Other Exceptions See Exceptions Type E2. 5-530 Vol. 2C VRSQRT28PD—Approximation to the Reciprocal Square Root of Packed Double-Precision Floating-Point Values with Less Than 2^-28 INSTRUCTION SET REFERENCE, V-Z VRSQRT28SD—Approximation to the Reciprocal Square Root of Scalar Double-Precision Floating-Point Value with Less Than 2^-28 Relative Error Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F38.W1 CD /r VRSQRT28SD xmm1 {k1}{z}, xmm2, xmm3/m64 {sae} T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512ER Description Computes approximate reciprocal square root (<2^-28 relative error) of the scalar double-precision floating-point value from xmm3/m64 and stores result in xmm1with writemask k1. Also, upper double-precision floating-point value (bits[127:64]) from xmm2 is copied to xmm1[127:64]. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Computes the reciprocal square root of the low float64 value in the second source operand (the third operand) and store the result to the destination operand (the first operand). The approximate reciprocal square root is evaluated with less than 2^-28 of maximum relative error. The result is written into the low float64 element of xmm1 according to the writemask k1. Bits 127:64 of the destination is copied from the corresponding bits of the first source operand (the second operand). If any source element is NaN, the quietized NaN source value is returned for that element. Negative (non-zero) source numbers, as well as -∞, return the canonical NaN and set the Invalid Flag (#I). A value of -0 must return -∞ and set the DivByZero flags (#Z). Negative numbers should return NaN and set the Invalid flag (#I). Note however that the instruction flush input denormals to zero of the same sign, so negative denormals return -∞ and set the DivByZero flag. The first source operand is an XMM register. The second source operand is an XMM register or a 64-bit memory location. The destination operand is a XMM register. A numerically exact implementation of VRSQRT28xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. Operation VRSQRT28SD (EVEX encoded versions) IF k1[0] OR *no writemask* THEN DEST[63: 0] (1.0/ SQRT(SRC[63: 0])); ELSE IF *merging-masking* ; merging-masking THEN *DEST[63: 0] remains unchanged* ELSE ; zeroing-masking DEST[63: 0] 0 FI; FI; ENDFOR; DEST[127:64] SRC1[127: 64] DEST[MAX_VL-1:128] 0 VRSQRT28SD—Approximation to the Reciprocal Square Root of Scalar Double-Precision Floating-Point Value with Less Than 2^-28 Vol. 2C 5-531 INSTRUCTION SET REFERENCE, V-Z Table 5-28. VRSQRT28SD Special Cases Input value Result value Comments NAN QNAN(input) If (SRC = SNaN) then #I X= 2-2n n 2 X<0 QNaN_Indefinite Including -INF X = -0 or negative denormal -INF #Z X = +0 or positive denormal +INF #Z X = +INF +0 Intel C/C++ Compiler Intrinsic Equivalent VRSQRT28SD __m128d _mm_rsqrt28_round_sd(__m128d a, __m128b b, int sae); VRSQRT28SD __m128d _mm_mask_rsqrt28_round_pd(__m128d s, __mmask8 m,__m128d a, __m128d b, int sae); VRSQRT28SD __m128d _mm_maskz_rsqrt28_round_pd( __mmask8 m,__m128d a, __m128d b, int sae); SIMD Floating-Point Exceptions Invalid (if SNaN input), Divide-by-zero Other Exceptions See Exceptions Type E3. 5-532 Vol. 2C VRSQRT28SD—Approximation to the Reciprocal Square Root of Scalar Double-Precision Floating-Point Value with Less Than 2^-28 INSTRUCTION SET REFERENCE, V-Z VRSQRT28PS—Approximation to the Reciprocal Square Root of Packed Single-Precision Floating-Point Values with Less Than 2^-28 Relative Error Opcode/ Instruction Op / En EVEX.512.66.0F38.W0 CC /r VRSQRT28PS zmm1 {k1}{z}, zmm2/m512/m32bcst {sae} FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512ER Description Computes approximations to the Reciprocal square root (<2^-28 relative error) of the packed single-precision floating-point values from zmm2/m512/m32bcst and stores result in zmm1with writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) ModRM:r/m (r) NA NA Description Computes the reciprocal square root of the float32 values in the source operand (the second operand) and store the results to the destination operand (the first operand). The approximate reciprocal is evaluated with less than 2^-28 of maximum relative error prior to final rounding. The final results is rounded to < 2^-23 relative error before written to the destination. If any source element is NaN, the quietized NaN source value is returned for that element. Negative (non-zero) source numbers, as well as -∞, return the canonical NaN and set the Invalid Flag (#I). A value of -0 must return -∞ and set the DivByZero flags (#Z). Negative numbers should return NaN and set the Invalid flag (#I). Note however that the instruction flush input denormals to zero of the same sign, so negative denormals return -∞ and set the DivByZero flag. The source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM register, conditionally updated using writemask k1. EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. A numerically exact implementation of VRSQRT28xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. Operation VRSQRT28PS (EVEX encoded versions) (KL, VL) = (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC *is memory*) THEN DEST[i+31:i] (1.0/ SQRT(SRC[31:0])); ELSE DEST[i+31:i] (1.0/ SQRT(SRC[i+31:i])); FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI; FI; ENDFOR; VRSQRT28PS—Approximation to the Reciprocal Square Root of Packed Single-Precision Floating-Point Values with Less Than 2^-28 Vol. 2C 5-533 INSTRUCTION SET REFERENCE, V-Z Table 5-29. VRSQRT28PS Special Cases Input value Result value Comments NAN QNAN(input) If (SRC = SNaN) then #I X= 2-2n n 2 X<0 QNaN_Indefinite Including -INF X = -0 or negative denormal -INF #Z X = +0 or positive denormal +INF #Z X = +INF +0 Intel C/C++ Compiler Intrinsic Equivalent VRSQRT28PS __m512 _mm512_rsqrt28_round_ps(__m512 a, int sae); VRSQRT28PS __m512 _mm512_mask_rsqrt28_round_ps(__m512 s, __mmask16 m,__m512 a, int sae); VRSQRT28PS __m512 _mm512_maskz_rsqrt28_round_ps(__mmask16 m,__m512 a, int sae); SIMD Floating-Point Exceptions Invalid (if SNaN input), Divide-by-zero Other Exceptions See Exceptions Type E2. 5-534 Vol. 2C VRSQRT28PS—Approximation to the Reciprocal Square Root of Packed Single-Precision Floating-Point Values with Less Than 2^-28 INSTRUCTION SET REFERENCE, V-Z VRSQRT28SS—Approximation to the Reciprocal Square Root of Scalar Single-Precision FloatingPoint Value with Less Than 2^-28 Relative Error Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F38.W0 CD /r VRSQRT28SS xmm1 {k1}{z}, xmm2, xmm3/m32 {sae} T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512ER Description Computes approximate reciprocal square root (<2^-28 relative error) of the scalar single-precision floating-point value from xmm3/m32 and stores result in xmm1with writemask k1. Also, upper 3 single-precision floating-point value (bits[127:32]) from xmm2 is copied to xmm1[127:32]. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Computes the reciprocal square root of the low float32 value in the second source operand (the third operand) and store the result to the destination operand (the first operand). The approximate reciprocal square root is evaluated with less than 2^-28 of maximum relative error prior to final rounding. The final result is rounded to < 2^-23 relative error before written to the low float32 element of the destination according to the writemask k1. Bits 127:32 of the destination is copied from the corresponding bits of the first source operand (the second operand). If any source element is NaN, the quietized NaN source value is returned for that element. Negative (non-zero) source numbers, as well as -∞, return the canonical NaN and set the Invalid Flag (#I). A value of -0 must return -∞ and set the DivByZero flags (#Z). Negative numbers should return NaN and set the Invalid flag (#I). Note however that the instruction flush input denormals to zero of the same sign, so negative denormals return -∞ and set the DivByZero flag. The first source operand is an XMM register. The second source operand is an XMM register or a 32-bit memory location. The destination operand is a XMM register. A numerically exact implementation of VRSQRT28xx can be found at https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2. Operation VRSQRT28SS (EVEX encoded versions) IF k1[0] OR *no writemask* THEN DEST[31: 0] (1.0/ SQRT(SRC[31: 0])); ELSE IF *merging-masking* ; merging-masking THEN *DEST[31: 0] remains unchanged* ELSE ; zeroing-masking DEST[31: 0] 0 FI; FI; ENDFOR; DEST[127:32] SRC1[127: 32] DEST[MAX_VL-1:128] 0 VRSQRT28SS—Approximation to the Reciprocal Square Root of Scalar Single-Precision Floating-Point Value with Less Than 2^-28 Rel- Vol. 2C 5-535 INSTRUCTION SET REFERENCE, V-Z Table 5-30. VRSQRT28SS Special Cases Input value Result value Comments NAN QNAN(input) If (SRC = SNaN) then #I X= 2-2n n 2 X<0 QNaN_Indefinite Including -INF X = -0 or negative denormal -INF #Z X = +0 or positive denormal +INF #Z X = +INF +0 Intel C/C++ Compiler Intrinsic Equivalent VRSQRT28SS __m128 _mm_rsqrt28_round_ss(__m128 a, __m128 b, int sae); VRSQRT28SS __m128 _mm512_mask_rsqrt28_round_ss(__m128 s, __mmask8 m,__m128 a,__m128 b, int sae); VRSQRT28SS __m128 _mm512_maskz_rsqrt28_round_ss(__mmask8 m,__m128 a,__m128 b, int sae); SIMD Floating-Point Exceptions Invalid (if SNaN input), Divide-by-zero Other Exceptions See Exceptions Type E3. 5-536 Vol. 2C VRSQRT28SS—Approximation to the Reciprocal Square Root of Scalar Single-Precision Floating-Point Value with Less Than 2^-28 Rel- INSTRUCTION SET REFERENCE, V-Z VSCALEFPD—Scale Packed Float64 Values With Float64 Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.NDS.128.66.0F38.W1 2C /r VSCALEFPD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDS.256.66.0F38.W1 2C /r VSCALEFPD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.512.66.0F38.W1 2C /r VSCALEFPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er} FV V/V AVX512VL AVX512F FV V/V AVX512F Description Scale the packed double-precision floating-point values in xmm2 using values from xmm3/m128/m64bcst. Under writemask k1. Scale the packed double-precision floating-point values in ymm2 using values from ymm3/m256/m64bcst. Under writemask k1. Scale the packed double-precision floating-point values in zmm2 using values from zmm3/m512/m64bcst. Under writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Performs a floating-point scale of the packed double-precision floating-point values in the first source operand by multiplying it by 2 power of the double-precision floating-point values in second source operand. The equation of this operation is given by: zmm1 := zmm2*2floor(zmm3). Floor(zmm3) means maximum integer value ≤ zmm3. If the result cannot be represented in double precision, then the proper overflow response (for positive scaling operand), or the proper underflow response (for negative scaling operand) is issued. The overflow and underflow responses are dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in MXCSR (exception mask bits, FTZ bit), and on the SAE bit. The first source operand is a ZMM/YMM/XMM register. The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1. Handling of special-case input values are listed in Table 5-31 and Table 5-32. Table 5-31. \VSCALEFPD/SD/PS/SS Special Cases Src2 Src1 Set IE ±NaN +Inf -Inf 0/Denorm/Norm ±QNaN QNaN(Src1) +INF +0 QNaN(Src1) IF either source is SNAN ±SNaN QNaN(Src1) QNaN(Src1) QNaN(Src1) QNaN(Src1) YES ±Inf QNaN(Src2) Src1 QNaN_Indefinite Src1 IF Src2 is SNAN or -INF ±0 QNaN(Src2) QNaN_Indefinite Src1 Src1 IF Src2 is SNAN or +INF Denorm/Norm QNaN(Src2) ±INF (Src1 sign) ±0 (Src1 sign) Compute Result IF Src2 is SNAN VSCALEFPD—Scale Packed Float64 Values With Float64 Values Vol. 2C 5-537 INSTRUCTION SET REFERENCE, V-Z Table 5-32. Additional VSCALEFPD/SD Special Cases Special Case |result| < 2-1074 |result| ≥ 2 1024 Returned value Faults ±0 or ±Min-Denormal (Src1 sign) Underflow ±INF (Src1 sign) or ±Max-normal (Src1 sign) Overflow Operation SCALE(SRC1, SRC2) { TMP_SRC2 SRC2 TMP_SRC1 SRC1 IF (SRC2 is denormal AND MXCSR.DAZ) THEN TMP_SRC2=0 IF (SRC1 is denormal AND MXCSR.DAZ) THEN TMP_SRC1=0 /* SRC2 is a 64 bits floating-point value */ DEST[63:0] TMP_SRC1[63:0] * POW(2, Floor(TMP_SRC2[63:0])) } VSCALEFPD (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) IF (VL = 512) AND (EVEX.b = 1) AND (SRC2 *is register*) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+63:i] SCALE(SRC1[i+63:i], SRC2[63:0]); ELSE DEST[i+63:i] SCALE(SRC1[i+63:i], SRC2[i+63:i]); FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE ; zeroing-masking DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-538 Vol. 2C VSCALEFPD—Scale Packed Float64 Values With Float64 Values INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VSCALEFPD __m512d _mm512_scalef_round_pd(__m512d a, __m512d b, int); VSCALEFPD __m512d _mm512_mask_scalef_round_pd(__m512d s, __mmask8 k, __m512d a, __m512d b, int); VSCALEFPD __m512d _mm512_maskz_scalef_round_pd(__mmask8 k, __m512d a, __m512d b, int); VSCALEFPD __m256d _mm256_scalef_round_pd(__m256d a, __m256d b, int); VSCALEFPD __m256d _mm256_mask_scalef_round_pd(__m256d s, __mmask8 k, __m256d a, __m256d b, int); VSCALEFPD __m256d _mm256_maskz_scalef_round_pd(__mmask8 k, __m256d a, __m256d b, int); VSCALEFPD __m128d _mm_scalef_round_pd(__m128d a, __m128d b, int); VSCALEFPD __m128d _mm_mask_scalef_round_pd(__m128d s, __mmask8 k, __m128d a, __m128d b, int); VSCALEFPD __m128d _mm_maskz_scalef_round_pd(__mmask8 k, __m128d a, __m128d b, int); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal (for Src1). Denormal is not reported for Src2. Other Exceptions See Exceptions Type E2. VSCALEFPD—Scale Packed Float64 Values With Float64 Values Vol. 2C 5-539 INSTRUCTION SET REFERENCE, V-Z VSCALEFSD—Scale Scalar Float64 Values With Float64 Values Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F38.W1 2D /r VSCALEFSD xmm1 {k1}{z}, xmm2, xmm3/m64{er} T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F Description Scale the scalar double-precision floating-point values in xmm2 using the value from xmm3/m64. Under writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Performs a floating-point scale of the packed double-precision floating-point value in the first source operand by multiplying it by 2 power of the double-precision floating-point value in second source operand. The equation of this operation is given by: xmm1 := xmm2*2floor(xmm3). Floor(xmm3) means maximum integer value ≤ xmm3. If the result cannot be represented in double precision, then the proper overflow response (for positive scaling operand), or the proper underflow response (for negative scaling operand) is issued. The overflow and underflow responses are dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in MXCSR (exception mask bits, FTZ bit), and on the SAE bit. EVEX encoded version: The first source operand is an XMM register. The second source operand is an XMM register or a memory location. The destination operand is an XMM register conditionally updated with writemask k1. Handling of special-case input values are listed in Table 5-31 and Table 5-32. Operation SCALE(SRC1, SRC2) { ; Check for denormal operands TMP_SRC2 SRC2 TMP_SRC1 SRC1 IF (SRC2 is denormal AND MXCSR.DAZ) THEN TMP_SRC2=0 IF (SRC1 is denormal AND MXCSR.DAZ) THEN TMP_SRC1=0 /* SRC2 is a 64 bits floating-point value */ DEST[63:0] TMP_SRC1[63:0] * POW(2, Floor(TMP_SRC2[63:0])) } 5-540 Vol. 2C VSCALEFSD—Scale Scalar Float64 Values With Float64 Values INSTRUCTION SET REFERENCE, V-Z VSCALEFSD (EVEX encoded version) IF (EVEX.b= 1) and SRC2 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] OR *no writemask* THEN DEST[63:0] SCALE(SRC1[63:0], SRC2[63:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged* ELSE ; zeroing-masking DEST[63:0] 0 FI FI; DEST[127:64] SRC1[127:64] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VSCALEFSD __m128d _mm_scalef_round_sd(__m128d a, __m128d b, int); VSCALEFSD __m128d _mm_mask_scalef_round_sd(__m128d s, __mmask8 k, __m128d a, __m128d b, int); VSCALEFSD __m128d _mm_maskz_scalef_round_sd(__mmask8 k, __m128d a, __m128d b, int); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal (for Src1). Denormal is not reported for Src2. Other Exceptions See Exceptions Type E3. VSCALEFSD—Scale Scalar Float64 Values With Float64 Values Vol. 2C 5-541 INSTRUCTION SET REFERENCE, V-Z VSCALEFPS—Scale Packed Float32 Values With Float32 Values Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.NDS.128.66.0F38.W0 2C /r VSCALEFPS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.256.66.0F38.W0 2C /r VSCALEFPS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.512.66.0F38.W0 2C /r VSCALEFPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{er} FV V/V AVX512VL AVX512F FV V/V AVX512F Description Scale the packed single-precision floating-point values in xmm2 using values from xmm3/m128/m32bcst. Under writemask k1. Scale the packed single-precision values in ymm2 using floating point values from ymm3/m256/m32bcst. Under writemask k1. Scale the packed single-precision floating-point values in zmm2 using floating-point values from zmm3/m512/m32bcst. Under writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Performs a floating-point scale of the packed single-precision floating-point values in the first source operand by multiplying it by 2 power of the float32 values in second source operand. The equation of this operation is given by: zmm1 := zmm2*2floor(zmm3). Floor(zmm3) means maximum integer value ≤ zmm3. If the result cannot be represented in single precision, then the proper overflow response (for positive scaling operand), or the proper underflow response (for negative scaling operand) is issued. The overflow and underflow responses are dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in MXCSR (exception mask bits, FTZ bit), and on the SAE bit. EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM register conditionally updated with writemask k1. EVEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register, a 256-bit memory location, or a 256-bit vector broadcasted from a 32-bit memory location. The destination operand is a YMM register, conditionally updated using writemask k1. EVEX.128 encoded version: The first source operand is an XMM register. The second source operand is a XMM register, a 128-bit memory location, or a 128-bit vector broadcasted from a 32-bit memory location. The destination operand is a XMM register, conditionally updated using writemask k1. Handling of special-case input values are listed in Table 5-31 and Table 5-33. Table 5-33. Additional VSCALEFPS/SS Special Cases Special Case Returned value Faults |result| < 2-149 ±0 or ±Min-Denormal (Src1 sign) Underflow |result| ≥ 2128 ±INF (Src1 sign) or ±Max-normal (Src1 sign) Overflow 5-542 Vol. 2C VSCALEFPS—Scale Packed Float32 Values With Float32 Values INSTRUCTION SET REFERENCE, V-Z Operation SCALE(SRC1, SRC2) { ; Check for denormal operands TMP_SRC2 SRC2 TMP_SRC1 SRC1 IF (SRC2 is denormal AND MXCSR.DAZ) THEN TMP_SRC2=0 IF (SRC1 is denormal AND MXCSR.DAZ) THEN TMP_SRC1=0 /* SRC2 is a 32 bits floating-point value */ DEST[31:0] TMP_SRC1[31:0] * POW(2, Floor(TMP_SRC2[31:0])) } VSCALEFPS (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) IF (VL = 512) AND (EVEX.b = 1) AND (SRC2 *is register*) THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN DEST[i+31:i] SCALE(SRC1[i+31:i], SRC2[31:0]); ELSE DEST[i+31:i] SCALE(SRC1[i+31:i], SRC2[i+31:i]); FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE ; zeroing-masking DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0; Intel C/C++ Compiler Intrinsic Equivalent VSCALEFPS __m512 _mm512_scalef_round_ps(__m512 a, __m512 b, int); VSCALEFPS __m512 _mm512_mask_scalef_round_ps(__m512 s, __mmask16 k, __m512 a, __m512 b, int); VSCALEFPS __m512 _mm512_maskz_scalef_round_ps(__mmask16 k, __m512 a, __m512 b, int); VSCALEFPS __m256 _mm256_scalef_round_ps(__m256 a, __m256 b, int); VSCALEFPS __m256 _mm256_mask_scalef_round_ps(__m256 s, __mmask8 k, __m256 a, __m256 b, int); VSCALEFPS __m256 _mm256_maskz_scalef_round_ps(__mmask8 k, __m256 a, __m256 b, int); VSCALEFPS __m128 _mm_scalef_round_ps(__m128 a, __m128 b, int); VSCALEFPS __m128 _mm_mask_scalef_round_ps(__m128 s, __mmask8 k, __m128 a, __m128 b, int); VSCALEFPS __m128 _mm_maskz_scalef_round_ps(__mmask8 k, __m128 a, __m128 b, int); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal (for Src1). Denormal is not reported for Src2. Other Exceptions See Exceptions Type E2. VSCALEFPS—Scale Packed Float32 Values With Float32 Values Vol. 2C 5-543 INSTRUCTION SET REFERENCE, V-Z VSCALEFSS—Scale Scalar Float32 Value With Float32 Value Opcode/ Instruction Op / En EVEX.NDS.LIG.66.0F38.W0 2D /r VSCALEFSS xmm1 {k1}{z}, xmm2, xmm3/m32{er} T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512F Description Scale the scalar single-precision floating-point value in xmm2 using floating-point value from xmm3/m32. Under writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Performs a floating-point scale of the scalar single-precision floating-point value in the first source operand by multiplying it by 2 power of the float32 value in second source operand. The equation of this operation is given by: xmm1 := xmm2*2floor(xmm3). Floor(xmm3) means maximum integer value ≤ xmm3. If the result cannot be represented in single precision, then the proper overflow response (for positive scaling operand), or the proper underflow response (for negative scaling operand) is issued. The overflow and underflow responses are dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in MXCSR (exception mask bits, FTZ bit), and on the SAE bit. EVEX encoded version: The first source operand is an XMM register. The second source operand is an XMM register or a memory location. The destination operand is an XMM register conditionally updated with writemask k1. Handling of special-case input values are listed in Table 5-31 and Table 5-33. 5-544 Vol. 2C VSCALEFSS—Scale Scalar Float32 Value With Float32 Value INSTRUCTION SET REFERENCE, V-Z Operation SCALE(SRC1, SRC2) { ; Check for denormal operands TMP_SRC2 SRC2 TMP_SRC1 SRC1 IF (SRC2 is denormal AND MXCSR.DAZ) THEN TMP_SRC2=0 IF (SRC1 is denormal AND MXCSR.DAZ) THEN TMP_SRC1=0 /* SRC2 is a 32 bits floating-point value */ DEST[31:0] TMP_SRC1[31:0] * POW(2, Floor(TMP_SRC2[31:0])) } VSCALEFSS (EVEX encoded version) IF (EVEX.b= 1) and SRC2 *is a register* THEN SET_RM(EVEX.RC); ELSE SET_RM(MXCSR.RM); FI; IF k1[0] OR *no writemask* THEN DEST[31:0] SCALE(SRC1[31:0], SRC2[31:0]) ELSE IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged* ELSE ; zeroing-masking DEST[31:0] 0 FI FI; DEST[127:32] SRC1[127:32] DEST[MAX_VL-1:128] 0 Intel C/C++ Compiler Intrinsic Equivalent VSCALEFSS __m128 _mm_scalef_round_ss(__m128 a, __m128 b, int); VSCALEFSS __m128 _mm_mask_scalef_round_ss(__m128 s, __mmask8 k, __m128 a, __m128 b, int); VSCALEFSS __m128 _mm_maskz_scalef_round_ss(__mmask8 k, __m128 a, __m128 b, int); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal (for Src1). Denormal is not reported for Src2. Other Exceptions See Exceptions Type E3. VSCALEFSS—Scale Scalar Float32 Value With Float32 Value Vol. 2C 5-545 INSTRUCTION SET REFERENCE, V-Z VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD—Scatter Packed Single, Packed Double with Signed Dword and Qword Indices Opcode/ Instruction Op/E n EVEX.128.66.0F38.W0 A2 /vsib VSCATTERDPS vm32x {k1}, xmm1 EVEX.256.66.0F38.W0 A2 /vsib VSCATTERDPS vm32y {k1}, ymm1 EVEX.512.66.0F38.W0 A2 /vsib VSCATTERDPS vm32z {k1}, zmm1 EVEX.128.66.0F38.W1 A2 /vsib VSCATTERDPD vm32x {k1}, xmm1 EVEX.256.66.0F38.W1 A2 /vsib VSCATTERDPD vm32x {k1}, ymm1 EVEX.512.66.0F38.W1 A2 /vsib VSCATTERDPD vm32y {k1}, zmm1 EVEX.128.66.0F38.W0 A3 /vsib VSCATTERQPS vm64x {k1}, xmm1 EVEX.256.66.0F38.W0 A3 /vsib VSCATTERQPS vm64y {k1}, xmm1 EVEX.512.66.0F38.W0 A3 /vsib VSCATTERQPS vm64z {k1}, ymm1 EVEX.128.66.0F38.W1 A3 /vsib VSCATTERQPD vm64x {k1}, xmm1 EVEX.256.66.0F38.W1 A3 /vsib VSCATTERQPD vm64y {k1}, ymm1 EVEX.512.66.0F38.W1 A3 /vsib VSCATTERQPD vm64z {k1}, zmm1 T1S 64/32 bit Mode Support V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V T1S V/V CPUID Feature Flag AVX512VL AVX512F AVX512VL AVX512F AVX512F AVX512VL AVX512F AVX512VL AVX512F AVX512F AVX512VL AVX512F AVX512VL AVX512F AVX512F AVX512VL AVX512F AVX512VL AVX512F AVX512F Description Using signed dword indices, scatter single-precision floating-point values to memory using writemask k1. Using signed dword indices, scatter single-precision floating-point values to memory using writemask k1. Using signed dword indices, scatter single-precision floating-point values to memory using writemask k1. Using signed dword indices, scatter double-precision floating-point values to memory using writemask k1. Using signed dword indices, scatter double-precision floating-point values to memory using writemask k1. Using signed dword indices, scatter double-precision floating-point values to memory using writemask k1. Using signed qword indices, scatter single-precision floating-point values to memory using writemask k1. Using signed qword indices, scatter single-precision floating-point values to memory using writemask k1. Using signed qword indices, scatter single-precision floating-point values to memory using writemask k1. Using signed qword indices, scatter double-precision floating-point values to memory using writemask k1. Using signed qword indices, scatter double-precision floating-point values to memory using writemask k1. Using signed qword indices, scatter double-precision floating-point values to memory using writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S BaseReg (R): VSIB:base, VectorReg(R): VSIB:index ModRM:reg (r) NA NA Description Stores up to 16 elements (or 8 elements) in doubleword/quadword vector zmm1 to the memory locations pointed by base address BASE_ADDR and index vector VINDEX, with scale SCALE. The elements are specified via the VSIB (i.e., the index register is a vector register, holding packed indices). Elements will only be stored if their corresponding mask bit is one. The entire mask register will be set to zero by this instruction unless it triggers an exception. This instruction can be suspended by an exception if at least one element is already scattered (i.e., if the exception is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination register and the mask register (k1) are partially updated. If any traps or interrupts are pending from already scattered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruction breakpoint is not re-triggered when the instruction is continued. Note that: • Only writes to overlapping vector indices are guaranteed to be ordered with respect to each other (from LSB to MSB of the source registers). Note that this also include partially overlapping vector indices. Writes that are not overlapped may happen in any order. Memory ordering with other instructions follows the Intel-64 memory ordering model. Note that this does not account for non-overlapping indices that map into the same physical address locations. 5-546 Vol. 2C VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD—Scatter Packed Single, Packed Double with Signed Dword and Qword INSTRUCTION SET REFERENCE, V-Z • • If two or more destination indices completely overlap, the “earlier” write(s) may be skipped. • Elements may be scattered in any order, but faults must be delivered in a right-to left order; thus, elements to the left of a faulting one may be gathered before the fault is delivered. A given implementation of this instruction is repeatable - given the same input values and architectural state, the same set of elements to the left of the faulting one will be gathered. • • • This instruction does not perform AC checks, and so will never deliver an AC fault. Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all elements closer to the LSB of the destination zmm will be completed (and non-faulting). Individual elements closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered in the conventional order. Not valid with 16-bit effective addresses. Will deliver a #UD fault. If this instruction overwrites itself and then takes a fault, only a subset of elements may be completed before the fault is delivered (as described above). If the fault handler completes and attempts to re-execute this instruction, the new instruction will be executed, and the scatter will not complete. Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if ModRM.rm is different than 100b. This instruction has special disp8*N and alignment rules. N is considered to be the size of a single vector element. The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are ignored. The instruction will #UD fault if the k0 mask register is specified VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD—Scatter Packed Single, Packed Double with Signed Dword and Qword Vol. 2C 5-547 INSTRUCTION SET REFERENCE, V-Z Operation BASE_ADDR stands for the memory operand base address (a GPR); may not exist VINDEX stands for the memory operand vector of indices (a ZMM register) SCALE stands for the memory operand scalar (1, 2, 4 or 8) DISP is the optional 1, 2 or 4 byte displacement 5-548 Vol. 2C VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD—Scatter Packed Single, Packed Double with Signed Dword and Qword INSTRUCTION SET REFERENCE, V-Z VSCATTERDPS (EVEX encoded versions) (KL, VL)= (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN MEM[BASE_ADDR +SignExtend(VINDEX[i+31:i]) * SCALE + DISP] SRC[i+31:i] k1[j] 0 FI; ENDFOR k1[MAX_KL-1:KL] 0 VSCATTERDPD (EVEX encoded versions) (KL, VL)= (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] OR *no writemask* THEN MEM[BASE_ADDR +SignExtend(VINDEX[k+31:k]) * SCALE + DISP] SRC[i+63:i] k1[j] 0 FI; ENDFOR k1[MAX_KL-1:KL] 0 VSCATTERQPS (EVEX encoded versions) (KL, VL)= (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 32 k j * 64 IF k1[j] OR *no writemask* THEN MEM[BASE_ADDR + (VINDEX[k+63:k]) * SCALE + DISP] SRC[i+31:i] k1[j] 0 FI; ENDFOR k1[MAX_KL-1:KL] 0 VSCATTERQPD (EVEX encoded versions) (KL, VL)= (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN MEM[BASE_ADDR + (VINDEX[i+63:i]) * SCALE + DISP] SRC[i+63:i] k1[j] 0 FI; ENDFOR k1[MAX_KL-1:KL] 0 VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD—Scatter Packed Single, Packed Double with Signed Dword and Qword Vol. 2C 5-549 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VSCATTERDPD void _mm512_i32scatter_pd(void * base, __m256i vdx, __m512d a, int scale); VSCATTERDPD void _mm512_mask_i32scatter_pd(void * base, __mmask8 k, __m256i vdx, __m512d a, int scale); VSCATTERDPS void _mm512_i32scatter_ps(void * base, __m512i vdx, __m512 a, int scale); VSCATTERDPS void _mm512_mask_i32scatter_ps(void * base, __mmask16 k, __m512i vdx, __m512 a, int scale); VSCATTERQPD void _mm512_i64scatter_pd(void * base, __m512i vdx, __m512d a, int scale); VSCATTERQPD void _mm512_mask_i64scatter_pd(void * base, __mmask8 k, __m512i vdx, __m512d a, int scale); VSCATTERQPS void _mm512_i64scatter_ps(void * base, __m512i vdx, __m256 a, int scale); VSCATTERQPS void _mm512_mask_i64scatter_ps(void * base, __mmask8 k, __m512i vdx, __m256 a, int scale); VSCATTERDPD void _mm256_i32scatter_pd(void * base, __m128i vdx, __m256d a, int scale); VSCATTERDPD void _mm256_mask_i32scatter_pd(void * base, __mmask8 k, __m128i vdx, __m256d a, int scale); VSCATTERDPS void _mm256_i32scatter_ps(void * base, __m256i vdx, __m256 a, int scale); VSCATTERDPS void _mm256_mask_i32scatter_ps(void * base, __mmask8 k, __m256i vdx, __m256 a, int scale); VSCATTERQPD void _mm256_i64scatter_pd(void * base, __m256i vdx, __m256d a, int scale); VSCATTERQPD void _mm256_mask_i64scatter_pd(void * base, __mmask8 k, __m256i vdx, __m256d a, int scale); VSCATTERQPS void _mm256_i64scatter_ps(void * base, __m256i vdx, __m128 a, int scale); VSCATTERQPS void _mm256_mask_i64scatter_ps(void * base, __mmask8 k, __m256i vdx, __m128 a, int scale); VSCATTERDPD void _mm_i32scatter_pd(void * base, __m128i vdx, __m128d a, int scale); VSCATTERDPD void _mm_mask_i32scatter_pd(void * base, __mmask8 k, __m128i vdx, __m128d a, int scale); VSCATTERDPS void _mm_i32scatter_ps(void * base, __m128i vdx, __m128 a, int scale); VSCATTERDPS void _mm_mask_i32scatter_ps(void * base, __mmask8 k, __m128i vdx, __m128 a, int scale); VSCATTERQPD void _mm_i64scatter_pd(void * base, __m128i vdx, __m128d a, int scale); VSCATTERQPD void _mm_mask_i64scatter_pd(void * base, __mmask8 k, __m128i vdx, __m128d a, int scale); VSCATTERQPS void _mm_i64scatter_ps(void * base, __m128i vdx, __m128 a, int scale); VSCATTERQPS void _mm_mask_i64scatter_ps(void * base, __mmask8 k, __m128i vdx, __m128 a, int scale); SIMD Floating-Point Exceptions Invalid, Overflow, Underflow, Precision, Denormal Other Exceptions See Exceptions Type E12. 5-550 Vol. 2C VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD—Scatter Packed Single, Packed Double with Signed Dword and Qword INSTRUCTION SET REFERENCE, V-Z VSCATTERPF0DPS/VSCATTERPF0QPS/VSCATTERPF0DPD/VSCATTERPF0QPD—Sparse Prefetch Packed SP/DP Data Values with Signed Dword, Signed Qword Indices Using T0 Hint with Intent to Write T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512PF EVEX.512.66.0F38.W0 C7 /5 /vsib VSCATTERPF0QPS vm64z {k1} T1S V/V AVX512PF EVEX.512.66.0F38.W1 C6 /5 /vsib VSCATTERPF0DPD vm32y {k1} T1S V/V AVX512PF EVEX.512.66.0F38.W1 C7 /5 /vsib VSCATTERPF0QPD vm64z {k1} T1S V/V AVX512PF Opcode/ Instruction Op/ En EVEX.512.66.0F38.W0 C6 /5 /vsib VSCATTERPF0DPS vm32z {k1} Description Using signed dword indices, prefetch sparse byte memory locations containing single-precision data using writemask k1 and T0 hint with intent to write. Using signed qword indices, prefetch sparse byte memory locations containing single-precision data using writemask k1 and T0 hint with intent to write. Using signed dword indices, prefetch sparse byte memory locations containing double-precision data using writemask k1 and T0 hint with intent to write. Using signed qword indices, prefetch sparse byte memory locations containing double-precision data using writemask k1 and T0 hint with intent to write. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S BaseReg (R): VSIB:base, VectorReg(R): VSIB:index NA NA NA Description The instruction conditionally prefetches up to sixteen 32-bit or eight 64-bit integer byte data elements. The elements are specified via the VSIB (i.e., the index register is an zmm, holding packed indices). Elements will only be prefetched if their corresponding mask bit is one. cache lines will be brought into exclusive state (RFO) specified by a locality hint (T0): • T0 (temporal data)—prefetch data into the first level cache. [PS data] For dword indices, the instruction will prefetch sixteen memory locations. For qword indices, the instruction will prefetch eight values. [PD data] For dword and qword indices, the instruction will prefetch eight memory locations. Note that: (1) The prefetches may happen in any order (or not at all). The instruction is a hint. (2) The mask is left unchanged. (3) Not valid with 16-bit effective addresses. Will deliver a #UD fault. (4) No FP nor memory faults may be produced by this instruction. (5) Prefetches do not handle cache line splits (6) A #UD is signaled if the memory operand is encoded without the SIB byte. Operation BASE_ADDR stands for the memory operand base address (a GPR); may not exist VINDEX stands for the memory operand vector of indices (a vector register) SCALE stands for the memory operand scalar (1, 2, 4 or 8) DISP is the optional 1, 2 or 4 byte displacement PREFETCH(mem, Level, State) Prefetches a byte memory location pointed by ‘mem’ into the cache level specified by ‘Level’; a request for exclusive/ownership is done if ‘State’ is 1. Note that the memory location ignore cache line splits. This operation is considered a hint for the processor and may be skipped depending on implementation. VSCATTERPF0DPS/VSCATTERPF0QPS/VSCATTERPF0DPD/VSCATTERPF0QPD—Sparse Prefetch Packed SP/DP Data Values with Vol. 2C 5-551 INSTRUCTION SET REFERENCE, V-Z VSCATTERPF0DPS (EVEX encoded version) (KL, VL) = (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[i+31:i]) * SCALE + DISP], Level=0, RFO = 1) FI; ENDFOR VSCATTERPF0DPD (EVEX encoded version) (KL, VL) = (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[k+31:k]) * SCALE + DISP], Level=0, RFO = 1) FI; ENDFOR VSCATTERPF0QPS (EVEX encoded version) (KL, VL) = (8, 256) FOR j 0 TO KL-1 i j * 64 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[i+63:i]) * SCALE + DISP], Level=0, RFO = 1) FI; ENDFOR VSCATTERPF0QPD (EVEX encoded version) (KL, VL) = (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 64 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[k+63:k]) * SCALE + DISP], Level=0, RFO = 1) FI; ENDFOR Intel C/C++ Compiler Intrinsic Equivalent VSCATTERPF0DPD void _mm512_prefetch_i32scatter_pd(void *base, __m256i vdx, int scale, int hint); VSCATTERPF0DPD void _mm512_mask_prefetch_i32scatter_pd(void *base, __mmask8 m, __m256i vdx, int scale, int hint); VSCATTERPF0DPS void _mm512_prefetch_i32scatter_ps(void *base, __m512i vdx, int scale, int hint); VSCATTERPF0DPS void _mm512_mask_prefetch_i32scatter_ps(void *base, __mmask16 m, __m512i vdx, int scale, int hint); VSCATTERPF0QPD void _mm512_prefetch_i64scatter_pd(void * base, __m512i vdx, int scale, int hint); VSCATTERPF0QPD void _mm512_mask_prefetch_i64scatter_pd(void * base, __mmask8 m, __m512i vdx, int scale, int hint); VSCATTERPF0QPS void _mm512_prefetch_i64scatter_ps(void * base, __m512i vdx, int scale, int hint); VSCATTERPF0QPS void _mm512_mask_prefetch_i64scatter_ps(void * base, __mmask8 m, __m512i vdx, int scale, int hint); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E12NP. 5-552 Vol. 2C VSCATTERPF0DPS/VSCATTERPF0QPS/VSCATTERPF0DPD/VSCATTERPF0QPD—Sparse Prefetch Packed SP/DP Data Values with INSTRUCTION SET REFERENCE, V-Z VSCATTERPF1DPS/VSCATTERPF1QPS/VSCATTERPF1DPD/VSCATTERPF1QPD—Sparse Prefetch Packed SP/DP Data Values with Signed Dword, Signed Qword Indices Using T1 Hint with Intent to Write T1S 64/32 bit Mode Support V/V CPUID Feature Flag AVX512PF EVEX.512.66.0F38.W0 C7 /6 /vsib VSCATTERPF1QPS vm64z {k1} T1S V/V AVX512PF EVEX.512.66.0F38.W1 C6 /6 /vsib VSCATTERPF1DPD vm32y {k1} T1S V/V AVX512PF EVEX.512.66.0F38.W1 C7 /6 /vsib VSCATTERPF1QPD vm64z {k1} T1S V/V AVX512PF Opcode/ Instruction Op/ En EVEX.512.66.0F38.W0 C6 /6 /vsib VSCATTERPF1DPS vm32z {k1} Description Using signed dword indices, prefetch sparse byte memory locations containing single-precision data using writemask k1 and T1 hint with intent to write. Using signed qword indices, prefetch sparse byte memory locations containing single-precision data using writemask k1 and T1 hint with intent to write. Using signed dword indices, prefetch sparse byte memory locations containing double-precision data using writemask k1 and T1 hint with intent to write. Using signed qword indices, prefetch sparse byte memory locations containing double-precision data using writemask k1 and T1 hint with intent to write. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 T1S BaseReg (R): VSIB:base, VectorReg(R): VSIB:index NA NA NA Description The instruction conditionally prefetches up to sixteen 32-bit or eight 64-bit integer byte data elements. The elements are specified via the VSIB (i.e., the index register is an zmm, holding packed indices). Elements will only be prefetched if their corresponding mask bit is one. cache lines will be brought into exclusive state (RFO) specified by a locality hint (T1): • T1 (temporal data)—prefetch data into the second level cache. [PS data] For dword indices, the instruction will prefetch sixteen memory locations. For qword indices, the instruction will prefetch eight values. [PD data] For dword and qword indices, the instruction will prefetch eight memory locations. Note that: (1) The prefetches may happen in any order (or not at all). The instruction is a hint. (2) The mask is left unchanged. (3) Not valid with 16-bit effective addresses. Will deliver a #UD fault. (4) No FP nor memory faults may be produced by this instruction. (5) Prefetches do not handle cache line splits (6) A #UD is signaled if the memory operand is encoded without the SIB byte. Operation BASE_ADDR stands for the memory operand base address (a GPR); may not exist VINDEX stands for the memory operand vector of indices (a vector register) SCALE stands for the memory operand scalar (1, 2, 4 or 8) DISP is the optional 1, 2 or 4 byte displacement PREFETCH(mem, Level, State) Prefetches a byte memory location pointed by ‘mem’ into the cache level specified by ‘Level’; a request for exclusive/ownership is done if ‘State’ is 1. Note that the memory location ignore cache line splits. This operation is considered a hint for the processor and may be skipped depending on implementation. VSCATTERPF1DPS/VSCATTERPF1QPS/VSCATTERPF1DPD/VSCATTERPF1QPD—Sparse Prefetch Packed SP/DP Data Values with Vol. 2C 5-553 INSTRUCTION SET REFERENCE, V-Z VSCATTERPF1DPS (EVEX encoded version) (KL, VL) = (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[i+31:i]) * SCALE + DISP], Level=1, RFO = 1) FI; ENDFOR VSCATTERPF1DPD (EVEX encoded version) (KL, VL) = (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 32 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[k+31:k]) * SCALE + DISP], Level=1, RFO = 1) FI; ENDFOR VSCATTERPF1QPS (EVEX encoded version) (KL, VL) = (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[i+63:i]) * SCALE + DISP], Level=1, RFO = 1) FI; ENDFOR VSCATTERPF1QPD (EVEX encoded version) (KL, VL) = (8, 512) FOR j 0 TO KL-1 i j * 64 k j * 64 IF k1[j] Prefetch( [BASE_ADDR + SignExtend(VINDEX[k+63:k]) * SCALE + DISP], Level=1, RFO = 1) FI; ENDFOR Intel C/C++ Compiler Intrinsic Equivalent VSCATTERPF1DPD void _mm512_prefetch_i32scatter_pd(void *base, __m256i vdx, int scale, int hint); VSCATTERPF1DPD void _mm512_mask_prefetch_i32scatter_pd(void *base, __mmask8 m, __m256i vdx, int scale, int hint); VSCATTERPF1DPS void _mm512_prefetch_i32scatter_ps(void *base, __m512i vdx, int scale, int hint); VSCATTERPF1DPS void _mm512_mask_prefetch_i32scatter_ps(void *base, __mmask16 m, __m512i vdx, int scale, int hint); VSCATTERPF1QPD void _mm512_prefetch_i64scatter_pd(void * base, __m512i vdx, int scale, int hint); VSCATTERPF1QPD void _mm512_mask_prefetch_i64scatter_pd(void * base, __mmask8 m, __m512i vdx, int scale, int hint); VSCATTERPF1QPS void _mm512_prefetch_i64scatter_ps(void *base, __m512i vdx, int scale, int hint); VSCATTERPF1QPS void _mm512_mask_prefetch_i64scatter_ps(void *base, __mmask8 m, __m512i vdx, int scale, int hint); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E12NP. 5-554 Vol. 2C VSCATTERPF1DPS/VSCATTERPF1QPS/VSCATTERPF1DPD/VSCATTERPF1QPD—Sparse Prefetch Packed SP/DP Data Values with INSTRUCTION SET REFERENCE, V-Z VSHUFF32x4/VSHUFF64x2/VSHUFI32x4/VSHUFI64x2—Shuffle Packed Values at 128-bit Granularity Opcode/ Instruction Op / En FV 64/32 bit Mode Support V/V CPUID Feature Flag AVX512VL AVX512F EVEX.NDS.256.66.0F3A.W0 23 /r ib VSHUFF32X4 ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst, imm8 EVEX.NDS.512.66.0F3A.W0 23 /r ib VSHUFF32x4 zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst, imm8 FV V/V AVX512F EVEX.NDS.256.66.0F3A.W1 23 /r ib VSHUFF64X2 ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst, imm8 FV V/V AVX512VL AVX512F EVEX.NDS.512.66.0F3A.W1 23 /r ib VSHUFF64x2 zmm1{k1}{z}, zmm2, zmm3/m512/m64bcst, imm8 FV V/V AVX512F EVEX.NDS.256.66.0F3A.W0 43 /r ib VSHUFI32X4 ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst, imm8 EVEX.NDS.512.66.0F3A.W0 43 /r ib VSHUFI32x4 zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst, imm8 EVEX.NDS.256.66.0F3A.W1 43 /r ib VSHUFI64X2 ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst, imm8 EVEX.NDS.512.66.0F3A.W1 43 /r ib VSHUFI64x2 zmm1{k1}{z}, zmm2, zmm3/m512/m64bcst, imm8 FV V/V AVX512VL AVX512F FV V/V AVX512F FV V/V AVX512VL AVX512F FV V/V AVX512F Description Shuffle 128-bit packed single-precision floating-point values selected by imm8 from ymm2 and ymm3/m256/m32bcst and place results in ymm1 subject to writemask k1. Shuffle 128-bit packed single-precision floating-point values selected by imm8 from zmm2 and zmm3/m512/m32bcst and place results in zmm1 subject to writemask k1. Shuffle 128-bit packed double-precision floating-point values selected by imm8 from ymm2 and ymm3/m256/m64bcst and place results in ymm1 subject to writemask k1. Shuffle 128-bit packed double-precision floating-point values selected by imm8 from zmm2 and zmm3/m512/m64bcst and place results in zmm1 subject to writemask k1. Shuffle 128-bit packed double-word values selected by imm8 from ymm2 and ymm3/m256/m32bcst and place results in ymm1 subject to writemask k1. Shuffle 128-bit packed double-word values selected by imm8 from zmm2 and zmm3/m512/m32bcst and place results in zmm1 subject to writemask k1. Shuffle 128-bit packed quad-word values selected by imm8 from ymm2 and ymm3/m256/m64bcst and place results in ymm1 subject to writemask k1. Shuffle 128-bit packed quad-word values selected by imm8 from zmm2 and zmm3/m512/m64bcst and place results in zmm1 subject to writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 FV ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description 256-bit Version: Moves one of the two 128-bit packed single-precision floating-point values from the first source operand (second operand) into the low 128-bit of the destination operand (first operand); moves one of the two packed 128-bit floating-point values from the second source operand (third operand) into the high 128-bit of the destination operand. The selector operand (third operand) determines which values are moved to the destination operand. 512-bit Version: Moves two of the four 128-bit packed single-precision floating-point values from the first source operand (second operand) into the low 256-bit of each double qword of the destination operand (first operand); moves two of the four packed 128-bit floating-point values from the second source operand (third operand) into the high 256-bit of the destination operand. The selector operand (third operand) determines which values are moved to the destination operand. The first source operand is a vector register. The second source operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a vector register. The writemask updates the destination operand with the granularity of 32/64-bit data elements. VSHUFF32x4/VSHUFF64x2/VSHUFI32x4/VSHUFI64x2—Shuffle Packed Values at 128-bit Granularity Vol. 2C 5-555 INSTRUCTION SET REFERENCE, V-Z Operation Select2(SRC, control) { CASE (control[0]) OF 0: TMP SRC[127:0]; 1: TMP SRC[255:128]; ESAC; RETURN TMP } Select4(SRC, control) { CASE (control[1:0]) OF 0: TMP SRC[127:0]; 1: TMP SRC[255:128]; 2: TMP SRC[383:256]; 3: TMP SRC[511:384]; ESAC; RETURN TMP } VSHUFF32x4 (EVEX versions) (KL, VL) = (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN TMP_SRC2[i+31:i] SRC2[31:0] ELSE TMP_SRC2[i+31:i] SRC2[i+31:i] FI; ENDFOR; IF VL = 256 TMP_DEST[127:0] Select2(SRC1[255:0], imm8[0]); TMP_DEST[255:128] Select2(SRC2[255:0], imm8[1]); FI; IF VL = 512 TMP_DEST[127:0] Select4(SRC1[511:0], imm8[1:0]); TMP_DEST[255:128] Select4(SRC1[511:0], imm8[3:2]); TMP_DEST[383:256] Select4(TMP_SRC2[511:0], imm8[5:4]); TMP_DEST[511:384] Select4(TMP_SRC2[511:0], imm8[7:6]); FI; FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking THEN DEST[i+31:i] 0 FI; FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-556 Vol. 2C VSHUFF32x4/VSHUFF64x2/VSHUFI32x4/VSHUFI64x2—Shuffle Packed Values at 128-bit Granularity INSTRUCTION SET REFERENCE, V-Z VSHUFF64x2 (EVEX 512-bit version) (KL, VL) = (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN TMP_SRC2[i+63:i] SRC2[63:0] ELSE TMP_SRC2[i+63:i] SRC2[i+63:i] FI; ENDFOR; IF VL = 256 TMP_DEST[127:0] Select2(SRC1[255:0], imm8[0]); TMP_DEST[255:128] Select2(SRC2[255:0], imm8[1]); FI; IF VL = 512 TMP_DEST[127:0] Select4(SRC1[511:0], imm8[1:0]); TMP_DEST[255:128] Select4(SRC1[511:0], imm8[3:2]); TMP_DEST[383:256] Select4(TMP_SRC2[511:0], imm8[5:4]); TMP_DEST[511:384] Select4(TMP_SRC2[511:0], imm8[7:6]); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking THEN DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VSHUFI32x4 (EVEX 512-bit version) (KL, VL) = (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN TMP_SRC2[i+31:i] SRC2[31:0] ELSE TMP_SRC2[i+31:i] SRC2[i+31:i] FI; ENDFOR; IF VL = 256 TMP_DEST[127:0] Select2(SRC1[255:0], imm8[0]); TMP_DEST[255:128] Select2(SRC2[255:0], imm8[1]); FI; IF VL = 512 TMP_DEST[127:0] Select4(SRC1[511:0], imm8[1:0]); TMP_DEST[255:128] Select4(SRC1[511:0], imm8[3:2]); TMP_DEST[383:256] Select4(TMP_SRC2[511:0], imm8[5:4]); TMP_DEST[511:384] Select4(TMP_SRC2[511:0], imm8[7:6]); FI; FOR j 0 TO KL-1 i j * 32 VSHUFF32x4/VSHUFF64x2/VSHUFI32x4/VSHUFI64x2—Shuffle Packed Values at 128-bit Granularity Vol. 2C 5-557 INSTRUCTION SET REFERENCE, V-Z IF k1[j] OR *no writemask* THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking THEN DEST[i+31:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VSHUFI64x2 (EVEX 512-bit version) (KL, VL) = (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN TMP_SRC2[i+63:i] SRC2[63:0] ELSE TMP_SRC2[i+63:i] SRC2[i+63:i] FI; ENDFOR; IF VL = 256 TMP_DEST[127:0] Select2(SRC1[255:0], imm8[0]); TMP_DEST[255:128] Select2(SRC2[255:0], imm8[1]); FI; IF VL = 512 TMP_DEST[127:0] Select4(SRC1[511:0], imm8[1:0]); TMP_DEST[255:128] Select4(SRC1[511:0], imm8[3:2]); TMP_DEST[383:256] Select4(TMP_SRC2[511:0], imm8[5:4]); TMP_DEST[511:384] Select4(TMP_SRC2[511:0], imm8[7:6]); FI; FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking THEN DEST[i+63:i] 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 5-558 Vol. 2C VSHUFF32x4/VSHUFF64x2/VSHUFI32x4/VSHUFI64x2—Shuffle Packed Values at 128-bit Granularity INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VSHUFI32x4 __m512i _mm512_shuffle_i32x4(__m512i a, __m512i b, int imm); VSHUFI32x4 __m512i _mm512_mask_shuffle_i32x4(__m512i s, __mmask16 k, __m512i a, __m512i b, int imm); VSHUFI32x4 __m512i _mm512_maskz_shuffle_i32x4( __mmask16 k, __m512i a, __m512i b, int imm); VSHUFI32x4 __m256i _mm256_shuffle_i32x4(__m256i a, __m256i b, int imm); VSHUFI32x4 __m256i _mm256_mask_shuffle_i32x4(__m256i s, __mmask8 k, __m256i a, __m256i b, int imm); VSHUFI32x4 __m256i _mm256_maskz_shuffle_i32x4( __mmask8 k, __m256i a, __m256i b, int imm); VSHUFF32x4 __m512 _mm512_shuffle_f32x4(__m512 a, __m512 b, int imm); VSHUFF32x4 __m512 _mm512_mask_shuffle_f32x4(__m512 s, __mmask16 k, __m512 a, __m512 b, int imm); VSHUFF32x4 __m512 _mm512_maskz_shuffle_f32x4( __mmask16 k, __m512 a, __m512 b, int imm); VSHUFI64x2 __m512i _mm512_shuffle_i64x2(__m512i a, __m512i b, int imm); VSHUFI64x2 __m512i _mm512_mask_shuffle_i64x2(__m512i s, __mmask8 k, __m512i b, __m512i b, int imm); VSHUFI64x2 __m512i _mm512_maskz_shuffle_i64x2( __mmask8 k, __m512i a, __m512i b, int imm); VSHUFF64x2 __m512d _mm512_shuffle_f64x2(__m512d a, __m512d b, int imm); VSHUFF64x2 __m512d _mm512_mask_shuffle_f64x2(__m512d s, __mmask8 k, __m512d a, __m512d b, int imm); VSHUFF64x2 __m512d _mm512_maskz_shuffle_f64x2( __mmask8 k, __m512d a, __m512d b, int imm); SIMD Floating-Point Exceptions None Other Exceptions See Exceptions Type E4NF. #UD If EVEX.L’L = 0 for VSHUFF32x4/VSHUFF64x2. VSHUFF32x4/VSHUFF64x2/VSHUFI32x4/VSHUFI64x2—Shuffle Packed Values at 128-bit Granularity Vol. 2C 5-559 INSTRUCTION SET REFERENCE, V-Z VTESTPD/VTESTPS—Packed Bit Test Opcode/ Instruction Op/ En 64/32 bit Mode Support CPUID Feature Flag Description VEX.128.66.0F38.W0 0E /r VTESTPS xmm1, xmm2/m128 RM V/V AVX Set ZF and CF depending on sign bit AND and ANDN of packed single-precision floating-point sources. VEX.256.66.0F38.W0 0E /r VTESTPS ymm1, ymm2/m256 RM V/V AVX Set ZF and CF depending on sign bit AND and ANDN of packed single-precision floating-point sources. VEX.128.66.0F38.W0 0F /r VTESTPD xmm1, xmm2/m128 RM V/V AVX Set ZF and CF depending on sign bit AND and ANDN of packed double-precision floating-point sources. VEX.256.66.0F38.W0 0F /r VTESTPD ymm1, ymm2/m256 RM V/V AVX Set ZF and CF depending on sign bit AND and ANDN of packed double-precision floating-point sources. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RM ModRM:reg (r) ModRM:r/m (r) NA NA Description VTESTPS performs a bitwise comparison of all the sign bits of the packed single-precision elements in the first source operation and corresponding sign bits in the second source operand. If the AND of the source sign bits with the dest sign bits produces all zeros, the ZF is set else the ZF is clear. If the AND of the source sign bits with the inverted dest sign bits produces all zeros the CF is set else the CF is clear. An attempt to execute VTESTPS with VEX.W=1 will cause #UD. VTESTPD performs a bitwise comparison of all the sign bits of the double-precision elements in the first source operation and corresponding sign bits in the second source operand. If the AND of the source sign bits with the dest sign bits produces all zeros, the ZF is set else the ZF is clear. If the AND the source sign bits with the inverted dest sign bits produces all zeros the CF is set else the CF is clear. An attempt to execute VTESTPS with VEX.W=1 will cause #UD. The first source register is specified by the ModR/M reg field. 128-bit version: The first source register is an XMM register. The second source register can be an XMM register or a 128-bit memory location. The destination register is not modified. VEX.256 encoded version: The first source register is a YMM register. The second source register can be a YMM register or a 256-bit memory location. The destination register is not modified. Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. 5-560 Vol. 2C VTESTPD/VTESTPS—Packed Bit Test INSTRUCTION SET REFERENCE, V-Z Operation VTESTPS (128-bit version) TEMP[127:0] SRC[127:0] AND DEST[127:0] IF (TEMP[31] = TEMP[63] = TEMP[95] = TEMP[127] = 0) THEN ZF 1; ELSE ZF 0; TEMP[127:0] SRC[127:0] AND NOT DEST[127:0] IF (TEMP[31] = TEMP[63] = TEMP[95] = TEMP[127] = 0) THEN CF 1; ELSE CF 0; DEST (unmodified) AF OF PF SF 0; VTESTPS (VEX.256 encoded version) TEMP[255:0] SRC[255:0] AND DEST[255:0] IF (TEMP[31] = TEMP[63] = TEMP[95] = TEMP[127]= TEMP[160] =TEMP[191] = TEMP[224] = TEMP[255] = 0) THEN ZF 1; ELSE ZF 0; TEMP[255:0] SRC[255:0] AND NOT DEST[255:0] IF (TEMP[31] = TEMP[63] = TEMP[95] = TEMP[127]= TEMP[160] =TEMP[191] = TEMP[224] = TEMP[255] = 0) THEN CF 1; ELSE CF 0; DEST (unmodified) AF OF PF SF 0; VTESTPD (128-bit version) TEMP[127:0] SRC[127:0] AND DEST[127:0] IF ( TEMP[63] = TEMP[127] = 0) THEN ZF 1; ELSE ZF 0; TEMP[127:0] SRC[127:0] AND NOT DEST[127:0] IF ( TEMP[63] = TEMP[127] = 0) THEN CF 1; ELSE CF 0; DEST (unmodified) AF OF PF SF 0; VTESTPD (VEX.256 encoded version) TEMP[255:0] SRC[255:0] AND DEST[255:0] IF (TEMP[63] = TEMP[127] = TEMP[191] = TEMP[255] = 0) THEN ZF 1; ELSE ZF 0; TEMP[255:0] SRC[255:0] AND NOT DEST[255:0] IF (TEMP[63] = TEMP[127] = TEMP[191] = TEMP[255] = 0) THEN CF 1; ELSE CF 0; DEST (unmodified) AF OF PF SF 0; VTESTPD/VTESTPS—Packed Bit Test Vol. 2C 5-561 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VTESTPS int _mm256_testz_ps (__m256 s1, __m256 s2); int _mm256_testc_ps (__m256 s1, __m256 s2); int _mm256_testnzc_ps (__m256 s1, __m128 s2); int _mm_testz_ps (__m128 s1, __m128 s2); int _mm_testc_ps (__m128 s1, __m128 s2); int _mm_testnzc_ps (__m128 s1, __m128 s2); VTESTPD int _mm256_testz_pd (__m256d s1, __m256d s2); int _mm256_testc_pd (__m256d s1, __m256d s2); int _mm256_testnzc_pd (__m256d s1, __m256d s2); int _mm_testz_pd (__m128d s1, __m128d s2); int _mm_testc_pd (__m128d s1, __m128d s2); int _mm_testnzc_pd (__m128d s1, __m128d s2); Flags Affected The 0F, AF, PF, SF flags are cleared and the ZF, CF flags are set according to the operation. SIMD Floating-Point Exceptions None. Other Exceptions See Exceptions Type 4; additionally #UD If VEX.vvvv ≠ 1111B. If VEX.W = 1 for VTESTPS or VTESTPD. 5-562 Vol. 2C VTESTPD/VTESTPS—Packed Bit Test INSTRUCTION SET REFERENCE, V-Z VZEROALL—Zero All YMM Registers Opcode/ Instruction Op/ En 64/32 bit Mode Support CPUID Feature Flag Description VEX.256.0F.WIG 77 NP V/V AVX Zero all YMM registers. VZEROALL Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 NP NA NA NA NA Description The instruction zeros contents of all XMM or YMM registers. Note: VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. In Compatibility and legacy 32bit mode only the lower 8 registers are modified. Operation VZEROALL (VEX.256 encoded version) IF (64-bit mode) YMM0[VLMAX-1:0] 0 YMM1[VLMAX-1:0] 0 YMM2[VLMAX-1:0] 0 YMM3[VLMAX-1:0] 0 YMM4[VLMAX-1:0] 0 YMM5[VLMAX-1:0] 0 YMM6[VLMAX-1:0] 0 YMM7[VLMAX-1:0] 0 YMM8[VLMAX-1:0] 0 YMM9[VLMAX-1:0] 0 YMM10[VLMAX-1:0] 0 YMM11[VLMAX-1:0] 0 YMM12[VLMAX-1:0] 0 YMM13[VLMAX-1:0] 0 YMM14[VLMAX-1:0] 0 YMM15[VLMAX-1:0] 0 ELSE YMM0[VLMAX-1:0] 0 YMM1[VLMAX-1:0] 0 YMM2[VLMAX-1:0] 0 YMM3[VLMAX-1:0] 0 YMM4[VLMAX-1:0] 0 YMM5[VLMAX-1:0] 0 YMM6[VLMAX-1:0] 0 YMM7[VLMAX-1:0] 0 YMM8-15: Unmodified FI VZEROALL—Zero All YMM Registers Vol. 2C 5-563 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VZEROALL: _mm256_zeroall() SIMD Floating-Point Exceptions None. Other Exceptions See Exceptions Type 8. 5-564 Vol. 2C VZEROALL—Zero All YMM Registers INSTRUCTION SET REFERENCE, V-Z VZEROUPPER—Zero Upper Bits of YMM Registers Opcode/ Instruction Op/ En 64/32 bit Mode Support CPUID Feature Flag Description VEX.128.0F.WIG 77 NP V/V AVX Zero upper 128 bits of all YMM registers. VZEROUPPER Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 NP NA NA NA NA Description The instruction zeros the bits in position 128 and higher of all YMM registers. The lower 128-bits of the registers (the corresponding XMM registers) are unmodified. This instruction is recommended when transitioning between AVX and legacy SSE code - it will eliminate performance penalties caused by false dependencies. Note: VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. In Compatibility and legacy 32-bit mode only the lower 8 registers are modified. Operation VZEROUPPER IF (64-bit mode) YMM0[VLMAX-1:128] 0 YMM1[VLMAX-1:128] 0 YMM2[VLMAX-1:128] 0 YMM3[VLMAX-1:128] 0 YMM4[VLMAX-1:128] 0 YMM5[VLMAX-1:128] 0 YMM6[VLMAX-1:128] 0 YMM7[VLMAX-1:128] 0 YMM8[VLMAX-1:128] 0 YMM9[VLMAX-1:128] 0 YMM10[VLMAX-1:128] 0 YMM11[VLMAX-1:128] 0 YMM12[VLMAX-1:128] 0 YMM13[VLMAX-1:128] 0 YMM14[VLMAX-1:128] 0 YMM15[VLMAX-1:128] 0 ELSE YMM0[VLMAX-1:128] 0 YMM1[VLMAX-1:128] 0 YMM2[VLMAX-1:128] 0 YMM3[VLMAX-1:128] 0 YMM4[VLMAX-1:128] 0 YMM5[VLMAX-1:128] 0 YMM6[VLMAX-1:128] 0 YMM7[VLMAX-1:128] 0 YMM8-15: unmodified FI VZEROUPPER—Zero Upper Bits of YMM Registers Vol. 2C 5-565 INSTRUCTION SET REFERENCE, V-Z Intel C/C++ Compiler Intrinsic Equivalent VZEROUPPER: _mm256_zeroupper() SIMD Floating-Point Exceptions None. Other Exceptions See Exceptions Type 8. 5-566 Vol. 2C VZEROUPPER—Zero Upper Bits of YMM Registers INSTRUCTION SET REFERENCE, V-Z WAIT/FWAIT—Wait Opcode Instruction Op/ En 64-Bit Mode Compat/ Description Leg Mode 9B WAIT NP Valid Valid Check pending unmasked floating-point exceptions. 9B FWAIT NP Valid Valid Check pending unmasked floating-point exceptions. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 NP NA NA NA NA Description Causes the processor to check for and handle pending, unmasked, floating-point exceptions before proceeding. (FWAIT is an alternate mnemonic for WAIT.) This instruction is useful for synchronizing exceptions in critical sections of code. Coding a WAIT instruction after a floating-point instruction ensures that any unmasked floating-point exceptions the instruction may raise are handled before the processor can modify the instruction’s results. See the section titled “Floating-Point Exception Synchronization” in Chapter 8 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for more information on using the WAIT/FWAIT instruction. This instruction’s operation is the same in non-64-bit modes and 64-bit mode. Operation CheckForPendingUnmaskedFloatingPointExceptions; FPU Flags Affected The C0, C1, C2, and C3 flags are undefined. Floating-Point Exceptions None. Protected Mode Exceptions #NM If CR0.MP[bit 1] = 1 and CR0.TS[bit 3] = 1. #UD If the LOCK prefix is used. Real-Address Mode Exceptions Same exceptions as in protected mode. Virtual-8086 Mode Exceptions Same exceptions as in protected mode. Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions Same exceptions as in protected mode. WAIT/FWAIT—Wait Vol. 2C 5-567 INSTRUCTION SET REFERENCE, V-Z WBINVD—Write Back and Invalidate Cache Opcode Instruction Op/ En 64-Bit Mode Compat/ Description Leg Mode 0F 09 WBINVD NP Valid Valid Write back and flush Internal caches; initiate writing-back and flushing of external caches. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 NP NA NA NA NA Description Writes back all modified cache lines in the processor’s internal cache to main memory and invalidates (flushes) the internal caches. The instruction then issues a special-function bus cycle that directs external caches to also write back modified data and another bus cycle to indicate that the external caches should be invalidated. After executing this instruction, the processor does not wait for the external caches to complete their write-back and flushing operations before proceeding with instruction execution. It is the responsibility of hardware to respond to the cache write-back and flush signals. The amount of time or cycles for WBINVD to complete will vary due to size and other factors of different cache hierarchies. As a consequence, the use of the WBINVD instruction can have an impact on logical processor interrupt/event response time. Additional information of WBINVD behavior in a cache hierarchy with hierarchical sharing topology can be found in Chapter 2 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A. The WBINVD instruction is a privileged instruction. When the processor is running in protected mode, the CPL of a program or procedure must be 0 to execute this instruction. This instruction is also a serializing instruction (see “Serializing Instructions” in Chapter 8 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). In situations where cache coherency with main memory is not a concern, software can use the INVD instruction. This instruction’s operation is the same in non-64-bit modes and 64-bit mode. IA-32 Architecture Compatibility The WBINVD instruction is implementation dependent, and its function may be implemented differently on future Intel 64 and IA-32 processors. The instruction is not supported on IA-32 processors earlier than the Intel486 processor. Operation WriteBack(InternalCaches); Flush(InternalCaches); SignalWriteBack(ExternalCaches); SignalFlush(ExternalCaches); Continue; (* Continue execution *) Flags Affected None. Protected Mode Exceptions #GP(0) If the current privilege level is not 0. #UD If the LOCK prefix is used. 5-568 Vol. 2C WBINVD—Write Back and Invalidate Cache INSTRUCTION SET REFERENCE, V-Z Real-Address Mode Exceptions #UD If the LOCK prefix is used. Virtual-8086 Mode Exceptions #GP(0) WBINVD cannot be executed at the virtual-8086 mode. Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions Same exceptions as in protected mode. WBINVD—Write Back and Invalidate Cache Vol. 2C 5-569 INSTRUCTION SET REFERENCE, V-Z WRFSBASE/WRGSBASE—Write FS/GS Segment Base Opcode/ Instruction Op/ En 64/32bit Mode CPUID Fea- Description ture Flag F3 0F AE /2 WRFSBASE r32 M V/I FSGSBASE Load the FS base address with the 32-bit value in the source register. F3 REX.W 0F AE /2 WRFSBASE r64 M V/I FSGSBASE Load the FS base address with the 64-bit value in the source register. F3 0F AE /3 WRGSBASE r32 M V/I FSGSBASE Load the GS base address with the 32-bit value in the source register. F3 REX.W 0F AE /3 WRGSBASE r64 M V/I FSGSBASE Load the GS base address with the 64-bit value in the source register. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 M ModRM:r/m (r) NA NA NA Description Loads the FS or GS segment base address with the general-purpose register indicated by the modR/M:r/m field. The source operand may be either a 32-bit or a 64-bit general-purpose register. The REX.W prefix indicates the operand size is 64 bits. If no REX.W prefix is used, the operand size is 32 bits; the upper 32 bits of the source register are ignored and upper 32 bits of the base address (for FS or GS) are cleared. This instruction is supported only in 64-bit mode. Operation FS/GS segment base address ← SRC; Flags Affected None C/C++ Compiler Intrinsic Equivalent WRFSBASE: void _writefsbase_u32( unsigned int ); WRFSBASE: _writefsbase_u64( unsigned __int64 ); WRGSBASE: void _writegsbase_u32( unsigned int ); WRGSBASE: _writegsbase_u64( unsigned __int64 ); Protected Mode Exceptions #UD The WRFSBASE and WRGSBASE instructions are not recognized in protected mode. Real-Address Mode Exceptions #UD The WRFSBASE and WRGSBASE instructions are not recognized in real-address mode. Virtual-8086 Mode Exceptions #UD The WRFSBASE and WRGSBASE instructions are not recognized in virtual-8086 mode. Compatibility Mode Exceptions #UD 5-570 Vol. 2C The WRFSBASE and WRGSBASE instructions are not recognized in compatibility mode. WRFSBASE/WRGSBASE—Write FS/GS Segment Base INSTRUCTION SET REFERENCE, V-Z 64-Bit Mode Exceptions #UD If the LOCK prefix is used. If CR4.FSGSBASE[bit 16] = 0. If CPUID.07H.0H:EBX.FSGSBASE[bit 0] = 0 #GP(0) If the source register contains a non-canonical address. WRFSBASE/WRGSBASE—Write FS/GS Segment Base Vol. 2C 5-571 INSTRUCTION SET REFERENCE, V-Z WRMSR—Write to Model Specific Register Opcode Instruction Op/ En 64-Bit Mode Compat/ Description Leg Mode 0F 30 WRMSR NP Valid Valid Write the value in EDX:EAX to MSR specified by ECX. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 NP NA NA NA NA Description Writes the contents of registers EDX:EAX into the 64-bit model specific register (MSR) specified in the ECX register. (On processors that support the Intel 64 architecture, the high-order 32 bits of RCX are ignored.) The contents of the EDX register are copied to high-order 32 bits of the selected MSR and the contents of the EAX register are copied to low-order 32 bits of the MSR. (On processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are ignored.) Undefined or reserved bits in an MSR should be set to values previously read. This instruction must be executed at privilege level 0 or in real-address mode; otherwise, a general protection exception #GP(0) is generated. Specifying a reserved or unimplemented MSR address in ECX will also cause a general protection exception. The processor will also generate a general protection exception if software attempts to write to bits in a reserved MSR. When the WRMSR instruction is used to write to an MTRR, the TLBs are invalidated. This includes global entries (see “Translation Lookaside Buffers (TLBs)” in Chapter 3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). MSRs control functions for testability, execution tracing, performance-monitoring and machine check errors. Chapter 35, “Model-Specific Registers (MSRs)”, in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C, lists all MSRs that can be written with this instruction and their addresses. Note that each processor family has its own set of MSRs. The WRMSR instruction is a serializing instruction (see “Serializing Instructions” in Chapter 8 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). Note that WRMSR to the IA32_TSC_DEADLINE MSR (MSR index 6E0H) and the X2APIC MSRs (MSR indices 802H to 83FH) are not serializing. The CPUID instruction should be used to determine whether MSRs are supported (CPUID.01H:EDX[5] = 1) before using this instruction. IA-32 Architecture Compatibility The MSRs and the ability to read them with the WRMSR instruction were introduced into the IA-32 architecture with the Pentium processor. Execution of this instruction by an IA-32 processor earlier than the Pentium processor results in an invalid opcode exception #UD. Operation MSR[ECX] ← EDX:EAX; Flags Affected None. 5-572 Vol. 2C WRMSR—Write to Model Specific Register INSTRUCTION SET REFERENCE, V-Z Protected Mode Exceptions #GP(0) If the current privilege level is not 0. If the value in ECX specifies a reserved or unimplemented MSR address. If the value in EDX:EAX sets bits that are reserved in the MSR specified by ECX. If the source register contains a non-canonical address and ECX specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE, IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP. #UD If the LOCK prefix is used. Real-Address Mode Exceptions #GP If the value in ECX specifies a reserved or unimplemented MSR address. If the value in EDX:EAX sets bits that are reserved in the MSR specified by ECX. If the source register contains a non-canonical address and ECX specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE, IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP. #UD If the LOCK prefix is used. Virtual-8086 Mode Exceptions #GP(0) The WRMSR instruction is not recognized in virtual-8086 mode. Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions Same exceptions as in protected mode. WRMSR—Write to Model Specific Register Vol. 2C 5-573 INSTRUCTION SET REFERENCE, V-Z WRPKRU—Write Data to User Page Key Register Opcode* Instruction Op/ En 64/32bit Mode Support CPUID Feature Flag Description 0F 01 EF WRPKRU NP V/V OSPKE Writes EAX into PKRU. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 NP NA NA NA NA Description Writes the value of EAX into PKRU. ECX and EDX must be 0 when WRPKRU is executed; otherwise, a generalprotection exception (#GP) occurs. WRPKRU can be executed only if CR4.PKE = 1; otherwise, an invalid-opcode exception (#UD) occurs. Software can discover the value of CR4.PKE by examining CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]. On processors that support the Intel 64 Architecture, the high-order 32-bits of RCX, RDX and RAX are ignored. Operation IF (ECX = 0 AND EDX = 0) THEN PKRU ← EAX; ELSE #GP(0); FI; Flags Affected None. C/C++ Compiler Intrinsic Equivalent WRPKRU: void _wrpkru(uint32_t); Protected Mode Exceptions #GP(0) If ECX If EDX #UD ≠ 0. ≠ 0. If the LOCK prefix is used. If CR4.PKE = 0. Real-Address Mode Exceptions Same exceptions as in protected mode. Virtual-8086 Mode Exceptions Same exceptions as in protected mode. Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions Same exceptions as in protected mode. 5-574 Vol. 2C WRPKRU—Write Data to User Page Key Register INSTRUCTION SET REFERENCE, V-Z XACQUIRE/XRELEASE — Hardware Lock Elision Prefix Hints Opcode/Instruction F2 XACQUIRE F3 XRELEASE 64/32bit Mode Support V/V CPUID Feature Flag HLE1 V/V HLE Description A hint used with an “XACQUIRE-enabled“ instruction to start lock elision on the instruction memory operand address. A hint used with an “XRELEASE-enabled“ instruction to end lock elision on the instruction memory operand address. NOTES: 1. Software is not required to check the HLE feature flag to use XACQUIRE or XRELEASE, as they are treated as regular prefix if HLE feature flag reports 0. Description The XACQUIRE prefix is a hint to start lock elision on the memory address specified by the instruction and the XRELEASE prefix is a hint to end lock elision on the memory address specified by the instruction. The XACQUIRE prefix hint can only be used with the following instructions (these instructions are also referred to as XACQUIRE-enabled when used with the XACQUIRE prefix): • Instructions with an explicit LOCK prefix (F0H) prepended to forms of the instruction where the destination operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCHG8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. • The XCHG instruction either with or without the presence of the LOCK prefix. The XRELEASE prefix hint can only be used with the following instructions (also referred to as XRELEASE-enabled when used with the XRELEASE prefix): • Instructions with an explicit LOCK prefix (F0H) prepended to forms of the instruction where the destination operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCHG8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. • • The XCHG instruction either with or without the presence of the LOCK prefix. The “MOV mem, reg” (Opcode 88H/89H) and “MOV mem, imm” (Opcode C6H/C7H) instructions. In these cases, the XRELEASE is recognized without the presence of the LOCK prefix. The lock variables must satisfy the guidelines described in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, Section 16.3.3, for elision to be successful, otherwise an HLE abort may be signaled. If an encoded byte sequence that meets XACQUIRE/XRELEASE requirements includes both prefixes, then the HLE semantic is determined by the prefix byte that is placed closest to the instruction opcode. For example, an F3F2C6 will not be treated as a XRELEASE-enabled instruction since the F2H (XACQUIRE) is closest to the instruction opcode C6. Similarly, an F2F3F0 prefixed instruction will be treated as a XRELEASE-enabled instruction since F3H (XRELEASE) is closest to the instruction opcode. XACQUIRE/XRELEASE — Hardware Lock Elision Prefix Hints Vol. 2C 5-575 INSTRUCTION SET REFERENCE, V-Z Intel 64 and IA-32 Compatibility The effect of the XACQUIRE/XRELEASE prefix hint is the same in non-64-bit modes and in 64-bit mode. For instructions that do not support the XACQUIRE hint, the presence of the F2H prefix behaves the same way as prior hardware, according to • • • • REPNE/REPNZ semantics for string instructions, Serve as SIMD prefix for legacy SIMD instructions operating on XMM register Cause #UD if prepending the VEX prefix. Undefined for non-string instructions or other situations. For instructions that do not support the XRELEASE hint, the presence of the F3H prefix behaves the same way as in prior hardware, according to • • • • REP/REPE/REPZ semantics for string instructions, Serve as SIMD prefix for legacy SIMD instructions operating on XMM register Cause #UD if prepending the VEX prefix. Undefined for non-string instructions or other situations. Operation XACQUIRE IF XACQUIRE-enabled instruction THEN IF (HLE_NEST_COUNT < MAX_HLE_NEST_COUNT) THEN HLE_NEST_COUNT++ IF (HLE_NEST_COUNT = 1) THEN HLE_ACTIVE ← 1 IF 64-bit mode THEN restartRIP ← instruction pointer of the XACQUIRE-enabled instruction ELSE restartEIP ← instruction pointer of the XACQUIRE-enabled instruction FI; Enter HLE Execution (* record register state, start tracking memory state *) FI; (* HLE_NEST_COUNT = 1*) IF ElisionBufferAvailable THEN Allocate elision buffer Record address and data for forwarding and commit checking Perform elision ELSE Perform lock acquire operation transactionally but without elision FI; ELSE (* HLE_NEST_COUNT = MAX_HLE_NEST_COUNT *) GOTO HLE_ABORT_PROCESSING FI; ELSE Treat instruction as non-XACQUIRE F2H prefixed legacy instruction FI; 5-576 Vol. 2C XACQUIRE/XRELEASE — Hardware Lock Elision Prefix Hints INSTRUCTION SET REFERENCE, V-Z XRELEASE IF XRELEASE-enabled instruction THEN IF (HLE_NEST_COUNT > 0) THEN HLE_NEST_COUNT-IF lock address matches in elision buffer THEN IF lock satisfies address and value requirements THEN Deallocate elision buffer ELSE GOTO HLE_ABORT_PROCESSING FI; FI; IF (HLE_NEST_COUNT = 0) THEN IF NoAllocatedElisionBuffer THEN Try to commit transactional execution IF fail to commit transactional execution THEN GOTO HLE_ABORT_PROCESSING; ELSE (* commit success *) HLE_ACTIVE ← 0 FI; ELSE GOTO HLE_ABORT_PROCESSING FI; FI; FI; (* HLE_NEST_COUNT > 0 *) ELSE Treat instruction as non-XRELEASE F3H prefixed legacy instruction FI; (* For any HLE abort condition encountered during HLE execution *) HLE_ABORT_PROCESSING: HLE_ACTIVE ← 0 HLE_NEST_COUNT ← 0 Restore architectural register state Discard memory updates performed in transaction Free any allocated lock elision buffers IF 64-bit mode THEN RIP ← restartRIP ELSE EIP ← restartEIP FI; Execute and retire instruction at RIP (or EIP) and ignore any HLE hint END XACQUIRE/XRELEASE — Hardware Lock Elision Prefix Hints Vol. 2C 5-577 INSTRUCTION SET REFERENCE, V-Z SIMD Floating-Point Exceptions None Other Exceptions #GP(0) 5-578 Vol. 2C If the use of prefix causes instruction length to exceed 15 bytes. XACQUIRE/XRELEASE — Hardware Lock Elision Prefix Hints INSTRUCTION SET REFERENCE, V-Z XABORT — Transactional Abort Opcode/Instruction Op/ En C6 F8 ib XABORT imm8 A 64/32bit Mode Support V/V CPUID Feature Flag RTM Description Causes an RTM abort if in RTM execution Instruction Operand Encoding Op/En Operand 1 Operand2 Operand3 Operand4 A imm8 NA NA NA Description XABORT forces an RTM abort. Following an RTM abort, the logical processor resumes execution at the fallback address computed through the outermost XBEGIN instruction. The EAX register is updated to reflect an XABORT instruction caused the abort, and the imm8 argument will be provided in bits 31:24 of EAX. Operation XABORT IF RTM_ACTIVE = 0 THEN Treat as NOP; ELSE GOTO RTM_ABORT_PROCESSING; FI; (* For any RTM abort condition encountered during RTM execution *) RTM_ABORT_PROCESSING: Restore architectural register state; Discard memory updates performed in transaction; Update EAX with status and XABORT argument; RTM_NEST_COUNT ← 0; RTM_ACTIVE ← 0; IF 64-bit Mode THEN RIP ← fallbackRIP; ELSE EIP ← fallbackEIP; FI; END Flags Affected None Intel C/C++ Compiler Intrinsic Equivalent XABORT: void _xabort( unsigned int); SIMD Floating-Point Exceptions None XABORT — Transactional Abort Vol. 2C 5-579 INSTRUCTION SET REFERENCE, V-Z Other Exceptions #UD CPUID.(EAX=7, ECX=0):EBX.RTM[bit 11] = 0. If LOCK prefix is used. 5-580 Vol. 2C XABORT — Transactional Abort INSTRUCTION SET REFERENCE, V-Z XADD—Exchange and Add Opcode Instruction Op/ En 64-Bit Mode Compat/ Description Leg Mode 0F C0 /r XADD r/m8, r8 MR Valid Valid Exchange r8 and r/m8; load sum into r/m8. REX + 0F C0 /r XADD r/m8*, r8* MR Valid N.E. Exchange r8 and r/m8; load sum into r/m8. 0F C1 /r XADD r/m16, r16 MR Valid Valid Exchange r16 and r/m16; load sum into r/m16. 0F C1 /r XADD r/m32, r32 MR Valid Valid Exchange r32 and r/m32; load sum into r/m32. REX.W + 0F C1 /r XADD r/m64, r64 MR Valid N.E. Exchange r64 and r/m64; load sum into r/m64. NOTES: * In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 MR ModRM:r/m (r, w) ModRM:reg (r, w) NA NA Description Exchanges the first operand (destination operand) with the second operand (source operand), then loads the sum of the two values into the destination operand. The destination operand can be a register or a memory location; the source operand is a register. In 64-bit mode, the instruction’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See the summary chart at the beginning of this section for encoding data and limits. This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. IA-32 Architecture Compatibility IA-32 processors earlier than the Intel486 processor do not recognize this instruction. If this instruction is used, you should provide an equivalent code sequence that runs on earlier processors. Operation TEMP ← SRC + DEST; SRC ← DEST; DEST ← TEMP; Flags Affected The CF, PF, AF, SF, ZF, and OF flags are set according to the result of the addition, which is stored in the destination operand. Protected Mode Exceptions #GP(0) If the destination is located in a non-writable segment. If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If the DS, ES, FS, or GS register contains a NULL segment selector. #SS(0) If a memory operand effective address is outside the SS segment limit. #PF(fault-code) If a page fault occurs. #AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3. #UD If the LOCK prefix is used but the destination is not a memory operand. XADD—Exchange and Add Vol. 2C 5-581 INSTRUCTION SET REFERENCE, V-Z Real-Address Mode Exceptions #GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. #SS If a memory operand effective address is outside the SS segment limit. #UD If the LOCK prefix is used but the destination is not a memory operand. Virtual-8086 Mode Exceptions #GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. #SS(0) If a memory operand effective address is outside the SS segment limit. #PF(fault-code) If a page fault occurs. #AC(0) If alignment checking is enabled and an unaligned memory reference is made. #UD If the LOCK prefix is used but the destination is not a memory operand. Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions #SS(0) If a memory address referencing the SS segment is in a non-canonical form. #GP(0) If the memory address is in a non-canonical form. #PF(fault-code) If a page fault occurs. #AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3. #UD If the LOCK prefix is used but the destination is not a memory operand. 5-582 Vol. 2C XADD—Exchange and Add INSTRUCTION SET REFERENCE, V-Z XBEGIN — Transactional Begin Opcode/Instruction Op/ En A 64/32bit Mode Support V/V CPUID Feature Flag RTM C7 F8 XBEGIN rel16 C7 F8 XBEGIN rel32 A V/V RTM Description Specifies the start of an RTM region. Provides a 16-bit relative offset to compute the address of the fallback instruction address at which execution resumes following an RTM abort. Specifies the start of an RTM region. Provides a 32-bit relative offset to compute the address of the fallback instruction address at which execution resumes following an RTM abort. Instruction Operand Encoding Op/En Operand 1 Operand2 Operand3 Operand4 A Offset NA NA NA Description The XBEGIN instruction specifies the start of an RTM code region. If the logical processor was not already in transactional execution, then the XBEGIN instruction causes the logical processor to transition into transactional execution. The XBEGIN instruction that transitions the logical processor into transactional execution is referred to as the outermost XBEGIN instruction. The instruction also specifies a relative offset to compute the address of the fallback code path following a transactional abort. On an RTM abort, the logical processor discards all architectural register and memory updates performed during the RTM execution and restores architectural state to that corresponding to the outermost XBEGIN instruction. The fallback address following an abort is computed from the outermost XBEGIN instruction. Operation XBEGIN IF RTM_NEST_COUNT < MAX_RTM_NEST_COUNT THEN RTM_NEST_COUNT++ IF RTM_NEST_COUNT = 1 THEN IF 64-bit Mode THEN fallbackRIP ← RIP + SignExtend64(IMM) (* RIP is instruction following XBEGIN instruction *) ELSE fallbackEIP ← EIP + SignExtend32(IMM) (* EIP is instruction following XBEGIN instruction *) FI; IF (64-bit mode) THEN IF (fallbackRIP is not canonical) THEN #GP(0) FI; ELSE IF (fallbackEIP outside code segment limit) THEN #GP(0) FI; FI; RTM_ACTIVE ← 1 Enter RTM Execution (* record register state, start tracking memory state*) FI; (* RTM_NEST_COUNT = 1 *) XBEGIN — Transactional Begin Vol. 2C 5-583 INSTRUCTION SET REFERENCE, V-Z ELSE (* RTM_NEST_COUNT = MAX_RTM_NEST_COUNT *) GOTO RTM_ABORT_PROCESSING FI; (* For any RTM abort condition encountered during RTM execution *) RTM_ABORT_PROCESSING: Restore architectural register state Discard memory updates performed in transaction Update EAX with status RTM_NEST_COUNT ← 0 RTM_ACTIVE ← 0 IF 64-bit mode THEN RIP ← fallbackRIP ELSE EIP ← fallbackEIP FI; END Flags Affected None Intel C/C++ Compiler Intrinsic Equivalent XBEGIN: unsigned int _xbegin( void ); SIMD Floating-Point Exceptions None Protected Mode Exceptions #UD CPUID.(EAX=7, ECX=0):EBX.RTM[bit 11]=0. If LOCK prefix is used. #GP(0) If the fallback address is outside the CS segment. Real-Address Mode Exceptions #GP(0) #UD If the fallback address is outside the address space 0000H and FFFFH. CPUID.(EAX=7, ECX=0):EBX.RTM[bit 11]=0. If LOCK prefix is used. Virtual-8086 Mode Exceptions #GP(0) #UD If the fallback address is outside the address space 0000H and FFFFH. CPUID.(EAX=7, ECX=0):EBX.RTM[bit 11]=0. If LOCK prefix is used. Compatibility Mode Exceptions Same exceptions as in protected mode. 5-584 Vol. 2C XBEGIN — Transactional Begin INSTRUCTION SET REFERENCE, V-Z 64-bit Mode Exceptions #UD CPUID.(EAX=7, ECX=0):EBX.RTM[bit 11] = 0. If LOCK prefix is used. #GP(0) If the fallback address is non-canonical. XBEGIN — Transactional Begin Vol. 2C 5-585 INSTRUCTION SET REFERENCE, V-Z XCHG—Exchange Register/Memory with Register Opcode Instruction Op/ En 64-Bit Mode Compat/ Description Leg Mode 90+rw XCHG AX, r16 O Valid Valid Exchange r16 with AX. 90+rw XCHG r16, AX O Valid Valid Exchange AX with r16. 90+rd XCHG EAX, r32 O Valid Valid Exchange r32 with EAX. REX.W + 90+rd XCHG RAX, r64 O Valid N.E. Exchange r64 with RAX. 90+rd XCHG r32, EAX O Valid Valid Exchange EAX with r32. REX.W + 90+rd XCHG r64, RAX O Valid N.E. Exchange RAX with r64. 86 /r XCHG r/m8, r8 MR Valid Valid Exchange r8 (byte register) with byte from r/m8. REX + 86 /r XCHG r/m8*, r8* MR Valid N.E. Exchange r8 (byte register) with byte from r/m8. 86 /r XCHG r8, r/m8 RM Valid Valid Exchange byte from r/m8 with r8 (byte register). REX + 86 /r XCHG r8*, r/m8* RM Valid N.E. Exchange byte from r/m8 with r8 (byte register). 87 /r XCHG r/m16, r16 MR Valid Valid Exchange r16 with word from r/m16. 87 /r XCHG r16, r/m16 RM Valid Valid Exchange word from r/m16 with r16. 87 /r XCHG r/m32, r32 MR Valid Valid Exchange r32 with doubleword from r/m32. REX.W + 87 /r XCHG r/m64, r64 MR Valid N.E. Exchange r64 with quadword from r/m64. 87 /r XCHG r32, r/m32 RM Valid Valid Exchange doubleword from r/m32 with r32. REX.W + 87 /r XCHG r64, r/m64 RM Valid N.E. Exchange quadword from r/m64 with r64. NOTES: * In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 O AX/EAX/RAX (r, w) opcode + rd (r, w) NA NA O opcode + rd (r, w) AX/EAX/RAX (r, w) NA NA MR ModRM:r/m (r, w) ModRM:reg (r) NA NA RM ModRM:reg (w) ModRM:r/m (r) NA NA Description Exchanges the contents of the destination (first) and source (second) operands. The operands can be two generalpurpose registers or a register and a memory location. If a memory operand is referenced, the processor’s locking protocol is automatically implemented for the duration of the exchange operation, regardless of the presence or absence of the LOCK prefix or of the value of the IOPL. (See the LOCK prefix description in this chapter for more information on the locking protocol.) This instruction is useful for implementing semaphores or similar data structures for process synchronization. (See “Bus Locking” in Chapter 8 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, for more information on bus locking.) The XCHG instruction can also be used instead of the BSWAP instruction for 16-bit operands. In 64-bit mode, the instruction’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See the summary chart at the beginning of this section for encoding data and limits. 5-586 Vol. 2C XCHG—Exchange Register/Memory with Register INSTRUCTION SET REFERENCE, V-Z NOTE XCHG (E)AX, (E)AX (encoded instruction byte is 90H) is an alias for NOP regardless of data size prefixes, including REX.W. Operation TEMP ← DEST; DEST ← SRC; SRC ← TEMP; Flags Affected None. Protected Mode Exceptions #GP(0) If either operand is in a non-writable segment. If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If the DS, ES, FS, or GS register contains a NULL segment selector. #SS(0) If a memory operand effective address is outside the SS segment limit. #PF(fault-code) If a page fault occurs. #AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3. #UD If the LOCK prefix is used but the destination is not a memory operand. Real-Address Mode Exceptions #GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. #SS If a memory operand effective address is outside the SS segment limit. #UD If the LOCK prefix is used but the destination is not a memory operand. Virtual-8086 Mode Exceptions #GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. #SS(0) If a memory operand effective address is outside the SS segment limit. #PF(fault-code) If a page fault occurs. #AC(0) If alignment checking is enabled and an unaligned memory reference is made. #UD If the LOCK prefix is used but the destination is not a memory operand. Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions #SS(0) If a memory address referencing the SS segment is in a non-canonical form. #GP(0) If the memory address is in a non-canonical form. #PF(fault-code) If a page fault occurs. #AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3. #UD If the LOCK prefix is used but the destination is not a memory operand. XCHG—Exchange Register/Memory with Register Vol. 2C 5-587 INSTRUCTION SET REFERENCE, V-Z XEND — Transactional End Opcode/Instruction Op/ En 0F 01 D5 XEND A 64/32bit Mode Support V/V CPUID Feature Flag RTM Description Specifies the end of an RTM code region. Instruction Operand Encoding Op/En Operand 1 Operand2 Operand3 Operand4 A NA NA NA NA Description The instruction marks the end of an RTM code region. If this corresponds to the outermost scope (that is, including this XEND instruction, the number of XBEGIN instructions is the same as number of XEND instructions), the logical processor will attempt to commit the logical processor state atomically. If the commit fails, the logical processor will rollback all architectural register and memory updates performed during the RTM execution. The logical processor will resume execution at the fallback address computed from the outermost XBEGIN instruction. The EAX register is updated to reflect RTM abort information. XEND executed outside a transactional region will cause a #GP (General Protection Fault). Operation XEND IF (RTM_ACTIVE = 0) THEN SIGNAL #GP ELSE RTM_NEST_COUNT-IF (RTM_NEST_COUNT = 0) THEN Try to commit transaction IF fail to commit transactional execution THEN GOTO RTM_ABORT_PROCESSING; ELSE (* commit success *) RTM_ACTIVE ← 0 FI; FI; FI; (* For any RTM abort condition encountered during RTM execution *) RTM_ABORT_PROCESSING: Restore architectural register state Discard memory updates performed in transaction Update EAX with status RTM_NEST_COUNT ← 0 RTM_ACTIVE ← 0 IF 64-bit Mode THEN RIP ← fallbackRIP ELSE EIP ← fallbackEIP FI; END 5-588 Vol. 2C XEND — Transactional End INSTRUCTION SET REFERENCE, V-Z Flags Affected None Intel C/C++ Compiler Intrinsic Equivalent XEND: void _xend( void ); SIMD Floating-Point Exceptions None Other Exceptions #UD CPUID.(EAX=7, ECX=0):EBX.RTM[bit 11] = 0. If LOCK or 66H or F2H or F3H prefix is used. #GP(0) XEND — Transactional End If RTM_ACTIVE = 0. Vol. 2C 5-589 INSTRUCTION SET REFERENCE, V-Z XGETBV—Get Value of Extended Control Register Opcode Instruction Op/ En 64-Bit Mode Compat/ Description Leg Mode 0F 01 D0 XGETBV NP Valid Valid Reads an XCR specified by ECX into EDX:EAX. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 NP NA NA NA NA Description Reads the contents of the extended control register (XCR) specified in the ECX register into registers EDX:EAX. (On processors that support the Intel 64 architecture, the high-order 32 bits of RCX are ignored.) The EDX register is loaded with the high-order 32 bits of the XCR and the EAX register is loaded with the low-order 32 bits. (On processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are cleared.) If fewer than 64 bits are implemented in the XCR being read, the values returned to EDX:EAX in unimplemented bit locations are undefined. XCR0 is supported on any processor that supports the XGETBV instruction. If CPUID.(EAX=0DH,ECX=1):EAX.XG1[bit 2] = 1, executing XGETBV with ECX = 1 returns in EDX:EAX the logicalAND of XCR0 and the current value of the XINUSE state-component bitmap. This allows software to discover the state of the init optimization used by XSAVEOPT and XSAVES. See Chapter 13, “Managing State Using the XSAVE Feature Set‚” in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1. Use of any other value for ECX results in a general-protection (#GP) exception. Operation EDX:EAX ← XCR[ECX]; Flags Affected None. Intel C/C++ Compiler Intrinsic Equivalent XGETBV: unsigned __int64 _xgetbv( unsigned int); Protected Mode Exceptions #GP(0) #UD If an invalid XCR is specified in ECX (includes ECX = 1 if CPUID.(EAX=0DH,ECX=1):EAX.XG1[bit 2] = 0). If CPUID.01H:ECX.XSAVE[bit 26] = 0. If CR4.OSXSAVE[bit 18] = 0. If the LOCK prefix is used. If 66H, F3H or F2H prefix is used. Real-Address Mode Exceptions #GP(0) #UD If an invalid XCR is specified in ECX (includes ECX = 1 if CPUID.(EAX=0DH,ECX=1):EAX.XG1[bit 2] = 0). If CPUID.01H:ECX.XSAVE[bit 26] = 0. If CR4.OSXSAVE[bit 18] = 0. If the LOCK prefix is used. If 66H, F3H or F2H prefix is used. 5-590 Vol. 2C XGETBV—Get Value of Extended Control Register INSTRUCTION SET REFERENCE, V-Z Virtual-8086 Mode Exceptions Same exceptions as in protected mode. Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions Same exceptions as in protected mode. XGETBV—Get Value of Extended Control Register Vol. 2C 5-591 INSTRUCTION SET REFERENCE, V-Z XLAT/XLATB—Table Look-up Translation Opcode Instruction Op/ En 64-Bit Mode Compat/ Description Leg Mode D7 XLAT m8 NP Valid Valid Set AL to memory byte DS:[(E)BX + unsigned AL]. D7 XLATB NP Valid Valid Set AL to memory byte DS:[(E)BX + unsigned AL]. REX.W + D7 XLATB NP Valid N.E. Set AL to memory byte [RBX + unsigned AL]. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 NP NA NA NA NA Description Locates a byte entry in a table in memory, using the contents of the AL register as a table index, then copies the contents of the table entry back into the AL register. The index in the AL register is treated as an unsigned integer. The XLAT and XLATB instructions get the base address of the table in memory from either the DS:EBX or the DS:BX registers (depending on the address-size attribute of the instruction, 32 or 16, respectively). (The DS segment may be overridden with a segment override prefix.) At the assembly-code level, two forms of this instruction are allowed: the “explicit-operand” form and the “nooperand” form. The explicit-operand form (specified with the XLAT mnemonic) allows the base address of the table to be specified explicitly with a symbol. This explicit-operands form is provided to allow documentation; however, note that the documentation provided by this form can be misleading. That is, the symbol does not have to specify the correct base address. The base address is always specified by the DS:(E)BX registers, which must be loaded correctly before the XLAT instruction is executed. The no-operands form (XLATB) provides a “short form” of the XLAT instructions. Here also the processor assumes that the DS:(E)BX registers contain the base address of the table. In 64-bit mode, operation is similar to that in legacy or compatibility mode. AL is used to specify the table index (the operand size is fixed at 8 bits). RBX, however, is used to specify the table’s base address. See the summary chart at the beginning of this section for encoding data and limits. Operation IF AddressSize = 16 THEN AL ← (DS:BX + ZeroExtend(AL)); ELSE IF (AddressSize = 32) AL ← (DS:EBX + ZeroExtend(AL)); FI; ELSE (AddressSize = 64) AL ← (RBX + ZeroExtend(AL)); FI; Flags Affected None. Protected Mode Exceptions #GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If the DS, ES, FS, or GS register contains a NULL segment selector. #SS(0) If a memory operand effective address is outside the SS segment limit. #PF(fault-code) If a page fault occurs. 5-592 Vol. 2C XLAT/XLATB—Table Look-up Translation INSTRUCTION SET REFERENCE, V-Z #UD If the LOCK prefix is used. Real-Address Mode Exceptions #GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. #SS If a memory operand effective address is outside the SS segment limit. #UD If the LOCK prefix is used. Virtual-8086 Mode Exceptions #GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. #SS(0) If a memory operand effective address is outside the SS segment limit. #PF(fault-code) If a page fault occurs. #UD If the LOCK prefix is used. Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions #SS(0) If a memory address referencing the SS segment is in a non-canonical form. #GP(0) If the memory address is in a non-canonical form. #PF(fault-code) If a page fault occurs. #UD If the LOCK prefix is used. XLAT/XLATB—Table Look-up Translation Vol. 2C 5-593 INSTRUCTION SET REFERENCE, V-Z XOR—Logical Exclusive OR Opcode Instruction Op/ En 64-Bit Mode Compat/ Description Leg Mode 34 ib XOR AL, imm8 I Valid Valid AL XOR imm8. 35 iw XOR AX, imm16 I Valid Valid AX XOR imm16. 35 id XOR EAX, imm32 I Valid Valid EAX XOR imm32. REX.W + 35 id XOR RAX, imm32 I Valid N.E. RAX XOR imm32 (sign-extended). 80 /6 ib XOR r/m8, imm8 MI Valid Valid r/m8 XOR imm8. REX + 80 /6 ib XOR r/m8*, imm8 MI Valid N.E. r/m8 XOR imm8. 81 /6 iw XOR r/m16, imm16 MI Valid Valid r/m16 XOR imm16. 81 /6 id XOR r/m32, imm32 MI Valid Valid r/m32 XOR imm32. REX.W + 81 /6 id XOR r/m64, imm32 MI Valid N.E. r/m64 XOR imm32 (sign-extended). 83 /6 ib XOR r/m16, imm8 MI Valid Valid r/m16 XOR imm8 (sign-extended). 83 /6 ib XOR r/m32, imm8 MI Valid Valid r/m32 XOR imm8 (sign-extended). REX.W + 83 /6 ib XOR r/m64, imm8 MI Valid N.E. r/m64 XOR imm8 (sign-extended). 30 /r XOR r/m8, r8 MR Valid Valid r/m8 XOR r8. REX + 30 /r XOR r/m8*, r8* MR Valid N.E. r/m8 XOR r8. 31 /r XOR r/m16, r16 MR Valid Valid r/m16 XOR r16. 31 /r XOR r/m32, r32 MR Valid Valid r/m32 XOR r32. REX.W + 31 /r XOR r/m64, r64 MR Valid N.E. r/m64 XOR r64. 32 /r XOR r8, r/m8 RM Valid Valid r8 XOR r/m8. REX + 32 /r XOR r8*, r/m8* RM Valid N.E. r8 XOR r/m8. 33 /r XOR r16, r/m16 RM Valid Valid r16 XOR r/m16. 33 /r XOR r32, r/m32 RM Valid Valid r32 XOR r/m32. REX.W + 33 /r XOR r64, r/m64 RM Valid N.E. r64 XOR r/m64. NOTES: * In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 I AL/AX/EAX/RAX imm8/16/32 NA NA MI ModRM:r/m (r, w) imm8/16/32 NA NA MR ModRM:r/m (r, w) ModRM:reg (r) NA NA RM ModRM:reg (r, w) ModRM:r/m (r) NA NA Description Performs a bitwise exclusive OR (XOR) operation on the destination (first) and source (second) operands and stores the result in the destination operand location. The source operand can be an immediate, a register, or a memory location; the destination operand can be a register or a memory location. (However, two memory operands cannot be used in one instruction.) Each bit of the result is 1 if the corresponding bits of the operands are different; each bit is 0 if the corresponding bits are the same. This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. 5-594 Vol. 2C XOR—Logical Exclusive OR INSTRUCTION SET REFERENCE, V-Z In 64-bit mode, using a REX prefix in the form of REX.R permits access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See the summary chart at the beginning of this section for encoding data and limits. Operation DEST ← DEST XOR SRC; Flags Affected The OF and CF flags are cleared; the SF, ZF, and PF flags are set according to the result. The state of the AF flag is undefined. Protected Mode Exceptions #GP(0) If the destination operand points to a non-writable segment. If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If the DS, ES, FS, or GS register contains a NULL segment selector. #SS(0) If a memory operand effective address is outside the SS segment limit. #PF(fault-code) If a page fault occurs. #AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3. #UD If the LOCK prefix is used but the destination is not a memory operand. Real-Address Mode Exceptions #GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. #SS If a memory operand effective address is outside the SS segment limit. #UD If the LOCK prefix is used but the destination is not a memory operand. Virtual-8086 Mode Exceptions #GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. #SS(0) If a memory operand effective address is outside the SS segment limit. #PF(fault-code) If a page fault occurs. #AC(0) If alignment checking is enabled and an unaligned memory reference is made. #UD If the LOCK prefix is used but the destination is not a memory operand. Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions #SS(0) If a memory address referencing the SS segment is in a non-canonical form. #GP(0) If the memory address is in a non-canonical form. #PF(fault-code) If a page fault occurs. #AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3. #UD If the LOCK prefix is used but the destination is not a memory operand. XOR—Logical Exclusive OR Vol. 2C 5-595 INSTRUCTION SET REFERENCE, V-Z XORPD—Bitwise Logical XOR of Packed Double Precision Floating-Point Values Opcode/ Instruction Op / En 66 0F 57/r XORPD xmm1, xmm2/m128 VEX.NDS.128.66.0F.WIG 57 /r VXORPD xmm1,xmm2, xmm3/m128 VEX.NDS.256.66.0F.WIG 57 /r VXORPD ymm1, ymm2, ymm3/m256 EVEX.NDS.128.66.0F.W1 57 /r VXORPD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst EVEX.NDS.256.66.0F.W1 57 /r VXORPD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst EVEX.NDS.512.66.0F.W1 57 /r VXORPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst 64/32 bit Mode Support V/V CPUID Feature Flag SSE2 V/V AVX RVM V/V AVX Return the bitwise logical XOR of packed doubleprecision floating-point values in ymm2 and ymm3/mem. FV V/V AVX512VL AVX512DQ FV V/V AVX512VL AVX512DQ FV V/V AVX512DQ Return the bitwise logical XOR of packed doubleprecision floating-point values in xmm2 and xmm3/m128/m64bcst subject to writemask k1. Return the bitwise logical XOR of packed doubleprecision floating-point values in ymm2 and ymm3/m256/m64bcst subject to writemask k1. Return the bitwise logical XOR of packed doubleprecision floating-point values in zmm2 and zmm3/m512/m64bcst subject to writemask k1. RM RVM Description Return the bitwise logical XOR of packed doubleprecision floating-point values in xmm1 and xmm2/mem. Return the bitwise logical XOR of packed doubleprecision floating-point values in xmm2 and xmm3/mem. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RM ModRM:reg (r, w) ModRM:r/m (r) NA NA RVM ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA FV ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA Description Performs a bitwise logical XOR of the two, four or eight packed double-precision floating-point values from the first source operand and the second source operand, and stores the result in the destination operand EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand can be a ZMM register or a vector memory location. The destination operand is a ZMM register conditionally updated with writemask k1. VEX.256 and EVEX.256 encoded versions: The first source operand is a YMM register. The second source operand is a YMM register or a 256-bit memory location. The destination operand is a YMM register (conditionally updated with writemask k1 in case of EVEX). The upper bits (MAX_VL-1:256) of the corresponding ZMM register destination are zeroed. VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register (conditionally updated with writemask k1 in case of EVEX). The upper bits (MAX_VL-1:128) of the corresponding ZMM register destination are zeroed. 128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding register destination are unmodified. Operation VXORPD (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) FOR j 0 TO KL-1 i j * 64 IF k1[j] OR *no writemask* THEN 5-596 Vol. 2C XORPD—Bitwise Logical XOR of Packed Double Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z IF (EVEX.b == 1) AND (SRC2 *is memory*) THEN DEST[i+63:i] SRC1[i+63:i] BITWISE XOR SRC2[63:0]; ELSE DEST[i+63:i] SRC1[i+63:i] BITWISE XOR SRC2[i+63:i]; FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+63:i] = 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VXORPD (VEX.256 encoded version) DEST[63:0] SRC1[63:0] BITWISE XOR SRC2[63:0] DEST[127:64] SRC1[127:64] BITWISE XOR SRC2[127:64] DEST[191:128] SRC1[191:128] BITWISE XOR SRC2[191:128] DEST[255:192] SRC1[255:192] BITWISE XOR SRC2[255:192] DEST[MAX_VL-1:256] 0 VXORPD (VEX.128 encoded version) DEST[63:0] SRC1[63:0] BITWISE XOR SRC2[63:0] DEST[127:64] SRC1[127:64] BITWISE XOR SRC2[127:64] DEST[MAX_VL-1:128] 0 XORPD (128-bit Legacy SSE version) DEST[63:0] DEST[63:0] BITWISE XOR SRC[63:0] DEST[127:64] DEST[127:64] BITWISE XOR SRC[127:64] DEST[MAX_VL-1:128] (Unmodified) Intel C/C++ Compiler Intrinsic Equivalent VXORPD __m512d _mm512_xor_pd (__m512d a, __m512d b); VXORPD __m512d _mm512_mask_xor_pd (__m512d a, __mmask8 m, __m512d b); VXORPD __m512d _mm512_maskz_xor_pd (__mmask8 m, __m512d a); VXORPD __m256d _mm256_xor_pd (__m256d a, __m256d b); VXORPD __m256d _mm256_mask_xor_pd (__m256d a, __mmask8 m, __m256d b); VXORPD __m256d _mm256_maskz_xor_pd (__mmask8 m, __m256d a); XORPD __m128d _mm_xor_pd (__m128d a, __m128d b); VXORPD __m128d _mm_mask_xor_pd (__m128d a, __mmask8 m, __m128d b); VXORPD __m128d _mm_maskz_xor_pd (__mmask8 m, __m128d a); SIMD Floating-Point Exceptions None XORPD—Bitwise Logical XOR of Packed Double Precision Floating-Point Values Vol. 2C 5-597 INSTRUCTION SET REFERENCE, V-Z Other Exceptions Non-EVEX-encoded instructions, see Exceptions Type 4. EVEX-encoded instructions, see Exceptions Type E4. 5-598 Vol. 2C XORPD—Bitwise Logical XOR of Packed Double Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z XORPS—Bitwise Logical XOR of Packed Single Precision Floating-Point Values Opcode/ Instruction Op / En RM 64/32 bit Mode Support V/V CPUID Feature Flag SSE 0F 57 /r XORPS xmm1, xmm2/m128 VEX.NDS.128.0F.WIG 57 /r VXORPS xmm1,xmm2, xmm3/m128 RVM V/V AVX VEX.NDS.256.0F.WIG 57 /r VXORPS ymm1, ymm2, ymm3/m256 RVM V/V AVX EVEX.NDS.128.0F.W0 57 /r VXORPS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.NDS.256.0F.W0 57 /r VXORPS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.NDS.512.0F.W0 57 /r VXORPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst FV V/V AVX512VL AVX512DQ FV V/V AVX512VL AVX512DQ FV V/V AVX512DQ Description Return the bitwise logical XOR of packed singleprecision floating-point values in xmm1 and xmm2/mem. Return the bitwise logical XOR of packed singleprecision floating-point values in xmm2 and xmm3/mem. Return the bitwise logical XOR of packed singleprecision floating-point values in ymm2 and ymm3/mem. Return the bitwise logical XOR of packed singleprecision floating-point values in xmm2 and xmm3/m128/m32bcst subject to writemask k1. Return the bitwise logical XOR of packed singleprecision floating-point values in ymm2 and ymm3/m256/m32bcst subject to writemask k1. Return the bitwise logical XOR of packed singleprecision floating-point values in zmm2 and zmm3/m512/m32bcst subject to writemask k1. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 RM ModRM:reg (r, w) ModRM:r/m (r) NA NA RVM ModRM:reg (w) VEX.vvvv ModRM:r/m (r) NA FV ModRM:reg (w) EVEX.vvvv ModRM:r/m (r) NA Description Performs a bitwise logical XOR of the four, eight or sixteen packed single-precision floating-point values from the first source operand and the second source operand, and stores the result in the destination operand EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand can be a ZMM register or a vector memory location. The destination operand is a ZMM register conditionally updated with writemask k1. VEX.256 and EVEX.256 encoded versions: The first source operand is a YMM register. The second source operand is a YMM register or a 256-bit memory location. The destination operand is a YMM register (conditionally updated with writemask k1 in case of EVEX). The upper bits (MAX_VL-1:256) of the corresponding ZMM register destination are zeroed. VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register (conditionally updated with writemask k1 in case of EVEX). The upper bits (MAX_VL-1:128) of the corresponding ZMM register destination are zeroed. 128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding register destination are unmodified. XORPS—Bitwise Logical XOR of Packed Single Precision Floating-Point Values Vol. 2C 5-599 INSTRUCTION SET REFERENCE, V-Z Operation VXORPS (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) FOR j 0 TO KL-1 i j * 32 IF k1[j] OR *no writemask* THEN IF (EVEX.b == 1) AND (SRC2 *is memory*) THEN DEST[i+31:i] SRC1[i+31:i] BITWISE XOR SRC2[31:0]; ELSE DEST[i+31:i] SRC1[i+31:i] BITWISE XOR SRC2[i+31:i]; FI; ELSE IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged* ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] = 0 FI FI; ENDFOR DEST[MAX_VL-1:VL] 0 VXORPS (VEX.256 encoded version) DEST[31:0] SRC1[31:0] BITWISE XOR SRC2[31:0] DEST[63:32] SRC1[63:32] BITWISE XOR SRC2[63:32] DEST[95:64] SRC1[95:64] BITWISE XOR SRC2[95:64] DEST[127:96] SRC1[127:96] BITWISE XOR SRC2[127:96] DEST[159:128] SRC1[159:128] BITWISE XOR SRC2[159:128] DEST[191:160] SRC1[191:160] BITWISE XOR SRC2[191:160] DEST[223:192] SRC1[223:192] BITWISE XOR SRC2[223:192] DEST[255:224] SRC1[255:224] BITWISE XOR SRC2[255:224]. DEST[MAX_VL-1:256] 0 VXORPS (VEX.128 encoded version) DEST[31:0] SRC1[31:0] BITWISE XOR SRC2[31:0] DEST[63:32] SRC1[63:32] BITWISE XOR SRC2[63:32] DEST[95:64] SRC1[95:64] BITWISE XOR SRC2[95:64] DEST[127:96] SRC1[127:96] BITWISE XOR SRC2[127:96] DEST[MAX_VL-1:128] 0 XORPS (128-bit Legacy SSE version) DEST[31:0] SRC1[31:0] BITWISE XOR SRC2[31:0] DEST[63:32] SRC1[63:32] BITWISE XOR SRC2[63:32] DEST[95:64] SRC1[95:64] BITWISE XOR SRC2[95:64] DEST[127:96] SRC1[127:96] BITWISE XOR SRC2[127:96] DEST[MAX_VL-1:128] (Unmodified) Intel C/C++ Compiler Intrinsic Equivalent VXORPS __m512 _mm512_xor_ps (__m512 a, __m512 b); VXORPS __m512 _mm512_mask_xor_ps (__m512 a, __mmask16 m, __m512 b); VXORPS __m512 _mm512_maskz_xor_ps (__mmask16 m, __m512 a); VXORPS __m256 _mm256_xor_ps (__m256 a, __m256 b); VXORPS __m256 _mm256_mask_xor_ps (__m256 a, __mmask8 m, __m256 b); VXORPS __m256 _mm256_maskz_xor_ps (__mmask8 m, __m256 a); XORPS __m128 _mm_xor_ps (__m128 a, __m128 b); VXORPS __m128 _mm_mask_xor_ps (__m128 a, __mmask8 m, __m128 b); 5-600 Vol. 2C XORPS—Bitwise Logical XOR of Packed Single Precision Floating-Point Values INSTRUCTION SET REFERENCE, V-Z VXORPS __m128 _mm_maskz_xor_ps (__mmask8 m, __m128 a); SIMD Floating-Point Exceptions None Other Exceptions Non-EVEX-encoded instructions, see Exceptions Type 4. EVEX-encoded instructions, see Exceptions Type E4. XORPS—Bitwise Logical XOR of Packed Single Precision Floating-Point Values Vol. 2C 5-601 INSTRUCTION SET REFERENCE, V-Z XRSTOR—Restore Processor Extended States Opcode Instruction Op/ En 64-Bit Mode Compat/ Description Leg Mode 0F AE /5 XRSTOR mem M Valid Valid Restore state components specified by EDX:EAX from mem. REX.W+ 0F AE /5 XRSTOR64 mem M Valid N.E. Restore state components specified by EDX:EAX from mem. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 M ModRM:r/m (r) NA NA NA Description Performs a full or partial restore of processor state components from the XSAVE area located at the memory address specified by the source operand. The implicit EDX:EAX register pair specifies a 64-bit instruction mask. The specific state components restored correspond to the bits set in the requested-feature bitmap (RFBM), which is the logical-AND of EDX:EAX and XCR0. The format of the XSAVE area is detailed in Section 13.4, “XSAVE Area,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1. Section 13.8, “Operation of XRSTOR,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 provides a detailed description of the operation of the XRSTOR instruction. The following items provide a highlevel outline: • Execution of XRSTOR may take one of two forms: standard and compacted. Bit 63 of the XCOMP_BV field in the XSAVE header determines which form is used: value 0 specifies the standard form, while value 1 specifies the compacted form. • • If RFBM[i] = 0, XRSTOR does not update state component i.1 If RFBM[i] = 1 and bit i is clear in the XSTATE_BV field in the XSAVE header, XRSTOR initializes state component i. • • If RFBM[i] = 1 and XSTATE_BV[i] = 1, XRSTOR loads state component i from the XSAVE area. • XRSTOR loads the internal value XRSTOR_INFO, which may be used to optimize a subsequent execution of XSAVEOPT or XSAVES. • Immediately following an execution of XRSTOR, the processor tracks as in-use (not in initial configuration) any state component i for which RFBM[i] = 1 and XSTATE_BV[i] = 1; it tracks as modified any state component i for which RFBM[i] = 0. The standard form of XRSTOR treats MXCSR (which is part of state component 1 — SSE) differently from the XMM registers. If either form attempts to load MXCSR with an illegal value, a general-protection exception (#GP) occurs. Use of a source operand not aligned to 64-byte boundary (for 64-bit and 32-bit modes) results in a general-protection (#GP) exception. In 64-bit mode, the upper 32 bits of RDX and RAX are ignored. Operation RFBM ← XCR0 AND EDX:EAX; /* bitwise logical AND */ COMPMASK ← XCOMP_BV field from XSAVE header; RSTORMASK ← XSTATE_BV field from XSAVE header; IF in VMX non-root operation THEN VMXNR ← 1; 1. There is an exception if RFBM[1] = 0 and RFBM[2] = 1. In this case, the standard form of XRSTOR will load MXCSR from memory, even though MXCSR is part of state component 1 — SSE. The compacted form of XRSTOR does not make this exception. 5-602 Vol. 2C XRSTOR—Restore Processor Extended States INSTRUCTION SET REFERENCE, V-Z ELSE VMXNR ← 0; FI; LAXA ← linear address of XSAVE area; IF COMPMASK[63] = 0 THEN /* Standard form of XRSTOR */ If RFBM[0] = 1 THEN IF RSTORMASK[0] = 1 THEN load x87 state from legacy region of XSAVE area; ELSE initialize x87 state; FI; FI; If RFBM[1] = 1 THEN IF RSTORMASK[1] = 1 THEN load XMM registers from legacy region of XSAVE area; ELSE set all XMM registers to 0; FI; FI; If RFBM[2] = 1 THEN IF RSTORMASK[2] = 1 THEN load AVX state from extended region (standard format) of XSAVE area; ELSE initialize AVX state; FI; FI; If RFBM[1] = 1 or RFBM[2] = 1 THEN load MXCSR from legacy region of XSAVE area; FI; FI; ELSE /* Compacted form of XRSTOR */ IF CPUID.(EAX=0DH,ECX=1):EAX.XSAVEC[bit 1] = 0 THEN /* compacted form not supported */ #GP(0); FI; If RFBM[0] = 1 THEN IF RSTORMASK[0] = 1 THEN load x87 state from legacy region of XSAVE area; ELSE initialize x87 state; FI; FI; If RFBM[1] = 1 THEN IF RSTORMASK[1] = 1 THEN load SSE state from legacy region of XSAVE area; ELSE initialize SSE state; FI; FI; If RFBM[2] = 1 THEN XRSTOR—Restore Processor Extended States Vol. 2C 5-603 INSTRUCTION SET REFERENCE, V-Z IF RSTORMASK[2] = 1 THEN load AVX state from extended region (compacted format) of XSAVE area; ELSE initialize AVX state; FI; FI; FI; XRSTOR_INFO ← CPL,VMXNR,LAXA,COMPMASK; Flags Affected None. Intel C/C++ Compiler Intrinsic Equivalent XRSTOR: void _xrstor( void * , unsigned __int64); XRSTOR: void _xrstor64( void * , unsigned __int64); Protected Mode Exceptions #GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If a memory operand is not aligned on a 64-byte boundary, regardless of segment. If bit 63 of the XCOMP_BV field of the XSAVE header is 1 and CPUID.(EAX=0DH,ECX=1):EAX.XSAVEC[bit 1] = 0. If the standard form is executed and a bit in XCR0 is 0 and the corresponding bit in the XSTATE_BV field of the XSAVE header is 1. If the standard form is executed and bytes 23:8 of the XSAVE header are not all zero. If the compacted form is executed and a bit in XCR0 is 0 and the corresponding bit in the XCOMP_BV field of the XSAVE header is 1. If the compacted form is executed and a bit in the XCOMP_BV field in the XSAVE header is 0 and the corresponding bit in the XSTATE_BV field is 1. If the compacted form is executed and bytes 63:16 of the XSAVE header are not all zero. If attempting to write any reserved bits of the MXCSR register with 1. #SS(0) If a memory operand effective address is outside the SS segment limit. #PF(fault-code) If a page fault occurs. #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0. If CR4.OSXSAVE[bit 18] = 0. If any of the LOCK, 66H, F3H or F2H prefixes is used. #AC If this exception is disabled a general protection exception (#GP) is signaled if the memory operand is not aligned on a 16-byte boundary, as described above. If the alignment check exception (#AC) is enabled (and the CPL is 3), signaling of #AC is not guaranteed and may vary with implementation, as follows. In all implementations where #AC is not signaled, a general protection exception is signaled in its place. In addition, the width of the alignment check may also vary with implementation. For instance, for a given implementation, an alignment check exception might be signaled for a 2-byte misalignment, whereas a general protection exception might be signaled for all other misalignments (4-, 8-, or 16-byte misalignments). Real-Address Mode Exceptions #GP If a memory operand is not aligned on a 64-byte boundary, regardless of segment. If any part of the operand lies outside the effective address space from 0 to FFFFH. If bit 63 of the XCOMP_BV field of the XSAVE header is 1 and CPUID.(EAX=0DH,ECX=1):EAX.XSAVEC[bit 1] = 0. 5-604 Vol. 2C XRSTOR—Restore Processor Extended States INSTRUCTION SET REFERENCE, V-Z If the standard form is executed and a bit in XCR0 is 0 and the corresponding bit in the XSTATE_BV field of the XSAVE header is 1. If the standard form is executed and bytes 23:8 of the XSAVE header are not all zero. If the compacted form is executed and a bit in XCR0 is 0 and the corresponding bit in the XCOMP_BV field of the XSAVE header is 1. If the compacted form is executed and a bit in the XCOMP_BV field in the XSAVE header is 0 and the corresponding bit in the XSTATE_BV field is 1. If the compacted form is executed and bytes 63:16 of the XSAVE header are not all zero. If attempting to write any reserved bits of the MXCSR register with 1. #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0. If CR4.OSXSAVE[bit 18] = 0. If any of the LOCK, 66H, F3H or F2H prefixes is used. Virtual-8086 Mode Exceptions Same exceptions as in protected mode Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions #GP(0) If a memory address is in a non-canonical form. If a memory operand is not aligned on a 64-byte boundary, regardless of segment. If bit 63 of the XCOMP_BV field of the XSAVE header is 1 and CPUID.(EAX=0DH,ECX=1):EAX.XSAVEC[bit 1] = 0. If the standard form is executed and a bit in XCR0 is 0 and the corresponding bit in the XSTATE_BV field of the XSAVE header is 1. If the standard form is executed and bytes 23:8 of the XSAVE header are not all zero. If the compacted form is executed and a bit in XCR0 is 0 and the corresponding bit in the XCOMP_BV field of the XSAVE header is 1. If the compacted form is executed and a bit in the XCOMP_BV field in the XSAVE header is 0 and the corresponding bit in the XSTATE_BV field is 1. If the compacted form is executed and bytes 63:16 of the XSAVE header are not all zero. If attempting to write any reserved bits of the MXCSR register with 1. #SS(0) If a memory address referencing the SS segment is in a non-canonical form. #PF(fault-code) If a page fault occurs. #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0. If CR4.OSXSAVE[bit 18] = 0. If any of the LOCK, 66H, F3H or F2H prefixes is used. #AC If this exception is disabled a general protection exception (#GP) is signaled if the memory operand is not aligned on a 16-byte boundary, as described above. If the alignment check exception (#AC) is enabled (and the CPL is 3), signaling of #AC is not guaranteed and may vary with implementation, as follows. In all implementations where #AC is not signaled, a general protection exception is signaled in its place. In addition, the width of the alignment check may also vary with implementation. For instance, for a given implementation, an alignment check exception might be signaled for a 2-byte misalignment, whereas a general protection exception might be signaled for all other misalignments (4-, 8-, or 16-byte misalignments). XRSTOR—Restore Processor Extended States Vol. 2C 5-605 INSTRUCTION SET REFERENCE, V-Z XRSTORS—Restore Processor Extended States Supervisor Opcode Instruction Op/ En 64-Bit Mode Compat/ Description Leg Mode 0F C7 /3 XRSTORS mem M Valid Valid Restore state components specified by EDX:EAX from mem. REX.W+ 0F C7 /3 XRSTORS64 mem M Valid N.E. Restore state components specified by EDX:EAX from mem. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 M ModRM:r/m (r) NA NA NA Description Performs a full or partial restore of processor state components from the XSAVE area located at the memory address specified by the source operand. The implicit EDX:EAX register pair specifies a 64-bit instruction mask. The specific state components restored correspond to the bits set in the requested-feature bitmap (RFBM), which is the logical-AND of EDX:EAX and the logical-OR of XCR0 with the IA32_XSS MSR. XRSTORS may be executed only if CPL = 0. The format of the XSAVE area is detailed in Section 13.4, “XSAVE Area,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1. Section 13.12, “Operation of XRSTORS,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 provides a detailed description of the operation of the XRSTOR instruction. The following items provide a high-level outline: • Execution of XRSTORS is similar to that of the compacted form of XRSTOR; XRSTORS cannot restore from an XSAVE area in which the extended region is in the standard format (see Section 13.4.3, “Extended Region of an XSAVE Area”). • XRSTORS differs from XRSTOR in that it can restore state components corresponding to bits set in the IA32_XSS MSR. • • If RFBM[i] = 0, XRSTORS does not update state component i. If RFBM[i] = 1 and bit i is clear in the XSTATE_BV field in the XSAVE header, XRSTORS initializes state component i. • • • If RFBM[i] = 1 and XSTATE_BV[i] = 1, XRSTORS loads state component i from the XSAVE area. • Immediately following an execution of XRSTORS, the processor tracks as in-use (not in initial configuration) any state component i for which RFBM[i] = 1 and XSTATE_BV[i] = 1; it tracks as modified any state component i for which RFBM[i] = 0. If XRSTORS attempts to load MXCSR with an illegal value, a general-protection exception (#GP) occurs. XRSTORS loads the internal value XRSTOR_INFO, which may be used to optimize a subsequent execution of XSAVEOPT or XSAVES. Use of a source operand not aligned to 64-byte boundary (for 64-bit and 32-bit modes) results in a general-protection (#GP) exception. In 64-bit mode, the upper 32 bits of RDX and RAX are ignored. Operation RFBM ← (XCR0 OR IA32_XSS) AND EDX:EAX; /* bitwise logical OR and AND */ COMPMASK ← XCOMP_BV field from XSAVE header; RSTORMASK ← XSTATE_BV field from XSAVE header; IF in VMX non-root operation THEN VMXNR ← 1; ELSE VMXNR ← 0; FI; 5-606 Vol. 2C XRSTORS—Restore Processor Extended States Supervisor INSTRUCTION SET REFERENCE, V-Z LAXA ← linear address of XSAVE area; If RFBM[0] = 1 THEN IF RSTORMASK[0] = 1 THEN load x87 state from legacy region of XSAVE area; ELSE initialize x87 state; FI; FI; If RFBM[1] = 1 THEN IF RSTORMASK[1] = 1 THEN load SSE state from legacy region of XSAVE area; ELSE initialize SSE state; FI; FI; If RFBM[2] = 1 THEN IF RSTORMASK[2] = 1 THEN load AVX state from extended region (compacted format) of XSAVE area; ELSE initialize AVX state; FI; FI; XRSTOR_INFO ← CPL,VMXNR,LAXA,COMPMASK; Flags Affected None. Intel C/C++ Compiler Intrinsic Equivalent XRSTORS: void _xrstors( void * , unsigned __int64); XRSTORS64: void _xrstors64( void * , unsigned __int64); Protected Mode Exceptions #GP(0) If CPL > 0. If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If a memory operand is not aligned on a 64-byte boundary, regardless of segment. If bit 63 of the XCOMP_BV field of the XSAVE header is 0. If a bit in XCR0 is 0 and the corresponding bit in the XCOMP_BV field of the XSAVE header is 1. If a bit in the XCOMP_BV field in the XSAVE header is 0 and the corresponding bit in the XSTATE_BV field is 1. If bytes 63:16 of the XSAVE header are not all zero. If attempting to write any reserved bits of the MXCSR register with 1. #SS(0) If a memory operand effective address is outside the SS segment limit. #PF(fault-code) If a page fault occurs. #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0 or CPUID.(EAX=0DH,ECX=1):EAX.XSS[bit 3] = 0. If CR4.OSXSAVE[bit 18] = 0. If any of the LOCK, 66H, F3H or F2H prefixes is used. #AC If this exception is disabled a general protection exception (#GP) is signaled if the memory operand is not aligned on a 16-byte boundary, as described above. If the alignment check XRSTORS—Restore Processor Extended States Supervisor Vol. 2C 5-607 INSTRUCTION SET REFERENCE, V-Z exception (#AC) is enabled (and the CPL is 3), signaling of #AC is not guaranteed and may vary with implementation, as follows. In all implementations where #AC is not signaled, a #GP is signaled in its place. In addition, the width of the alignment check may also vary with implementation. For instance, for a given implementation, an alignment check exception might be signaled for a 2-byte misalignment, whereas a #GP might be signaled for all other misalignments (4-, 8-, or 16-byte misalignments). Real-Address Mode Exceptions #GP If a memory operand is not aligned on a 64-byte boundary, regardless of segment. If any part of the operand lies outside the effective address space from 0 to FFFFH. If bit 63 of the XCOMP_BV field of the XSAVE header is 0. If a bit in XCR0 is 0 and the corresponding bit in the XCOMP_BV field of the XSAVE header is 1. If a bit in the XCOMP_BV field in the XSAVE header is 0 and the corresponding bit in the XSTATE_BV field is 1. If bytes 63:16 of the XSAVE header are not all zero. If attempting to write any reserved bits of the MXCSR register with 1. #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0 or CPUID.(EAX=0DH,ECX=1):EAX.XSS[bit 3] = 0. If CR4.OSXSAVE[bit 18] = 0. If any of the LOCK, 66H, F3H or F2H prefixes is used. Virtual-8086 Mode Exceptions Same exceptions as in protected mode Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions #GP(0) If CPL > 0. If a memory address is in a non-canonical form. If a memory operand is not aligned on a 64-byte boundary, regardless of segment. If bit 63 of the XCOMP_BV field of the XSAVE header is 0. If a bit in XCR0 is 0 and the corresponding bit in the XCOMP_BV field of the XSAVE header is 1. If a bit in the XCOMP_BV field in the XSAVE header is 0 and the corresponding bit in the XSTATE_BV field is 1. If bytes 63:16 of the XSAVE header are not all zero. If attempting to write any reserved bits of the MXCSR register with 1. #SS(0) If a memory address referencing the SS segment is in a non-canonical form. #PF(fault-code) If a page fault occurs. #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0 or CPUID.(EAX=0DH,ECX=1):EAX.XSS[bit 3] = 0. If CR4.OSXSAVE[bit 18] = 0. If any of the LOCK, 66H, F3H or F2H prefixes is used. #AC 5-608 Vol. 2C If this exception is disabled a general protection exception (#GP) is signaled if the memory operand is not aligned on a 16-byte boundary, as described above. If the alignment check exception (#AC) is enabled (and the CPL is 3), signaling of #AC is not guaranteed and may vary with implementation, as follows. In all implementations where #AC is not signaled, a general protection exception is signaled in its place. In addition, the width of the alignment check may also vary with implementation. For instance, for a given implementation, an alignment check exception might be signaled for a 2-byte misalignment, whereas a general protecXRSTORS—Restore Processor Extended States Supervisor INSTRUCTION SET REFERENCE, V-Z tion exception might be signaled for all other misalignments (4-, 8-, or 16-byte misalignments). XRSTORS—Restore Processor Extended States Supervisor Vol. 2C 5-609 INSTRUCTION SET REFERENCE, V-Z XSAVE—Save Processor Extended States Opcode Instruction Op/ En 64-Bit Mode Compat/ Description Leg Mode 0F AE /4 XSAVE mem M Valid Valid Save state components specified by EDX:EAX to mem. REX.W+ 0F AE /4 XSAVE64 mem M Valid N.E. Save state components specified by EDX:EAX to mem. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 M ModRM:r/m (w) NA NA NA Description Performs a full or partial save of processor state components to the XSAVE area located at the memory address specified by the destination operand. The implicit EDX:EAX register pair specifies a 64-bit instruction mask. The specific state components saved correspond to the bits set in the requested-feature bitmap (RFBM), which is the logical-AND of EDX:EAX and XCR0. The format of the XSAVE area is detailed in Section 13.4, “XSAVE Area,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1. Section 13.7, “Operation of XSAVE,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 provides a detailed description of the operation of the XSAVE instruction. The following items provide a high-level outline: • • XSAVE saves state component i if and only if RFBM[i] = 1.1 • XSAVE reads the XSTATE_BV field of the XSAVE header (see Section 13.4.2, “XSAVE Header”) and writes a modified value back to memory as follows. If RFBM[i] = 1, XSAVE writes XSTATE_BV[i] with the value of XINUSE[i]. (XINUSE is a bitmap by which the processor tracks the status of various state components. See Section 13.6, “Processor Tracking of XSAVE-Managed State.”) If RFBM[i] = 0, XSAVE writes XSTATE_BV[i] with the value that it read from memory (it does not modify the bit). XSAVE does not write to any part of the XSAVE header other than the XSTATE_BV field. • XSAVE always uses the standard format of the extended region of the XSAVE area (see Section 13.4.3, “Extended Region of an XSAVE Area”). XSAVE does not modify bytes 511:464 of the legacy region of the XSAVE area (see Section 13.4.1, “Legacy Region of an XSAVE Area”). Use of a destination operand not aligned to 64-byte boundary (in either 64-bit or 32-bit modes) results in a general-protection (#GP) exception. In 64-bit mode, the upper 32 bits of RDX and RAX are ignored. Operation RFBM ← XCR0 AND EDX:EAX; /* bitwise logical AND */ OLD_BV ← XSTATE_BV field from XSAVE header; IF RFBM[0] = 1 THEN store x87 state into legacy region of XSAVE area; FI; IF RFBM[1] = 1 THEN store XMM registers into legacy region of XSAVE area; FI; 1. An exception is made for MXCSR and MXCSR_MASK, which belong to state component 1 — SSE. XSAVE saves these values to memory if either RFBM[1] or RFBM[2] is 1. 5-610 Vol. 2C XSAVE—Save Processor Extended States INSTRUCTION SET REFERENCE, V-Z IF RFBM[2] = 1 THEN store AVX state into extended region of XSAVE area; FI; IF RFBM[1] = 1 or RFBM[2] = 1 THEN store MXCSR and MXCSR_MASK into legacy region of XSAVE area; FI; XSTATE_BV field in XSAVE header ← (OLD_BV AND ~RFBM) OR (XINUSE AND RFBM); Flags Affected None. Intel C/C++ Compiler Intrinsic Equivalent XSAVE: void _xsave( void * , unsigned __int64); XSAVE: void _xsave64( void * , unsigned __int64); Protected Mode Exceptions #GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If a memory operand is not aligned on a 64-byte boundary, regardless of segment. #SS(0) If a memory operand effective address is outside the SS segment limit. #PF(fault-code) If a page fault occurs. #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0. If CR4.OSXSAVE[bit 18] = 0. If any of the LOCK, 66H, F3H or F2H prefixes is used. #AC If this exception is disabled a general protection exception (#GP) is signaled if the memory operand is not aligned on a 16-byte boundary, as described above. If the alignment check exception (#AC) is enabled (and the CPL is 3), signaling of #AC is not guaranteed and may vary with implementation, as follows. In all implementations where #AC is not signaled, a general protection exception is signaled in its place. In addition, the width of the alignment check may also vary with implementation. For instance, for a given implementation, an alignment check exception might be signaled for a 2-byte misalignment, whereas a general protection exception might be signaled for all other misalignments (4-, 8-, or 16-byte misalignments). Real-Address Mode Exceptions #GP If a memory operand is not aligned on a 64-byte boundary, regardless of segment. If any part of the operand lies outside the effective address space from 0 to FFFFH. #NM #UD If CR0.TS[bit 3] = 1. If CPUID.01H:ECX.XSAVE[bit 26] = 0. If CR4.OSXSAVE[bit 18] = 0. If any of the LOCK, 66H, F3H or F2H prefixes is used. Virtual-8086 Mode Exceptions Same exceptions as in protected mode. Compatibility Mode Exceptions Same exceptions as in protected mode. XSAVE—Save Processor Extended States Vol. 2C 5-611 INSTRUCTION SET REFERENCE, V-Z 64-Bit Mode Exceptions #GP(0) If the memory address is in a non-canonical form. If a memory operand is not aligned on a 64-byte boundary, regardless of segment. #SS(0) If a memory address referencing the SS segment is in a non-canonical form. #PF(fault-code) If a page fault occurs. #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0. If CR4.OSXSAVE[bit 18] = 0. If any of the LOCK, 66H, F3H or F2H prefixes is used. #AC 5-612 Vol. 2C If this exception is disabled a general protection exception (#GP) is signaled if the memory operand is not aligned on a 16-byte boundary, as described above. If the alignment check exception (#AC) is enabled (and the CPL is 3), signaling of #AC is not guaranteed and may vary with implementation, as follows. In all implementations where #AC is not signaled, a general protection exception is signaled in its place. In addition, the width of the alignment check may also vary with implementation. For instance, for a given implementation, an alignment check exception might be signaled for a 2-byte misalignment, whereas a general protection exception might be signaled for all other misalignments (4-, 8-, or 16-byte misalignments). XSAVE—Save Processor Extended States INSTRUCTION SET REFERENCE, V-Z XSAVEC—Save Processor Extended States with Compaction Opcode Instruction Op/ En 64-Bit Mode Compat/ Description Leg Mode 0F C7 /4 XSAVEC mem M Valid Valid Save state components specified by EDX:EAX to mem with compaction. REX.W+ 0F C7 /4 XSAVEC64 mem M Valid N.E. Save state components specified by EDX:EAX to mem with compaction. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 M ModRM:r/m (w) NA NA NA Description Performs a full or partial save of processor state components to the XSAVE area located at the memory address specified by the destination operand. The implicit EDX:EAX register pair specifies a 64-bit instruction mask. The specific state components saved correspond to the bits set in the requested-feature bitmap (RFBM), which is the logical-AND of EDX:EAX and XCR0. The format of the XSAVE area is detailed in Section 13.4, “XSAVE Area,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1. Section 13.10, “Operation of XSAVEC,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 provides a detailed description of the operation of the XSAVEC instruction. The following items provide a highlevel outline: • Execution of XSAVEC is similar to that of XSAVE. XSAVEC differs from XSAVE in that it uses compaction and that it may use the init optimization. • XSAVEC saves state component i if and only if RFBM[i] = 1 and XINUSE[i] = 1.1 (XINUSE is a bitmap by which the processor tracks the status of various state components. See Section 13.6, “Processor Tracking of XSAVEManaged State.”) • XSAVEC does not modify bytes 511:464 of the legacy region of the XSAVE area (see Section 13.4.1, “Legacy Region of an XSAVE Area”). • XSAVEC writes the logical AND of RFBM and XINUSE to the XSTATE_BV field of the XSAVE header.2,3 (See Section 13.4.2, “XSAVE Header.”) XSAVEC sets bit 63 of the XCOMP_BV field and sets bits 62:0 of that field to RFBM[62:0]. XSAVEC does not write to any parts of the XSAVE header other than the XSTATE_BV and XCOMP_BV fields. • XSAVEC always uses the compacted format of the extended region of the XSAVE area (see Section 13.4.3, “Extended Region of an XSAVE Area”). Use of a destination operand not aligned to 64-byte boundary (in either 64-bit or 32-bit modes) results in a general-protection (#GP) exception. In 64-bit mode, the upper 32 bits of RDX and RAX are ignored. Operation RFBM ← XCR0 AND EDX:EAX; /* bitwise logical AND */ COMPMASK ← RFBM OR 80000000_00000000H; IF RFBM[0] = 1 and XINUSE[0] = 1 1. There is an exception for state component 1 (SSE). MXCSR is part of SSE state, but XINUSE[1] may be 0 even if MXCSR does not have its initial value of 1F80H. In this case, XSAVEC saves SSE state as long as RFBM[1] = 1. 2. Unlike XSAVE and XSAVEOPT, XSAVEC clears bits in the XSTATE_BV field that correspond to bits that are clear in RFBM. 3. There is an exception for state component 1 (SSE). MXCSR is part of SSE state, but XINUSE[1] may be 0 even if MXCSR does not have its initial value of 1F80H. In this case, XSAVEC sets XSTATE_BV[1] to 1 as long as RFBM[1] = 1. XSAVEC—Save Processor Extended States with Compaction Vol. 2C 5-613 INSTRUCTION SET REFERENCE, V-Z THEN store x87 state into legacy region of XSAVE area; FI; IF RFBM[1] = 1 and (XINUSE[1] = 1 or MXCSR ≠ 1F80H) THEN store SSE state into legacy region of XSAVE area; FI; IF RFBM[2] = 1 AND XINUSE[2] = 1 THEN store AVX state into extended region of XSAVE area; FI; XSTATE_BV field in XSAVE header ← XINUSE AND RFBM;1 XCOMP_BV field in XSAVE header ← COMPMASK; Flags Affected None. Intel C/C++ Compiler Intrinsic Equivalent XSAVEC: void _xsavec( void * , unsigned __int64); XSAVEC64: void _xsavec64( void * , unsigned __int64); Protected Mode Exceptions #GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If a memory operand is not aligned on a 64-byte boundary, regardless of segment. #SS(0) If a memory operand effective address is outside the SS segment limit. #PF(fault-code) If a page fault occurs. #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0 or CPUID.(EAX=0DH,ECX=1):EAX.XSAVEC[bit 1] = 0. If CR4.OSXSAVE[bit 18] = 0. If any of the LOCK, 66H, F3H or F2H prefixes is used. #AC If this exception is disabled a general protection exception (#GP) is signaled if the memory operand is not aligned on a 16-byte boundary, as described above. If the alignment check exception (#AC) is enabled (and the CPL is 3), signaling of #AC is not guaranteed and may vary with implementation, as follows. In all implementations where #AC is not signaled, a general protection exception is signaled in its place. In addition, the width of the alignment check may also vary with implementation. For instance, for a given implementation, an alignment check exception might be signaled for a 2-byte misalignment, whereas a general protection exception might be signaled for all other misalignments (4-, 8-, or 16-byte misalignments). Real-Address Mode Exceptions #GP If a memory operand is not aligned on a 64-byte boundary, regardless of segment. If any part of the operand lies outside the effective address space from 0 to FFFFH. #NM #UD If CR0.TS[bit 3] = 1. If CPUID.01H:ECX.XSAVE[bit 26] = 0 or CPUID.(EAX=0DH,ECX=1):EAX.XSAVEC[bit 1] = 0. If CR4.OSXSAVE[bit 18] = 0. If any of the LOCK, 66H, F3H or F2H prefixes is used. 1. If MXCSR does not have its initial value of 1F80H, XSAVEC sets XSTATE_BV[1] to 1 as long as RFBM[1] = 1, regardless of the value of XINUSE[1]. 5-614 Vol. 2C XSAVEC—Save Processor Extended States with Compaction INSTRUCTION SET REFERENCE, V-Z Virtual-8086 Mode Exceptions Same exceptions as in protected mode. Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions #GP(0) If the memory address is in a non-canonical form. If a memory operand is not aligned on a 64-byte boundary, regardless of segment. #SS(0) If a memory address referencing the SS segment is in a non-canonical form. #PF(fault-code) If a page fault occurs. #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0 or CPUID.(EAX=0DH,ECX=1):EAX.XSAVEC[bit 1] = 0. If CR4.OSXSAVE[bit 18] = 0. If any of the LOCK, 66H, F3H or F2H prefixes is used. #AC If this exception is disabled a general protection exception (#GP) is signaled if the memory operand is not aligned on a 16-byte boundary, as described above. If the alignment check exception (#AC) is enabled (and the CPL is 3), signaling of #AC is not guaranteed and may vary with implementation, as follows. In all implementations where #AC is not signaled, a general protection exception is signaled in its place. In addition, the width of the alignment check may also vary with implementation. For instance, for a given implementation, an alignment check exception might be signaled for a 2-byte misalignment, whereas a general protection exception might be signaled for all other misalignments (4-, 8-, or 16-byte misalignments). XSAVEC—Save Processor Extended States with Compaction Vol. 2C 5-615 INSTRUCTION SET REFERENCE, V-Z XSAVEOPT—Save Processor Extended States Optimized Opcode/ Instruction Op/ En 64/32 bit Mode Support CPUID Feature Flag 0F AE /6 M V/V XSAVEOPT Save state components specified by EDX:EAX to mem, optimizing if possible. M V/V XSAVEOPT Save state components specified by EDX:EAX to mem, optimizing if possible. XSAVEOPT mem REX.W + 0F AE /6 XSAVEOPT64 mem Description Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 M ModRM:r/m (w) NA NA NA Description Performs a full or partial save of processor state components to the XSAVE area located at the memory address specified by the destination operand. The implicit EDX:EAX register pair specifies a 64-bit instruction mask. The specific state components saved correspond to the bits set in the requested-feature bitmap (RFBM), which is the logical-AND of EDX:EAX and XCR0. The format of the XSAVE area is detailed in Section 13.4, “XSAVE Area,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1. Section 13.9, “Operation of XSAVEOPT,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 provides a detailed description of the operation of the XSAVEOPT instruction. The following items provide a high-level outline: • Execution of XSAVEOPT is similar to that of XSAVE. XSAVEOPT differs from XSAVE in that it uses compaction and that it may use the init and modified optimizations. The performance of XSAVEOPT will be equal to or better than that of XSAVE. • XSAVEOPT saves state component i only if RFBM[i] = 1 and XINUSE[i] = 1.1 (XINUSE is a bitmap by which the processor tracks the status of various state components. See Section 13.6, “Processor Tracking of XSAVEManaged State.”) Even if both bits are 1, XSAVEOPT may optimize and not save state component i if (1) state component i has not been modified since the last execution of XRTOR or XRSTORS; and (2) this execution of XSAVES corresponds to that last execution of XRTOR or XRSTORS as determined by the internal value XRSTOR_INFO (see the Operation section below). • XSAVEOPT does not modify bytes 511:464 of the legacy region of the XSAVE area (see Section 13.4.1, “Legacy Region of an XSAVE Area”). • XSAVEOPT reads the XSTATE_BV field of the XSAVE header (see Section 13.4.2, “XSAVE Header”) and writes a modified value back to memory as follows. If RFBM[i] = 1, XSAVEOPT writes XSTATE_BV[i] with the value of XINUSE[i]. If RFBM[i] = 0, XSAVEOPT writes XSTATE_BV[i] with the value that it read from memory (it does not modify the bit). XSAVEOPT does not write to any part of the XSAVE header other than the XSTATE_BV field. • XSAVEOPT always uses the standard format of the extended region of the XSAVE area (see Section 13.4.3, “Extended Region of an XSAVE Area”). Use of a destination operand not aligned to 64-byte boundary (in either 64-bit or 32-bit modes) will result in a general-protection (#GP) exception. In 64-bit mode, the upper 32 bits of RDX and RAX are ignored. Operation RFBM ← XCR0 AND EDX:EAX; /* bitwise logical AND */ OLD_BV ← XSTATE_BV field from XSAVE header; 1. There is an exception made for MXCSR and MXCSR_MASK, which belong to state component 1 — SSE. XSAVEOPT always saves these to memory if RFBM[1] = 1 or RFBM[2] = 1, regardless of the value of XINUSE. 5-616 Vol. 2C XSAVEOPT—Save Processor Extended States Optimized INSTRUCTION SET REFERENCE, V-Z IF in VMX non-root operation THEN VMXNR ← 1; ELSE VMXNR ← 0; FI; LAXA ← linear address of XSAVE area; COMPMASK ← 00000000_00000000H; IF XRSTOR_INFO = CPL,VMXNR,LAXA,COMPMASK THEN MODOPT ← 1; ELSE MODOPT ← 0; FI; IF RFBM[0] = 1 and XINUSE[0] = 1 THEN store x87 state into legacy region of XSAVE area; /* might avoid saving if x87 state is not modified and MODOPT = 1 */ FI; IF RFBM[1] = 1 and XINUSE[1] THEN store XMM registers into legacy region of XSAVE area; /* might avoid saving if XMM registers are not modified and MODOPT = 1 */ FI; IF RFBM[2] = 1 AND XINUSE[2] = 1 THEN store AVX state into extended region of XSAVE area; /* might avoid saving if AVX state is not modified and MODOPT = 1 */ FI; IF RFBM[1] = 1 or RFBM[2] = 1 THEN store MXCSR and MXCSR_MASK into legacy region of XSAVE area; FI; XSTATE_BV field in XSAVE header ← (OLD_BV AND ~RFBM) OR (XINUSE AND RFBM); Flags Affected None. Intel C/C++ Compiler Intrinsic Equivalent XSAVEOPT: void _xsaveopt( void * , unsigned __int64); XSAVEOPT: void _xsaveopt64( void * , unsigned __int64); Protected Mode Exceptions #GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If a memory operand is not aligned on a 64-byte boundary, regardless of segment. #SS(0) If a memory operand effective address is outside the SS segment limit. #PF(fault-code) If a page fault occurs. #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0 or CPUID.(EAX=0DH,ECX=1):EAX.XSAVEOPT[bit 0] = 0. If CR4.OSXSAVE[bit 18] = 0. If the LOCK prefix is used. If 66H, F3H or F2H prefix is used. Real-Address Mode Exceptions #GP If a memory operand is not aligned on a 64-byte boundary, regardless of segment. If any part of the operand lies outside the effective address space from 0 to FFFFH. XSAVEOPT—Save Processor Extended States Optimized Vol. 2C 5-617 INSTRUCTION SET REFERENCE, V-Z #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0 or CPUID.(EAX=0DH,ECX=1):EAX.XSAVEOPT[bit 0] = 0. If CR4.OSXSAVE[bit 18] = 0. If the LOCK prefix is used. If 66H, F3H or F2H prefix is used. Virtual-8086 Mode Exceptions Same exceptions as in protected mode. Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions #SS(0) If a memory address referencing the SS segment is in a non-canonical form. #GP(0) If the memory address is in a non-canonical form. If a memory operand is not aligned on a 64-byte boundary, regardless of segment. #PF(fault-code) If a page fault occurs. #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0 or CPUID.(EAX=0DH,ECX=1):EAX.XSAVEOPT[bit 0] = 0. If CR4.OSXSAVE[bit 18] = 0. If the LOCK prefix is used. If 66H, F3H or F2H prefix is used. 5-618 Vol. 2C XSAVEOPT—Save Processor Extended States Optimized INSTRUCTION SET REFERENCE, V-Z XSAVES—Save Processor Extended States Supervisor Opcode Instruction Op/ En 64-Bit Mode Compat/ Description Leg Mode 0F C7 /5 XSAVES mem M Valid Valid Save state components specified by EDX:EAX to mem with compaction, optimizing if possible. REX.W+ 0F C7 /5 XSAVES64 mem M Valid N.E. Save state components specified by EDX:EAX to mem with compaction, optimizing if possible. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 M ModRM:r/m (w) NA NA NA Description Performs a full or partial save of processor state components to the XSAVE area located at the memory address specified by the destination operand. The implicit EDX:EAX register pair specifies a 64-bit instruction mask. The specific state components saved correspond to the bits set in the requested-feature bitmap (RFBM), the logicalAND of EDX:EAX and the logical-OR of XCR0 with the IA32_XSS MSR. XSAVES may be executed only if CPL = 0. The format of the XSAVE area is detailed in Section 13.4, “XSAVE Area,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1. Section 13.11, “Operation of XSAVES,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 provides a detailed description of the operation of the XSAVES instruction. The following items provide a highlevel outline: • Execution of XSAVES is similar to that of XSAVEC. XSAVES differs from XSAVEC in that it can save state components corresponding to bits set in the IA32_XSS MSR and that it may use the modified optimization. • XSAVES saves state component i only if RFBM[i] = 1 and XINUSE[i] = 1.1 (XINUSE is a bitmap by which the processor tracks the status of various state components. See Section 13.6, “Processor Tracking of XSAVEManaged State.”) Even if both bits are 1, XSAVES may optimize and not save state component i if (1) state component i has not been modified since the last execution of XRTOR or XRSTORS; and (2) this execution of XSAVES correspond to that last execution of XRTOR or XRSTORS as determined by XRSTOR_INFO (see the Operation section below). • XSAVES does not modify bytes 511:464 of the legacy region of the XSAVE area (see Section 13.4.1, “Legacy Region of an XSAVE Area”). • XSAVES writes the logical AND of RFBM and XINUSE to the XSTATE_BV field of the XSAVE header.2 (See Section 13.4.2, “XSAVE Header.”) XSAVES sets bit 63 of the XCOMP_BV field and sets bits 62:0 of that field to RFBM[62:0]. XSAVES does not write to any parts of the XSAVE header other than the XSTATE_BV and XCOMP_BV fields. • XSAVES always uses the compacted format of the extended region of the XSAVE area (see Section 13.4.3, “Extended Region of an XSAVE Area”). Use of a destination operand not aligned to 64-byte boundary (in either 64-bit or 32-bit modes) results in a general-protection (#GP) exception. In 64-bit mode, the upper 32 bits of RDX and RAX are ignored. 1. There is an exception for state component 1 (SSE). MXCSR is part of SSE state, but XINUSE[1] may be 0 even if MXCSR does not have its initial value of 1F80H. In this case, the init optimization does not apply and XSAVEC will save SSE state as long as RFBM[1] = 1 and the modified optimization is not being applied. 2. There is an exception for state component 1 (SSE). MXCSR is part of SSE state, but XINUSE[1] may be 0 even if MXCSR does not have its initial value of 1F80H. In this case, XSAVES sets XSTATE_BV[1] to 1 as long as RFBM[1] = 1. XSAVES—Save Processor Extended States Supervisor Vol. 2C 5-619 INSTRUCTION SET REFERENCE, V-Z Operation RFBM ← (XCR0 OR IA32_XSS) AND EDX:EAX; /* bitwise logical OR and AND */ IF in VMX non-root operation THEN VMXNR ← 1; ELSE VMXNR ← 0; FI; LAXA ← linear address of XSAVE area; COMPMASK ← RFBM OR 80000000_00000000H; IF XRSTOR_INFO = CPL,VMXNR,LAXA,COMPMASK THEN MODOPT ← 1; ELSE MODOPT ← 0; FI; IF RFBM[0] = 1 and XINUSE[0] = 1 THEN store x87 state into legacy region of XSAVE area; /* might avoid saving if x87 state is not modified and MODOPT = 1 */ FI; IF RFBM[1] = 1 and (XINUSE[1] = 1 or MXCSR ≠ 1F80H) THEN store SSE state into legacy region of XSAVE area; /* might avoid saving if SSE state is not modified and MODOPT = 1 */ FI; IF RFBM[2] = 1 AND XINUSE[2] = 1 THEN store AVX state into extended region of XSAVE area; /* might avoid saving if AVX state is not modified and MODOPT = 1 */ FI; XSTATE_BV field in XSAVE header ← XINUSE AND RFBM;1 XCOMP_BV field in XSAVE header ← COMPMASK; Flags Affected None. Intel C/C++ Compiler Intrinsic Equivalent XSAVES: void _xsaves( void * , unsigned __int64); XSAVES64: void _xsaves64( void * , unsigned __int64); Protected Mode Exceptions #GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If a memory operand is not aligned on a 64-byte boundary, regardless of segment. #SS(0) If a memory operand effective address is outside the SS segment limit. #PF(fault-code) If a page fault occurs. #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0 or CPUID.(EAX=0DH,ECX=1):EAX.XSS[bit 3] = 0. If CR4.OSXSAVE[bit 18] = 0. If any of the LOCK, 66H, F3H or F2H prefixes is used. #AC If this exception is disabled a general protection exception (#GP) is signaled if the memory operand is not aligned on a 16-byte boundary, as described above. If the alignment check 1. If MXCSR does not have its initial value of 1F80H, XSAVES sets XSTATE_BV[1] to 1 as long as RFBM[1] = 1, regardless of the value of XINUSE[1]. 5-620 Vol. 2C XSAVES—Save Processor Extended States Supervisor INSTRUCTION SET REFERENCE, V-Z exception (#AC) is enabled (and the CPL is 3), signaling of #AC is not guaranteed and may vary with implementation, as follows. In all implementations where #AC is not signaled, a general protection exception is signaled in its place. In addition, the width of the alignment check may also vary with implementation. For instance, for a given implementation, an alignment check exception might be signaled for a 2-byte misalignment, whereas a general protection exception might be signaled for all other misalignments (4-, 8-, or 16-byte misalignments). Real-Address Mode Exceptions #GP If a memory operand is not aligned on a 64-byte boundary, regardless of segment. If any part of the operand lies outside the effective address space from 0 to FFFFH. #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0 or CPUID.(EAX=0DH,ECX=1):EAX.XSS[bit 3] = 0. If CR4.OSXSAVE[bit 18] = 0. If any of the LOCK, 66H, F3H or F2H prefixes is used. Virtual-8086 Mode Exceptions Same exceptions as in protected mode. Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions #GP(0) If the memory address is in a non-canonical form. If a memory operand is not aligned on a 64-byte boundary, regardless of segment. #SS(0) If a memory address referencing the SS segment is in a non-canonical form. #PF(fault-code) If a page fault occurs. #NM If CR0.TS[bit 3] = 1. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0 or CPUID.(EAX=0DH,ECX=1):EAX.XSS[bit 3] = 0. If CR4.OSXSAVE[bit 18] = 0. If any of the LOCK, 66H, F3H or F2H prefixes is used. #AC If this exception is disabled a general protection exception (#GP) is signaled if the memory operand is not aligned on a 16-byte boundary, as described above. If the alignment check exception (#AC) is enabled (and the CPL is 3), signaling of #AC is not guaranteed and may vary with implementation, as follows. In all implementations where #AC is not signaled, a general protection exception is signaled in its place. In addition, the width of the alignment check may also vary with implementation. For instance, for a given implementation, an alignment check exception might be signaled for a 2-byte misalignment, whereas a general protection exception might be signaled for all other misalignments (4-, 8-, or 16-byte misalignments). XSAVES—Save Processor Extended States Supervisor Vol. 2C 5-621 INSTRUCTION SET REFERENCE, V-Z XSETBV—Set Extended Control Register Opcode Instruction Op/ En 64-Bit Mode Compat/ Description Leg Mode 0F 01 D1 XSETBV NP Valid Valid Write the value in EDX:EAX to the XCR specified by ECX. Instruction Operand Encoding Op/En Operand 1 Operand 2 Operand 3 Operand 4 NP NA NA NA NA Description Writes the contents of registers EDX:EAX into the 64-bit extended control register (XCR) specified in the ECX register. (On processors that support the Intel 64 architecture, the high-order 32 bits of RCX are ignored.) The contents of the EDX register are copied to high-order 32 bits of the selected XCR and the contents of the EAX register are copied to low-order 32 bits of the XCR. (On processors that support the Intel 64 architecture, the highorder 32 bits of each of RAX and RDX are ignored.) Undefined or reserved bits in an XCR should be set to values previously read. This instruction must be executed at privilege level 0 or in real-address mode; otherwise, a general protection exception #GP(0) is generated. Specifying a reserved or unimplemented XCR in ECX will also cause a general protection exception. The processor will also generate a general protection exception if software attempts to write to reserved bits in an XCR. Currently, only XCR0 is supported. Thus, all other values of ECX are reserved and will cause a #GP(0). Note that bit 0 of XCR0 (corresponding to x87 state) must be set to 1; the instruction will cause a #GP(0) if an attempt is made to clear this bit. In addition, the instruction causes a #GP(0) if an attempt is made to set XCR0[2] (AVX state) while clearing XCR0[1] (SSE state); it is necessary to set both bits to use AVX instructions; Section 13.3, “Enabling the XSAVE Feature Set and XSAVE-Enabled Features,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1. Operation XCR[ECX] ← EDX:EAX; Flags Affected None. Intel C/C++ Compiler Intrinsic Equivalent XSETBV: void _xsetbv( unsigned int, unsigned __int64); Protected Mode Exceptions #GP(0) If the current privilege level is not 0. If an invalid XCR is specified in ECX. If the value in EDX:EAX sets bits that are reserved in the XCR specified by ECX. If an attempt is made to clear bit 0 of XCR0. If an attempt is made to set XCR0[2:1] to 10b. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0. If CR4.OSXSAVE[bit 18] = 0. If the LOCK prefix is used. If 66H, F3H or F2H prefix is used. 5-622 Vol. 2C XSETBV—Set Extended Control Register INSTRUCTION SET REFERENCE, V-Z Real-Address Mode Exceptions #GP If an invalid XCR is specified in ECX. If the value in EDX:EAX sets bits that are reserved in the XCR specified by ECX. If an attempt is made to clear bit 0 of XCR0. If an attempt is made to set XCR0[2:1] to 10b. #UD If CPUID.01H:ECX.XSAVE[bit 26] = 0. If CR4.OSXSAVE[bit 18] = 0. If the LOCK prefix is used. If 66H, F3H or F2H prefix is used. Virtual-8086 Mode Exceptions #GP(0) The XSETBV instruction is not recognized in virtual-8086 mode. Compatibility Mode Exceptions Same exceptions as in protected mode. 64-Bit Mode Exceptions Same exceptions as in protected mode. XSETBV—Set Extended Control Register Vol. 2C 5-623 INSTRUCTION SET REFERENCE, V-Z XTEST — Test If In Transactional Execution Opcode/Instruction Op/ En 0F 01 D6 XTEST A 64/32bit Mode Support V/V CPUID Feature Flag HLE or RTM Description Test if executing in a transactional region Instruction Operand Encoding Op/En Operand 1 Operand2 Operand3 Operand4 A NA NA NA NA Description The XTEST instruction queries the transactional execution status. If the instruction executes inside a transactionally executing RTM region or a transactionally executing HLE region, then the ZF flag is cleared, else it is set. Operation XTEST IF (RTM_ACTIVE = 1 OR HLE_ACTIVE = 1) THEN ZF ← 0 ELSE ZF ← 1 FI; Flags Affected The ZF flag is cleared if the instruction is executed transactionally; otherwise it is set to 1. The CF, OF, SF, PF, and AF, flags are cleared. Intel C/C++ Compiler Intrinsic Equivalent XTEST: int _xtest( void ); SIMD Floating-Point Exceptions None Other Exceptions #UD CPUID.(EAX=7, ECX=0):HLE[bit 4] = 0 and CPUID.(EAX=7, ECX=0):RTM[bit 11] = 0. If LOCK or 66H or F2H or F3H prefix is used. 5-624 Vol. 2C XTEST — Test If In Transactional Execution
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.6 Linearized : No Encryption : Standard V4.4 (128-bit) User Access : Print, Copy, Extract, Print high-res Author : Intel Corporation Create Date : 2016:10:01 21:42:26Z Keywords : 64, ia 32 architecture, 326018 Modify Date : 2016:10:08 01:15:09+05:30 XMP Toolkit : Adobe XMP Core 5.6-c015 84.158975, 2016/02/13-02:40:29 Creator Tool : FrameMaker 10.0.2 Metadata Date : 2016:10:08 01:15:09+05:30 Producer : Acrobat Distiller 9.2.0 (Windows) Format : application/pdf Title : Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2C: Instruction Set Reference, V-Z Creator : Intel Corporation Subject : 64, ia 32 architecture, 326018 Document ID : uuid:ed493781-53f5-42dd-ba4a-e5e697342228 Instance ID : uuid:72febce4-0571-4374-8ac8-bfadb410665b Page Mode : UseOutlines Page Count : 626EXIF Metadata provided by EXIF.tools