You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@orc.apache.org by "wpleonardo (via GitHub)" <gi...@apache.org> on 2023/04/18 04:29:30 UTC

[GitHub] [orc] wpleonardo opened a new pull request, #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

wpleonardo opened a new pull request, #1375:
URL: https://github.com/apache/orc/pull/1375

   ### What changes were proposed in this pull request?
   In the original ORC Rle-bit-packing, it decodes value one by one, and Intel AVX-512 brings the capabilities of 512-bit vector operations to accelerate the Rle-bit-packing decode process. We only need execute much less CPU instructions to unpacking more data than usual. So the performance of AVX-512 vector decode is much better than before. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs. In the real world, user will store large data with ORC data format, and need to decoding hundreds or thousands of bytes, AVX-512 vector decode will be more efficient and help to improve this processing.
   
   In the real world, the data size in ORC will be less than 32bit as usual. So I supplied the vector code transform about the data value size less than 32bits in this PR. To the data value is 8bit, 16bit or other 8x bit size, the performance improvement will be relatively small compared with other not 8x bit size value.
   
   Intel AVX512 instructions official link:
   https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
   
   1. Added cmake option named "BUILD_ENABLE_AVX512", to switch this feature enable or not in the building process.
   The default value of BUILD_ENABLE_AVX512 is OFF.
   For example, cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON
   This will build ORC library with AVX512 Bit-unpacking enabling.
   2. Added macro "ORC_HAVE_RUNTIME_AVX512" to enable this feature code build or not in ORC.
   3. Added the file "CpuInfoUtil.cc" to dynamicly detect the current platform supports AVX-512 or not. When customers build ORC with AVX-512 enable, and the current platform ORC running on doesn't support AVX-512, it will use the original bit-packing decode function instead of AVX-512 vector decode.
   4. Added the functions "vectorUnpackX" to support X-bit value decode instead of the original function plainUnpackLongs or vectorUnpackX
   5. Added the testcases "RleV2BitUnpackAvx512Test" to verify N-bit value AVX-512 vector decode in the new testcase file TestRleVectorDecoder.cc.
   6. Modified the function plainUnpackLongs, added an output parameter uint64_t& startBit. This parameter used to store the left bit number after unpacking.
   7. AVX-512 vector decode process 512 bits data in every data unpacking. So if the current unpacking data length is long enough, almost all of the data can be processed by AVX-512. But if the data length (or block size) is too short, less than 512 bits, it will not use AVX-512 to do unpacking work. It will back to the original decode way to do unpacking one by one.
   
   Add new files:
   <html xmlns:v="urn:schemas-microsoft-com:vml"
   xmlns:o="urn:schemas-microsoft-com:office:office"
   xmlns:x="urn:schemas-microsoft-com:office:excel"
   xmlns="http://www.w3.org/TR/REC-html40">
   
   <head>
   
   <meta name=ProgId content=Excel.Sheet>
   <meta name=Generator content="Microsoft Excel 15">
   <link id=Main-File rel=Main-File
   href="file:///C:/Users/pengwan5/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
   <link rel=File-List
   href="file:///C:/Users/pengwan5/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
   <style>
   <!--table
   	{mso-displayed-decimal-separator:"\.";
   	mso-displayed-thousand-separator:"\,";}
   @page
   	{margin:.75in .7in .75in .7in;
   	mso-header-margin:.3in;
   	mso-footer-margin:.3in;}
   tr
   	{mso-height-source:auto;}
   col
   	{mso-width-source:auto;}
   br
   	{mso-data-placement:same-cell;}
   td
   	{padding-top:1px;
   	padding-right:1px;
   	padding-left:1px;
   	mso-ignore:padding;
   	color:black;
   	font-size:11.0pt;
   	font-weight:400;
   	font-style:normal;
   	text-decoration:none;
   	font-family:Calibri, sans-serif;
   	mso-font-charset:0;
   	mso-number-format:General;
   	text-align:general;
   	vertical-align:bottom;
   	border:none;
   	mso-background-source:auto;
   	mso-pattern:auto;
   	mso-protection:locked visible;
   	white-space:nowrap;
   	mso-rotate:0;}
   .xl65
   	{text-align:left;
   	vertical-align:middle;}
   .xl66
   	{text-align:left;
   	vertical-align:middle;
   	white-space:normal;}
   -->
   </style>
   </head>
   
   <body link="#0563C1" vlink="#954F72">
   
   
   
   New   Files | File Purpose
   -- | --
   CpuInfoUtil.hh .cc | Dynamically detect the current   platform supports       AVX-512 or not. If yes, will use AVX-512 vector decode,     if not, will still the original decode functions.
   BitUnpackerAvx512.hh | This file contains the new   macros, arrays, and unions     which AVX-512 vector decode needs.
   BpackingAvx512.hh .cc | This file contains the AVX512 Bit-unpacking functions about 1~32 bit data
   BpackingDefault.hh .cc | This file contains the default Bit-unpacking functions
   Dispatch.hh | This file contains the dynamic dispatch according to available DispatchLevel
   TestRleVectorDecoder.cc | New testcases to do unit and   funcational test     about this new feature
   
   
   
   </body>
   
   </html>
   
   
   
   ### Why are the changes needed?
   This can improve the performance of Rle-bit-packing decode. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs.
   As Intel gradually improves CPU performance every year and users do data analyzation based ORC data format on the newer platform. 6 years ago, on Intel SKX platform it already support AVX512 instructions. So we need to upgrade ORC data unpacking according to the popular feature of CPU, this will keep ORC pace with the times. 
   
   ### How to enable AVX512 Bit-unpacking?
   1. Enable the cmake option BUILD_ENABLE_AVX512, it will build ORC library with AVX512 enabling.
   cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON
   2. Set the ENV parameter when using ORC library
   export ORC_USER_SIMD_LEVEL=AVX512
   (Note: This parameter has only 2 values "AVX512" && "none", the value has no case-sensitive)
   If set ORC_USER_SIMD_LEVEL=none, AVX512 Bit-unpacking will be disabled.
   
   ### How was this patch tested?
   I created a new testcase file TestRleVectorDecoder.cc. It contains the below testcases, we can open cmake option -DBUILD_ENABLE_AVX512=ON and running these testcases on the platform support AVX-512. Every testcase contain 2 scenarios:
   1. The blockSize increases from 1 to 10000, and data length is 10240;
   4. The blockSize increases from 1000 to 10000, and data length increases from 1000 to 70000
   The testcase will be executed for a while, so I added a progress bar for every testcase.
   Here is a progress bar demo print of one testcase:
   [ RUN      ] OrcTest/RleVectorTest.RleV2_basic_vector_decode_10bit/1
   10bit Test 1st Part:[OK][#################################################################################][100%]
   10bit Test 2nd Part:[OK][#################################################################################][100%]
   To the main vector function vectorUnpackX, the test code coverage upto 100%.
   
   
   <html xmlns:v="urn:schemas-microsoft-com:vml"
   xmlns:o="urn:schemas-microsoft-com:office:office"
   xmlns:x="urn:schemas-microsoft-com:office:excel"
   xmlns="http://www.w3.org/TR/REC-html40">
   
   <head>
   
   <meta name=ProgId content=Excel.Sheet>
   <meta name=Generator content="Microsoft Excel 15">
   <link id=Main-File rel=Main-File
   href="file:///C:/Users/pengwan5/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
   <link rel=File-List
   href="file:///C:/Users/pengwan5/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
   <style>
   <!--table
   	{mso-displayed-decimal-separator:"\.";
   	mso-displayed-thousand-separator:"\,";}
   @page
   	{margin:.75in .7in .75in .7in;
   	mso-header-margin:.3in;
   	mso-footer-margin:.3in;}
   tr
   	{mso-height-source:auto;}
   col
   	{mso-width-source:auto;}
   br
   	{mso-data-placement:same-cell;}
   td
   	{padding-top:1px;
   	padding-right:1px;
   	padding-left:1px;
   	mso-ignore:padding;
   	color:black;
   	font-size:11.0pt;
   	font-weight:400;
   	font-style:normal;
   	text-decoration:none;
   	font-family:Calibri, sans-serif;
   	mso-font-charset:0;
   	mso-number-format:General;
   	text-align:general;
   	vertical-align:bottom;
   	border:none;
   	mso-background-source:auto;
   	mso-pattern:auto;
   	mso-protection:locked visible;
   	white-space:nowrap;
   	mso-rotate:0;}
   .xl65
   	{text-align:center;
   	vertical-align:middle;}
   -->
   </style>
   </head>
   
   <body link="#0563C1" vlink="#954F72">
   
   
   
   New Testcases | Test Data Bit Size
   -- | --
   RleV2_basic_vector_decode_1bit | 1bit
   RleV2_basic_vector_decode_2bit | 2bit
   RleV2_basic_vector_decode_3bit | 3bit
   RleV2_basic_vector_decode_4bit | 4bit
   RleV2_basic_vector_decode_5bit | 5bit
   RleV2_basic_vector_decode_6bit | 6bit
   RleV2_basic_vector_decode_7bit | 7bit
   RleV2_basic_vector_decode_9bit | 9bit
   RleV2_basic_vector_decode_10bit | 10bit
   RleV2_basic_vector_decode_11bit | 11bit
   RleV2_basic_vector_decode_12bit | 12bit
   RleV2_basic_vector_decode_13bit | 13bit
   RleV2_basic_vector_decode_14bit | 14bit
   RleV2_basic_vector_decode_15bit | 15bit
   RleV2_basic_vector_decode_16bit | 16bit
   RleV2_basic_vector_decode_17bit | 17bit
   RleV2_basic_vector_decode_18bit | 18bit
   RleV2_basic_vector_decode_19bit | 19bit
   RleV2_basic_vector_decode_20bit | 20bit
   RleV2_basic_vector_decode_21bit | 21bit
   RleV2_basic_vector_decode_22bit | 22bit
   RleV2_basic_vector_decode_23bit | 23bit
   RleV2_basic_vector_decode_24bit | 24bit
   RleV2_basic_vector_decode_26bit | 26bit
   RleV2_basic_vector_decode_28bit | 28bit
   RleV2_basic_vector_decode_30bit | 30bit
   RleV2_basic_vector_decode_32bit | 32bit
   
   
   
   </body>
   
   </html>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1163489688


##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2724 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen, uint64_t& len,
+                                                uint32_t& bitWidth, uint64_t& tailBitLen,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr, uint32_t bitMaxSize) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveLen(len * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH, ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveLen(len * bitWidth, ORC_VECTOR_BYTE_WIDTH);
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      numElements = len;
+      resetBuf = false;
+      len -= numElements;
+    } else {
+      if (startBit != 0) {
+        numElements =
+            (bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + ORC_VECTOR_BYTE_WIDTH - startBit) / bitWidth;
+        len -= numElements;
+        tailBitLen = fmod(bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + ORC_VECTOR_BYTE_WIDTH - startBit,
+                          bitWidth);
+        resetBuf = true;
+      } else {
+        numElements = (bufRestByteLen * ORC_VECTOR_BYTE_WIDTH) / bitWidth;
+        len -= numElements;
+        tailBitLen = fmod(bufRestByteLen * ORC_VECTOR_BYTE_WIDTH, bitWidth);
+        resetBuf = true;
+      }

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1444982057

   Hi @wgtmac, I found an error on macOS. I'm fixing it now. Thanks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1451736803

   Hi @wgtmac , The reason of the first CI testcase failed is that 
     /usr/bin/git -c protocol.version=2 fetch --no-tags --prune --progress --no-recurse-submodules --depth=1 origin +0669b46b6652da4b9662655025ab78cf466f1428:refs/remotes/pull/1375/merge
     Error: fatal: unable to access 'https://github.com/apache/orc/': The requested URL returned error: 429
   
   All of other testcases passed. It seems that there is some network error when the first testcase running.
   Could you help me rerun the CI test? Thank you very much!
   
   Meanwhile, I'm doing the new Github action now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1144371049


##########
c++/src/BpackingAvx512.hh:
##########
@@ -0,0 +1,89 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGAVX512_HH
+#define ORC_BPACKINGAVX512_HH
+
+#include <stdlib.h>

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1449475273

   > That's fine. Please make the CIs happy.
   > 
   > One more thing, could you please add one or more Github actions with platform equipped with AVX512? In that way, we can have this new feature tested automatically. Current workflows are here: https://github.com/apache/orc/blob/main/.github/workflows/build_and_test.yml
   > 
   > Please check this for reference: https://github.com/marketplace/actions/run-on-architecture
   
   OK, I will do it and fix the CI errors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1491353070

   @wpleonardo Please make the CI happy. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by GitBox <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1383144045

   > May I have a question about clang-format error about file TestRleVectorDecoder.cc? I have already use clang-format -style=google to format file TestRleVectorDecoder.cc, but still get clang-format errors in CI. Do we use -style=google in clang-format, or other style? Thank you very much!
   
   The clang-format we use is defined here: https://github.com/apache/orc/blob/main/.clang-format. You can simply use `clang-format -i TestRleVectorDecoder.cc` to format it automatically.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1169460079


##########
c++/src/BitUnpackerAvx512.hh:
##########
@@ -0,0 +1,488 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BIT_UNPACKER_AVX512_HH
+#define ORC_BIT_UNPACKER_AVX512_HH
+
+// Mingw-w64 defines strcasecmp in string.h
+#if defined(_WIN32) && !defined(strcasecmp)
+#include <string.h>
+#define strcasecmp stricmp
+#else
+#include <strings.h>
+#endif
+
+#include <immintrin.h>
+#include <cstdint>
+#include <vector>
+
+namespace orc {
+#define ORC_VECTOR_BITS_2_BYTE(x) \
+  (((x) + 7u) >> 3u) /**< Convert a number of bits to a number of bytes */
+#define ORC_VECTOR_ONE_64U (1ULL)
+#define ORC_VECTOR_MAX_16U 0xFFFF     /**< Max value for uint16_t */
+#define ORC_VECTOR_MAX_32U 0xFFFFFFFF /**< Max value for uint32_t */
+#define ORC_VECTOR_BYTE_WIDTH 8u      /**< Byte width in bits */
+#define ORC_VECTOR_WORD_WIDTH 16u     /**< Word width in bits */
+#define ORC_VECTOR_DWORD_WIDTH 32u    /**< Dword width in bits */
+#define ORC_VECTOR_QWORD_WIDTH 64u    /**< Qword width in bits */
+#define ORC_VECTOR_BIT_MASK(x) \
+  ((ORC_VECTOR_ONE_64U << (x)) - 1u) /**< Bit mask below bit position */
+
+#define ORC_VECTOR_BITS_2_WORD(x) \
+  (((x) + 15u) >> 4u) /**< Convert a number of bits to a number of words */
+#define ORC_VECTOR_BITS_2_DWORD(x) \
+  (((x) + 31u) >> 5u) /**< Convert a number of bits to a number of double words */
+
+  // ------------------------------------ 3u -----------------------------------------
+  static uint8_t shuffleIdxTable3u_0[64] = {

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1138652526


##########
c++/src/CpuInfoUtil.cc:
##########
@@ -0,0 +1,581 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "CpuInfoUtil.hh"
+
+#ifdef __APPLE__
+#include <sys/sysctl.h>
+#endif
+
+#ifndef _MSC_VER
+#include <unistd.h>
+#endif
+
+#ifdef _WIN32
+#define NOMINMAX
+#include <Windows.h>
+#include <intrin.h>
+#endif
+
+#include <algorithm>
+#include <array>
+#include <bitset>
+#include <cctype>
+#include <cerrno>
+#include <cstdint>
+#include <fstream>
+#include <memory>
+#include <optional>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>

Review Comment:
   OK, thank you very much for reminding me.
   Delete the below files:
   <cctype>
   <cerrno>
   <memory>
   <sstream>



##########
c++/src/CpuInfoUtil.cc:
##########
@@ -0,0 +1,581 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "CpuInfoUtil.hh"
+
+#ifdef __APPLE__
+#include <sys/sysctl.h>
+#endif
+
+#ifndef _MSC_VER
+#include <unistd.h>
+#endif
+
+#ifdef _WIN32
+#define NOMINMAX
+#include <Windows.h>
+#include <intrin.h>
+#endif
+
+#include <algorithm>
+#include <array>
+#include <bitset>
+#include <cctype>
+#include <cerrno>
+#include <cstdint>
+#include <fstream>
+#include <memory>
+#include <optional>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>

Review Comment:
   OK, thank you very much for reminding me.
   Delete the below files:
   cctype
   cerrno
   memory
   sstream



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1139739087


##########
cmake_modules/ConfigSimdLevel.cmake:
##########
@@ -0,0 +1,92 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

Review Comment:
   ```suggestion
   # Licensed to the Apache Software Foundation (ASF) under one
   # or more contributor license agreements.  See the NOTICE file
   # distributed with this work for additional information
   # regarding copyright ownership.  The ASF licenses this file
   # to you under the Apache License, Version 2.0 (the
   # "License"); you may not use this file except in compliance
   # with the License.  You may obtain a copy of the License at
   #
   #   http://www.apache.org/licenses/LICENSE-2.0
   #
   # Unless required by applicable law or agreed to in writing,
   # software distributed under the License is distributed on an
   # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   # KIND, either express or implied.  See the License for the
   # specific language governing permissions and limitations
   # under the License.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1138652526


##########
c++/src/CpuInfoUtil.cc:
##########
@@ -0,0 +1,581 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "CpuInfoUtil.hh"
+
+#ifdef __APPLE__
+#include <sys/sysctl.h>
+#endif
+
+#ifndef _MSC_VER
+#include <unistd.h>
+#endif
+
+#ifdef _WIN32
+#define NOMINMAX
+#include <Windows.h>
+#include <intrin.h>
+#endif
+
+#include <algorithm>
+#include <array>
+#include <bitset>
+#include <cctype>
+#include <cerrno>
+#include <cstdint>
+#include <fstream>
+#include <memory>
+#include <optional>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>

Review Comment:
   OK, thank you very much for reminding me.
   Delete the below files:
   cctype
   cerrno
   memory



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1415257863

   According to the CI status, I converted this PR from `Draft` to `Normal` PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode [orc]

Posted by "taiyang-li (via GitHub)" <gi...@apache.org>.

taiyang-li commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1754699373

   @wpleonardo Do we have any performance benchmark about this PR?   @alexey-milovidov Maybe you are interested in it. 
   
   I try to use this feature in clickhouse(https://github.com/clickHouse/ClickHouse), but can't see any performance improvement. 
   
   Q: `select *  from  file('/data1/clickhouse_official/data/user_files/bigolive_audience_stats_orc.orc') format Null;`
   
   With AVX512: 
   ```
   0 rows in set. Elapsed: 3.659 sec. Processed 1.13 million rows, 486.19 MB (308.68 thousand rows/s., 132.88 MB/s.)
   0 rows in set. Elapsed: 3.653 sec. Processed 1.20 million rows, 517.87 MB (329.40 thousand rows/s., 141.76 MB/s.)
   0 rows in set. Elapsed: 3.719 sec. Processed 1.13 million rows, 486.19 MB (303.70 thousand rows/s., 130.74 MB/s.)
   ```
   
   Without AVX512
   ```
   0 rows in set. Elapsed: 3.565 sec. Processed 1.13 million rows, 486.19 MB (316.81 thousand rows/s., 136.38 MB/s.)
   0 rows in set. Elapsed: 3.540 sec. Processed 1.20 million rows, 517.87 MB (339.91 thousand rows/s., 146.28 MB/s.)
   0 rows in set. Elapsed: 3.681 sec. Processed 1.20 million rows, 517.87 MB (326.90 thousand rows/s., 140.69 MB/s.)
   ``` 
   
   About the test orc file: 
   ```
   $ du -sh bigolive_audience_stats_orc.orc                                                     
   505M	bigolive_audience_stats_orc.orc
   
   
   $ orc-metadata ./bigolive_audience_stats_orc.orc                           
   { "name": "./bigolive_audience_stats_orc.orc",
     "type": "struct<reporttime:bigint,appid:bigint,uid:bigint,platform:int,nettype:int,clientversioncode:bigint,sdkversioncode:bigint,statid:string,statversion:int,countrycode:string,language:string,model:string,osversion:string,channel:string,heartcount:int,msgcount:int,giftcount:int,barragecount:int,gid:string,entrytype:int,prefetchedms:int,linkdstate:int,networkavailable:int,starttimestamp:bigint,sessionlogints:int,medialogints:int,sdkboundts:int,msconnectedts:int,vsconnectedts:int,firstiframets:int,ownerstatus:int,stopreason:int,totaltime:int,cpuusageavg:int,memusageavg:int,backgroundtotal:bigint,foregroundtotal:bigint,firstvideopackts:int,firstvoicerecvts:int,firstvoiceplayts:int,firstiframeassemblets:int,uiinitts:int,uiloadedts:int,uiappearedts:int,setvideoviewts:int,blurviewdimissts:int,preparesdkinqueuets:int,preparesdkexects:int,startsdkinqueuets:int,startsdkexects:int,sdkjoinchannelinqueuets:int,sdkjoinchannelexects:int,lastsdkleavechannelinqueuets:int,lastsdkleavechannele
 xects:int,unused_1:int,unused_2:int,setvideoviewinqueuets:int,setvideoviewexects:int,livetype:int,audiostatus:int,firstiframesize:bigint,firstiframedecodetime:bigint,extras:bigint,entrancetype:int,entrancemode:int,mclientip:bigint,mnc:bigint,mcc:bigint,vsipsuccess:bigint,msipsuccess:bigint,vsipfail:bigint,msipfail:bigint,mediaflag:bigint,dispatchid:string,proxyflag:int,redirectcount:int,directorrescode:int,subentrancetab:string,logininfolist:array<struct<strategy:bigint,ip:bigint,loginStat:bigint,reserve1:bigint,reserve2:bigint>>,playcentertype:int,videomutetype:bigint,owneruid:bigint,extra:string>",
     "rows": 1203317,
     "stripe count": 12,
     "format": "0.12", "writer version": "future - 9",
     "compression": "snappy", "compression block": 65536,
     "file length": 529207118,
     "content": 529182229, "stripe stats": 21150, "footer": 3712, "postscript": 26,
     "row index stride": 10000,
     "user metadata": {
       "org.apache.spark.version": "3.3.2"
     },
     "stripes": [
       { "stripe": 0, "rows": 117760,
         "offset": 3, "length": 50876922,
         "index": 23728, "data": 50851823, "footer": 1371
       },
       { "stripe": 1, "rows": 117760,
         "offset": 50876925, "length": 50948680,
         "index": 23679, "data": 50923619, "footer": 1382
       },
       { "stripe": 2, "rows": 62050,
         "offset": 101825605, "length": 26902880,
         "index": 15322, "data": 26886211, "footer": 1347
       },
       { "stripe": 3, "rows": 117760,
         "offset": 128728485, "length": 50474083,
         "index": 24110, "data": 50448601, "footer": 1372
       },
       { "stripe": 4, "rows": 117760,
         "offset": 179202568, "length": 50413042,
         "index": 23858, "data": 50387825, "footer": 1359
       },
       { "stripe": 5, "rows": 63570,
         "offset": 229615610, "length": 27504277,
         "index": 14890, "data": 27488029, "footer": 1358
       },
       { "stripe": 6, "rows": 117760,
         "offset": 268435456, "length": 50981984,
         "index": 24191, "data": 50956424, "footer": 1369
       },
       { "stripe": 7, "rows": 117760,
         "offset": 319417440, "length": 51017894,
         "index": 23792, "data": 50992731, "footer": 1371
       },
       { "stripe": 8, "rows": 61720,
         "offset": 370435334, "length": 26840720,
         "index": 15246, "data": 26824109, "footer": 1365
       },
       { "stripe": 9, "rows": 117760,
         "offset": 397276054, "length": 49971095,
         "index": 23487, "data": 49946233, "footer": 1375
       },
       { "stripe": 10, "rows": 117760,
         "offset": 447247149, "length": 50259825,
         "index": 24090, "data": 50234369, "footer": 1366
       },
       { "stripe": 11, "rows": 73897,
         "offset": 497506974, "length": 31675255,
         "index": 16948, "data": 31656952, "footer": 1355
       }
     ]
   }
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] stiga-huang commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "stiga-huang (via GitHub)" <gi...@apache.org>.

stiga-huang commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1097305535


##########
CMakeLists.txt:
##########
@@ -157,6 +172,139 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "powerpc|ppc")
+    set(ORC_CPU_FLAG "ppc")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "s390x")
+    set(ORC_CPU_FLAG "s390x")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "riscv64")
+    set(ORC_CPU_FLAG "riscv64")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_SSE4_2_FLAG "")
+    set(ORC_AVX2_FLAG "/arch:AVX2")
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    set(ORC_SSE4_2_FLAG "-msse4.2")
+    set(ORC_AVX2_FLAG "-march=haswell")
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")
+    # Append the avx2/avx512 subset option also, fix issue ORC-9877 for homebrew-cpp
+    set(ORC_AVX2_FLAG "${ORC_AVX2_FLAG} -mavx2")
+    set(ORC_AVX512_FLAG
+        "${ORC_AVX512_FLAG} -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vbmi")
+  endif()
+  check_cxx_compiler_flag(${ORC_AVX512_FLAG} CXX_SUPPORTS_AVX512)
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    check_cxx_source_compiles("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+  # Runtime SIMD level it can get from compiler and ORC_RUNTIME_SIMD_LEVEL
+  if(CXX_SUPPORTS_SSE4_2 AND ORC_RUNTIME_SIMD_LEVEL MATCHES
+                             "^(SSE4_2|AVX2|AVX512|MAX)$")
+    set(ORC_HAVE_RUNTIME_SSE4_2 ON)
+    set(ORC_SIMD_LEVEL "SSE4_2")
+    add_definitions(-DORC_HAVE_RUNTIME_SSE4_2)
+  endif()
+  if(CXX_SUPPORTS_AVX2 AND ORC_RUNTIME_SIMD_LEVEL MATCHES "^(AVX2|AVX512|MAX)$")
+    set(ORC_HAVE_RUNTIME_AVX2 ON)
+    set(ORC_SIMD_LEVEL "AVX2")
+    add_definitions(-DORC_HAVE_RUNTIME_AVX2 -DORC_HAVE_RUNTIME_BMI2)
+  endif()
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND ORC_RUNTIME_SIMD_LEVEL MATCHES "^(AVX512|MAX)$")
+    message(STATUS "Enable the AVX512 vector decode of bit-packing")
+    set(ORC_HAVE_RUNTIME_AVX512 ON)
+    set(ORC_SIMD_LEVEL "AVX512")
+    add_definitions(-DORC_HAVE_RUNTIME_AVX512 -DORC_HAVE_RUNTIME_BMI2)
+  else ()
+    message(STATUS "Disable the AVX512 vector decode of bit-packing")

Review Comment:
   Can we also print the values of BUILD_ENABLE_AVX512, CXX_SUPPORTS_AVX512 AND ORC_RUNTIME_SIMD_LEVEL? So we know why it's disabled for better debugging.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1090092521


##########
CMakeLists.txt:
##########
@@ -67,6 +67,10 @@ option(BUILD_CPP_ENABLE_METRICS
     "Enable the metrics collection at compile phase"
     OFF)
 
+option(BUILD_ENABLE_AVX512
+    "Enable AVX512 vector decode of bit-packing"
+    OFF)

Review Comment:
   No, I just wanna to check the CI test result when disable AVX512 feature, because I don't have these different platforms and my own orc_test ruuning success on local platorm(centos 8).
   In the end, this option will be set as "ON" follow your above suggestions. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1139540587


##########
c++/src/CpuInfoUtil.cc:
##########
@@ -0,0 +1,581 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "CpuInfoUtil.hh"
+
+#ifdef __APPLE__
+#include <sys/sysctl.h>
+#endif
+
+#ifndef _MSC_VER
+#include <unistd.h>
+#endif
+
+#ifdef _WIN32
+#define NOMINMAX
+#include <Windows.h>
+#include <intrin.h>
+#endif
+
+#include <algorithm>
+#include <array>
+#include <bitset>
+#include <cctype>
+#include <cerrno>
+#include <cstdint>
+#include <fstream>
+#include <memory>
+#include <optional>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>

Review Comment:
   I also deleted the code about Arm and PowerPC platforms in [c++/src/CpuInfoUtil.cc].



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1139533686


##########
cmake_modules/ConfigSimdLevel.cmake:
##########
@@ -0,0 +1,105 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(CheckCXXCompilerFlag)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_SIMD_LEVEL)
+  set(ORC_SIMD_LEVEL
+      "DEFAULT"
+      CACHE STRING "Compile time SIMD optimization level")
+endif()
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(STATUS "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+  else()
+    # "arch=native" selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine.
+    # Using -march=native enables all instruction subsets supported by the local machine.
+    # Using -mtune=native produces code optimized for the local machine under the constraints of the selected instruction set.
+    set(ORC_AVX512_FLAG "-march=native -mtune=native")
+  endif()
+
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    CHECK_CXX_SOURCE_COMPILES("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND NOT MSVC)

Review Comment:
   Thank you very much for reminding me. Same works on windows if use clang. So removed "NOT MSVC".
   https://github.com/wpleonardo/orc/blob/440d6d159e356b6d2c863ef4bac9dde9a7977e99/cmake_modules/ConfigSimdLevel.cmake#L66



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1429331452

   > @wgtmac May I have a question about your suggestion "The options to enable AVX512 at compile and runtime are still confusing. The goal is to add one cmake option to control if AVX512 should be compiled. At runtime, we simply need an environment variable to enable/disable AVX512 at runtime if AVX512 has been compiled. We don't need to do the same thing as Apache Arrow because it has a different context."
   > 
   > Currently, I added a cmake option "BUILD_ENABLE_AVX512" to control if AVX512 code should be compiled. (For example, cmake command: cmake .. -DCMAKE_BUILD_TYPE=debug -DBUILD_JAVA=OFF -DBUILD_ENABLE_AVX512=ON) Then, if BUILD_ENABLE_AVX512 is on, a macro "ORC_HAVE_RUNTIME_AVX512" will be defined in the cmake process. It will enable the build of AVX512 code in the make process.
   > 
   > I also created an ENV parameter "ORC_USER_SIMD_LEVEL" whose value could be "none" or "avx512", to determine if enable /disable AVX512 at the runtime.
   > 
   > What is the confusing process? Is the ENV parameter name "ORC_USER_SIMD_LEVEL" too similar to the macro "ORC_HAVE_RUNTIME_AVX512"? or others? Thank you very much!
   
   From the statement above, adding `BUILD_ENABLE_AVX512` and `ORC_USER_SIMD_LEVEL` is enough. `ORC_SIMD_LEVEL` is redundant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1428944374

   > Thanks for contributing this!
   > 
   > Can we make the existing tests also run on the new avx512 options? At least for tests in TestRleDecoder.cc (e.g. bitSize1Direct), they also provide coverage for different bit sizes.
   
   Yes, I think we can run the existing test cases with the new AVX512 options.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1107117827


##########
CMakeLists.txt:
##########
@@ -157,6 +172,102 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")
+    set(ORC_AVX512_FLAG
+        "${ORC_AVX512_FLAG} -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vbmi")
+  endif()
+  check_cxx_compiler_flag(${ORC_AVX512_FLAG} CXX_SUPPORTS_AVX512)
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")

Review Comment:
   we can find the CMAKE_REQUIRED_FLAGS information in the cmake document:
   https://cmake.org/cmake/help/latest/module/CheckCXXSourceCompiles.html
   Is there no need to change CMAKE_REQUIRED_FLAGS ? Is my understanding right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1107113508


##########
CMakeLists.txt:
##########
@@ -157,6 +172,102 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")

Review Comment:
   Fixed.



##########
CMakeLists.txt:
##########
@@ -157,6 +172,102 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1141578060


##########
.github/workflows/build_and_test.yml:
##########
@@ -91,6 +91,49 @@ jobs:
         cmake --build . --config Debug
         ctest -C Debug --output-on-failure
 
+  simdUbuntu:
+    name: "SIMD programming using C++ intrinsic functions on ${{ matrix.os }}"
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os:
+          - ubuntu-22.04
+        cxx:
+          - clang++
+    env:
+      ORC_USER_SIMD_LEVEL: avx512
+    steps:
+    - name: Checkout
+      uses: actions/checkout@v2
+    - name: "Test"
+      run: |
+        mkdir -p ~/.m2
+        mkdir build
+        cd build
+        cmake -DBUILD_JAVA=OFF -DBUILD_ENABLE_AVX512=ON ..
+        make package test-out
+
+  simdWindows:
+    name: "SIMD programming using C++ intrinsic functions on Windows"
+    runs-on: windows-2019
+    env:
+      ORC_USER_SIMD_LEVEL: avx512

Review Comment:
   Done



##########
.github/workflows/build_and_test.yml:
##########
@@ -91,6 +91,49 @@ jobs:
         cmake --build . --config Debug
         ctest -C Debug --output-on-failure
 
+  simdUbuntu:
+    name: "SIMD programming using C++ intrinsic functions on ${{ matrix.os }}"
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os:
+          - ubuntu-22.04
+        cxx:
+          - clang++
+    env:
+      ORC_USER_SIMD_LEVEL: avx512

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] stiga-huang commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "stiga-huang (via GitHub)" <gi...@apache.org>.

stiga-huang commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1163675489


##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2694 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(const uint32_t bitWidth, const uint32_t bitMaxSize,
+                                                uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint64_t& tailBitLen, uint32_t& backupByteLen,
+                                                uint64_t& numElements, bool& resetBuf,
+                                                const uint8_t*& srcPtr, int64_t*& dstPtr) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveByteLen(remainingNumElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveByteLen(remainingNumElements * bitWidth);
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      numElements = remainingNumElements;
+      resetBuf = false;
+      remainingNumElements = 0;
+    } else {
+      uint64_t leadingBits = 0;
+      if (startBit != 0) leadingBits = ORC_VECTOR_BYTE_WIDTH - startBit;
+      uint64_t bufRestBitLen = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + leadingBits;
+      numElements = bufRestBitLen / bitWidth;
+      remainingNumElements -= numElements;
+      tailBitLen = fmod(bufRestBitLen, bitWidth);
+      resetBuf = true;
+    }
+
+    if (tailBitLen != 0) {
+      backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+      tailBitLen = 0;
+    }
+
+    if (startBit > 0) {
+      uint32_t align = getAlign(startBit, bitWidth, bitMaxSize);
+      if (align > numElements) {
+        align = numElements;
+      }
+      if (align != 0) {
+        bufMoveByteLen -= moveByteLen(align * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+        plainUnpackLongs(dstPtr, 0, align, bitWidth, startBit);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+        dstPtr += align;
+        numElements -= align;
+      }
+    }
+  }
+
+  inline void UnpackAvx512::alignTailerBoundary(const uint32_t bitWidth, uint64_t& startBit,
+                                                uint64_t& bufMoveByteLen, uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr) {
+    if (numElements > 0) {
+      if (startBit != 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+      } else {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+      }
+      plainUnpackLongs(dstPtr, 0, numElements, bitWidth, startBit);
+      srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+      dstPtr += numElements;
+      bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                resetBuf, backupByteLen);
+      return;
+    }
+
+    if (backupByteLen != 0) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+      plainUnpackLongs(dstPtr, 0, 1, bitWidth, startBit);
+      dstPtr++;
+      backupByteLen = 0;
+      remainingNumElements--;
+    } else {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);

Review Comment:
   Line114-115 duplicates line107-108. They can be extracted outside the if-branch.



##########
c++/src/RLEv2.hh:
##########
@@ -166,6 +166,20 @@ namespace orc {
 
     void next(int16_t* data, uint64_t numValues, const char* notNull) override;
 
+    unsigned char readByte(char** bufStart, char** bufEnd);
+
+    /**
+     * Most hotspot of this function locates in saving stack, so inline this function to have
+     * performance gain.
+     */
+    inline void resetBufferStart(char** bufStart, char** bufEnd, uint64_t len, bool resetBuf,
+                                 uint32_t backupLen);
+
+    char* bufferStart;
+    char* bufferEnd;
+    uint32_t bitsLeft;  // Used by readLongs when bitSize < 8
+    uint32_t curByte;   // Used by anything that uses readLongs

Review Comment:
   I think we can still keep these fields private by exposing methods like bufferStart(), bufferLength(). Most of the usage on `bufferEnd` is just `decoder->bufferEnd - decoder->bufferStart`.
   
   `bitsLeft` and `curByte` are used in `BpackingAvx512` and `BpackingDefault`. We can declare them as friend class.



##########
c++/src/BitUnpackerAvx512.hh:
##########
@@ -0,0 +1,488 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BIT_UNPACKER_AVX512_HH
+#define ORC_BIT_UNPACKER_AVX512_HH
+
+// Mingw-w64 defines strcasecmp in string.h
+#if defined(_WIN32) && !defined(strcasecmp)
+#include <string.h>
+#define strcasecmp stricmp
+#else
+#include <strings.h>
+#endif
+
+#include <immintrin.h>
+#include <cstdint>
+#include <vector>
+
+namespace orc {
+#define ORC_VECTOR_BITS_2_BYTE(x) \
+  (((x) + 7u) >> 3u) /**< Convert a number of bits to a number of bytes */
+#define ORC_VECTOR_ONE_64U (1ULL)
+#define ORC_VECTOR_MAX_16U 0xFFFF     /**< Max value for uint16_t */
+#define ORC_VECTOR_MAX_32U 0xFFFFFFFF /**< Max value for uint32_t */
+#define ORC_VECTOR_BYTE_WIDTH 8u      /**< Byte width in bits */
+#define ORC_VECTOR_WORD_WIDTH 16u     /**< Word width in bits */
+#define ORC_VECTOR_DWORD_WIDTH 32u    /**< Dword width in bits */
+#define ORC_VECTOR_QWORD_WIDTH 64u    /**< Qword width in bits */
+#define ORC_VECTOR_BIT_MASK(x) \
+  ((ORC_VECTOR_ONE_64U << (x)) - 1u) /**< Bit mask below bit position */
+
+#define ORC_VECTOR_BITS_2_WORD(x) \
+  (((x) + 15u) >> 4u) /**< Convert a number of bits to a number of words */
+#define ORC_VECTOR_BITS_2_DWORD(x) \
+  (((x) + 31u) >> 5u) /**< Convert a number of bits to a number of double words */
+
+  // ------------------------------------ 3u -----------------------------------------
+  static uint8_t shuffleIdxTable3u_0[64] = {

Review Comment:
   Let's add `const` for these constants.



##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2694 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(const uint32_t bitWidth, const uint32_t bitMaxSize,
+                                                uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint64_t& tailBitLen, uint32_t& backupByteLen,
+                                                uint64_t& numElements, bool& resetBuf,
+                                                const uint8_t*& srcPtr, int64_t*& dstPtr) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveByteLen(remainingNumElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveByteLen(remainingNumElements * bitWidth);
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      numElements = remainingNumElements;
+      resetBuf = false;
+      remainingNumElements = 0;
+    } else {
+      uint64_t leadingBits = 0;
+      if (startBit != 0) leadingBits = ORC_VECTOR_BYTE_WIDTH - startBit;
+      uint64_t bufRestBitLen = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + leadingBits;
+      numElements = bufRestBitLen / bitWidth;
+      remainingNumElements -= numElements;
+      tailBitLen = fmod(bufRestBitLen, bitWidth);
+      resetBuf = true;
+    }
+
+    if (tailBitLen != 0) {
+      backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+      tailBitLen = 0;
+    }
+
+    if (startBit > 0) {
+      uint32_t align = getAlign(startBit, bitWidth, bitMaxSize);
+      if (align > numElements) {
+        align = numElements;
+      }
+      if (align != 0) {
+        bufMoveByteLen -= moveByteLen(align * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+        plainUnpackLongs(dstPtr, 0, align, bitWidth, startBit);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+        dstPtr += align;
+        numElements -= align;
+      }
+    }
+  }
+
+  inline void UnpackAvx512::alignTailerBoundary(const uint32_t bitWidth, uint64_t& startBit,
+                                                uint64_t& bufMoveByteLen, uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr) {
+    if (numElements > 0) {
+      if (startBit != 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+      } else {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+      }
+      plainUnpackLongs(dstPtr, 0, numElements, bitWidth, startBit);
+      srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+      dstPtr += numElements;
+      bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                resetBuf, backupByteLen);
+      return;
+    }
+
+    if (backupByteLen != 0) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+      plainUnpackLongs(dstPtr, 0, 1, bitWidth, startBit);
+      dstPtr++;
+      backupByteLen = 0;
+      remainingNumElements--;
+    } else {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+    }
+
+    bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bufMoveByteLen = 0;
+    srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+  }
+
+  void UnpackAvx512::vectorUnpack1(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 1;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __m512i reverseMask1u = _mm512_loadu_si512(reverseMaskTable1u);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          uint64_t src_64 = *reinterpret_cast<uint64_t*>(const_cast<uint8_t*>(srcPtr));
+          // convert mask to 512-bit register. 0 --> 0x00, 1 --> 0xFF
+          __m512i srcmm = _mm512_movm_epi8(src_64);
+          // make 0x00 --> 0x00, 0xFF --> 0x01
+          srcmm = _mm512_abs_epi8(srcmm);
+          srcmm = _mm512_shuffle_epi8(srcmm, reverseMask1u);
+          _mm512_storeu_si512(simdPtr, srcmm);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack2(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 2;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_16U;         // first 16 bytes (64 elements)
+        __m512i parse_mask = _mm512_set1_epi16(0x0303);  // 2 times 1 then (8 - 2) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm3 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          __m512i srcmm0, srcmm1, srcmm2, tmpmm;
+
+          srcmm2 = _mm512_srli_epi16(srcmm3, 2);
+          srcmm1 = _mm512_srli_epi16(srcmm3, 4);
+          srcmm0 = _mm512_srli_epi16(srcmm3, 6);
+
+          // turn 2 bitWidth into 8 by zeroing 3 of each 4 elements.
+          // move them into their places
+          // srcmm0: a e i m 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm1: b f j n 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);        // ab ef 00 00 00 00 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);       // ij mn 00 00 00 00 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x00);  // ab ef ab ef ij mn ij mn
+
+          // srcmm2: c g k o 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm3: d h l p 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm2, srcmm3);        // cd gh 00 00 00 00 00 00
+          srcmm1 = _mm512_unpackhi_epi8(srcmm2, srcmm3);       // kl op 00 00 00 00 00 00
+          srcmm1 = _mm512_shuffle_i64x2(tmpmm, srcmm1, 0x00);  // cd gh cd gh kl op kl op
+
+          tmpmm = _mm512_unpacklo_epi16(srcmm0, srcmm1);        // abcd abcd ijkl ijkl
+          srcmm0 = _mm512_unpackhi_epi16(srcmm0, srcmm1);       // efgh efgh mnop mnop
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x88);   // abcd ijkl efgh mnop
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // abcd efgh ijkl mnop
+
+          srcmm0 = _mm512_and_si512(srcmm0, parse_mask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack3(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 3;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable3u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable3u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable3u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable3u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable3u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack4(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 4;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_32U;        // first 32 bytes (64 elements)
+        __m512i parseMask = _mm512_set1_epi16(0x0F0F);  // 4 times 1 then (8 - 4) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm0, srcmm1, tmpmm;
+
+          srcmm1 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm0 = _mm512_srli_epi16(srcmm1, 4);
+
+          // move elements into their places
+          // srcmm0: a c e g 0 0 0 0
+          // srcmm1: b d f h 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);         // ab ef 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);        // cd gh 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x44);   // ab ef cd gh
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // ab cd ef gh
+
+          // turn 4 bitWidth into 8 by zeroing 4 of each 8 bits.
+          srcmm0 = _mm512_and_si512(srcmm0, parseMask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack5(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 5;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable5u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable5u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable5u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable5u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable5u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack6(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 6;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable6u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable6u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable6u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable6u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable6u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack7(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 7;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable7u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable7u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable7u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable7u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable7u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack9(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 9;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable9u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable9u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable9u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable9u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable9u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable9u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable9u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 7);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack10(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 10;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable10u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable10u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable10u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack11(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 11;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable11u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable11u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable11u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable11u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable11u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable11u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable11u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable11u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable11u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4u);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 5);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack12(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 12;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable12u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable12u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable12u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack13(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 13;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable13u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable13u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable13u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable13u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable13u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable13u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable13u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable13u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable13u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 3);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack14(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 14;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable14u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable14u_1);
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable14u);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable14u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable14u_1);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack15(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 15;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable15u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable15u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable15u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable15u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable15u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable15u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable15u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable15u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable15u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 1);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack16(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 16;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = len;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    int64_t* dstPtr = data + offset;
+    bool resetBuf = false;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      bufMoveByteLen += moveByteLen(len * bitWidth);
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        numElements = len;
+        resetBuf = false;
+      } else {
+        numElements = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH / bitWidth;
+        len -= numElements;
+        tailBitLen = fmod(bufRestByteLen * ORC_VECTOR_BYTE_WIDTH, bitWidth);
+        resetBuf = true;
+      }
+
+      if (tailBitLen != 0) {
+        backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+        tailBitLen = 0;
+      }
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm = _mm512_loadu_si512(srcPtr);
+          srcmm = _mm512_shuffle_epi8(srcmm, reverse_mask_16u);
+          _mm512_storeu_si512(simdPtr, srcmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      if (numElements > 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+        unpackDefault.unrolledUnpack16(dstPtr, 0, numElements);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        dstPtr += numElements;
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+      }
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                  resetBuf, backupByteLen);
+        return;
+      }
+
+      if (backupByteLen != 0) {
+        decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                  resetBuf, backupByteLen);
+        ;

Review Comment:
   Remove extra `;`



##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2694 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(const uint32_t bitWidth, const uint32_t bitMaxSize,
+                                                uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint64_t& tailBitLen, uint32_t& backupByteLen,
+                                                uint64_t& numElements, bool& resetBuf,
+                                                const uint8_t*& srcPtr, int64_t*& dstPtr) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveByteLen(remainingNumElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveByteLen(remainingNumElements * bitWidth);
+    }

Review Comment:
   This can also be simplified:
   ```cpp
       uint64_t numBits = remainingNumElements * bitWidth;
       if (startBit != 0) {
         numBits += startBit - ORC_VECTOR_BYTE_WIDTH;
       }
       bufMoveByteLen += moveByteLen(numBits);
   ```
   BTW, should we directly set `bufMoveByteLen` instead of increasing it?



##########
c++/src/RLEv2.hh:
##########
@@ -220,17 +221,36 @@ namespace orc {
 
     const std::unique_ptr<SeekableInputStream> inputStream;
     const bool isSigned;
-
     unsigned char firstByte;
-    uint64_t runLength;  // Length of the current run
-    uint64_t runRead;    // Number of returned values of the current run
-    const char* bufferStart;
-    const char* bufferEnd;
-    uint32_t bitsLeft;                  // Used by readLongs when bitSize < 8
-    uint32_t curByte;                   // Used by anything that uses readLongs
+    uint64_t runLength;                 // Length of the current run
+    uint64_t runRead;                   // Number of returned values of the current run
     DataBuffer<int64_t> unpackedPatch;  // Used by PATCHED_BASE
     DataBuffer<int64_t> literals;       // Values of the current run
   };
+
+  inline void RleDecoderV2::resetBufferStart(char** bufStart, char** bufEnd, uint64_t len,
+                                             bool resetBuf, uint32_t backupByteLen) {
+    uint64_t remainingLen = *bufEnd - *bufStart;
+    int bufferLength = 0;
+    const void* bufferPointer = nullptr;
+
+    if (backupByteLen != 0) {
+      inputStream->BackUp(backupByteLen);
+    }
+
+    if (len >= remainingLen && resetBuf == true) {

Review Comment:
   Replace `resetBuf == true` with `resetBuf`



##########
c++/src/BpackingAvx512.hh:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGAVX512_HH
+#define ORC_BPACKINGAVX512_HH
+
+#include <cstdint>
+#include <cstdlib>
+
+#include "BpackingDefault.hh"
+
+namespace orc {
+
+#define VECTOR_UNPACK_8BIT_MAX_NUM 64
+#define VECTOR_UNPACK_16BIT_MAX_NUM 32
+#define VECTOR_UNPACK_32BIT_MAX_NUM 16
+#define UNPACK_8Bit_MAX_SIZE 8
+#define UNPACK_16Bit_MAX_SIZE 16
+#define UNPACK_32Bit_MAX_SIZE 32
+
+  class RleDecoderV2;
+
+  class UnpackAvx512 {
+   public:
+    UnpackAvx512(RleDecoderV2* dec);
+    ~UnpackAvx512();
+
+    void vectorUnpack1(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack2(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack3(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack4(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack5(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack6(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack7(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack9(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack10(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack11(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack12(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack13(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack14(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack15(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack16(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack17(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack18(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack19(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack20(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack21(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack22(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack23(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack24(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack26(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack28(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack30(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack32(int64_t* data, uint64_t offset, uint64_t len);
+
+    void plainUnpackLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs,
+                          uint64_t& startBit);
+
+    inline void alignHeaderBoundary(const uint32_t bitWidth, const uint32_t bitMaxSize,
+                                    uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                    uint64_t& bufRestByteLen, uint64_t& remainingNumElements,
+                                    uint64_t& tailBitLen, uint32_t& backupByteLen,
+                                    uint64_t& numElements, bool& resetBuf, const uint8_t*& srcPtr,
+                                    int64_t*& dstPtr);
+
+    inline void alignTailerBoundary(const uint32_t bitWidth, uint64_t& startBit,
+                                    uint64_t& bufMoveByteLen, uint64_t& bufRestByteLen,
+                                    uint64_t& remainingNumElements, uint32_t& backupByteLen,
+                                    uint64_t& numElements, bool& resetBuf, const uint8_t*& srcPtr,
+                                    int64_t*& dstPtr);
+
+   private:
+    RleDecoderV2* decoder;
+    UnpackDefault unpackDefault;
+
+    // Used by vectorially bit-unpacking data

Review Comment:
   "vectorially" -> "vectorized" ?



##########
c++/src/RLEv2.hh:
##########
@@ -220,17 +221,36 @@ namespace orc {
 
     const std::unique_ptr<SeekableInputStream> inputStream;
     const bool isSigned;
-
     unsigned char firstByte;
-    uint64_t runLength;  // Length of the current run
-    uint64_t runRead;    // Number of returned values of the current run
-    const char* bufferStart;
-    const char* bufferEnd;
-    uint32_t bitsLeft;                  // Used by readLongs when bitSize < 8
-    uint32_t curByte;                   // Used by anything that uses readLongs
+    uint64_t runLength;                 // Length of the current run
+    uint64_t runRead;                   // Number of returned values of the current run
     DataBuffer<int64_t> unpackedPatch;  // Used by PATCHED_BASE
     DataBuffer<int64_t> literals;       // Values of the current run
   };
+
+  inline void RleDecoderV2::resetBufferStart(char** bufStart, char** bufEnd, uint64_t len,

Review Comment:
   I see all the callsites use `&decoder->bufferStart, &decoder->bufferEnd` as the first parameters. I think we can remove them since these are internal fields.
   
   The method can be simplified as
   ```cpp
   inline void RleDecoderV2::resetBufferStart(uint64_t moveByteLen, bool resetBuf, uint32_t backupByteLen)
   ```
   The callsite can be simplified as
   ```cpp
   decoder->resetBufferStart(bufMoveByteLen, resetBuf, backupByteLen);
   ```



##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2694 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(const uint32_t bitWidth, const uint32_t bitMaxSize,
+                                                uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint64_t& tailBitLen, uint32_t& backupByteLen,
+                                                uint64_t& numElements, bool& resetBuf,
+                                                const uint8_t*& srcPtr, int64_t*& dstPtr) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveByteLen(remainingNumElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveByteLen(remainingNumElements * bitWidth);
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      numElements = remainingNumElements;
+      resetBuf = false;
+      remainingNumElements = 0;
+    } else {
+      uint64_t leadingBits = 0;
+      if (startBit != 0) leadingBits = ORC_VECTOR_BYTE_WIDTH - startBit;
+      uint64_t bufRestBitLen = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + leadingBits;
+      numElements = bufRestBitLen / bitWidth;
+      remainingNumElements -= numElements;
+      tailBitLen = fmod(bufRestBitLen, bitWidth);
+      resetBuf = true;
+    }
+
+    if (tailBitLen != 0) {
+      backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+      tailBitLen = 0;
+    }
+
+    if (startBit > 0) {
+      uint32_t align = getAlign(startBit, bitWidth, bitMaxSize);
+      if (align > numElements) {
+        align = numElements;
+      }
+      if (align != 0) {
+        bufMoveByteLen -= moveByteLen(align * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+        plainUnpackLongs(dstPtr, 0, align, bitWidth, startBit);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+        dstPtr += align;
+        numElements -= align;
+      }
+    }
+  }
+
+  inline void UnpackAvx512::alignTailerBoundary(const uint32_t bitWidth, uint64_t& startBit,
+                                                uint64_t& bufMoveByteLen, uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr) {
+    if (numElements > 0) {
+      if (startBit != 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+      } else {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+      }
+      plainUnpackLongs(dstPtr, 0, numElements, bitWidth, startBit);
+      srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+      dstPtr += numElements;
+      bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                resetBuf, backupByteLen);
+      return;
+    }
+
+    if (backupByteLen != 0) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+      plainUnpackLongs(dstPtr, 0, 1, bitWidth, startBit);
+      dstPtr++;
+      backupByteLen = 0;
+      remainingNumElements--;
+    } else {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+    }
+
+    bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bufMoveByteLen = 0;
+    srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+  }
+
+  void UnpackAvx512::vectorUnpack1(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 1;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __m512i reverseMask1u = _mm512_loadu_si512(reverseMaskTable1u);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          uint64_t src_64 = *reinterpret_cast<uint64_t*>(const_cast<uint8_t*>(srcPtr));
+          // convert mask to 512-bit register. 0 --> 0x00, 1 --> 0xFF
+          __m512i srcmm = _mm512_movm_epi8(src_64);
+          // make 0x00 --> 0x00, 0xFF --> 0x01
+          srcmm = _mm512_abs_epi8(srcmm);
+          srcmm = _mm512_shuffle_epi8(srcmm, reverseMask1u);
+          _mm512_storeu_si512(simdPtr, srcmm);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack2(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 2;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_16U;         // first 16 bytes (64 elements)
+        __m512i parse_mask = _mm512_set1_epi16(0x0303);  // 2 times 1 then (8 - 2) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm3 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          __m512i srcmm0, srcmm1, srcmm2, tmpmm;
+
+          srcmm2 = _mm512_srli_epi16(srcmm3, 2);
+          srcmm1 = _mm512_srli_epi16(srcmm3, 4);
+          srcmm0 = _mm512_srli_epi16(srcmm3, 6);
+
+          // turn 2 bitWidth into 8 by zeroing 3 of each 4 elements.
+          // move them into their places
+          // srcmm0: a e i m 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm1: b f j n 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);        // ab ef 00 00 00 00 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);       // ij mn 00 00 00 00 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x00);  // ab ef ab ef ij mn ij mn
+
+          // srcmm2: c g k o 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm3: d h l p 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm2, srcmm3);        // cd gh 00 00 00 00 00 00
+          srcmm1 = _mm512_unpackhi_epi8(srcmm2, srcmm3);       // kl op 00 00 00 00 00 00
+          srcmm1 = _mm512_shuffle_i64x2(tmpmm, srcmm1, 0x00);  // cd gh cd gh kl op kl op
+
+          tmpmm = _mm512_unpacklo_epi16(srcmm0, srcmm1);        // abcd abcd ijkl ijkl
+          srcmm0 = _mm512_unpackhi_epi16(srcmm0, srcmm1);       // efgh efgh mnop mnop
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x88);   // abcd ijkl efgh mnop
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // abcd efgh ijkl mnop
+
+          srcmm0 = _mm512_and_si512(srcmm0, parse_mask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack3(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 3;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable3u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable3u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable3u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable3u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable3u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack4(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 4;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_32U;        // first 32 bytes (64 elements)
+        __m512i parseMask = _mm512_set1_epi16(0x0F0F);  // 4 times 1 then (8 - 4) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm0, srcmm1, tmpmm;
+
+          srcmm1 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm0 = _mm512_srli_epi16(srcmm1, 4);
+
+          // move elements into their places
+          // srcmm0: a c e g 0 0 0 0
+          // srcmm1: b d f h 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);         // ab ef 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);        // cd gh 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x44);   // ab ef cd gh
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // ab cd ef gh
+
+          // turn 4 bitWidth into 8 by zeroing 4 of each 8 bits.
+          srcmm0 = _mm512_and_si512(srcmm0, parseMask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack5(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 5;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable5u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable5u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable5u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable5u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable5u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack6(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 6;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable6u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable6u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable6u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable6u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable6u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack7(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 7;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable7u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable7u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable7u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable7u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable7u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack9(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 9;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable9u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable9u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable9u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable9u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable9u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable9u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable9u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 7);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack10(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 10;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable10u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable10u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable10u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack11(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 11;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable11u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable11u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable11u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable11u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable11u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable11u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable11u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable11u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable11u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4u);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 5);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack12(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 12;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable12u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable12u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable12u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack13(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 13;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable13u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable13u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable13u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable13u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable13u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable13u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable13u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable13u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable13u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 3);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack14(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 14;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable14u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable14u_1);
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable14u);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable14u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable14u_1);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack15(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 15;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable15u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable15u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable15u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable15u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable15u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable15u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable15u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable15u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable15u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 1);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack16(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 16;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = len;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    int64_t* dstPtr = data + offset;
+    bool resetBuf = false;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      bufMoveByteLen += moveByteLen(len * bitWidth);
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        numElements = len;
+        resetBuf = false;

Review Comment:
   Shouldn't we set `len` to 0 here?



##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2694 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(const uint32_t bitWidth, const uint32_t bitMaxSize,
+                                                uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint64_t& tailBitLen, uint32_t& backupByteLen,
+                                                uint64_t& numElements, bool& resetBuf,
+                                                const uint8_t*& srcPtr, int64_t*& dstPtr) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveByteLen(remainingNumElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveByteLen(remainingNumElements * bitWidth);
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      numElements = remainingNumElements;
+      resetBuf = false;
+      remainingNumElements = 0;
+    } else {
+      uint64_t leadingBits = 0;
+      if (startBit != 0) leadingBits = ORC_VECTOR_BYTE_WIDTH - startBit;
+      uint64_t bufRestBitLen = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + leadingBits;
+      numElements = bufRestBitLen / bitWidth;
+      remainingNumElements -= numElements;
+      tailBitLen = fmod(bufRestBitLen, bitWidth);
+      resetBuf = true;
+    }
+
+    if (tailBitLen != 0) {
+      backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+      tailBitLen = 0;
+    }
+
+    if (startBit > 0) {
+      uint32_t align = getAlign(startBit, bitWidth, bitMaxSize);
+      if (align > numElements) {
+        align = numElements;
+      }
+      if (align != 0) {
+        bufMoveByteLen -= moveByteLen(align * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+        plainUnpackLongs(dstPtr, 0, align, bitWidth, startBit);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+        dstPtr += align;
+        numElements -= align;
+      }
+    }
+  }
+
+  inline void UnpackAvx512::alignTailerBoundary(const uint32_t bitWidth, uint64_t& startBit,
+                                                uint64_t& bufMoveByteLen, uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr) {
+    if (numElements > 0) {
+      if (startBit != 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+      } else {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+      }
+      plainUnpackLongs(dstPtr, 0, numElements, bitWidth, startBit);
+      srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+      dstPtr += numElements;
+      bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                resetBuf, backupByteLen);
+      return;
+    }
+
+    if (backupByteLen != 0) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+      plainUnpackLongs(dstPtr, 0, 1, bitWidth, startBit);
+      dstPtr++;
+      backupByteLen = 0;
+      remainingNumElements--;
+    } else {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+    }
+
+    bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bufMoveByteLen = 0;
+    srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+  }
+
+  void UnpackAvx512::vectorUnpack1(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 1;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __m512i reverseMask1u = _mm512_loadu_si512(reverseMaskTable1u);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          uint64_t src_64 = *reinterpret_cast<uint64_t*>(const_cast<uint8_t*>(srcPtr));
+          // convert mask to 512-bit register. 0 --> 0x00, 1 --> 0xFF
+          __m512i srcmm = _mm512_movm_epi8(src_64);
+          // make 0x00 --> 0x00, 0xFF --> 0x01
+          srcmm = _mm512_abs_epi8(srcmm);
+          srcmm = _mm512_shuffle_epi8(srcmm, reverseMask1u);
+          _mm512_storeu_si512(simdPtr, srcmm);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack2(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 2;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_16U;         // first 16 bytes (64 elements)
+        __m512i parse_mask = _mm512_set1_epi16(0x0303);  // 2 times 1 then (8 - 2) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm3 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          __m512i srcmm0, srcmm1, srcmm2, tmpmm;
+
+          srcmm2 = _mm512_srli_epi16(srcmm3, 2);
+          srcmm1 = _mm512_srli_epi16(srcmm3, 4);
+          srcmm0 = _mm512_srli_epi16(srcmm3, 6);
+
+          // turn 2 bitWidth into 8 by zeroing 3 of each 4 elements.
+          // move them into their places
+          // srcmm0: a e i m 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm1: b f j n 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);        // ab ef 00 00 00 00 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);       // ij mn 00 00 00 00 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x00);  // ab ef ab ef ij mn ij mn
+
+          // srcmm2: c g k o 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm3: d h l p 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm2, srcmm3);        // cd gh 00 00 00 00 00 00
+          srcmm1 = _mm512_unpackhi_epi8(srcmm2, srcmm3);       // kl op 00 00 00 00 00 00
+          srcmm1 = _mm512_shuffle_i64x2(tmpmm, srcmm1, 0x00);  // cd gh cd gh kl op kl op
+
+          tmpmm = _mm512_unpacklo_epi16(srcmm0, srcmm1);        // abcd abcd ijkl ijkl
+          srcmm0 = _mm512_unpackhi_epi16(srcmm0, srcmm1);       // efgh efgh mnop mnop
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x88);   // abcd ijkl efgh mnop
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // abcd efgh ijkl mnop
+
+          srcmm0 = _mm512_and_si512(srcmm0, parse_mask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack3(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 3;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable3u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable3u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable3u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable3u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable3u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack4(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 4;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_32U;        // first 32 bytes (64 elements)
+        __m512i parseMask = _mm512_set1_epi16(0x0F0F);  // 4 times 1 then (8 - 4) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm0, srcmm1, tmpmm;
+
+          srcmm1 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm0 = _mm512_srli_epi16(srcmm1, 4);
+
+          // move elements into their places
+          // srcmm0: a c e g 0 0 0 0
+          // srcmm1: b d f h 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);         // ab ef 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);        // cd gh 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x44);   // ab ef cd gh
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // ab cd ef gh
+
+          // turn 4 bitWidth into 8 by zeroing 4 of each 8 bits.
+          srcmm0 = _mm512_and_si512(srcmm0, parseMask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack5(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 5;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable5u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable5u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable5u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable5u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable5u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack6(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 6;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable6u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable6u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable6u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable6u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable6u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack7(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 7;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable7u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable7u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable7u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable7u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable7u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack9(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 9;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable9u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable9u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable9u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable9u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable9u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable9u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable9u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 7);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack10(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 10;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable10u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable10u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable10u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack11(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 11;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable11u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable11u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable11u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable11u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable11u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable11u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable11u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable11u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable11u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4u);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 5);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack12(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 12;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable12u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable12u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable12u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack13(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 13;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable13u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable13u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable13u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable13u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable13u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable13u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable13u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable13u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable13u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 3);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack14(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 14;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable14u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable14u_1);
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable14u);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable14u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable14u_1);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack15(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 15;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable15u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable15u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable15u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable15u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable15u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable15u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable15u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable15u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable15u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 1);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack16(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 16;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = len;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    int64_t* dstPtr = data + offset;
+    bool resetBuf = false;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      bufMoveByteLen += moveByteLen(len * bitWidth);
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        numElements = len;
+        resetBuf = false;
+      } else {
+        numElements = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH / bitWidth;
+        len -= numElements;
+        tailBitLen = fmod(bufRestByteLen * ORC_VECTOR_BYTE_WIDTH, bitWidth);
+        resetBuf = true;
+      }
+
+      if (tailBitLen != 0) {
+        backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+        tailBitLen = 0;
+      }
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm = _mm512_loadu_si512(srcPtr);
+          srcmm = _mm512_shuffle_epi8(srcmm, reverse_mask_16u);
+          _mm512_storeu_si512(simdPtr, srcmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      if (numElements > 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+        unpackDefault.unrolledUnpack16(dstPtr, 0, numElements);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        dstPtr += numElements;
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+      }
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                  resetBuf, backupByteLen);
+        return;

Review Comment:
   Is it correct to return here? It seems these codes are copied from `alignTailerBoundary()`



##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2694 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(const uint32_t bitWidth, const uint32_t bitMaxSize,
+                                                uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint64_t& tailBitLen, uint32_t& backupByteLen,
+                                                uint64_t& numElements, bool& resetBuf,
+                                                const uint8_t*& srcPtr, int64_t*& dstPtr) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveByteLen(remainingNumElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveByteLen(remainingNumElements * bitWidth);
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      numElements = remainingNumElements;
+      resetBuf = false;
+      remainingNumElements = 0;
+    } else {
+      uint64_t leadingBits = 0;
+      if (startBit != 0) leadingBits = ORC_VECTOR_BYTE_WIDTH - startBit;
+      uint64_t bufRestBitLen = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + leadingBits;
+      numElements = bufRestBitLen / bitWidth;
+      remainingNumElements -= numElements;
+      tailBitLen = fmod(bufRestBitLen, bitWidth);
+      resetBuf = true;
+    }
+
+    if (tailBitLen != 0) {
+      backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+      tailBitLen = 0;
+    }
+
+    if (startBit > 0) {
+      uint32_t align = getAlign(startBit, bitWidth, bitMaxSize);
+      if (align > numElements) {
+        align = numElements;
+      }
+      if (align != 0) {
+        bufMoveByteLen -= moveByteLen(align * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+        plainUnpackLongs(dstPtr, 0, align, bitWidth, startBit);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+        dstPtr += align;
+        numElements -= align;
+      }
+    }
+  }
+
+  inline void UnpackAvx512::alignTailerBoundary(const uint32_t bitWidth, uint64_t& startBit,
+                                                uint64_t& bufMoveByteLen, uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr) {
+    if (numElements > 0) {
+      if (startBit != 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+      } else {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+      }
+      plainUnpackLongs(dstPtr, 0, numElements, bitWidth, startBit);
+      srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+      dstPtr += numElements;
+      bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                resetBuf, backupByteLen);
+      return;
+    }
+
+    if (backupByteLen != 0) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+      plainUnpackLongs(dstPtr, 0, 1, bitWidth, startBit);
+      dstPtr++;
+      backupByteLen = 0;
+      remainingNumElements--;
+    } else {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+    }
+
+    bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bufMoveByteLen = 0;
+    srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+  }
+
+  void UnpackAvx512::vectorUnpack1(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 1;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __m512i reverseMask1u = _mm512_loadu_si512(reverseMaskTable1u);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          uint64_t src_64 = *reinterpret_cast<uint64_t*>(const_cast<uint8_t*>(srcPtr));
+          // convert mask to 512-bit register. 0 --> 0x00, 1 --> 0xFF
+          __m512i srcmm = _mm512_movm_epi8(src_64);
+          // make 0x00 --> 0x00, 0xFF --> 0x01
+          srcmm = _mm512_abs_epi8(srcmm);
+          srcmm = _mm512_shuffle_epi8(srcmm, reverseMask1u);
+          _mm512_storeu_si512(simdPtr, srcmm);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack2(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 2;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_16U;         // first 16 bytes (64 elements)
+        __m512i parse_mask = _mm512_set1_epi16(0x0303);  // 2 times 1 then (8 - 2) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm3 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          __m512i srcmm0, srcmm1, srcmm2, tmpmm;
+
+          srcmm2 = _mm512_srli_epi16(srcmm3, 2);
+          srcmm1 = _mm512_srli_epi16(srcmm3, 4);
+          srcmm0 = _mm512_srli_epi16(srcmm3, 6);
+
+          // turn 2 bitWidth into 8 by zeroing 3 of each 4 elements.
+          // move them into their places
+          // srcmm0: a e i m 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm1: b f j n 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);        // ab ef 00 00 00 00 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);       // ij mn 00 00 00 00 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x00);  // ab ef ab ef ij mn ij mn
+
+          // srcmm2: c g k o 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm3: d h l p 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm2, srcmm3);        // cd gh 00 00 00 00 00 00
+          srcmm1 = _mm512_unpackhi_epi8(srcmm2, srcmm3);       // kl op 00 00 00 00 00 00
+          srcmm1 = _mm512_shuffle_i64x2(tmpmm, srcmm1, 0x00);  // cd gh cd gh kl op kl op
+
+          tmpmm = _mm512_unpacklo_epi16(srcmm0, srcmm1);        // abcd abcd ijkl ijkl
+          srcmm0 = _mm512_unpackhi_epi16(srcmm0, srcmm1);       // efgh efgh mnop mnop
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x88);   // abcd ijkl efgh mnop
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // abcd efgh ijkl mnop
+
+          srcmm0 = _mm512_and_si512(srcmm0, parse_mask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack3(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 3;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable3u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable3u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable3u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable3u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable3u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack4(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 4;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_32U;        // first 32 bytes (64 elements)
+        __m512i parseMask = _mm512_set1_epi16(0x0F0F);  // 4 times 1 then (8 - 4) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm0, srcmm1, tmpmm;
+
+          srcmm1 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm0 = _mm512_srli_epi16(srcmm1, 4);
+
+          // move elements into their places
+          // srcmm0: a c e g 0 0 0 0
+          // srcmm1: b d f h 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);         // ab ef 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);        // cd gh 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x44);   // ab ef cd gh
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // ab cd ef gh
+
+          // turn 4 bitWidth into 8 by zeroing 4 of each 8 bits.
+          srcmm0 = _mm512_and_si512(srcmm0, parseMask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack5(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 5;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable5u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable5u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable5u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable5u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable5u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack6(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 6;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable6u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable6u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable6u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable6u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable6u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack7(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 7;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable7u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable7u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable7u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable7u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable7u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack9(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 9;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable9u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable9u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable9u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable9u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable9u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable9u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable9u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 7);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack10(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 10;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable10u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable10u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable10u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack11(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 11;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable11u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable11u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable11u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable11u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable11u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable11u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable11u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable11u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable11u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4u);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 5);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack12(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 12;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable12u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable12u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable12u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack13(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 13;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable13u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable13u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable13u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable13u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable13u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable13u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable13u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable13u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable13u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 3);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack14(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 14;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable14u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable14u_1);
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable14u);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable14u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable14u_1);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack15(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 15;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable15u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable15u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable15u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable15u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable15u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable15u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable15u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable15u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable15u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 1);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack16(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 16;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = len;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    int64_t* dstPtr = data + offset;
+    bool resetBuf = false;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      bufMoveByteLen += moveByteLen(len * bitWidth);
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        numElements = len;
+        resetBuf = false;
+      } else {
+        numElements = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH / bitWidth;
+        len -= numElements;
+        tailBitLen = fmod(bufRestByteLen * ORC_VECTOR_BYTE_WIDTH, bitWidth);
+        resetBuf = true;
+      }
+
+      if (tailBitLen != 0) {
+        backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+        tailBitLen = 0;
+      }
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm = _mm512_loadu_si512(srcPtr);
+          srcmm = _mm512_shuffle_epi8(srcmm, reverse_mask_16u);
+          _mm512_storeu_si512(simdPtr, srcmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      if (numElements > 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+        unpackDefault.unrolledUnpack16(dstPtr, 0, numElements);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        dstPtr += numElements;
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+      }
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                  resetBuf, backupByteLen);
+        return;
+      }
+
+      if (backupByteLen != 0) {
+        decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                  resetBuf, backupByteLen);
+        ;
+        unpackDefault.unrolledUnpack16(dstPtr, 0, 1);
+        dstPtr++;
+        backupByteLen = 0;
+        len--;
+      } else {
+        decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                  resetBuf, backupByteLen);
+      }
+
+      bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+      bufMoveByteLen = 0;
+      srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack17(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 17;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_32Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(bitWidth);
+        __m512i parseMask0 = _mm512_set1_epi32(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask32u = _mm512_loadu_si512(reverseMaskTable32u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable17u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable17u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable17u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable17u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable17u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable17u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable17u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1u);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+
+        if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi32(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi32(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi64(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi64(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi32(zmm[0], 0xAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi32(zmm[0], 15);
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask32u);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack18(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 18;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_32Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+        __mmask16 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_DWORD(bitWidth * 16));
+        __m512i parseMask0 = _mm512_set1_epi32(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask32u = _mm512_loadu_si512(reverseMaskTable32u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable18u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable18u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable18u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable18u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable18u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable18u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable18u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+
+        if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi32(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi32(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi32(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi64(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi64(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi32(zmm[0], 0xAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi32(zmm[0], 14);
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask32u);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack19(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 19;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_32Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(bitWidth);
+        __m512i parseMask0 = _mm512_set1_epi32(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask32u = _mm512_loadu_si512(reverseMaskTable32u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable19u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable19u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable19u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable19u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable19u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable19u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable19u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+
+        if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi32(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi32(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi64(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi64(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi32(zmm[0], 0xAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi32(zmm[0], 13);
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask32u);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack20(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 20;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_32Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+        __mmask16 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_DWORD(bitWidth * 16));
+        __m512i parseMask0 = _mm512_set1_epi32(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable20u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable20u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable20u);
+
+        while (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi32(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi32(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(vectorBuf, zmm);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack21(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 21;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_32Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(bitWidth);
+        __m512i parseMask0 = _mm512_set1_epi32(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask32u = _mm512_loadu_si512(reverseMaskTable32u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable21u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable21u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable21u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable21u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable21u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable21u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable21u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+
+        if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi32(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi32(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi64(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi64(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi32(zmm[0], 0xAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi32(zmm[0], 11);
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask32u);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack22(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 22;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_32Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+        __mmask16 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_DWORD(bitWidth * 16));
+        __m512i parseMask0 = _mm512_set1_epi32(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask32u = _mm512_loadu_si512(reverseMaskTable32u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable22u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable22u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable22u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable22u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable22u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable22u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable22u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+
+        if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi32(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi32(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi32(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi64(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi64(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi32(zmm[0], 0xAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi32(zmm[0], 10);
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask32u);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack23(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 23;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_32Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(bitWidth);
+        __m512i parseMask0 = _mm512_set1_epi32(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask32u = _mm512_loadu_si512(reverseMaskTable32u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable23u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable23u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable23u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable23u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable23u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable23u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable23u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+
+        if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi32(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi32(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi64(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi64(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi32(zmm[0], 0xAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi32(zmm[0], 9);
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask32u);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack24(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 24;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      bufMoveByteLen += moveByteLen(len * bitWidth);
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        numElements = len;
+        resetBuf = false;

Review Comment:
   Same question here. Should we set `len` to 0? I think these codes are the same as those in vectorUnpack16(). They are a simplication of `alignHeaderBoundary()`. To reduce duplicate codes, we can make `alignHeaderBoundary()` and `alignTailerBoundary()` template functions with a template argument of `bool hasBitOffset`. E.g.
   
   ```cpp
     template<bool hasBitOffset>
     inline void UnpackAvx512::alignHeaderBoundary(const uint32_t bitWidth, const uint32_t bitMaxSize,
                                                   uint64_t& startBit, uint64_t& bufMoveByteLen,
                                                   uint64_t& bufRestByteLen,
                                                   uint64_t& remainingNumElements,
                                                   uint64_t& tailBitLen, uint32_t& backupByteLen,
                                                   uint64_t& numElements, bool& resetBuf,
                                                   const uint8_t*& srcPtr, int64_t*& dstPtr) {
       if (hasBitOffset && startBit != 0) {
         bufMoveByteLen +=
             moveByteLen(remainingNumElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
       } else {
         bufMoveByteLen += moveByteLen(remainingNumElements * bitWidth);
       }
       ...
   ```
   In callsites, use `alignHeaderBoundary<true>()` or `alignHeaderBoundary<false>()`. Compiler will remove dead codes when `hasBitOffset` is `false`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] stiga-huang commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "stiga-huang (via GitHub)" <gi...@apache.org>.

stiga-huang commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1147227057


##########
c++/src/RLEv2.hh:
##########
@@ -189,23 +203,10 @@ namespace orc {
       resetReadLongs();
     }
 
-    unsigned char readByte();
-
     int64_t readLongBE(uint64_t bsz);
     int64_t readVslong();
     uint64_t readVulong();
-    void readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs);
-    void plainUnpackLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs);
-
-    void unrolledUnpack4(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack8(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack16(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack24(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack32(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack40(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack48(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack56(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack64(int64_t* data, uint64_t offset, uint64_t len);
+    int readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs);

Review Comment:
   If we do need to change the return type from `void` to `int`, could you left a comment about what it means?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1149843312


##########
.github/workflows/build_and_test.yml:
##########
@@ -91,6 +91,49 @@ jobs:
         cmake --build . --config Debug
         ctest -C Debug --output-on-failure
 
+  simdUbuntu:
+    name: "SIMD programming using C++ intrinsic functions on ${{ matrix.os }}"
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os:
+          - ubuntu-22.04
+        cxx:
+          - clang++
+    env:
+      ORC_USER_SIMD_LEVEL: AVX512
+    steps:
+    - name: Checkout
+      uses: actions/checkout@v2
+    - name: "Test"
+      run: |
+        mkdir -p ~/.m2
+        mkdir build
+        cd build
+        cmake -DBUILD_JAVA=OFF -DBUILD_ENABLE_AVX512=ON ..
+        make package test-out
+
+  simdWindows:

Review Comment:
   If we have only only two line difference, shall we merge this to the existing `windows` GitHub Action job?
   ```
   env:
     ORC_USER_SIMD_LEVEL: AVX512
   ```
   ```
   -DBUILD_ENABLE_AVX512=ON
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1150087753


##########
.github/workflows/build_and_test.yml:
##########
@@ -91,6 +91,49 @@ jobs:
         cmake --build . --config Debug
         ctest -C Debug --output-on-failure
 
+  simdUbuntu:
+    name: "SIMD programming using C++ intrinsic functions on ${{ matrix.os }}"
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os:
+          - ubuntu-22.04
+        cxx:
+          - clang++
+    env:
+      ORC_USER_SIMD_LEVEL: AVX512
+    steps:
+    - name: Checkout
+      uses: actions/checkout@v2
+    - name: "Test"
+      run: |
+        mkdir -p ~/.m2
+        mkdir build
+        cd build
+        cmake -DBUILD_JAVA=OFF -DBUILD_ENABLE_AVX512=ON ..
+        make package test-out
+
+  simdWindows:

Review Comment:
   Hi @dongjoon-hyun @wgtmac , I have already integrated SIMD CI test with the original windows CI test.
   https://github.com/wpleonardo/orc/blob/f6b28da5b5fa4eab3917513a7309c0f6a5d96af2/.github/workflows/build_and_test.yml#L76
   Currently, if SIMD CI test running on the machine that doesn't support AVX512, it will change to the default path without AVX512. So CI testcase should be always passed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1133091124


##########
.github/workflows/build_and_test.yml:
##########
@@ -91,6 +91,50 @@ jobs:
         cmake --build . --config Debug
         ctest -C Debug --output-on-failure
 
+  simdUbuntu:
+    name: "SIMD programming using C++ intrinsic functions on ${{ matrix.os }}"
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os:
+          - ubuntu-20.04

Review Comment:
   I think `ubuntu-22.04` is enough as we do not have access to AVX512 on either platform. We can save some resource then.



##########
cmake_modules/ConfigSimdLevel.cmake:
##########
@@ -0,0 +1,105 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(CheckCXXCompilerFlag)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_SIMD_LEVEL)
+  set(ORC_SIMD_LEVEL
+      "DEFAULT"
+      CACHE STRING "Compile time SIMD optimization level")
+endif()
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(STATUS "Unknown system processor")

Review Comment:
   ```suggestion
     else()
       message(STATUS "Unsupported system processor for SIMD optimization")
   ```



##########
cmake_modules/ConfigSimdLevel.cmake:
##########
@@ -0,0 +1,105 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(CheckCXXCompilerFlag)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_SIMD_LEVEL)
+  set(ORC_SIMD_LEVEL
+      "DEFAULT"
+      CACHE STRING "Compile time SIMD optimization level")
+endif()
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(STATUS "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+  else()
+    # "arch=native" selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine.
+    # Using -march=native enables all instruction subsets supported by the local machine.
+    # Using -mtune=native produces code optimized for the local machine under the constraints of the selected instruction set.
+    set(ORC_AVX512_FLAG "-march=native -mtune=native")
+  endif()
+
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    CHECK_CXX_SOURCE_COMPILES("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND NOT MSVC)
+    execute_process(COMMAND grep flags /proc/cpuinfo
+                    COMMAND head -1
+                    OUTPUT_VARIABLE flags_ver)
+    message(STATUS "CPU ${flags_ver}")
+  endif()
+
+  message(STATUS "BUILD_ENABLE_AVX512: ${BUILD_ENABLE_AVX512}")
+  # Runtime SIMD level it can get from compiler
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512)
+    message(STATUS "Enable the AVX512 vector decode of bit-packing, compiler support AVX512")
+    set(ORC_HAVE_RUNTIME_AVX512 ON)
+    set(ORC_SIMD_LEVEL "AVX512")
+    add_definitions(-DORC_HAVE_RUNTIME_AVX512)
+  elseif(BUILD_ENABLE_AVX512 AND NOT CXX_SUPPORTS_AVX512)
+    message(FATAL_ERROR "AVX512 required but compiler doesn't support it, failed to enable AVX512.")
+    set(ORC_HAVE_RUNTIME_AVX512 OFF)
+  elseif(NOT BUILD_ENABLE_AVX512)

Review Comment:
   ```suggestion
     else()
   ```



##########
cmake_modules/ConfigSimdLevel.cmake:
##########
@@ -0,0 +1,105 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(CheckCXXCompilerFlag)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_SIMD_LEVEL)
+  set(ORC_SIMD_LEVEL
+      "DEFAULT"
+      CACHE STRING "Compile time SIMD optimization level")
+endif()
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(STATUS "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+  else()
+    # "arch=native" selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine.
+    # Using -march=native enables all instruction subsets supported by the local machine.
+    # Using -mtune=native produces code optimized for the local machine under the constraints of the selected instruction set.
+    set(ORC_AVX512_FLAG "-march=native -mtune=native")
+  endif()
+
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    CHECK_CXX_SOURCE_COMPILES("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND NOT MSVC)
+    execute_process(COMMAND grep flags /proc/cpuinfo
+                    COMMAND head -1
+                    OUTPUT_VARIABLE flags_ver)
+    message(STATUS "CPU ${flags_ver}")
+  endif()
+
+  message(STATUS "BUILD_ENABLE_AVX512: ${BUILD_ENABLE_AVX512}")
+  # Runtime SIMD level it can get from compiler
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512)
+    message(STATUS "Enable the AVX512 vector decode of bit-packing, compiler support AVX512")
+    set(ORC_HAVE_RUNTIME_AVX512 ON)
+    set(ORC_SIMD_LEVEL "AVX512")
+    add_definitions(-DORC_HAVE_RUNTIME_AVX512)
+  elseif(BUILD_ENABLE_AVX512 AND NOT CXX_SUPPORTS_AVX512)
+    message(FATAL_ERROR "AVX512 required but compiler doesn't support it, failed to enable AVX512.")
+    set(ORC_HAVE_RUNTIME_AVX512 OFF)
+  elseif(NOT BUILD_ENABLE_AVX512)
+    set(ORC_HAVE_RUNTIME_AVX512 OFF)
+    message(STATUS "Disable the AVX512 vector decode of bit-packing")
+  endif()
+  if(ORC_SIMD_LEVEL STREQUAL "DEFAULT")
+    set(ORC_SIMD_LEVEL "NONE")
+  endif()
+
+  # Enable additional instruction sets if they are supported
+  if(MINGW)

Review Comment:
   It seems that `MINGW` will not be supported according to line 46. Should we remove it here?



##########
c++/src/CpuInfoUtil.cc:
##########
@@ -0,0 +1,581 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "CpuInfoUtil.hh"
+
+#ifdef __APPLE__
+#include <sys/sysctl.h>
+#endif
+
+#ifndef _MSC_VER
+#include <unistd.h>
+#endif
+
+#ifdef _WIN32
+#define NOMINMAX
+#include <Windows.h>
+#include <intrin.h>
+#endif
+
+#include <algorithm>
+#include <array>
+#include <bitset>
+#include <cctype>
+#include <cerrno>
+#include <cstdint>
+#include <fstream>
+#include <memory>
+#include <optional>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>

Review Comment:
   Can we trim some unused headers?



##########
c++/src/CpuInfoUtil.hh:
##########
@@ -0,0 +1,109 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_CPUINFOUTIL_HH
+#define ORC_CPUINFOUTIL_HH
+
+#include <cstdint>
+#include <memory>
+#include <string>
+
+namespace orc {
+
+  /**
+   * CpuInfo is an interface to query for cpu information at runtime.  The caller can
+   * ask for the sizes of the caches and what hardware features are supported.
+   * On Linux, this information is pulled from a couple of sys files (/proc/cpuinfo and
+   * /sys/devices)
+   */
+  class CpuInfo {

Review Comment:
   This file comes mostly from https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/cpu_info.h and it contains many unused code path. Should we tailor it to match our need or add a comment to explicitly saying we are borrowing the code from there? I am not sure about this risk @dongjoon-hyun @williamhyun   



##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,4318 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"

Review Comment:
   ```suggestion
   #include "BitUnpackerAvx512.hh"
   #include "BpackingAvx512.hh"
   ```



##########
cmake_modules/ConfigSimdLevel.cmake:
##########
@@ -0,0 +1,105 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(CheckCXXCompilerFlag)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_SIMD_LEVEL)
+  set(ORC_SIMD_LEVEL
+      "DEFAULT"
+      CACHE STRING "Compile time SIMD optimization level")
+endif()
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(STATUS "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+  else()
+    # "arch=native" selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine.
+    # Using -march=native enables all instruction subsets supported by the local machine.
+    # Using -mtune=native produces code optimized for the local machine under the constraints of the selected instruction set.
+    set(ORC_AVX512_FLAG "-march=native -mtune=native")
+  endif()
+
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    CHECK_CXX_SOURCE_COMPILES("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND NOT MSVC)
+    execute_process(COMMAND grep flags /proc/cpuinfo
+                    COMMAND head -1
+                    OUTPUT_VARIABLE flags_ver)
+    message(STATUS "CPU ${flags_ver}")
+  endif()
+
+  message(STATUS "BUILD_ENABLE_AVX512: ${BUILD_ENABLE_AVX512}")
+  # Runtime SIMD level it can get from compiler
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512)
+    message(STATUS "Enable the AVX512 vector decode of bit-packing, compiler support AVX512")
+    set(ORC_HAVE_RUNTIME_AVX512 ON)
+    set(ORC_SIMD_LEVEL "AVX512")
+    add_definitions(-DORC_HAVE_RUNTIME_AVX512)
+  elseif(BUILD_ENABLE_AVX512 AND NOT CXX_SUPPORTS_AVX512)
+    message(FATAL_ERROR "AVX512 required but compiler doesn't support it, failed to enable AVX512.")
+    set(ORC_HAVE_RUNTIME_AVX512 OFF)
+  elseif(NOT BUILD_ENABLE_AVX512)

Review Comment:
   Will this be hit any way? If no, we can remove it.



##########
CMakeLists.txt:
##########
@@ -169,6 +173,9 @@ enable_testing()
 
 INCLUDE(CheckSourceCompiles)
 INCLUDE(ThirdpartyToolchain)
+if (BUILD_ENABLE_AVX512 AND NOT APPLE)

Review Comment:
   Add some comment about why `apple` is not supported here?



##########
cmake_modules/ConfigSimdLevel.cmake:
##########
@@ -0,0 +1,105 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(CheckCXXCompilerFlag)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_SIMD_LEVEL)
+  set(ORC_SIMD_LEVEL
+      "DEFAULT"
+      CACHE STRING "Compile time SIMD optimization level")
+endif()
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(STATUS "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+  else()
+    # "arch=native" selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine.
+    # Using -march=native enables all instruction subsets supported by the local machine.
+    # Using -mtune=native produces code optimized for the local machine under the constraints of the selected instruction set.
+    set(ORC_AVX512_FLAG "-march=native -mtune=native")
+  endif()
+
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    CHECK_CXX_SOURCE_COMPILES("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND NOT MSVC)
+    execute_process(COMMAND grep flags /proc/cpuinfo
+                    COMMAND head -1
+                    OUTPUT_VARIABLE flags_ver)
+    message(STATUS "CPU ${flags_ver}")
+  endif()
+
+  message(STATUS "BUILD_ENABLE_AVX512: ${BUILD_ENABLE_AVX512}")
+  # Runtime SIMD level it can get from compiler
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512)
+    message(STATUS "Enable the AVX512 vector decode of bit-packing, compiler support AVX512")

Review Comment:
   ```suggestion
       message(STATUS "Enabled the AVX512 for RLE bit-unpacking")
   ```



##########
cmake_modules/ConfigSimdLevel.cmake:
##########
@@ -0,0 +1,105 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(CheckCXXCompilerFlag)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_SIMD_LEVEL)
+  set(ORC_SIMD_LEVEL
+      "DEFAULT"
+      CACHE STRING "Compile time SIMD optimization level")
+endif()
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(STATUS "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+  else()
+    # "arch=native" selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine.
+    # Using -march=native enables all instruction subsets supported by the local machine.
+    # Using -mtune=native produces code optimized for the local machine under the constraints of the selected instruction set.
+    set(ORC_AVX512_FLAG "-march=native -mtune=native")
+  endif()
+
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    CHECK_CXX_SOURCE_COMPILES("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND NOT MSVC)

Review Comment:
   What will happen if we use clang on windows?



##########
cmake_modules/ConfigSimdLevel.cmake:
##########
@@ -0,0 +1,105 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(CheckCXXCompilerFlag)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_SIMD_LEVEL)
+  set(ORC_SIMD_LEVEL
+      "DEFAULT"
+      CACHE STRING "Compile time SIMD optimization level")
+endif()
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(STATUS "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+  else()
+    # "arch=native" selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine.
+    # Using -march=native enables all instruction subsets supported by the local machine.
+    # Using -mtune=native produces code optimized for the local machine under the constraints of the selected instruction set.
+    set(ORC_AVX512_FLAG "-march=native -mtune=native")
+  endif()
+
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    CHECK_CXX_SOURCE_COMPILES("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND NOT MSVC)
+    execute_process(COMMAND grep flags /proc/cpuinfo
+                    COMMAND head -1
+                    OUTPUT_VARIABLE flags_ver)
+    message(STATUS "CPU ${flags_ver}")
+  endif()
+
+  message(STATUS "BUILD_ENABLE_AVX512: ${BUILD_ENABLE_AVX512}")
+  # Runtime SIMD level it can get from compiler
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512)
+    message(STATUS "Enable the AVX512 vector decode of bit-packing, compiler support AVX512")
+    set(ORC_HAVE_RUNTIME_AVX512 ON)
+    set(ORC_SIMD_LEVEL "AVX512")
+    add_definitions(-DORC_HAVE_RUNTIME_AVX512)
+  elseif(BUILD_ENABLE_AVX512 AND NOT CXX_SUPPORTS_AVX512)
+    message(FATAL_ERROR "AVX512 required but compiler doesn't support it, failed to enable AVX512.")
+    set(ORC_HAVE_RUNTIME_AVX512 OFF)
+  elseif(NOT BUILD_ENABLE_AVX512)
+    set(ORC_HAVE_RUNTIME_AVX512 OFF)
+    message(STATUS "Disable the AVX512 vector decode of bit-packing")
+  endif()
+  if(ORC_SIMD_LEVEL STREQUAL "DEFAULT")
+    set(ORC_SIMD_LEVEL "NONE")
+  endif()
+
+  # Enable additional instruction sets if they are supported
+  if(MINGW)
+    # Enable _xgetbv() intrinsic to query OS support for ZMM register saves
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mxsave")
+  endif()
+  if(ORC_SIMD_LEVEL STREQUAL "AVX512")
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${ORC_AVX512_FLAG}")
+  elseif(NOT ORC_SIMD_LEVEL STREQUAL "NONE")
+    message(WARNING "ORC_SIMD_LEVEL=${ORC_SIMD_LEVEL} not supported by x86.")

Review Comment:
   Do we actually need this `elseif()` or it can be removed?



##########
c++/src/CMakeLists.txt:
##########
@@ -184,7 +184,11 @@ set(SOURCE_FILES
   Timezone.cc
   TypeImpl.cc
   Vector.cc
-  Writer.cc)
+  Writer.cc
+  CpuInfoUtil.cc
+  BpackingDefault.cc
+  BpackingAvx512.cc
+  Bpacking.cc)

Review Comment:
   Can we append these files to `SOURCE_FILES` only when `BUILD_ENABLE_AVX512` is true?
   Then we probably do not need to add many macros to disable compilation in the source files.



##########
c++/src/Bpacking.hh:
##########
@@ -0,0 +1,37 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKING_HH
+#define ORC_BPACKING_HH
+
+#include <stdint.h>

Review Comment:
   Can we use `cstdint`?



##########
c++/src/RleDecoderV2.cc:
##########
@@ -17,26 +17,31 @@
  */
 
 #include "Adaptor.hh"
+#include "Bpacking.hh"
 #include "Compression.hh"
+#include "Dispatch.hh"
 #include "RLEV2Util.hh"
 #include "RLEv2.hh"
 #include "Utils.hh"
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+#include "BpackingAvx512.hh"
+#endif

Review Comment:
   Why this header is required here?



##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,4318 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#if defined(ORC_HAVE_RUNTIME_AVX512)

Review Comment:
   This macro is not required if we append `BpackingAvx512.cc` to `SOURCE_FILES` in the cmakelist.txt only when avx512 is enabled.



##########
c++/src/CpuInfoUtil.cc:
##########
@@ -0,0 +1,581 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "CpuInfoUtil.hh"
+
+#ifdef __APPLE__
+#include <sys/sysctl.h>
+#endif
+
+#ifndef _MSC_VER
+#include <unistd.h>
+#endif
+
+#ifdef _WIN32
+#define NOMINMAX
+#include <Windows.h>
+#include <intrin.h>
+#endif
+
+#include <algorithm>
+#include <array>
+#include <bitset>
+#include <cctype>
+#include <cerrno>
+#include <cstdint>
+#include <fstream>
+#include <memory>
+#include <optional>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>
+
+#include "orc/Exceptions.hh"
+
+#undef CPUINFO_ARCH_X86
+#undef CPUINFO_ARCH_ARM
+#undef CPUINFO_ARCH_PPC
+
+#if defined(__i386) || defined(_M_IX86) || defined(__x86_64__) || defined(_M_X64)
+#define CPUINFO_ARCH_X86
+#elif defined(_M_ARM64) || defined(__aarch64__) || defined(__arm64__)
+#define CPUINFO_ARCH_ARM
+#elif defined(__PPC64__) || defined(__PPC64LE__) || defined(__ppc64__) || defined(__powerpc64__)
+#define CPUINFO_ARCH_PPC
+#endif
+
+#ifndef ORC_HAVE_RUNTIME_AVX512
+#define UNUSED(x) (void)(x)
+#endif
+
+namespace orc {
+
+  namespace {
+
+    constexpr int kCacheLevels = static_cast<int>(CpuInfo::CacheLevel::Last) + 1;
+
+    //============================== OS Dependent ==============================//
+
+#if defined(_WIN32)
+    //------------------------------ WINDOWS ------------------------------//
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = nullptr;
+      PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer_position = nullptr;
+      DWORD buffer_size = 0;
+      size_t offset = 0;
+      typedef BOOL(WINAPI * GetLogicalProcessorInformationFuncPointer)(void*, void*);
+      GetLogicalProcessorInformationFuncPointer func_pointer =
+          (GetLogicalProcessorInformationFuncPointer)GetProcAddress(
+              GetModuleHandle("kernel32"), "GetLogicalProcessorInformation");
+
+      if (!func_pointer) {
+        throw ParseError("Failed to find procedure GetLogicalProcessorInformation");
+      }
+
+      // Get buffer size
+      if (func_pointer(buffer, &buffer_size) && GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+        throw ParseError("Failed to get size of processor information buffer");
+      }
+
+      buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(buffer_size);
+      if (!buffer) {
+        return;
+      }
+
+      if (!func_pointer(buffer, &buffer_size)) {
+        free(buffer);
+        throw ParseError("Failed to get processor information");
+      }
+
+      buffer_position = buffer;
+      while (offset + sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION) <= buffer_size) {
+        if (RelationCache == buffer_position->Relationship) {
+          PCACHE_DESCRIPTOR cache = &buffer_position->Cache;
+          if (cache->Level >= 1 && cache->Level <= kCacheLevels) {
+            const int64_t current = (*cache_sizes)[cache->Level - 1];
+            (*cache_sizes)[cache->Level - 1] = std::max<int64_t>(current, cache->Size);
+          }
+        }
+        offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
+        buffer_position++;
+      }
+
+      free(buffer);
+    }
+
+#if defined(CPUINFO_ARCH_X86)
+    // On x86, get CPU features by cpuid, https://en.wikipedia.org/wiki/CPUID
+
+#if defined(__MINGW64_VERSION_MAJOR) && __MINGW64_VERSION_MAJOR < 5
+    void __cpuidex(int CPUInfo[4], int function_id, int subfunction_id) {
+      __asm__ __volatile__("cpuid"
+                           : "=a"(CPUInfo[0]), "=b"(CPUInfo[1]), "=c"(CPUInfo[2]), "=d"(CPUInfo[3])
+                           : "a"(function_id), "c"(subfunction_id));
+    }
+
+    int64_t _xgetbv(int xcr) {
+      int out = 0;
+      __asm__ __volatile__("xgetbv" : "=a"(out) : "c"(xcr) : "%edx");
+      return out;
+    }
+#endif  // MINGW
+
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      int register_EAX_id = 1;
+      int highest_valid_id = 0;
+      int highest_extended_valid_id = 0;
+      std::bitset<32> features_ECX;
+      std::array<int, 4> cpu_info;
+
+      // Get highest valid id
+      __cpuid(cpu_info.data(), 0);
+      highest_valid_id = cpu_info[0];
+      // HEX of "GenuineIntel": 47656E75 696E6549 6E74656C
+      // HEX of "AuthenticAMD": 41757468 656E7469 63414D44
+      if (cpu_info[1] == 0x756e6547 && cpu_info[3] == 0x49656e69 && cpu_info[2] == 0x6c65746e) {
+        *vendor = CpuInfo::Vendor::Intel;
+      } else if (cpu_info[1] == 0x68747541 && cpu_info[3] == 0x69746e65 &&
+                 cpu_info[2] == 0x444d4163) {
+        *vendor = CpuInfo::Vendor::AMD;
+      }
+
+      if (highest_valid_id <= register_EAX_id) {
+        return;
+      }
+
+      // EAX=1: Processor Info and Feature Bits
+      __cpuidex(cpu_info.data(), register_EAX_id, 0);
+      features_ECX = cpu_info[2];
+
+      // Get highest extended id
+      __cpuid(cpu_info.data(), 0x80000000);
+      highest_extended_valid_id = cpu_info[0];
+
+      // Retrieve CPU model name
+      if (highest_extended_valid_id >= static_cast<int>(0x80000004)) {
+        model_name->clear();
+        for (int i = 0x80000002; i <= static_cast<int>(0x80000004); ++i) {
+          __cpuidex(cpu_info.data(), i, 0);
+          *model_name += std::string(reinterpret_cast<char*>(cpu_info.data()), sizeof(cpu_info));
+        }
+      }
+
+      bool zmm_enabled = false;
+      if (features_ECX[27]) {  // OSXSAVE
+        // Query if the OS supports saving ZMM registers when switching contexts
+        int64_t xcr0 = _xgetbv(0);
+        zmm_enabled = (xcr0 & 0xE0) == 0xE0;
+      }
+
+      if (features_ECX[9]) *hardware_flags |= CpuInfo::SSSE3;
+      if (features_ECX[19]) *hardware_flags |= CpuInfo::SSE4_1;
+      if (features_ECX[20]) *hardware_flags |= CpuInfo::SSE4_2;
+      if (features_ECX[23]) *hardware_flags |= CpuInfo::POPCNT;
+      if (features_ECX[28]) *hardware_flags |= CpuInfo::AVX;
+
+      // cpuid with EAX=7, ECX=0: Extended Features
+      register_EAX_id = 7;
+      if (highest_valid_id > register_EAX_id) {
+        __cpuidex(cpu_info.data(), register_EAX_id, 0);
+        std::bitset<32> features_EBX = cpu_info[1];
+
+        if (features_EBX[3]) *hardware_flags |= CpuInfo::BMI1;
+        if (features_EBX[5]) *hardware_flags |= CpuInfo::AVX2;
+        if (features_EBX[8]) *hardware_flags |= CpuInfo::BMI2;
+        if (zmm_enabled) {
+          if (features_EBX[16]) *hardware_flags |= CpuInfo::AVX512F;
+          if (features_EBX[17]) *hardware_flags |= CpuInfo::AVX512DQ;
+          if (features_EBX[28]) *hardware_flags |= CpuInfo::AVX512CD;
+          if (features_EBX[30]) *hardware_flags |= CpuInfo::AVX512BW;
+          if (features_EBX[31]) *hardware_flags |= CpuInfo::AVX512VL;
+        }
+      }
+    }
+#elif defined(CPUINFO_ARCH_ARM)
+    // Windows on Arm
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      *hardware_flags |= CpuInfo::ASIMD;
+      // TODO: vendor, model_name
+    }
+#endif
+
+#elif defined(__APPLE__)
+    //------------------------------ MACOS ------------------------------//
+    std::optional<int64_t> IntegerSysCtlByName(const char* name) {
+      size_t len = sizeof(int64_t);
+      int64_t data = 0;
+      if (sysctlbyname(name, &data, &len, nullptr, 0) == 0) {
+        return data;
+      }
+      // ENOENT is the official errno value for non-existing sysctl's,
+      // but EINVAL and ENOTSUP have been seen in the wild.
+      if (errno != ENOENT && errno != EINVAL && errno != ENOTSUP) {
+        std::ostringstream ss;
+        ss << "sysctlbyname failed for '" << name << "'";
+        throw ParseError(ss.str());
+      }
+      return std::nullopt;
+    }
+
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      static_assert(kCacheLevels >= 3, "");
+      auto c = IntegerSysCtlByName("hw.l1dcachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[0] = *c;
+      }
+      c = IntegerSysCtlByName("hw.l2cachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[1] = *c;
+      }
+      c = IntegerSysCtlByName("hw.l3cachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[2] = *c;
+      }
+    }
+
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      // hardware_flags
+      struct SysCtlCpuFeature {
+        const char* name;
+        int64_t flag;
+      };
+      std::vector<SysCtlCpuFeature> features = {
+#if defined(CPUINFO_ARCH_X86)
+        {"hw.optional.sse4_2",
+         CpuInfo::SSSE3 | CpuInfo::SSE4_1 | CpuInfo::SSE4_2 | CpuInfo::POPCNT},
+        {"hw.optional.avx1_0", CpuInfo::AVX},
+        {"hw.optional.avx2_0", CpuInfo::AVX2},
+        {"hw.optional.bmi1", CpuInfo::BMI1},
+        {"hw.optional.bmi2", CpuInfo::BMI2},
+        {"hw.optional.avx512f", CpuInfo::AVX512F},
+        {"hw.optional.avx512cd", CpuInfo::AVX512CD},
+        {"hw.optional.avx512dq", CpuInfo::AVX512DQ},
+        {"hw.optional.avx512bw", CpuInfo::AVX512BW},
+        {"hw.optional.avx512vl", CpuInfo::AVX512VL},
+#elif defined(CPUINFO_ARCH_ARM)
+        // ARM64 (note that this is exposed under Rosetta as well)
+        {"hw.optional.neon", CpuInfo::ASIMD},
+#endif
+      };
+      for (const auto& feature : features) {
+        auto v = IntegerSysCtlByName(feature.name);
+        if (v.value_or(0)) {
+          *hardware_flags |= feature.flag;
+        }
+      }
+
+      // TODO: vendor, model_name

Review Comment:
   Is this TODO actionable or can be removed? Same for TODOs below.



##########
c++/test/CMakeLists.txt:
##########
@@ -42,6 +42,7 @@ add_executable (orc-test
   TestReader.cc
   TestRleDecoder.cc
   TestRleEncoder.cc
+  TestRleVectorDecoder.cc

Review Comment:
   ditto, only append this file when option is enabled.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode [orc]

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1761397725

   > @wpleonardo still find no improvement if just select int64 type columns.
   > 
   > Q: `select reporttime,appid,uid,clientversioncode,sdkversioncode,starttimestamp,backgroundtotal,foregroundtotal,firstiframesize,firstiframedecodetime,extras,mclientip,mnc,mcc,vsipsuccess,msipsuccess,vsipfail,msipfail,mediaflag,videomutetype,owneruid from file('lz4_new_bigolive_audience_stats_orc.orc') format Null;`
   > 
   > without avx512:
   > 
   > ```
   > localhost:9001, queries: 20, QPS: 2.256, RPS: 2715210.217, MiB/s: 4092.049, result RPS: 0.000, result MiB/s: 0.000.
   > 
   > 0.000%		0.421 sec.	
   > 10.000%		0.423 sec.	
   > 20.000%		0.425 sec.	
   > 30.000%		0.429 sec.	
   > 40.000%		0.433 sec.	
   > 50.000%		0.440 sec.	
   > 60.000%		0.440 sec.	
   > 70.000%		0.442 sec.	
   > 80.000%		0.443 sec.	
   > 90.000%		0.456 sec.	
   > 95.000%		0.457 sec.	
   > 99.000%		0.464 sec.	
   > 99.900%		0.464 sec.	
   > 99.990%		0.464 sec.	
   > ```
   > 
   > with avx512
   > 
   > ```
   > localhost:9001, queries: 20, QPS: 2.216, RPS: 2665968.958, MiB/s: 4017.839, result RPS: 0.000, result MiB/s: 0.000.
   > 
   > 0.000%		0.423 sec.	
   > 10.000%		0.429 sec.	
   > 20.000%		0.431 sec.	
   > 30.000%		0.434 sec.	
   > 40.000%		0.438 sec.	
   > 50.000%		0.442 sec.	
   > 60.000%		0.448 sec.	
   > 70.000%		0.451 sec.	
   > 80.000%		0.453 sec.	
   > 90.000%		0.469 sec.	
   > 95.000%		0.473 sec.	
   > 99.000%		0.482 sec.	
   > 99.900%		0.482 sec.	
   > 99.990%		0.482 sec.	
   > ```
   
   Could you debug your program to check if ORC is using AVX512 bit-unpacking, for example, to check if the function "BitUnpackAVX512::readLongs" is invoked when you execute the query statement?
   If you find ORC is using AVX512 bit-unpacking, then execute the command "perf top" to check the proportion of AVX512 bit-unpacking function hotspots, for example, function "vectorUnpack x".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1092940634


##########
c++/src/VectorDecoder.hh:
##########
@@ -0,0 +1,488 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef VECTOR_DECODER_HH

Review Comment:
   Yes, this is specific to AVX512, I will rename it follow your suggestion



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1415194814

   > > > Do you know why this PR causes `ILLEGAL` failures?
   > > > ```
   > > > 75% tests passed, 2 tests failed out of 8
   > > > 
   > > > Total Test time (real) = 545.23 sec
   > > > 
   > > > The following tests FAILED:
   > > > 	  1 - orc-test (ILLEGAL)
   > > > 	  8 - tool-test (ILLEGAL)
   > > > ```
   > > 
   > > 
   > > All checks have already passed. Thanks.
   > 
   > Please let me know when it is ready for review again. Thanks!
   
   OK, I'm doing code refactoring about AVX512 function unrolledUnpackVector1 and the default part following your above suggestions. Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1408085363

   I approved to run again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1449370607

   That's fine. Please make the CIs happy.
   
   One more thing, could you please add one or more Github actions with platform equipped with AVX512? In that way, we can have this new feature tested automatically. Current workflows are here: https://github.com/apache/orc/blob/main/.github/workflows/build_and_test.yml
   
   Please check this for reference: https://github.com/marketplace/actions/run-on-architecture


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1148907631


##########
README.md:
##########
@@ -93,3 +93,15 @@ To build only the C++ library:
 % make test-out
 
 ```
+
+To build the C++ library with AVX512 enabling:
+```shell
+ENV parameter ORC_USER_SIMD_LEVEL is to switch "AVX512" and "NONE" at the running time.

Review Comment:
   Change to below one, please check it. Thank you very much!
   
   Cmake option BUILD_ENABLE_AVX512 can be set to "ON" or (default value)"OFF" at the compile time. At compile time, it defines the SIMD level(AVX512) to be compiled into the binaries.
   Environment variable ORC_USER_SIMD_LEVEL can be set to "AVX512" or (default value)"NONE" at the run time. At run time, it defines the SIMD level to dispatch the code which can apply SIMD optimization. 
   Note that if ORC_USER_SIMD_LEVEL is set to "NONE" at run time, AVX512 will not take effect at run time even if BUILD_ENABLE_AVX512 is set to "ON" at compile time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1169459804


##########
c++/src/RLEv2.hh:
##########
@@ -166,6 +166,20 @@ namespace orc {
 
     void next(int16_t* data, uint64_t numValues, const char* notNull) override;
 
+    unsigned char readByte(char** bufStart, char** bufEnd);
+
+    /**
+     * Most hotspot of this function locates in saving stack, so inline this function to have
+     * performance gain.
+     */
+    inline void resetBufferStart(char** bufStart, char** bufEnd, uint64_t len, bool resetBuf,
+                                 uint32_t backupLen);
+
+    char* bufferStart;
+    char* bufferEnd;
+    uint32_t bitsLeft;  // Used by readLongs when bitSize < 8
+    uint32_t curByte;   // Used by anything that uses readLongs

Review Comment:
   Already change these parameters back to private, and added the set/get functions to access them.
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1519229254

   Hi @wgtmac @dongjoon-hyun , CI test passed, do you have any other comments? Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1517444841

   Thanks for @stiga-huang and @wpleonardo!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1517823536

   Just fixed an AVX512 flag check issue on windows platform.
   In CI Windows test, the test machine doesn't have AVX512 CPU flags, but in Cmake file, the checking code failed to verify successfully. The reason is that 
   check_cxx_compiler_flag("/arch:AVX512" COMPILER_SUPPORT_AVX512)
   only check if enable the use of AVX512 instructions (https://learn.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-170), but CPU doesn't have AVX512 flags.
   So, I changed the checking code to
   check_cxx_compiler_flag("-mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw" COMPILER_SUPPORT_AVX512)
   It will verify if the current CPU has AVX512 instructions directly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1429277096

   @wgtmac May I have a question about your suggestion
   "The options to enable AVX512 at compile and runtime are still confusing. The goal is to add one cmake option to control if AVX512 should be compiled. At runtime, we simply need an environment variable to enable/disable AVX512 at runtime if AVX512 has been compiled. We don't need to do the same thing as Apache Arrow because it has a different context."
   
   Currently, I added a cmake option "BUILD_ENABLE_AVX512" to control if AVX512 code should be compiled.
   (For example, cmake command: cmake .. -DCMAKE_BUILD_TYPE=debug -DBUILD_JAVA=OFF -DBUILD_ENABLE_AVX512=ON)
   Then, if BUILD_ENABLE_AVX512 is on, a macro "ORC_HAVE_RUNTIME_AVX512" will be defined in the cmake process. It will enable the build of AVX512 code in the make process.
   
   I also created an ENV parameter "ORC_USER_SIMD_LEVEL" whose value could be "none" or "avx512", to determine if enable /disable AVX512 at the runtime.
   
   What is the confusing process? Is the ENV parameter name "ORC_USER_SIMD_LEVEL" too similar to the macro "ORC_HAVE_RUNTIME_AVX512"? or others?
   Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1148180787


##########
c++/src/RLEv2.hh:
##########
@@ -189,23 +203,10 @@ namespace orc {
       resetReadLongs();
     }
 
-    unsigned char readByte();
-
     int64_t readLongBE(uint64_t bsz);
     int64_t readVslong();
     uint64_t readVulong();
-    void readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs);
-    void plainUnpackLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs);
-
-    void unrolledUnpack4(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack8(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack16(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack24(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack32(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack40(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack48(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack56(int64_t* data, uint64_t offset, uint64_t len);
-    void unrolledUnpack64(int64_t* data, uint64_t offset, uint64_t len);
+    int readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs);

Review Comment:
   Changed the RleDecoderV2::readLongs return type back to void.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1138649541


##########
c++/src/Bpacking.hh:
##########
@@ -0,0 +1,37 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKING_HH
+#define ORC_BPACKING_HH
+
+#include <stdint.h>

Review Comment:
   OK, changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1431352085

   Hi @wgtmac  @coderex2522 , I just modified the code follow your suggestions, please check it. Thank you very much for your help!
   1.Modified the CMakelists, delete the part of aarch64 and ORC_RUNTIME_SIMD_LEVEL, also changed the printed message content
   2.Modified the print content and style about BUILD_ENABLE_AVX512, CXX_SUPPORTS_AVX512, ORC_HAVE_RUNTIME_AVX512 and ORC_SIMD_LEVEL
   Delete the print of CXX_SUPPORTS_AVX512
   Below is the print information in the cmake process:
   -- System processor: x86_64
   -- Performing Test CXX_SUPPORTS_AVX512
   -- Performing Test CXX_SUPPORTS_AVX512 - Success
   -- BUILD_ENABLE_AVX512: ON
   -- Enable the AVX512 vector decode of bit-packing, compiler support AVX512
   -- ORC_HAVE_RUNTIME_AVX512: ON, ORC_SIMD_LEVEL: AVX512
   
   -- System processor: x86_64
   -- Performing Test CXX_SUPPORTS_AVX512
   -- Performing Test CXX_SUPPORTS_AVX512 - Success
   -- BUILD_ENABLE_AVX512: OFF
   -- Disable the AVX512 vector decode of bit-packing
   -- ORC_HAVE_RUNTIME_AVX512: OFF, ORC_SIMD_LEVEL: NONE
   3.Separate the configuration of AVX512 from CMakelists, and create a new cmake module "cmake_modules/ConfigSimdLevel.cmake" file
   4.The default value of BUILD_ENABLE_AVX512 is still ON. Do we need to change it back to OFF?
   5.Modified the style of code comments
   6.Delete message(FATAL_ERROR "Unknown system processor"), to avoid break down the build process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1142898582


##########
c++/src/Bpacking.hh:
##########
@@ -0,0 +1,32 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKING_HH
+#define ORC_BPACKING_HH
+
+#include <cstdint>
+
+namespace orc {
+  class BitUnpack {
+   public:
+    static int readLongs(RleDecoderV2* decoder, int64_t* data, uint64_t offset, uint64_t len,

Review Comment:
   nit: add forward declaration for `RleDecoderV2`. I am not sure if some compilers will complain or not.



##########
c++/src/BpackingAvx512.hh:
##########
@@ -0,0 +1,89 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGAVX512_HH
+#define ORC_BPACKINGAVX512_HH
+
+#include <stdlib.h>

Review Comment:
   Can we use cstdlib?



##########
c++/src/BpackingAvx512.hh:
##########
@@ -0,0 +1,89 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGAVX512_HH
+#define ORC_BPACKINGAVX512_HH
+
+#include <stdlib.h>
+#include <cstdint>
+
+#include "BpackingDefault.hh"
+
+namespace orc {
+
+#define MAX_VECTOR_BUF_8BIT_LENGTH 64
+#define MAX_VECTOR_BUF_16BIT_LENGTH 32
+#define MAX_VECTOR_BUF_32BIT_LENGTH 16
+
+  class UnpackAvx512 {
+   public:
+    UnpackAvx512(RleDecoderV2* dec);
+    ~UnpackAvx512();
+
+    void vectorUnpack1(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack2(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack3(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack4(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack5(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack6(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack7(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack9(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack10(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack11(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack12(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack13(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack14(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack15(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack16(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack17(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack18(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack19(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack20(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack21(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack22(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack23(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack24(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack26(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack28(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack30(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack32(int64_t* data, uint64_t offset, uint64_t len);
+
+    void plainUnpackLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs,
+                          uint64_t& startBit);
+
+   private:
+    RleDecoderV2* decoder;
+    UnpackDefault unpackDefault;
+
+    // Used by vectorially 1~8 bit-unpacking data
+    uint8_t vectorBuf8[MAX_VECTOR_BUF_8BIT_LENGTH + 1];
+    // Used by vectorially 9~16 bit-unpacking data
+    uint16_t vectorBuf16[MAX_VECTOR_BUF_16BIT_LENGTH + 1];
+    // Used by vectorially 17~32 bit-unpacking data
+    uint32_t vectorBuf32[MAX_VECTOR_BUF_32BIT_LENGTH + 1];

Review Comment:
   Can we consolidate them into a single buffer to reduce memory consumption? For example we can use the buffer of largest length and make three separate pointers to point to the buffer for different bit widths.



##########
c++/src/BpackingDefault.hh:
##########
@@ -0,0 +1,60 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGDEFAULT_HH
+#define ORC_BPACKINGDEFAULT_HH
+
+#include <stdlib.h>

Review Comment:
   ditto



##########
c++/src/CpuInfoUtil.hh:
##########
@@ -0,0 +1,113 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * @file CpuInfoUtil.hh code borrowing from

Review Comment:
   ```suggestion
    * @file CpuInfoUtil.hh is from Apache Arrow as of 2023-03-21
   ```



##########
c++/src/CpuInfoUtil.hh:
##########
@@ -0,0 +1,113 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * @file CpuInfoUtil.hh code borrowing from

Review Comment:
   We can remove the link below because it may change in the future.



##########
c++/src/BpackingAvx512.hh:
##########
@@ -0,0 +1,89 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGAVX512_HH
+#define ORC_BPACKINGAVX512_HH
+
+#include <stdlib.h>
+#include <cstdint>
+
+#include "BpackingDefault.hh"
+
+namespace orc {
+
+#define MAX_VECTOR_BUF_8BIT_LENGTH 64
+#define MAX_VECTOR_BUF_16BIT_LENGTH 32
+#define MAX_VECTOR_BUF_32BIT_LENGTH 16
+
+  class UnpackAvx512 {
+   public:
+    UnpackAvx512(RleDecoderV2* dec);

Review Comment:
   same here, add forward declaration of RleDecoderV2



##########
c++/src/BpackingDefault.hh:
##########
@@ -0,0 +1,60 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGDEFAULT_HH
+#define ORC_BPACKINGDEFAULT_HH
+
+#include <stdlib.h>
+#include <cstdint>
+
+#include "RLEv2.hh"
+
+#include "Bpacking.hh"
+

Review Comment:
   ```suggestion
   #include "Bpacking.hh"
   #include "RLEv2.hh"
   ```



##########
c++/src/BpackingDefault.hh:
##########
@@ -0,0 +1,60 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGDEFAULT_HH
+#define ORC_BPACKINGDEFAULT_HH
+
+#include <stdlib.h>
+#include <cstdint>
+
+#include "RLEv2.hh"
+
+#include "Bpacking.hh"
+

Review Comment:
   BTW, we can also remote RLEv2.hh here and use forward declaration.



##########
c++/src/CpuInfoUtil.hh:
##########
@@ -0,0 +1,113 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * @file CpuInfoUtil.hh code borrowing from
+ * https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/cpu_info.h
+ * @file CpuInfoUtil.cc code borrowing from
+ * https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/cpu_info.cc

Review Comment:
   Move this disclaimer to the cc file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1441247394

> @wpleonardo It seems that the CI check does not provide sufficient error message which makes the debugging painful. Please check out the docker files provided here: https://github.com/apache/orc/tree/main/docker. Hope it helps.

Hi @wgtmac , as your suggestions, I have already run the CI test in different platforms' docker containers. But I can't reproduce the CI failed testcase in my own test.
The orc-test and tool-test, which failed in your CI test, passed on all of the platforms (including ubuntu22, ubuntu20, ubuntu18, debian10_jdk-11, CentOS 7). I don't find the MACOS docker image in https://hub.docker.com/r/apache/orc-dev/tags?page=1&ordering=-name
In my own test, I closed JAVA build in my own CI test (cmake .. -DBUILD_JAVA=OFF && make package test-out), due to the JAVA parts passed in your CI test and Java parts should have no relationship with C++ in my opinion.
(I'm sure that I did these test based on my own repo https://github.com/wpleonardo/orc.git and export ENV parameter "ORC_USER_SIMD_LEVEL=avx512".
In the cmake process, we can find the AVX512 information printed. So I'm sure that I used my own repo to do the CI test)

Do you have any idea about it? Thank you very much!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1144373438


##########
c++/src/BpackingDefault.hh:
##########
@@ -0,0 +1,60 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGDEFAULT_HH
+#define ORC_BPACKINGDEFAULT_HH
+
+#include <stdlib.h>
+#include <cstdint>
+
+#include "RLEv2.hh"
+
+#include "Bpacking.hh"
+

Review Comment:
   Done. Already removed RLEv2.hh here
   https://github.com/wpleonardo/orc/blob/f053f9c73bf13fe29aff95cfe4cb71857c57da07/c%2B%2B/src/BpackingDefault.hh#L28



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1163489533


##########
c++/src/BitUnpackerAvx512.hh:
##########
@@ -0,0 +1,488 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BIT_UNPACKER_AVX512_HH
+#define ORC_BIT_UNPACKER_AVX512_HH
+
+// Mingw-w64 defines strcasecmp in string.h
+#if defined(_WIN32) && !defined(strcasecmp)
+#include <string.h>
+#define strcasecmp stricmp
+#else
+#include <strings.h>
+#endif
+
+#include <immintrin.h>
+#include <cstdint>
+#include <vector>
+
+namespace orc {
+#define ORC_VECTOR_BITS_2_BYTE(x) \
+  (((x) + 7u) >> 3u) /**< Convert a number of bits to a number of bytes */
+#define ORC_VECTOR_ONE_64U (1ULL)
+#define ORC_VECTOR_MAX_16U 0xFFFF     /**< Max value for uint16_t */
+#define ORC_VECTOR_MAX_32U 0xFFFFFFFF /**< Max value for uint32_t */
+#define ORC_VECTOR_BYTE_WIDTH 8u      /**< Byte width in bits */
+#define ORC_VECTOR_WORD_WIDTH 16u     /**< Word width in bits */
+#define ORC_VECTOR_DWORD_WIDTH 32u    /**< Dword width in bits */
+#define ORC_VECTOR_QWORD_WIDTH 64u    /**< Qword width in bits */
+#define ORC_VECTOR_BIT_MASK(x) \
+  ((ORC_VECTOR_ONE_64U << (x)) - 1u) /**< Bit mask below bit position */
+
+#define ORC_VECTOR_BITS_2_WORD(x) \
+  (((x) + 15u) >> 4u) /**< Convert a number of bits to a number of words */
+#define ORC_VECTOR_BITS_2_DWORD(x) \
+  (((x) + 31u) >> 5u) /**< Convert a number of bits to a number of double words */
+
+  // ------------------------------------ 3u -----------------------------------------
+  static uint8_t shuffleIdxTable3u_0[64] = {
+      1u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 4u, 3u, 5u, 4u, 6u, 5u, 1u, 0u, 1u, 0u, 2u, 1u,
+      3u, 2u, 4u, 3u, 4u, 3u, 5u, 4u, 6u, 5u, 1u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 4u, 3u,
+      5u, 4u, 6u, 5u, 1u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 4u, 3u, 5u, 4u, 6u, 5u};
+  static uint8_t shuffleIdxTable3u_1[64] = {
+      0u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 0u, 0u, 1u, 0u, 2u, 1u,
+      3u, 2u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 0u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 3u, 2u, 4u, 3u,
+      5u, 4u, 6u, 5u, 0u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u};
+  static uint16_t shiftTable3u_0[32] = {13u, 7u,  9u,  11u, 13u, 7u,  9u,  11u, 13u, 7u,  9u,
+                                        11u, 13u, 7u,  9u,  11u, 13u, 7u,  9u,  11u, 13u, 7u,
+                                        9u,  11u, 13u, 7u,  9u,  11u, 13u, 7u,  9u,  11u};
+  static uint16_t shiftTable3u_1[32] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u,
+                                        0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u,
+                                        2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint16_t permutexIdxTable3u[32] = {0u,  1u,  2u,  0x0, 0x0, 0x0, 0x0, 0x0, 3u,  4u,  5u,
+                                            0x0, 0x0, 0x0, 0x0, 0x0, 6u,  7u,  8u,  0x0, 0x0, 0x0,
+                                            0x0, 0x0, 9u,  10u, 11u, 0x0, 0x0, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 5u -----------------------------------------
+  static uint8_t shuffleIdxTable5u_0[64] = {
+      1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u,
+      4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u,
+      8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u};
+  static uint8_t shuffleIdxTable5u_1[64] = {
+      1u, 0u, 2u,  1u, 3u, 2u, 5u, 4u, 6u,  5u, 7u, 6u, 8u, 7u, 10u, 9u, 1u, 0u, 2u,  1u, 3u, 2u,
+      5u, 4u, 6u,  5u, 7u, 6u, 8u, 7u, 10u, 9u, 1u, 0u, 2u, 1u, 3u,  2u, 5u, 4u, 6u,  5u, 7u, 6u,
+      8u, 7u, 10u, 9u, 1u, 0u, 2u, 1u, 3u,  2u, 5u, 4u, 6u, 5u, 7u,  6u, 8u, 7u, 10u, 9u};
+  static uint16_t shiftTable5u_0[32] = {11u, 9u,  7u,  5u, 11u, 9u,  7u,  5u, 11u, 9u,  7u,
+                                        5u,  11u, 9u,  7u, 5u,  11u, 9u,  7u, 5u,  11u, 9u,
+                                        7u,  5u,  11u, 9u, 7u,  5u,  11u, 9u, 7u,  5u};
+  static uint16_t shiftTable5u_1[32] = {2u, 4u, 6u, 0u, 2u, 4u, 6u, 0u, 2u, 4u, 6u,
+                                        0u, 2u, 4u, 6u, 0u, 2u, 4u, 6u, 0u, 2u, 4u,
+                                        6u, 0u, 2u, 4u, 6u, 0u, 2u, 4u, 6u, 0u};
+  static uint16_t permutexIdxTable5u[32] = {0u,  1u,  2u,  3u,  4u,  0x0, 0x0, 0x0, 5u,  6u,  7u,
+                                            8u,  9u,  0x0, 0x0, 0x0, 10u, 11u, 12u, 13u, 14u, 0x0,
+                                            0x0, 0x0, 15u, 16u, 17u, 18u, 19u, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 6u -----------------------------------------
+  static uint8_t shuffleIdxTable6u_0[64] = {
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u};
+  static uint8_t shuffleIdxTable6u_1[64] = {
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u,
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u,
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u,
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u};
+  static uint16_t shiftTable6u_0[32] = {10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u,
+                                        6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,
+                                        10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u};
+  static uint16_t shiftTable6u_1[32] = {4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u,
+                                        0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u,
+                                        4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u};
+  static uint32_t permutexIdxTable6u[16] = {0u, 1u, 2u, 0x0, 3u, 4u,  5u,  0x0,
+                                            6u, 7u, 8u, 0x0, 9u, 10u, 11u, 0x0};
+
+  // ------------------------------------ 7u -----------------------------------------
+  static uint8_t shuffleIdxTable7u_0[64] = {
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u,
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u,
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u,
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u};
+  static uint8_t shuffleIdxTable7u_1[64] = {
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u,
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u,
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u,
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u};
+  static uint16_t shiftTable7u_0[32] = {9u, 3u, 5u, 7u, 9u, 3u, 5u, 7u, 9u, 3u, 5u,
+                                        7u, 9u, 3u, 5u, 7u, 9u, 3u, 5u, 7u, 9u, 3u,
+                                        5u, 7u, 9u, 3u, 5u, 7u, 9u, 3u, 5u, 7u};
+  static uint16_t shiftTable7u_1[32] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u,
+                                        0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u,
+                                        2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint16_t permutexIdxTable7u[32] = {0u,  1u,  2u,  3u,  4u,  5u,  6u,  0x0, 7u,  8u,  9u,
+                                            10u, 11u, 12u, 13u, 0x0, 14u, 15u, 16u, 17u, 18u, 19u,
+                                            20u, 0x0, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 0x0};
+
+  // ------------------------------------ 9u -----------------------------------------
+  static uint16_t permutexIdxTable9u_0[32] = {0u,  1u,  1u,  2u,  2u,  3u,  3u,  4u,  4u,  5u,  5u,
+                                              6u,  6u,  7u,  7u,  8u,  9u,  10u, 10u, 11u, 11u, 12u,
+                                              12u, 13u, 13u, 14u, 14u, 15u, 15u, 16u, 16u, 17u};
+  static uint16_t permutexIdxTable9u_1[32] = {0u,  1u,  1u,  2u,  2u,  3u,  3u,  4u,  5u,  6u,  6u,
+                                              7u,  7u,  8u,  8u,  9u,  9u,  10u, 10u, 11u, 11u, 12u,
+                                              12u, 13u, 14u, 15u, 15u, 16u, 16u, 17u, 17u, 18u};
+  static uint32_t shiftTable9u_0[16] = {0u, 2u, 4u, 6u, 8u, 10u, 12u, 14u,
+                                        0u, 2u, 4u, 6u, 8u, 10u, 12u, 14u};
+  static uint32_t shiftTable9u_1[16] = {7u, 5u, 3u, 1u, 15u, 13u, 11u, 9u,
+                                        7u, 5u, 3u, 1u, 15u, 13u, 11u, 9u};
+
+  static uint8_t shuffleIdxTable9u_0[64] = {
+      1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 7u, 6u, 8u, 7u, 1u, 0u, 2u, 1u, 3u, 2u,
+      4u, 3u, 5u, 4u, 6u, 5u, 7u, 6u, 8u, 7u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u,
+      7u, 6u, 8u, 7u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 7u, 6u, 8u, 7u};
+  static uint16_t shiftTable9u_2[32] = {7u, 6u, 5u, 4u, 3u, 2u, 1u, 0u, 7u, 6u, 5u,
+                                        4u, 3u, 2u, 1u, 0u, 7u, 6u, 5u, 4u, 3u, 2u,
+                                        1u, 0u, 7u, 6u, 5u, 4u, 3u, 2u, 1u, 0u};
+  static uint64_t gatherIdxTable9u[8] = {0u, 8u, 9u, 17u, 18u, 26u, 27u, 35u};
+
+  // ------------------------------------ 10u -----------------------------------------
+  static uint8_t shuffleIdxTable10u_0[64] = {
+      1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u,
+      4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u,
+      8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u};
+  static uint16_t shiftTable10u[32] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u,
+                                       0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u,
+                                       2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint16_t permutexIdxTable10u[32] = {0u,  1u,  2u,  3u,  4u,  0x0, 0x0, 0x0, 5u,  6u,  7u,
+                                             8u,  9u,  0x0, 0x0, 0x0, 10u, 11u, 12u, 13u, 14u, 0x0,
+                                             0x0, 0x0, 15u, 16u, 17u, 18u, 19u, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 11u -----------------------------------------
+  static uint16_t permutexIdxTable11u_0[32] = {
+      0u,  1u,  1u,  2u,  2u,  3u,  4u,  5u,  5u,  6u,  6u,  7u,  8u,  9u,  9u,  10u,
+      11u, 12u, 12u, 13u, 13u, 14u, 15u, 16u, 16u, 17u, 17u, 18u, 19u, 20u, 20u, 21u};
+  static uint16_t permutexIdxTable11u_1[32] = {
+      0u,  1u,  2u,  3u,  3u,  4u,  4u,  5u,  6u,  7u,  7u,  8u,  8u,  9u,  10u, 11u,
+      11u, 12u, 13u, 14u, 14u, 15u, 15u, 16u, 17u, 18u, 18u, 19u, 19u, 20u, 21u, 22u};
+  static uint32_t shiftTable11u_0[16] = {0u, 6u, 12u, 2u, 8u, 14u, 4u, 10u,
+                                         0u, 6u, 12u, 2u, 8u, 14u, 4u, 10u};
+  static uint32_t shiftTable11u_1[16] = {5u, 15u, 9u, 3u, 13u, 7u, 1u, 11u,
+                                         5u, 15u, 9u, 3u, 13u, 7u, 1u, 11u};
+
+  static uint8_t shuffleIdxTable11u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u};
+  static uint8_t shuffleIdxTable11u_1[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u};
+  static uint32_t shiftTable11u_2[16] = {21u, 15u, 17u, 19u, 21u, 15u, 17u, 19u,
+                                         21u, 15u, 17u, 19u, 21u, 15u, 17u, 19u};
+  static uint32_t shiftTable11u_3[16] = {6u, 4u, 10u, 8u, 6u, 4u, 10u, 8u,
+                                         6u, 4u, 10u, 8u, 6u, 4u, 10u, 8u};
+  static uint64_t gatherIdxTable11u[8] = {0u, 8u, 11u, 19u, 22u, 30u, 33u, 41u};
+
+  // ------------------------------------ 12u -----------------------------------------
+  static uint8_t shuffleIdxTable12u_0[64] = {
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u};
+  static uint16_t shiftTable12u[32] = {4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u,
+                                       0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u,
+                                       4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u};
+  static uint32_t permutexIdxTable12u[16] = {0u, 1u, 2u, 0x0, 3u, 4u,  5u,  0x0,
+                                             6u, 7u, 8u, 0x0, 9u, 10u, 11u, 0x0};
+
+  // ------------------------------------ 13u -----------------------------------------
+  static uint16_t permutexIdxTable13u_0[32] = {
+      0u,  1u,  1u,  2u,  3u,  4u,  4u,  5u,  6u,  7u,  8u,  9u,  9u,  10u, 11u, 12u,
+      13u, 14u, 14u, 15u, 16u, 17u, 17u, 18u, 19u, 20u, 21u, 22u, 22u, 23u, 24u, 25u};
+  static uint16_t permutexIdxTable13u_1[32] = {
+      0u,  1u,  2u,  3u,  4u,  5u,  5u,  6u,  7u,  8u,  8u,  9u,  10u, 11u, 12u, 13u,
+      13u, 14u, 15u, 16u, 17u, 18u, 18u, 19u, 20u, 21u, 21u, 22u, 23u, 24u, 25u, 26u};
+  static uint32_t shiftTable13u_0[16] = {0u, 10u, 4u, 14u, 8u, 2u, 12u, 6u,
+                                         0u, 10u, 4u, 14u, 8u, 2u, 12u, 6u};
+  static uint32_t shiftTable13u_1[16] = {3u, 9u, 15u, 5u, 11u, 1u, 7u, 13u,
+                                         3u, 9u, 15u, 5u, 11u, 1u, 7u, 13u};
+
+  static uint8_t shuffleIdxTable13u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u};
+  static uint8_t shuffleIdxTable13u_1[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u};
+  static uint32_t shiftTable13u_2[16] = {19u, 17u, 15u, 13u, 19u, 17u, 15u, 13u,
+                                         19u, 17u, 15u, 13u, 19u, 17u, 15u, 13u};
+  static uint32_t shiftTable13u_3[16] = {10u, 12u, 6u, 8u, 10u, 12u, 6u, 8u,
+                                         10u, 12u, 6u, 8u, 10u, 12u, 6u, 8u};
+  static uint64_t gatherIdxTable13u[8] = {0u, 8u, 13u, 21u, 26u, 34u, 39u, 47u};
+
+  // ------------------------------------ 14u -----------------------------------------
+  static uint8_t shuffleIdxTable14u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u};
+  static uint8_t shuffleIdxTable14u_1[64] = {
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u};
+  static uint32_t shiftTable14u_0[16] = {18u, 14u, 18u, 14u, 18u, 14u, 18u, 14u,
+                                         18u, 14u, 18u, 14u, 18u, 14u, 18u, 14u};
+  static uint32_t shiftTable14u_1[16] = {12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u,
+                                         12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u};
+  static uint16_t permutexIdxTable14u[32] = {0u,  1u,  2u,  3u,  4u,  5u,  6u,  0x0, 7u,  8u,  9u,
+                                             10u, 11u, 12u, 13u, 0x0, 14u, 15u, 16u, 17u, 18u, 19u,
+                                             20u, 0x0, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 0x0};
+
+  // ------------------------------------ 15u -----------------------------------------
+  static uint16_t permutexIdxTable15u_0[32] = {
+      0u,  1u,  1u,  2u,  3u,  4u,  5u,  6u,  7u,  8u,  9u,  10u, 11u, 12u, 13u, 14u,
+      15u, 16u, 16u, 17u, 18u, 19u, 20u, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 28u, 29u};
+  static uint16_t permutexIdxTable15u_1[32] = {
+      0u,  1u,  2u,  3u,  4u,  5u,  6u,  7u,  8u,  9u,  10u, 11u, 12u, 13u, 14u, 15u,
+      15u, 16u, 17u, 18u, 19u, 20u, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 28u, 29u, 30u};
+  static uint32_t shiftTable15u_0[16] = {0u, 14u, 12u, 10u, 8u, 6u, 4u, 2u,
+                                         0u, 14u, 12u, 10u, 8u, 6u, 4u, 2u};
+  static uint32_t shiftTable15u_1[16] = {1u, 3u, 5u, 7u, 9u, 11u, 13u, 15u,
+                                         1u, 3u, 5u, 7u, 9u, 11u, 13u, 15u};
+
+  static uint8_t shuffleIdxTable15u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u};
+  static uint8_t shuffleIdxTable15u_1[64] = {
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u};
+  static uint32_t shiftTable15u_2[16] = {17u, 11u, 13u, 15u, 17u, 11u, 13u, 15u,
+                                         17u, 11u, 13u, 15u, 17u, 11u, 13u, 15u};
+  static uint32_t shiftTable15u_3[16] = {14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u,
+                                         14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u};
+  static uint64_t gatherIdxTable15u[8] = {0u, 8u, 15u, 23u, 30u, 38u, 45u, 53u};
+
+  // ------------------------------------ 17u -----------------------------------------
+  static uint32_t permutexIdxTable17u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 6u, 7u, 7u, 8u};
+  static uint32_t permutexIdxTable17u_1[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 6u, 7u, 7u, 8u};
+  static uint64_t shiftTable17u_0[8] = {0u, 2u, 4u, 6u, 8u, 10u, 12u, 14u};
+  static uint64_t shiftTable17u_1[8] = {15u, 13u, 11u, 9u, 7u, 5u, 3u, 1u};
+
+  static uint8_t shuffleIdxTable17u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u,
+      3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u,
+      9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u};
+  static uint32_t shiftTable17u_2[16] = {15u, 14u, 13u, 12u, 11u, 10u, 9u, 8u,
+                                         15u, 14u, 13u, 12u, 11u, 10u, 9u, 8u};
+  static uint64_t gatherIdxTable17u[8] = {0u, 8u, 8u, 16u, 17u, 25u, 25u, 33u};
+
+  // ------------------------------------ 18u -----------------------------------------
+  static uint32_t permutexIdxTable18u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 6u, 7u, 7u, 8u};
+  static uint32_t permutexIdxTable18u_1[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               5u, 6u, 6u, 7u, 7u, 8u, 8u, 9u};
+  static uint64_t shiftTable18u_0[8] = {0u, 4u, 8u, 12u, 16u, 20u, 24u, 28u};
+  static uint64_t shiftTable18u_1[8] = {14u, 10u, 6u, 2u, 30u, 26u, 22u, 18u};
+
+  static uint8_t shuffleIdxTable18u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u,
+      3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u,
+      9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u};
+  static uint32_t shiftTable18u_2[16] = {14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u,
+                                         14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u};
+  static uint64_t gatherIdxTable18u[8] = {0u, 8u, 9u, 17u, 18u, 26u, 27u, 35u};
+
+  // ------------------------------------ 19u -----------------------------------------
+  static uint32_t permutexIdxTable19u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 7u, 8u, 8u, 9u};
+  static uint32_t permutexIdxTable19u_1[16] = {0u, 1u, 1u, 2u, 2u, 3u, 4u, 5u,
+                                               5u, 6u, 6u, 7u, 7u, 8u, 8u, 9u};
+  static uint64_t shiftTable19u_0[8] = {0u, 6u, 12u, 18u, 24u, 30u, 4u, 10u};
+  static uint64_t shiftTable19u_1[8] = {13u, 7u, 1u, 27u, 21u, 15u, 9u, 3u};
+
+  static uint8_t shuffleIdxTable19u_0[64] = {
+      3u,  2u, 1u, 0u, 5u, 4u, 3u,  2u, 7u, 6u, 5u, 4u, 10u, 9u, 8u, 7u, 3u,  2u, 1u, 0u, 5u, 4u,
+      3u,  2u, 8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u,  2u, 7u, 6u, 5u, 4u,
+      10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u, 2u, 8u,  7u, 6u, 5u, 10u, 9u, 8u, 7u};
+  static uint32_t shiftTable19u_2[16] = {13u, 10u, 7u, 12u, 9u, 6u, 11u, 8u,
+                                         13u, 10u, 7u, 12u, 9u, 6u, 11u, 8u};
+  static uint64_t gatherIdxTable19u[8] = {0u, 8u, 9u, 17u, 19u, 27u, 28u, 36u};
+
+  // ------------------------------------ 20u -----------------------------------------
+  static uint8_t shuffleIdxTable20u_0[64] = {
+      3u,  2u, 1u, 0u, 5u, 4u, 3u,  2u, 8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u,  2u, 1u, 0u, 5u, 4u,
+      3u,  2u, 8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u,  2u, 8u, 7u, 6u, 5u,
+      10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u, 2u, 8u,  7u, 6u, 5u, 10u, 9u, 8u, 7u};
+  static uint32_t shiftTable20u[16] = {12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u,
+                                       12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u};
+  static uint16_t permutexIdxTable20u[32] = {0u,  1u,  2u,  3u,  4u,  0x0, 0x0, 0x0, 5u,  6u,  7u,
+                                             8u,  9u,  0x0, 0x0, 0x0, 10u, 11u, 12u, 13u, 14u, 0x0,
+                                             0x0, 0x0, 15u, 16u, 17u, 18u, 19u, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 21u -----------------------------------------
+  static uint32_t permutexIdxTable21u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               5u, 6u, 6u, 7u, 7u, 8u, 9u, 10u};
+  static uint32_t permutexIdxTable21u_1[16] = {0u, 1u, 1u, 2u, 3u, 4u, 4u, 5u,
+                                               5u, 6u, 7u, 8u, 8u, 9u, 9u, 10u};
+  static uint64_t shiftTable21u_0[8] = {0u, 10u, 20u, 30u, 8u, 18u, 28u, 6u};
+  static uint64_t shiftTable21u_1[8] = {11u, 1u, 23u, 13u, 3u, 25u, 15u, 5u};
+
+  static uint8_t shuffleIdxTable21u_0[64] = {
+      3u,  2u, 1u, 0u, 5u, 4u, 3u,  2u,  8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u,  2u,  1u, 0u, 6u, 5u,
+      4u,  3u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u, 3u, 2u, 1u,  0u, 5u, 4u, 3u,  2u,  8u, 7u, 6u, 5u,
+      10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u,  6u, 5u, 4u, 3u, 8u,  7u, 6u, 5u, 11u, 10u, 9u, 8u};
+  static uint32_t shiftTable21u_2[16] = {11u, 6u, 9u, 4u, 7u, 10u, 5u, 8u,
+                                         11u, 6u, 9u, 4u, 7u, 10u, 5u, 8u};
+  static uint64_t gatherIdxTable21u[8] = {0u, 8u, 10u, 18u, 21u, 29u, 31u, 39u};
+
+  // ------------------------------------ 22u -----------------------------------------
+  static uint32_t permutexIdxTable22u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 4u, 5u,
+                                               5u, 6u, 6u, 7u, 8u, 9u, 9u, 10u};
+  static uint32_t permutexIdxTable22u_1[16] = {0u, 1u, 2u, 3u, 3u, 4u, 4u,  5u,
+                                               6u, 7u, 7u, 8u, 8u, 9u, 10u, 11u};
+  static uint64_t shiftTable22u_0[8] = {0u, 12u, 24u, 4u, 16u, 28u, 8u, 20u};
+  static uint64_t shiftTable22u_1[8] = {10u, 30u, 18u, 6u, 26u, 14u, 2u, 22u};
+
+  static uint8_t shuffleIdxTable22u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u};
+  static uint32_t shiftTable22u_2[16] = {10u, 4u, 6u, 8u, 10u, 4u, 6u, 8u,
+                                         10u, 4u, 6u, 8u, 10u, 4u, 6u, 8u};
+  static uint64_t gatherIdxTable22u[8] = {0u, 8u, 11u, 19u, 22u, 30u, 33u, 41u};
+
+  // ------------------------------------ 23u -----------------------------------------
+  static uint32_t permutexIdxTable23u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 4u,  5u,
+                                               5u, 6u, 7u, 8u, 8u, 9u, 10u, 11u};
+  static uint32_t permutexIdxTable23u_1[16] = {0u, 1u, 2u, 3u, 3u, 4u,  5u,  6u,
+                                               6u, 7u, 7u, 8u, 9u, 10u, 10u, 11u};
+  static uint64_t shiftTable23u_0[8] = {0u, 14u, 28u, 10u, 24u, 6u, 20u, 2u};
+  static uint64_t shiftTable23u_1[8] = {9u, 27u, 13u, 31u, 17u, 3u, 21u, 7u};
+
+  static uint8_t shuffleIdxTable23u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u,  8u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u,  8u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u};
+  static uint32_t shiftTable23u_2[16] = {9u, 2u, 3u, 4u, 5u, 6u, 7u, 8u,
+                                         9u, 2u, 3u, 4u, 5u, 6u, 7u, 8u};
+  static uint64_t gatherIdxTable23u[8] = {0u, 8u, 11u, 19u, 23u, 31u, 34u, 42u};
+
+  // ------------------------------------ 24u -----------------------------------------
+  static uint8_t shuffleIdxTable24u_0[64] = {
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF,
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF,
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF,
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF};
+  static uint32_t permutexIdxTable24u[16] = {0u, 1u, 2u, 0x0, 3u, 4u,  5u,  0x0,
+                                             6u, 7u, 8u, 0x0, 9u, 10u, 11u, 0x0};
+
+  // ------------------------------------ 26u -----------------------------------------
+  static uint32_t permutexIdxTable26u_0[16] = {0u, 1u, 1u, 2u, 3u, 4u,  4u,  5u,
+                                               6u, 7u, 8u, 9u, 9u, 10u, 11u, 12u};
+  static uint32_t permutexIdxTable26u_1[16] = {0u, 1u, 2u, 3u, 4u,  5u,  5u,  6u,
+                                               7u, 8u, 8u, 9u, 10u, 11u, 12u, 13u};
+  static uint64_t shiftTable26u_0[8] = {0u, 20u, 8u, 28u, 16u, 4u, 24u, 12u};
+  static uint64_t shiftTable26u_1[8] = {6u, 18u, 30u, 10u, 22u, 2u, 14u, 26u};
+
+  static uint8_t shuffleIdxTable26u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u};
+  static uint32_t shiftTable26u_2[16] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u,
+                                         6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint64_t gatherIdxTable26u[8] = {0u, 8u, 13u, 21u, 26u, 34u, 39u, 47u};
+
+  // ------------------------------------ 28u -----------------------------------------
+  static uint8_t shuffleIdxTable28u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u};
+  static uint32_t shiftTable28u[16] = {4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u,
+                                       4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u};
+  static uint16_t permutexIdxTable28u[32] = {0u,  1u,  2u,  3u,  4u,  5u,  6u,  0x0, 7u,  8u,  9u,
+                                             10u, 11u, 12u, 13u, 0x0, 14u, 15u, 16u, 17u, 18u, 19u,
+                                             20u, 0x0, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 0x0};
+
+  // ------------------------------------ 30u -----------------------------------------
+  static uint32_t permutexIdxTable30u_0[16] = {0u, 1u, 1u, 2u,  3u,  4u,  5u,  6u,
+                                               7u, 8u, 9u, 10u, 11u, 12u, 13u, 14u};
+  static uint32_t permutexIdxTable30u_1[16] = {0u, 1u, 2u,  3u,  4u,  5u,  6u,  7u,
+                                               8u, 9u, 10u, 11u, 12u, 13u, 14u, 15u};
+  static uint64_t shiftTable30u_0[8] = {0u, 28u, 24u, 20u, 16u, 12u, 8u, 4u};
+  static uint64_t shiftTable30u_1[8] = {2u, 6u, 10u, 14u, 18u, 22u, 26u, 30u};
+
+  static uint8_t shuffleIdxTable30u_0[64] = {
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u,
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u,
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u,
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u};
+  static uint8_t shuffleIdxTable30u_1[64] = {
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u,
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u,
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u,
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u};
+  static uint64_t shiftTable30u_2[8] = {34u, 30u, 34u, 30u, 34u, 30u, 34u, 30u};
+  static uint64_t shiftTable30u_3[8] = {28u, 24u, 28u, 24u, 28u, 24u, 28u, 24u};
+  static uint64_t gatherIdxTable30u[8] = {0u, 8u, 15u, 23u, 30u, 38u, 45u, 53u};
+
+  static uint64_t nibbleReverseTable[8] = {
+      0x0E060A020C040800, 0x0F070B030D050901, 0x0E060A020C040800, 0x0F070B030D050901,
+      0x0E060A020C040800, 0x0F070B030D050901, 0x0E060A020C040800, 0x0F070B030D050901};
+
+  static uint64_t reverseMaskTable1u[8] = {
+      0x0001020304050607, 0x08090A0B0C0D0E0F, 0x1011121314151617, 0x18191A1B1C1D1E1F,
+      0x2021222324252627, 0x28292A2B2C2D2E2F, 0x3031323334353637, 0x38393A3B3C3D3E3F};
+
+  static uint64_t reverseMaskTable16u[8] = {
+      0x0607040502030001, 0x0E0F0C0D0A0B0809, 0x1617141512131011, 0x1E1F1C1D1A1B1819,
+      0x2627242522232021, 0x2E2F2C2D2A2B2829, 0x3637343532333031, 0x3E3F3C3D3A3B3839};
+
+  static uint64_t reverseMaskTable32u[8] = {
+      0x0405060700010203, 0x0C0D0E0F08090A0B, 0x1415161710111213, 0x1C1D1E1F18191A1B,
+      0x2425262720212223, 0x2C2D2E2F28292A2B, 0x3435363730313233, 0x3C3D3E3F38393A3B};
+
+  inline uint32_t getAlign(uint32_t startBit, uint32_t base, uint32_t bitSize) {
+    uint32_t remnant = bitSize - startBit;
+    uint32_t retValue = 0xFFFFFFFF;
+    for (uint32_t i = 0u; i < bitSize; ++i) {
+      uint32_t testValue = (i * base) % bitSize;
+      if (testValue == remnant) {
+        retValue = i;
+        break;
+      }
+    }
+    return retValue;
+  }
+
+  inline uint64_t moveLen(uint64_t x, uint64_t y) {

Review Comment:
   1. rename function moveLen to moveByteLen
   2. delete the 2nd parameter
   3. rename the 1st parameter



##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2724 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen, uint64_t& len,
+                                                uint32_t& bitWidth, uint64_t& tailBitLen,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr, uint32_t bitMaxSize) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveLen(len * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH, ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveLen(len * bitWidth, ORC_VECTOR_BYTE_WIDTH);
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      numElements = len;
+      resetBuf = false;
+      len -= numElements;

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1138652526


##########
c++/src/CpuInfoUtil.cc:
##########
@@ -0,0 +1,581 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "CpuInfoUtil.hh"
+
+#ifdef __APPLE__
+#include <sys/sysctl.h>
+#endif
+
+#ifndef _MSC_VER
+#include <unistd.h>
+#endif
+
+#ifdef _WIN32
+#define NOMINMAX
+#include <Windows.h>
+#include <intrin.h>
+#endif
+
+#include <algorithm>
+#include <array>
+#include <bitset>
+#include <cctype>
+#include <cerrno>
+#include <cstdint>
+#include <fstream>
+#include <memory>
+#include <optional>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>

Review Comment:
   OK, thank you very much for reminding me.
   Delete the below files:
   #include <cctype>
   #include <cerrno>
   #include <memory>
   #include <sstream>



##########
c++/src/CpuInfoUtil.cc:
##########
@@ -0,0 +1,581 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "CpuInfoUtil.hh"
+
+#ifdef __APPLE__
+#include <sys/sysctl.h>
+#endif
+
+#ifndef _MSC_VER
+#include <unistd.h>
+#endif
+
+#ifdef _WIN32
+#define NOMINMAX
+#include <Windows.h>
+#include <intrin.h>
+#endif
+
+#include <algorithm>
+#include <array>
+#include <bitset>
+#include <cctype>
+#include <cerrno>
+#include <cstdint>
+#include <fstream>
+#include <memory>
+#include <optional>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>
+
+#include "orc/Exceptions.hh"
+
+#undef CPUINFO_ARCH_X86
+#undef CPUINFO_ARCH_ARM
+#undef CPUINFO_ARCH_PPC
+
+#if defined(__i386) || defined(_M_IX86) || defined(__x86_64__) || defined(_M_X64)
+#define CPUINFO_ARCH_X86
+#elif defined(_M_ARM64) || defined(__aarch64__) || defined(__arm64__)
+#define CPUINFO_ARCH_ARM
+#elif defined(__PPC64__) || defined(__PPC64LE__) || defined(__ppc64__) || defined(__powerpc64__)
+#define CPUINFO_ARCH_PPC
+#endif
+
+#ifndef ORC_HAVE_RUNTIME_AVX512
+#define UNUSED(x) (void)(x)
+#endif
+
+namespace orc {
+
+  namespace {
+
+    constexpr int kCacheLevels = static_cast<int>(CpuInfo::CacheLevel::Last) + 1;
+
+    //============================== OS Dependent ==============================//
+
+#if defined(_WIN32)
+    //------------------------------ WINDOWS ------------------------------//
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = nullptr;
+      PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer_position = nullptr;
+      DWORD buffer_size = 0;
+      size_t offset = 0;
+      typedef BOOL(WINAPI * GetLogicalProcessorInformationFuncPointer)(void*, void*);
+      GetLogicalProcessorInformationFuncPointer func_pointer =
+          (GetLogicalProcessorInformationFuncPointer)GetProcAddress(
+              GetModuleHandle("kernel32"), "GetLogicalProcessorInformation");
+
+      if (!func_pointer) {
+        throw ParseError("Failed to find procedure GetLogicalProcessorInformation");
+      }
+
+      // Get buffer size
+      if (func_pointer(buffer, &buffer_size) && GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+        throw ParseError("Failed to get size of processor information buffer");
+      }
+
+      buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(buffer_size);
+      if (!buffer) {
+        return;
+      }
+
+      if (!func_pointer(buffer, &buffer_size)) {
+        free(buffer);
+        throw ParseError("Failed to get processor information");
+      }
+
+      buffer_position = buffer;
+      while (offset + sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION) <= buffer_size) {
+        if (RelationCache == buffer_position->Relationship) {
+          PCACHE_DESCRIPTOR cache = &buffer_position->Cache;
+          if (cache->Level >= 1 && cache->Level <= kCacheLevels) {
+            const int64_t current = (*cache_sizes)[cache->Level - 1];
+            (*cache_sizes)[cache->Level - 1] = std::max<int64_t>(current, cache->Size);
+          }
+        }
+        offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
+        buffer_position++;
+      }
+
+      free(buffer);
+    }
+
+#if defined(CPUINFO_ARCH_X86)
+    // On x86, get CPU features by cpuid, https://en.wikipedia.org/wiki/CPUID
+
+#if defined(__MINGW64_VERSION_MAJOR) && __MINGW64_VERSION_MAJOR < 5
+    void __cpuidex(int CPUInfo[4], int function_id, int subfunction_id) {
+      __asm__ __volatile__("cpuid"
+                           : "=a"(CPUInfo[0]), "=b"(CPUInfo[1]), "=c"(CPUInfo[2]), "=d"(CPUInfo[3])
+                           : "a"(function_id), "c"(subfunction_id));
+    }
+
+    int64_t _xgetbv(int xcr) {
+      int out = 0;
+      __asm__ __volatile__("xgetbv" : "=a"(out) : "c"(xcr) : "%edx");
+      return out;
+    }
+#endif  // MINGW
+
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      int register_EAX_id = 1;
+      int highest_valid_id = 0;
+      int highest_extended_valid_id = 0;
+      std::bitset<32> features_ECX;
+      std::array<int, 4> cpu_info;
+
+      // Get highest valid id
+      __cpuid(cpu_info.data(), 0);
+      highest_valid_id = cpu_info[0];
+      // HEX of "GenuineIntel": 47656E75 696E6549 6E74656C
+      // HEX of "AuthenticAMD": 41757468 656E7469 63414D44
+      if (cpu_info[1] == 0x756e6547 && cpu_info[3] == 0x49656e69 && cpu_info[2] == 0x6c65746e) {
+        *vendor = CpuInfo::Vendor::Intel;
+      } else if (cpu_info[1] == 0x68747541 && cpu_info[3] == 0x69746e65 &&
+                 cpu_info[2] == 0x444d4163) {
+        *vendor = CpuInfo::Vendor::AMD;
+      }
+
+      if (highest_valid_id <= register_EAX_id) {
+        return;
+      }
+
+      // EAX=1: Processor Info and Feature Bits
+      __cpuidex(cpu_info.data(), register_EAX_id, 0);
+      features_ECX = cpu_info[2];
+
+      // Get highest extended id
+      __cpuid(cpu_info.data(), 0x80000000);
+      highest_extended_valid_id = cpu_info[0];
+
+      // Retrieve CPU model name
+      if (highest_extended_valid_id >= static_cast<int>(0x80000004)) {
+        model_name->clear();
+        for (int i = 0x80000002; i <= static_cast<int>(0x80000004); ++i) {
+          __cpuidex(cpu_info.data(), i, 0);
+          *model_name += std::string(reinterpret_cast<char*>(cpu_info.data()), sizeof(cpu_info));
+        }
+      }
+
+      bool zmm_enabled = false;
+      if (features_ECX[27]) {  // OSXSAVE
+        // Query if the OS supports saving ZMM registers when switching contexts
+        int64_t xcr0 = _xgetbv(0);
+        zmm_enabled = (xcr0 & 0xE0) == 0xE0;
+      }
+
+      if (features_ECX[9]) *hardware_flags |= CpuInfo::SSSE3;
+      if (features_ECX[19]) *hardware_flags |= CpuInfo::SSE4_1;
+      if (features_ECX[20]) *hardware_flags |= CpuInfo::SSE4_2;
+      if (features_ECX[23]) *hardware_flags |= CpuInfo::POPCNT;
+      if (features_ECX[28]) *hardware_flags |= CpuInfo::AVX;
+
+      // cpuid with EAX=7, ECX=0: Extended Features
+      register_EAX_id = 7;
+      if (highest_valid_id > register_EAX_id) {
+        __cpuidex(cpu_info.data(), register_EAX_id, 0);
+        std::bitset<32> features_EBX = cpu_info[1];
+
+        if (features_EBX[3]) *hardware_flags |= CpuInfo::BMI1;
+        if (features_EBX[5]) *hardware_flags |= CpuInfo::AVX2;
+        if (features_EBX[8]) *hardware_flags |= CpuInfo::BMI2;
+        if (zmm_enabled) {
+          if (features_EBX[16]) *hardware_flags |= CpuInfo::AVX512F;
+          if (features_EBX[17]) *hardware_flags |= CpuInfo::AVX512DQ;
+          if (features_EBX[28]) *hardware_flags |= CpuInfo::AVX512CD;
+          if (features_EBX[30]) *hardware_flags |= CpuInfo::AVX512BW;
+          if (features_EBX[31]) *hardware_flags |= CpuInfo::AVX512VL;
+        }
+      }
+    }
+#elif defined(CPUINFO_ARCH_ARM)
+    // Windows on Arm
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      *hardware_flags |= CpuInfo::ASIMD;
+      // TODO: vendor, model_name
+    }
+#endif
+
+#elif defined(__APPLE__)
+    //------------------------------ MACOS ------------------------------//
+    std::optional<int64_t> IntegerSysCtlByName(const char* name) {
+      size_t len = sizeof(int64_t);
+      int64_t data = 0;
+      if (sysctlbyname(name, &data, &len, nullptr, 0) == 0) {
+        return data;
+      }
+      // ENOENT is the official errno value for non-existing sysctl's,
+      // but EINVAL and ENOTSUP have been seen in the wild.
+      if (errno != ENOENT && errno != EINVAL && errno != ENOTSUP) {
+        std::ostringstream ss;
+        ss << "sysctlbyname failed for '" << name << "'";
+        throw ParseError(ss.str());
+      }
+      return std::nullopt;
+    }
+
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      static_assert(kCacheLevels >= 3, "");
+      auto c = IntegerSysCtlByName("hw.l1dcachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[0] = *c;
+      }
+      c = IntegerSysCtlByName("hw.l2cachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[1] = *c;
+      }
+      c = IntegerSysCtlByName("hw.l3cachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[2] = *c;
+      }
+    }
+
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      // hardware_flags
+      struct SysCtlCpuFeature {
+        const char* name;
+        int64_t flag;
+      };
+      std::vector<SysCtlCpuFeature> features = {
+#if defined(CPUINFO_ARCH_X86)
+        {"hw.optional.sse4_2",
+         CpuInfo::SSSE3 | CpuInfo::SSE4_1 | CpuInfo::SSE4_2 | CpuInfo::POPCNT},
+        {"hw.optional.avx1_0", CpuInfo::AVX},
+        {"hw.optional.avx2_0", CpuInfo::AVX2},
+        {"hw.optional.bmi1", CpuInfo::BMI1},
+        {"hw.optional.bmi2", CpuInfo::BMI2},
+        {"hw.optional.avx512f", CpuInfo::AVX512F},
+        {"hw.optional.avx512cd", CpuInfo::AVX512CD},
+        {"hw.optional.avx512dq", CpuInfo::AVX512DQ},
+        {"hw.optional.avx512bw", CpuInfo::AVX512BW},
+        {"hw.optional.avx512vl", CpuInfo::AVX512VL},
+#elif defined(CPUINFO_ARCH_ARM)
+        // ARM64 (note that this is exposed under Rosetta as well)
+        {"hw.optional.neon", CpuInfo::ASIMD},
+#endif
+      };
+      for (const auto& feature : features) {
+        auto v = IntegerSysCtlByName(feature.name);
+        if (v.value_or(0)) {
+          *hardware_flags |= feature.flag;
+        }
+      }
+
+      // TODO: vendor, model_name

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1472980645

   Hi @wgtmac , already changed code following all of your above comments. But in CI check there is an error about license of cmake_modules/ConfigSimdLevel.cmake. May I have a question how to have a valid license header?
   https://github.com/wpleonardo/orc/blob/440d6d159e356b6d2c863ef4bac9dde9a7977e99/cmake_modules/ConfigSimdLevel.cmake#L1
   
   ERROR the following files don't have a valid license header: 
   cmake_modules/ConfigSimdLevel.cmake 
   ERROR one or more files does not have a valid license header 
   Error: Process completed with exit code 1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1133924093


##########
cmake_modules/ConfigSimdLevel.cmake:
##########
@@ -0,0 +1,105 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(CheckCXXCompilerFlag)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_SIMD_LEVEL)
+  set(ORC_SIMD_LEVEL
+      "DEFAULT"
+      CACHE STRING "Compile time SIMD optimization level")
+endif()
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(STATUS "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+  else()
+    # "arch=native" selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine.
+    # Using -march=native enables all instruction subsets supported by the local machine.
+    # Using -mtune=native produces code optimized for the local machine under the constraints of the selected instruction set.
+    set(ORC_AVX512_FLAG "-march=native -mtune=native")
+  endif()
+
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    CHECK_CXX_SOURCE_COMPILES("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND NOT MSVC)
+    execute_process(COMMAND grep flags /proc/cpuinfo
+                    COMMAND head -1
+                    OUTPUT_VARIABLE flags_ver)
+    message(STATUS "CPU ${flags_ver}")
+  endif()
+
+  message(STATUS "BUILD_ENABLE_AVX512: ${BUILD_ENABLE_AVX512}")
+  # Runtime SIMD level it can get from compiler
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512)
+    message(STATUS "Enable the AVX512 vector decode of bit-packing, compiler support AVX512")
+    set(ORC_HAVE_RUNTIME_AVX512 ON)
+    set(ORC_SIMD_LEVEL "AVX512")
+    add_definitions(-DORC_HAVE_RUNTIME_AVX512)
+  elseif(BUILD_ENABLE_AVX512 AND NOT CXX_SUPPORTS_AVX512)
+    message(FATAL_ERROR "AVX512 required but compiler doesn't support it, failed to enable AVX512.")
+    set(ORC_HAVE_RUNTIME_AVX512 OFF)
+  elseif(NOT BUILD_ENABLE_AVX512)
+    set(ORC_HAVE_RUNTIME_AVX512 OFF)
+    message(STATUS "Disable the AVX512 vector decode of bit-packing")
+  endif()
+  if(ORC_SIMD_LEVEL STREQUAL "DEFAULT")
+    set(ORC_SIMD_LEVEL "NONE")
+  endif()
+
+  # Enable additional instruction sets if they are supported
+  if(MINGW)
+    # Enable _xgetbv() intrinsic to query OS support for ZMM register saves
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mxsave")
+  endif()
+  if(ORC_SIMD_LEVEL STREQUAL "AVX512")
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${ORC_AVX512_FLAG}")
+  elseif(NOT ORC_SIMD_LEVEL STREQUAL "NONE")
+    message(WARNING "ORC_SIMD_LEVEL=${ORC_SIMD_LEVEL} not supported by x86.")

Review Comment:
   Removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1169566643


##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2694 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(const uint32_t bitWidth, const uint32_t bitMaxSize,
+                                                uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint64_t& tailBitLen, uint32_t& backupByteLen,
+                                                uint64_t& numElements, bool& resetBuf,
+                                                const uint8_t*& srcPtr, int64_t*& dstPtr) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveByteLen(remainingNumElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveByteLen(remainingNumElements * bitWidth);
+    }

Review Comment:
   Fixed. I feel it could be more nature that calculate the moving offset firstly, and then calculate the result. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1090349354


##########
c++/src/RleDecoderV2.cc:
##########
@@ -67,6 +91,147 @@ namespace orc {
   }
 
   void RleDecoderV2::readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs) {
+    uint64_t startBit = 0;
+#if ENABLE_AVX512

Review Comment:
   I mean we can use an environment variable like `std::getenv("ENABLE_RUNTIME_AVX512")` as a toggle.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1407952803

   > @wpleonardo I'd suggest apply `clang-format -i source_file` to all files that you have changed or added to make the format check happy. You can also set up your IDEs to do it automatically. AFAIK, VSCode or CLion support it.
   > 
   > For the failure on a specific platform, we can probably disable it in the cmake config first.
   
   
   
   > @wpleonardo I'd suggest apply `clang-format -i source_file` to all files that you have changed or added to make the format check happy. You can also set up your IDEs to do it automatically. AFAIK, VSCode or CLion support it.
   > 
   > For the failure on a specific platform, we can probably disable it in the cmake config first.
   
   
   
   > @wpleonardo I'd suggest apply `clang-format -i source_file` to all files that you have changed or added to make the format check happy. You can also set up your IDEs to do it automatically. AFAIK, VSCode or CLion support it.
   > 
   > For the failure on a specific platform, we can probably disable it in the cmake config first.
   
   Thank you very much for your suggestions, Just use clang-format to modify the code style of all the files in  my MR. Other suggestions are in processing now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1091648119


##########
c++/src/RleDecoderV2.cc:
##########
@@ -67,6 +91,147 @@ namespace orc {
   }
 
   void RleDecoderV2::readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs) {
+    uint64_t startBit = 0;
+#if ENABLE_AVX512

Review Comment:
   Add an Env parameter "ENABLE_RUNTIME_AVX512" to open or close AVX512 feature at the runtime.
   If set ENABLE_RUNTIME_AVX512 as on or ON, AVX512 will be opened at the runtime, if other value, this feature will be closed at the runtime.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] coderex2522 commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "coderex2522 (via GitHub)" <gi...@apache.org>.

coderex2522 commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1105358196


##########
c++/src/CpuInfoUtil.hh:
##########
@@ -0,0 +1,110 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_CPUINFOUTIL_HH
+#define ORC_CPUINFOUTIL_HH
+
+#include <cstdint>
+#include <memory>
+#include <string>
+
+namespace orc {
+
+  /// CpuInfo is an interface to query for cpu information at runtime.  The caller can
+  /// ask for the sizes of the caches and what hardware features are supported.
+  /// On Linux, this information is pulled from a couple of sys files (/proc/cpuinfo and
+  /// /sys/devices)
+  class CpuInfo {
+   public:
+    ~CpuInfo();
+
+    /// x86 features
+    static constexpr int64_t SSSE3 = (1LL << 0);
+    static constexpr int64_t SSE4_1 = (1LL << 1);
+    static constexpr int64_t SSE4_2 = (1LL << 2);
+    static constexpr int64_t POPCNT = (1LL << 3);
+    static constexpr int64_t AVX = (1LL << 4);
+    static constexpr int64_t AVX2 = (1LL << 5);
+    static constexpr int64_t AVX512F = (1LL << 6);
+    static constexpr int64_t AVX512CD = (1LL << 7);
+    static constexpr int64_t AVX512VL = (1LL << 8);
+    static constexpr int64_t AVX512DQ = (1LL << 9);
+    static constexpr int64_t AVX512BW = (1LL << 10);
+    static constexpr int64_t AVX512 = AVX512F | AVX512CD | AVX512VL | AVX512DQ | AVX512BW;
+    static constexpr int64_t BMI1 = (1LL << 11);
+    static constexpr int64_t BMI2 = (1LL << 12);
+
+    /// Arm features
+    static constexpr int64_t ASIMD = (1LL << 32);
+
+    /// Cache enums for L1 (data), L2 and L3
+    enum class CacheLevel { L1 = 0, L2, L3, Last = L3 };
+
+    /// CPU vendors
+    enum class Vendor { Unknown, Intel, AMD };
+
+    static const CpuInfo* GetInstance();
+
+    /// Returns all the flags for this cpu
+    int64_t hardwareFlags() const;
+
+    /// Returns the number of cores (including hyper-threaded) on this machine.
+    int numCores() const;
+
+    /// Returns the vendor of the cpu.
+    Vendor vendor() const;
+
+    /// Returns the model name of the cpu (e.g. Intel i7-2600)
+    const std::string& modelName() const;
+
+    /// Returns the size of the cache in KB at this cache level
+    int64_t CacheSize(CacheLevel level) const;
+
+    /// \brief Returns whether or not the given feature is enabled.
+    ///
+    /// IsSupported() is true if IsDetected() is also true and the feature
+    /// wasn't disabled by the user (for example by setting the ORC_USER_SIMD_LEVEL
+    /// environment variable).
+    bool IsSupported(int64_t flags) const;
+
+    /// Returns whether or not the given feature is available on the CPU.
+    bool IsDetected(int64_t flags) const;
+
+    /// Determine if the CPU meets the minimum CPU requirements and if not, issue an error
+    /// and terminate.
+    void VerifyCpuRequirements() const;
+
+    /// Toggle a hardware feature on and off.  It is not valid to turn on a feature
+    /// that the underlying hardware cannot support. This is useful for testing.
+    // void EnableFeature(int64_t flag, bool enable);

Review Comment:
   Whether the EnableFeature funcition will be used? If not, please remove it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1451931822

   No worries. @wpleonardo 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by GitBox <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1383140489

   May I have a question about clang-format error about file TestRleVectorDecoder.cc?
   I have already use clang-format -style=google to format file TestRleVectorDecoder.cc, but still get clang-format errors in CI. Do we use -style=google in clang-format, or other style? 
   Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1141576276


##########
c++/src/BpackingDefault.hh:
##########
@@ -0,0 +1,61 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGDEFAULT_HH
+#define ORC_BPACKINGDEFAULT_HH
+
+#include <stdlib.h>
+#include <cstdint>
+
+#include "Bpacking.hh"
+#include "RLEv2.hh"
+#include "io/InputStream.hh"
+#include "io/OutputStream.hh"

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1148199567


##########
c++/src/CpuInfoUtil.cc:
##########
@@ -0,0 +1,545 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * @file CpuInfoUtil.cc is from Apache Arrow as of 2023-03-21
+ */
+
+#include "CpuInfoUtil.hh"
+
+#ifdef __APPLE__
+#include <sys/sysctl.h>
+#endif
+
+#ifndef _MSC_VER
+#include <unistd.h>
+#endif
+
+#ifdef _WIN32
+#define NOMINMAX
+#include <Windows.h>
+#include <intrin.h>
+#endif
+
+#include <algorithm>
+#include <array>
+#include <bitset>
+#include <cstdint>
+#include <fstream>
+#include <optional>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>
+
+#include "orc/Exceptions.hh"
+
+#undef CPUINFO_ARCH_X86
+
+#if defined(__i386) || defined(_M_IX86) || defined(__x86_64__) || defined(_M_X64)
+#define CPUINFO_ARCH_X86
+#endif
+
+#ifndef ORC_HAVE_RUNTIME_AVX512
+#define UNUSED(x) (void)(x)
+#endif
+
+namespace orc {
+
+  namespace {
+
+    constexpr int kCacheLevels = static_cast<int>(CpuInfo::CacheLevel::Last) + 1;
+
+    //============================== OS Dependent ==============================//
+
+#if defined(_WIN32)
+    //------------------------------ WINDOWS ------------------------------//
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = nullptr;
+      PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer_position = nullptr;
+      DWORD buffer_size = 0;
+      size_t offset = 0;
+      typedef BOOL(WINAPI * GetLogicalProcessorInformationFuncPointer)(void*, void*);
+      GetLogicalProcessorInformationFuncPointer func_pointer =
+          (GetLogicalProcessorInformationFuncPointer)GetProcAddress(
+              GetModuleHandle("kernel32"), "GetLogicalProcessorInformation");
+
+      if (!func_pointer) {
+        throw ParseError("Failed to find procedure GetLogicalProcessorInformation");
+      }
+
+      // Get buffer size
+      if (func_pointer(buffer, &buffer_size) && GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+        throw ParseError("Failed to get size of processor information buffer");
+      }
+
+      buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(buffer_size);
+      if (!buffer) {
+        return;
+      }
+
+      if (!func_pointer(buffer, &buffer_size)) {
+        free(buffer);
+        throw ParseError("Failed to get processor information");
+      }
+
+      buffer_position = buffer;
+      while (offset + sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION) <= buffer_size) {
+        if (RelationCache == buffer_position->Relationship) {
+          PCACHE_DESCRIPTOR cache = &buffer_position->Cache;
+          if (cache->Level >= 1 && cache->Level <= kCacheLevels) {
+            const int64_t current = (*cache_sizes)[cache->Level - 1];
+            (*cache_sizes)[cache->Level - 1] = std::max<int64_t>(current, cache->Size);
+          }
+        }
+        offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
+        buffer_position++;
+      }
+
+      free(buffer);
+    }
+
+#if defined(CPUINFO_ARCH_X86)
+    // On x86, get CPU features by cpuid, https://en.wikipedia.org/wiki/CPUID
+
+#if defined(__MINGW64_VERSION_MAJOR) && __MINGW64_VERSION_MAJOR < 5
+    void __cpuidex(int CPUInfo[4], int function_id, int subfunction_id) {
+      __asm__ __volatile__("cpuid"
+                           : "=a"(CPUInfo[0]), "=b"(CPUInfo[1]), "=c"(CPUInfo[2]), "=d"(CPUInfo[3])
+                           : "a"(function_id), "c"(subfunction_id));
+    }
+
+    int64_t _xgetbv(int xcr) {
+      int out = 0;
+      __asm__ __volatile__("xgetbv" : "=a"(out) : "c"(xcr) : "%edx");
+      return out;
+    }
+#endif  // MINGW
+
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      int register_EAX_id = 1;
+      int highest_valid_id = 0;
+      int highest_extended_valid_id = 0;
+      std::bitset<32> features_ECX;
+      std::array<int, 4> cpu_info;
+
+      // Get highest valid id
+      __cpuid(cpu_info.data(), 0);
+      highest_valid_id = cpu_info[0];
+      // HEX of "GenuineIntel": 47656E75 696E6549 6E74656C
+      // HEX of "AuthenticAMD": 41757468 656E7469 63414D44
+      if (cpu_info[1] == 0x756e6547 && cpu_info[3] == 0x49656e69 && cpu_info[2] == 0x6c65746e) {
+        *vendor = CpuInfo::Vendor::Intel;
+      } else if (cpu_info[1] == 0x68747541 && cpu_info[3] == 0x69746e65 &&
+                 cpu_info[2] == 0x444d4163) {
+        *vendor = CpuInfo::Vendor::AMD;
+      }
+
+      if (highest_valid_id <= register_EAX_id) {
+        return;
+      }
+
+      // EAX=1: Processor Info and Feature Bits
+      __cpuidex(cpu_info.data(), register_EAX_id, 0);
+      features_ECX = cpu_info[2];
+
+      // Get highest extended id
+      __cpuid(cpu_info.data(), 0x80000000);
+      highest_extended_valid_id = cpu_info[0];
+
+      // Retrieve CPU model name
+      if (highest_extended_valid_id >= static_cast<int>(0x80000004)) {
+        model_name->clear();
+        for (int i = 0x80000002; i <= static_cast<int>(0x80000004); ++i) {
+          __cpuidex(cpu_info.data(), i, 0);
+          *model_name += std::string(reinterpret_cast<char*>(cpu_info.data()), sizeof(cpu_info));
+        }
+      }
+
+      bool zmm_enabled = false;
+      if (features_ECX[27]) {  // OSXSAVE
+        // Query if the OS supports saving ZMM registers when switching contexts
+        int64_t xcr0 = _xgetbv(0);
+        zmm_enabled = (xcr0 & 0xE0) == 0xE0;
+      }
+
+      if (features_ECX[9]) *hardware_flags |= CpuInfo::SSSE3;
+      if (features_ECX[19]) *hardware_flags |= CpuInfo::SSE4_1;
+      if (features_ECX[20]) *hardware_flags |= CpuInfo::SSE4_2;
+      if (features_ECX[23]) *hardware_flags |= CpuInfo::POPCNT;
+      if (features_ECX[28]) *hardware_flags |= CpuInfo::AVX;
+
+      // cpuid with EAX=7, ECX=0: Extended Features
+      register_EAX_id = 7;
+      if (highest_valid_id > register_EAX_id) {
+        __cpuidex(cpu_info.data(), register_EAX_id, 0);
+        std::bitset<32> features_EBX = cpu_info[1];
+
+        if (features_EBX[3]) *hardware_flags |= CpuInfo::BMI1;
+        if (features_EBX[5]) *hardware_flags |= CpuInfo::AVX2;
+        if (features_EBX[8]) *hardware_flags |= CpuInfo::BMI2;
+        if (zmm_enabled) {
+          if (features_EBX[16]) *hardware_flags |= CpuInfo::AVX512F;
+          if (features_EBX[17]) *hardware_flags |= CpuInfo::AVX512DQ;
+          if (features_EBX[28]) *hardware_flags |= CpuInfo::AVX512CD;
+          if (features_EBX[30]) *hardware_flags |= CpuInfo::AVX512BW;
+          if (features_EBX[31]) *hardware_flags |= CpuInfo::AVX512VL;
+        }
+      }
+    }
+#endif
+
+#elif defined(__APPLE__)
+    //------------------------------ MACOS ------------------------------//
+    std::optional<int64_t> IntegerSysCtlByName(const char* name) {
+      size_t len = sizeof(int64_t);
+      int64_t data = 0;
+      if (sysctlbyname(name, &data, &len, nullptr, 0) == 0) {
+        return data;
+      }
+      // ENOENT is the official errno value for non-existing sysctl's,
+      // but EINVAL and ENOTSUP have been seen in the wild.
+      if (errno != ENOENT && errno != EINVAL && errno != ENOTSUP) {
+        std::ostringstream ss;
+        ss << "sysctlbyname failed for '" << name << "'";
+        throw ParseError(ss.str());
+      }
+      return std::nullopt;
+    }
+
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      static_assert(kCacheLevels >= 3, "");
+      auto c = IntegerSysCtlByName("hw.l1dcachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[0] = *c;
+      }
+      c = IntegerSysCtlByName("hw.l2cachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[1] = *c;
+      }
+      c = IntegerSysCtlByName("hw.l3cachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[2] = *c;
+      }
+    }
+
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      // hardware_flags
+      struct SysCtlCpuFeature {
+        const char* name;
+        int64_t flag;
+      };
+      std::vector<SysCtlCpuFeature> features = {
+#if defined(CPUINFO_ARCH_X86)
+        {"hw.optional.sse4_2",
+         CpuInfo::SSSE3 | CpuInfo::SSE4_1 | CpuInfo::SSE4_2 | CpuInfo::POPCNT},
+        {"hw.optional.avx1_0", CpuInfo::AVX},
+        {"hw.optional.avx2_0", CpuInfo::AVX2},
+        {"hw.optional.bmi1", CpuInfo::BMI1},
+        {"hw.optional.bmi2", CpuInfo::BMI2},
+        {"hw.optional.avx512f", CpuInfo::AVX512F},
+        {"hw.optional.avx512cd", CpuInfo::AVX512CD},
+        {"hw.optional.avx512dq", CpuInfo::AVX512DQ},
+        {"hw.optional.avx512bw", CpuInfo::AVX512BW},
+        {"hw.optional.avx512vl", CpuInfo::AVX512VL},
+#endif
+      };
+      for (const auto& feature : features) {
+        auto v = IntegerSysCtlByName(feature.name);
+        if (v.value_or(0)) {
+          *hardware_flags |= feature.flag;
+        }
+      }
+
+      // TODO: vendor, model_name
+      *vendor = CpuInfo::Vendor::Unknown;
+      *model_name = "Unknown";
+    }
+
+#else
+    //------------------------------ LINUX ------------------------------//
+    // Get cache size, return 0 on error
+    int64_t LinuxGetCacheSize(int level) {
+      // get cache size by sysconf()
+#ifdef _SC_LEVEL1_DCACHE_SIZE
+      const int kCacheSizeConf[] = {
+          _SC_LEVEL1_DCACHE_SIZE,
+          _SC_LEVEL2_CACHE_SIZE,
+          _SC_LEVEL3_CACHE_SIZE,
+      };
+      static_assert(sizeof(kCacheSizeConf) / sizeof(kCacheSizeConf[0]) == kCacheLevels, "");
+
+      errno = 0;
+      const int64_t cache_size = sysconf(kCacheSizeConf[level]);
+      if (errno == 0 && cache_size > 0) {
+        return cache_size;
+      }
+#endif
+
+      // get cache size from sysfs if sysconf() fails or not supported
+      const char* kCacheSizeSysfs[] = {
+          "/sys/devices/system/cpu/cpu0/cache/index0/size",  // l1d (index1 is l1i)
+          "/sys/devices/system/cpu/cpu0/cache/index2/size",  // l2
+          "/sys/devices/system/cpu/cpu0/cache/index3/size",  // l3
+      };
+      static_assert(sizeof(kCacheSizeSysfs) / sizeof(kCacheSizeSysfs[0]) == kCacheLevels, "");
+
+      std::ifstream cacheinfo(kCacheSizeSysfs[level], std::ios::in);
+      if (!cacheinfo) {
+        return 0;
+      }
+      // cacheinfo is one line like: 65536, 64K, 1M, etc.
+      uint64_t size = 0;
+      char unit = '\0';
+      cacheinfo >> size >> unit;
+      if (unit == 'K') {
+        size <<= 10;
+      } else if (unit == 'M') {
+        size <<= 20;
+      } else if (unit == 'G') {
+        size <<= 30;
+      } else if (unit != '\0') {
+        return 0;
+      }
+      return static_cast<int64_t>(size);
+    }
+
+    // Helper function to parse for hardware flags from /proc/cpuinfo
+    // values contains a list of space-separated flags.  check to see if the flags we
+    // care about are present.
+    // Returns a bitmap of flags.
+    int64_t LinuxParseCpuFlags(const std::string& values) {
+      const struct {
+        std::string name;
+        int64_t flag;
+      } flag_mappings[] = {
+#if defined(CPUINFO_ARCH_X86)
+        {"ssse3", CpuInfo::SSSE3},
+        {"sse4_1", CpuInfo::SSE4_1},
+        {"sse4_2", CpuInfo::SSE4_2},
+        {"popcnt", CpuInfo::POPCNT},
+        {"avx", CpuInfo::AVX},
+        {"avx2", CpuInfo::AVX2},
+        {"avx512f", CpuInfo::AVX512F},
+        {"avx512cd", CpuInfo::AVX512CD},
+        {"avx512vl", CpuInfo::AVX512VL},
+        {"avx512dq", CpuInfo::AVX512DQ},
+        {"avx512bw", CpuInfo::AVX512BW},
+        {"bmi1", CpuInfo::BMI1},
+        {"bmi2", CpuInfo::BMI2},
+#endif
+      };
+      const int64_t num_flags = sizeof(flag_mappings) / sizeof(flag_mappings[0]);
+
+      int64_t flags = 0;
+      for (int i = 0; i < num_flags; ++i) {
+        if (values.find(flag_mappings[i].name) != std::string::npos) {
+          flags |= flag_mappings[i].flag;
+        }
+      }
+      return flags;
+    }
+
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      for (int i = 0; i < kCacheLevels; ++i) {
+        const int64_t cache_size = LinuxGetCacheSize(i);
+        if (cache_size > 0) {
+          (*cache_sizes)[i] = cache_size;
+        }
+      }
+    }
+
+    static constexpr bool IsWhitespace(char c) {
+      return c == ' ' || c == '\t';
+    }
+
+    std::string TrimString(std::string value) {
+      size_t ltrim_chars = 0;
+      while (ltrim_chars < value.size() && IsWhitespace(value[ltrim_chars])) {
+        ++ltrim_chars;
+      }
+      value.erase(0, ltrim_chars);
+      size_t rtrim_chars = 0;
+      while (rtrim_chars < value.size() && IsWhitespace(value[value.size() - 1 - rtrim_chars])) {
+        ++rtrim_chars;
+      }
+      value.erase(value.size() - rtrim_chars, rtrim_chars);
+      return value;
+    }
+
+    // Read from /proc/cpuinfo
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      std::ifstream cpuinfo("/proc/cpuinfo", std::ios::in);
+      while (cpuinfo) {
+        std::string line;
+        std::getline(cpuinfo, line);
+        const size_t colon = line.find(':');
+        if (colon != std::string::npos) {
+          const std::string name = TrimString(line.substr(0, colon - 1));
+          const std::string value = TrimString(line.substr(colon + 1, std::string::npos));
+          if (name.compare("flags") == 0 || name.compare("Features") == 0) {
+            *hardware_flags |= LinuxParseCpuFlags(value);
+          } else if (name.compare("model name") == 0) {
+            *model_name = value;
+          } else if (name.compare("vendor_id") == 0) {
+            if (value.compare("GenuineIntel") == 0) {
+              *vendor = CpuInfo::Vendor::Intel;
+            } else if (value.compare("AuthenticAMD") == 0) {
+              *vendor = CpuInfo::Vendor::AMD;
+            }
+          }
+        }
+      }
+    }
+#endif  // WINDOWS, MACOS, LINUX
+
+    //============================== Arch Dependent ==============================//
+
+#if defined(CPUINFO_ARCH_X86)
+    //------------------------------ X86_64 ------------------------------//
+    bool ArchParseUserSimdLevel(const std::string& simd_level, int64_t* hardware_flags) {
+      enum {
+        USER_SIMD_NONE,
+        USER_SIMD_AVX512,
+        USER_SIMD_MAX,
+      };
+
+      int level = USER_SIMD_MAX;
+      // Parse the level
+      if (simd_level == "AVX512") {
+        level = USER_SIMD_AVX512;
+      } else if (simd_level == "NONE") {
+        level = USER_SIMD_NONE;
+      } else {
+        return false;
+      }
+
+      // Disable feature as the level
+      if (level < USER_SIMD_AVX512) {
+        *hardware_flags &= ~CpuInfo::AVX512;
+      }
+      return true;
+    }
+
+    void ArchVerifyCpuRequirements(const CpuInfo* ci) {
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+      if (!ci->isDetected(CpuInfo::AVX512)) {
+        throw ParseError("CPU does not support the Supplemental AVX512 instruction set");
+      }
+#else
+      UNUSED(ci);
+#endif
+    }
+
+#endif  // X86
+
+  }  // namespace
+
+  struct CpuInfo::Impl {
+    int64_t hardware_flags = 0;
+    int numCores = 0;
+    int64_t original_hardware_flags = 0;
+    Vendor vendor = Vendor::Unknown;
+    std::string model_name = "Unknown";
+    std::array<int64_t, kCacheLevels> cache_sizes{};
+
+    Impl() {
+      OsRetrieveCacheSize(&cache_sizes);
+      OsRetrieveCpuInfo(&hardware_flags, &vendor, &model_name);
+      original_hardware_flags = hardware_flags;
+      numCores = std::max(static_cast<int>(std::thread::hardware_concurrency()), 1);
+
+      // parse user simd level
+      const auto maybe_env_var = std::getenv("ORC_USER_SIMD_LEVEL");

Review Comment:
   Already undated this PR description, added a topic "How to enable AVX512 Bit-unpacking?"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1479983592

   Thank you for improving the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1138687810


##########
c++/src/CMakeLists.txt:
##########
@@ -184,7 +184,11 @@ set(SOURCE_FILES
   Timezone.cc
   TypeImpl.cc
   Vector.cc
-  Writer.cc)
+  Writer.cc
+  CpuInfoUtil.cc
+  BpackingDefault.cc
+  BpackingAvx512.cc
+  Bpacking.cc)

Review Comment:
   https://github.com/wpleonardo/orc/blob/fe5b6c7a29721bb5d8c4699a0b072d64555d600d/c%2B%2B/src/CMakeLists.txt#L196
   Already changed the building source_files when Build_Enable_AVX512 is true. And also changed the function definitions about  BitUnpack::readLongs, BitUnpackDefault::readLongs, and BitUnpackAVX512::readLongs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1149951617


##########
.github/workflows/build_and_test.yml:
##########
@@ -91,6 +91,49 @@ jobs:
         cmake --build . --config Debug
         ctest -C Debug --output-on-failure
 
+  simdUbuntu:
+    name: "SIMD programming using C++ intrinsic functions on ${{ matrix.os }}"
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os:
+          - ubuntu-22.04
+        cxx:
+          - clang++
+    env:
+      ORC_USER_SIMD_LEVEL: AVX512
+    steps:
+    - name: Checkout
+      uses: actions/checkout@v2
+    - name: "Test"
+      run: |
+        mkdir -p ~/.m2
+        mkdir build
+        cd build
+        cmake -DBUILD_JAVA=OFF -DBUILD_ENABLE_AVX512=ON ..
+        make package test-out
+
+  simdWindows:

Review Comment:
   The downside is that we don't know whether the windows build will have AVX512 support or not. We should at least make sure AVX512-disabled code path is always covered.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1486194915

   > Could you use more up-to-date link instead of the following?
   > 
   > * https://github.com/apple/darwin-xnu/blob/0a798f6738bc1db01281fc08ae024145e84df927/osfmk/i386/fpu.c#L176
   > 
   > Specifically, can we use the following link instead?
   > 
   > * https://github.com/apple/darwin-xnu/blob/2ff845c2e033bd0ff64b5b6aa6063a1f8f65aa32/osfmk/i386/fpu.c#L174
   
   Done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1436244200

   @wpleonardo It seems that the CI check does not provide sufficient error message which makes the debugging painful. Please check out the docker files provided here: https://github.com/apache/orc/tree/main/docker. Hope it helps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1092924747


##########
c++/test/TestRleVectorDecoder.cc:
##########
@@ -0,0 +1,608 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <inttypes.h>
+
+#include <cstdlib>
+
+#include "MemoryOutputStream.hh"
+#include "RLEv2.hh"
+#include "wrap/gtest-wrapper.h"
+#include "wrap/orc-proto-wrapper.hh"
+
+#ifdef __clang__
+DIAGNOSTIC_IGNORE("-Wmissing-variable-declarations")
+#endif
+
+namespace orc {
+
+  using ::testing::TestWithParam;
+  using ::testing::Values;
+
+  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024;  // 1M
+
+  class RleVectorTest : public TestWithParam<bool> {

Review Comment:
   OK, already changed follow your suggestion



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1491510135

   > @wpleonardo Please make the CI happy. Thanks!
   
   Just added "shell: bash" in the CI test on windows, and make CI commands running within bash. Trying to fix the previous syntax error on windows CI test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1163489027


##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2724 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen, uint64_t& len,
+                                                uint32_t& bitWidth, uint64_t& tailBitLen,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr, uint32_t bitMaxSize) {

Review Comment:
   Thank you very much for your comments!
   1. renamed "len" to "remainingNumElements"
   2. Move parameters bitWidth, bitMaxSize to the beginning of the functions
   3. Change the type of bitWidth, bitMaxSize to const



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1148756381


##########
c++/src/CpuInfoUtil.cc:
##########
@@ -0,0 +1,545 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * @file CpuInfoUtil.cc is from Apache Arrow as of 2023-03-21
+ */
+
+#include "CpuInfoUtil.hh"
+
+#ifdef __APPLE__
+#include <sys/sysctl.h>
+#endif
+
+#ifndef _MSC_VER
+#include <unistd.h>
+#endif
+
+#ifdef _WIN32
+#define NOMINMAX
+#include <Windows.h>
+#include <intrin.h>
+#endif
+
+#include <algorithm>
+#include <array>
+#include <bitset>
+#include <cstdint>
+#include <fstream>
+#include <optional>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>
+
+#include "orc/Exceptions.hh"
+
+#undef CPUINFO_ARCH_X86
+
+#if defined(__i386) || defined(_M_IX86) || defined(__x86_64__) || defined(_M_X64)
+#define CPUINFO_ARCH_X86
+#endif
+
+#ifndef ORC_HAVE_RUNTIME_AVX512
+#define UNUSED(x) (void)(x)
+#endif
+
+namespace orc {
+
+  namespace {
+
+    constexpr int kCacheLevels = static_cast<int>(CpuInfo::CacheLevel::Last) + 1;
+
+    //============================== OS Dependent ==============================//
+
+#if defined(_WIN32)
+    //------------------------------ WINDOWS ------------------------------//
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = nullptr;
+      PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer_position = nullptr;
+      DWORD buffer_size = 0;
+      size_t offset = 0;
+      typedef BOOL(WINAPI * GetLogicalProcessorInformationFuncPointer)(void*, void*);
+      GetLogicalProcessorInformationFuncPointer func_pointer =
+          (GetLogicalProcessorInformationFuncPointer)GetProcAddress(
+              GetModuleHandle("kernel32"), "GetLogicalProcessorInformation");
+
+      if (!func_pointer) {
+        throw ParseError("Failed to find procedure GetLogicalProcessorInformation");
+      }
+
+      // Get buffer size
+      if (func_pointer(buffer, &buffer_size) && GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+        throw ParseError("Failed to get size of processor information buffer");
+      }
+
+      buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(buffer_size);
+      if (!buffer) {
+        return;
+      }
+
+      if (!func_pointer(buffer, &buffer_size)) {
+        free(buffer);
+        throw ParseError("Failed to get processor information");
+      }
+
+      buffer_position = buffer;
+      while (offset + sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION) <= buffer_size) {
+        if (RelationCache == buffer_position->Relationship) {
+          PCACHE_DESCRIPTOR cache = &buffer_position->Cache;
+          if (cache->Level >= 1 && cache->Level <= kCacheLevels) {
+            const int64_t current = (*cache_sizes)[cache->Level - 1];
+            (*cache_sizes)[cache->Level - 1] = std::max<int64_t>(current, cache->Size);
+          }
+        }
+        offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
+        buffer_position++;
+      }
+
+      free(buffer);
+    }
+
+#if defined(CPUINFO_ARCH_X86)
+    // On x86, get CPU features by cpuid, https://en.wikipedia.org/wiki/CPUID
+
+#if defined(__MINGW64_VERSION_MAJOR) && __MINGW64_VERSION_MAJOR < 5
+    void __cpuidex(int CPUInfo[4], int function_id, int subfunction_id) {
+      __asm__ __volatile__("cpuid"
+                           : "=a"(CPUInfo[0]), "=b"(CPUInfo[1]), "=c"(CPUInfo[2]), "=d"(CPUInfo[3])
+                           : "a"(function_id), "c"(subfunction_id));
+    }
+
+    int64_t _xgetbv(int xcr) {
+      int out = 0;
+      __asm__ __volatile__("xgetbv" : "=a"(out) : "c"(xcr) : "%edx");
+      return out;
+    }
+#endif  // MINGW
+
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      int register_EAX_id = 1;
+      int highest_valid_id = 0;
+      int highest_extended_valid_id = 0;
+      std::bitset<32> features_ECX;
+      std::array<int, 4> cpu_info;
+
+      // Get highest valid id
+      __cpuid(cpu_info.data(), 0);
+      highest_valid_id = cpu_info[0];
+      // HEX of "GenuineIntel": 47656E75 696E6549 6E74656C
+      // HEX of "AuthenticAMD": 41757468 656E7469 63414D44
+      if (cpu_info[1] == 0x756e6547 && cpu_info[3] == 0x49656e69 && cpu_info[2] == 0x6c65746e) {
+        *vendor = CpuInfo::Vendor::Intel;
+      } else if (cpu_info[1] == 0x68747541 && cpu_info[3] == 0x69746e65 &&
+                 cpu_info[2] == 0x444d4163) {
+        *vendor = CpuInfo::Vendor::AMD;
+      }
+
+      if (highest_valid_id <= register_EAX_id) {
+        return;
+      }
+
+      // EAX=1: Processor Info and Feature Bits
+      __cpuidex(cpu_info.data(), register_EAX_id, 0);
+      features_ECX = cpu_info[2];
+
+      // Get highest extended id
+      __cpuid(cpu_info.data(), 0x80000000);
+      highest_extended_valid_id = cpu_info[0];
+
+      // Retrieve CPU model name
+      if (highest_extended_valid_id >= static_cast<int>(0x80000004)) {
+        model_name->clear();
+        for (int i = 0x80000002; i <= static_cast<int>(0x80000004); ++i) {
+          __cpuidex(cpu_info.data(), i, 0);
+          *model_name += std::string(reinterpret_cast<char*>(cpu_info.data()), sizeof(cpu_info));
+        }
+      }
+
+      bool zmm_enabled = false;
+      if (features_ECX[27]) {  // OSXSAVE
+        // Query if the OS supports saving ZMM registers when switching contexts
+        int64_t xcr0 = _xgetbv(0);
+        zmm_enabled = (xcr0 & 0xE0) == 0xE0;
+      }
+
+      if (features_ECX[9]) *hardware_flags |= CpuInfo::SSSE3;
+      if (features_ECX[19]) *hardware_flags |= CpuInfo::SSE4_1;
+      if (features_ECX[20]) *hardware_flags |= CpuInfo::SSE4_2;
+      if (features_ECX[23]) *hardware_flags |= CpuInfo::POPCNT;
+      if (features_ECX[28]) *hardware_flags |= CpuInfo::AVX;
+
+      // cpuid with EAX=7, ECX=0: Extended Features
+      register_EAX_id = 7;
+      if (highest_valid_id > register_EAX_id) {
+        __cpuidex(cpu_info.data(), register_EAX_id, 0);
+        std::bitset<32> features_EBX = cpu_info[1];
+
+        if (features_EBX[3]) *hardware_flags |= CpuInfo::BMI1;
+        if (features_EBX[5]) *hardware_flags |= CpuInfo::AVX2;
+        if (features_EBX[8]) *hardware_flags |= CpuInfo::BMI2;
+        if (zmm_enabled) {
+          if (features_EBX[16]) *hardware_flags |= CpuInfo::AVX512F;
+          if (features_EBX[17]) *hardware_flags |= CpuInfo::AVX512DQ;
+          if (features_EBX[28]) *hardware_flags |= CpuInfo::AVX512CD;
+          if (features_EBX[30]) *hardware_flags |= CpuInfo::AVX512BW;
+          if (features_EBX[31]) *hardware_flags |= CpuInfo::AVX512VL;
+        }
+      }
+    }
+#endif
+
+#elif defined(__APPLE__)
+    //------------------------------ MACOS ------------------------------//
+    std::optional<int64_t> IntegerSysCtlByName(const char* name) {
+      size_t len = sizeof(int64_t);
+      int64_t data = 0;
+      if (sysctlbyname(name, &data, &len, nullptr, 0) == 0) {
+        return data;
+      }
+      // ENOENT is the official errno value for non-existing sysctl's,
+      // but EINVAL and ENOTSUP have been seen in the wild.
+      if (errno != ENOENT && errno != EINVAL && errno != ENOTSUP) {
+        std::ostringstream ss;
+        ss << "sysctlbyname failed for '" << name << "'";
+        throw ParseError(ss.str());
+      }
+      return std::nullopt;
+    }
+
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      static_assert(kCacheLevels >= 3, "");
+      auto c = IntegerSysCtlByName("hw.l1dcachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[0] = *c;
+      }
+      c = IntegerSysCtlByName("hw.l2cachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[1] = *c;
+      }
+      c = IntegerSysCtlByName("hw.l3cachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[2] = *c;
+      }
+    }
+
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      // hardware_flags
+      struct SysCtlCpuFeature {
+        const char* name;
+        int64_t flag;
+      };
+      std::vector<SysCtlCpuFeature> features = {
+#if defined(CPUINFO_ARCH_X86)
+        {"hw.optional.sse4_2",
+         CpuInfo::SSSE3 | CpuInfo::SSE4_1 | CpuInfo::SSE4_2 | CpuInfo::POPCNT},
+        {"hw.optional.avx1_0", CpuInfo::AVX},
+        {"hw.optional.avx2_0", CpuInfo::AVX2},
+        {"hw.optional.bmi1", CpuInfo::BMI1},
+        {"hw.optional.bmi2", CpuInfo::BMI2},
+        {"hw.optional.avx512f", CpuInfo::AVX512F},
+        {"hw.optional.avx512cd", CpuInfo::AVX512CD},
+        {"hw.optional.avx512dq", CpuInfo::AVX512DQ},
+        {"hw.optional.avx512bw", CpuInfo::AVX512BW},
+        {"hw.optional.avx512vl", CpuInfo::AVX512VL},
+#endif
+      };
+      for (const auto& feature : features) {
+        auto v = IntegerSysCtlByName(feature.name);
+        if (v.value_or(0)) {
+          *hardware_flags |= feature.flag;
+        }
+      }
+
+      // TODO: vendor, model_name
+      *vendor = CpuInfo::Vendor::Unknown;
+      *model_name = "Unknown";
+    }
+
+#else
+    //------------------------------ LINUX ------------------------------//
+    // Get cache size, return 0 on error
+    int64_t LinuxGetCacheSize(int level) {
+      // get cache size by sysconf()
+#ifdef _SC_LEVEL1_DCACHE_SIZE
+      const int kCacheSizeConf[] = {
+          _SC_LEVEL1_DCACHE_SIZE,
+          _SC_LEVEL2_CACHE_SIZE,
+          _SC_LEVEL3_CACHE_SIZE,
+      };
+      static_assert(sizeof(kCacheSizeConf) / sizeof(kCacheSizeConf[0]) == kCacheLevels, "");
+
+      errno = 0;
+      const int64_t cache_size = sysconf(kCacheSizeConf[level]);
+      if (errno == 0 && cache_size > 0) {
+        return cache_size;
+      }
+#endif
+
+      // get cache size from sysfs if sysconf() fails or not supported
+      const char* kCacheSizeSysfs[] = {
+          "/sys/devices/system/cpu/cpu0/cache/index0/size",  // l1d (index1 is l1i)
+          "/sys/devices/system/cpu/cpu0/cache/index2/size",  // l2
+          "/sys/devices/system/cpu/cpu0/cache/index3/size",  // l3
+      };
+      static_assert(sizeof(kCacheSizeSysfs) / sizeof(kCacheSizeSysfs[0]) == kCacheLevels, "");
+
+      std::ifstream cacheinfo(kCacheSizeSysfs[level], std::ios::in);
+      if (!cacheinfo) {
+        return 0;
+      }
+      // cacheinfo is one line like: 65536, 64K, 1M, etc.
+      uint64_t size = 0;
+      char unit = '\0';
+      cacheinfo >> size >> unit;
+      if (unit == 'K') {
+        size <<= 10;
+      } else if (unit == 'M') {
+        size <<= 20;
+      } else if (unit == 'G') {
+        size <<= 30;
+      } else if (unit != '\0') {
+        return 0;
+      }
+      return static_cast<int64_t>(size);
+    }
+
+    // Helper function to parse for hardware flags from /proc/cpuinfo
+    // values contains a list of space-separated flags.  check to see if the flags we
+    // care about are present.
+    // Returns a bitmap of flags.
+    int64_t LinuxParseCpuFlags(const std::string& values) {
+      const struct {
+        std::string name;
+        int64_t flag;
+      } flag_mappings[] = {
+#if defined(CPUINFO_ARCH_X86)
+        {"ssse3", CpuInfo::SSSE3},
+        {"sse4_1", CpuInfo::SSE4_1},
+        {"sse4_2", CpuInfo::SSE4_2},
+        {"popcnt", CpuInfo::POPCNT},
+        {"avx", CpuInfo::AVX},
+        {"avx2", CpuInfo::AVX2},
+        {"avx512f", CpuInfo::AVX512F},
+        {"avx512cd", CpuInfo::AVX512CD},
+        {"avx512vl", CpuInfo::AVX512VL},
+        {"avx512dq", CpuInfo::AVX512DQ},
+        {"avx512bw", CpuInfo::AVX512BW},
+        {"bmi1", CpuInfo::BMI1},
+        {"bmi2", CpuInfo::BMI2},
+#endif
+      };
+      const int64_t num_flags = sizeof(flag_mappings) / sizeof(flag_mappings[0]);
+
+      int64_t flags = 0;
+      for (int i = 0; i < num_flags; ++i) {
+        if (values.find(flag_mappings[i].name) != std::string::npos) {
+          flags |= flag_mappings[i].flag;
+        }
+      }
+      return flags;
+    }
+
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      for (int i = 0; i < kCacheLevels; ++i) {
+        const int64_t cache_size = LinuxGetCacheSize(i);
+        if (cache_size > 0) {
+          (*cache_sizes)[i] = cache_size;
+        }
+      }
+    }
+
+    static constexpr bool IsWhitespace(char c) {
+      return c == ' ' || c == '\t';
+    }
+
+    std::string TrimString(std::string value) {
+      size_t ltrim_chars = 0;
+      while (ltrim_chars < value.size() && IsWhitespace(value[ltrim_chars])) {
+        ++ltrim_chars;
+      }
+      value.erase(0, ltrim_chars);
+      size_t rtrim_chars = 0;
+      while (rtrim_chars < value.size() && IsWhitespace(value[value.size() - 1 - rtrim_chars])) {
+        ++rtrim_chars;
+      }
+      value.erase(value.size() - rtrim_chars, rtrim_chars);
+      return value;
+    }
+
+    // Read from /proc/cpuinfo
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      std::ifstream cpuinfo("/proc/cpuinfo", std::ios::in);
+      while (cpuinfo) {
+        std::string line;
+        std::getline(cpuinfo, line);
+        const size_t colon = line.find(':');
+        if (colon != std::string::npos) {
+          const std::string name = TrimString(line.substr(0, colon - 1));
+          const std::string value = TrimString(line.substr(colon + 1, std::string::npos));
+          if (name.compare("flags") == 0 || name.compare("Features") == 0) {
+            *hardware_flags |= LinuxParseCpuFlags(value);
+          } else if (name.compare("model name") == 0) {
+            *model_name = value;
+          } else if (name.compare("vendor_id") == 0) {
+            if (value.compare("GenuineIntel") == 0) {
+              *vendor = CpuInfo::Vendor::Intel;
+            } else if (value.compare("AuthenticAMD") == 0) {
+              *vendor = CpuInfo::Vendor::AMD;
+            }
+          }
+        }
+      }
+    }
+#endif  // WINDOWS, MACOS, LINUX
+
+    //============================== Arch Dependent ==============================//
+
+#if defined(CPUINFO_ARCH_X86)
+    //------------------------------ X86_64 ------------------------------//
+    bool ArchParseUserSimdLevel(const std::string& simd_level, int64_t* hardware_flags) {
+      enum {
+        USER_SIMD_NONE,
+        USER_SIMD_AVX512,
+        USER_SIMD_MAX,
+      };
+
+      int level = USER_SIMD_MAX;
+      // Parse the level
+      if (simd_level == "AVX512") {
+        level = USER_SIMD_AVX512;
+      } else if (simd_level == "NONE") {
+        level = USER_SIMD_NONE;
+      } else {
+        return false;
+      }
+
+      // Disable feature as the level
+      if (level < USER_SIMD_AVX512) {
+        *hardware_flags &= ~CpuInfo::AVX512;
+      }
+      return true;
+    }
+
+    void ArchVerifyCpuRequirements(const CpuInfo* ci) {
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+      if (!ci->isDetected(CpuInfo::AVX512)) {
+        throw ParseError("CPU does not support the Supplemental AVX512 instruction set");
+      }
+#else
+      UNUSED(ci);
+#endif
+    }
+
+#endif  // X86
+
+  }  // namespace
+
+  struct CpuInfo::Impl {
+    int64_t hardware_flags = 0;
+    int numCores = 0;
+    int64_t original_hardware_flags = 0;
+    Vendor vendor = Vendor::Unknown;
+    std::string model_name = "Unknown";
+    std::array<int64_t, kCacheLevels> cache_sizes{};
+
+    Impl() {
+      OsRetrieveCacheSize(&cache_sizes);
+      OsRetrieveCpuInfo(&hardware_flags, &vendor, &model_name);
+      original_hardware_flags = hardware_flags;
+      numCores = std::max(static_cast<int>(std::thread::hardware_concurrency()), 1);
+
+      // parse user simd level
+      const auto maybe_env_var = std::getenv("ORC_USER_SIMD_LEVEL");

Review Comment:
   Already added into the last line of the root README.md. Please check it, thanks.
   https://github.com/wpleonardo/orc/blob/main/README.md



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1519077759

   check_cxx_source_runs will be hung on windows platform, when the CPU doesn't have AVX512 flags.
   So change check_cxx_source_runs back to check_cxx_source_compiles, and added "grep avx512f /proc/cpuinfo" to check CPU if have AVX512 flags.
   https://github.com/wpleonardo/orc/blob/1f2085e68ff4e691fb178080ec0c53e5b37286ea/cmake_modules/ConfigSimdLevel.cmake#L79


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1097448595


##########
CMakeLists.txt:
##########
@@ -157,6 +172,139 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "powerpc|ppc")
+    set(ORC_CPU_FLAG "ppc")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "s390x")
+    set(ORC_CPU_FLAG "s390x")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "riscv64")
+    set(ORC_CPU_FLAG "riscv64")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_SSE4_2_FLAG "")
+    set(ORC_AVX2_FLAG "/arch:AVX2")
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    set(ORC_SSE4_2_FLAG "-msse4.2")
+    set(ORC_AVX2_FLAG "-march=haswell")
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")
+    # Append the avx2/avx512 subset option also, fix issue ORC-9877 for homebrew-cpp
+    set(ORC_AVX2_FLAG "${ORC_AVX2_FLAG} -mavx2")
+    set(ORC_AVX512_FLAG
+        "${ORC_AVX512_FLAG} -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vbmi")
+  endif()
+  check_cxx_compiler_flag(${ORC_AVX512_FLAG} CXX_SUPPORTS_AVX512)
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    check_cxx_source_compiles("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+  # Runtime SIMD level it can get from compiler and ORC_RUNTIME_SIMD_LEVEL
+  if(CXX_SUPPORTS_SSE4_2 AND ORC_RUNTIME_SIMD_LEVEL MATCHES
+                             "^(SSE4_2|AVX2|AVX512|MAX)$")
+    set(ORC_HAVE_RUNTIME_SSE4_2 ON)
+    set(ORC_SIMD_LEVEL "SSE4_2")
+    add_definitions(-DORC_HAVE_RUNTIME_SSE4_2)
+  endif()
+  if(CXX_SUPPORTS_AVX2 AND ORC_RUNTIME_SIMD_LEVEL MATCHES "^(AVX2|AVX512|MAX)$")
+    set(ORC_HAVE_RUNTIME_AVX2 ON)
+    set(ORC_SIMD_LEVEL "AVX2")
+    add_definitions(-DORC_HAVE_RUNTIME_AVX2 -DORC_HAVE_RUNTIME_BMI2)
+  endif()
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND ORC_RUNTIME_SIMD_LEVEL MATCHES "^(AVX512|MAX)$")
+    message(STATUS "Enable the AVX512 vector decode of bit-packing")
+    set(ORC_HAVE_RUNTIME_AVX512 ON)
+    set(ORC_SIMD_LEVEL "AVX512")
+    add_definitions(-DORC_HAVE_RUNTIME_AVX512 -DORC_HAVE_RUNTIME_BMI2)
+  else ()
+    message(STATUS "Disable the AVX512 vector decode of bit-packing")

Review Comment:
   OK, I will do it following your suggestions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1141576958


##########
c++/src/BpackingAvx512.hh:
##########
@@ -0,0 +1,93 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGAVX512_HH
+#define ORC_BPACKINGAVX512_HH
+
+#include <stdlib.h>
+#include <cstdint>
+
+#include "BpackingDefault.hh"
+#include "Dispatch.hh"
+#include "RLEv2.hh"
+#include "io/InputStream.hh"
+#include "io/OutputStream.hh"

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1144372776


##########
c++/src/BpackingDefault.hh:
##########
@@ -0,0 +1,60 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGDEFAULT_HH
+#define ORC_BPACKINGDEFAULT_HH
+
+#include <stdlib.h>

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1144373951


##########
c++/src/CpuInfoUtil.hh:
##########
@@ -0,0 +1,113 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * @file CpuInfoUtil.hh code borrowing from

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1139852652


##########
.github/workflows/build_and_test.yml:
##########
@@ -91,6 +91,49 @@ jobs:
         cmake --build . --config Debug
         ctest -C Debug --output-on-failure
 
+  simdUbuntu:
+    name: "SIMD programming using C++ intrinsic functions on ${{ matrix.os }}"
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os:
+          - ubuntu-22.04
+        cxx:
+          - clang++
+    env:
+      ORC_USER_SIMD_LEVEL: avx512

Review Comment:
   No, it is not case-insensitive.
   I will change it to uppercase here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1138648353


##########
cmake_modules/ConfigSimdLevel.cmake:
##########
@@ -0,0 +1,105 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(CheckCXXCompilerFlag)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_SIMD_LEVEL)
+  set(ORC_SIMD_LEVEL
+      "DEFAULT"
+      CACHE STRING "Compile time SIMD optimization level")
+endif()
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(STATUS "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+  else()
+    # "arch=native" selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine.
+    # Using -march=native enables all instruction subsets supported by the local machine.
+    # Using -mtune=native produces code optimized for the local machine under the constraints of the selected instruction set.
+    set(ORC_AVX512_FLAG "-march=native -mtune=native")
+  endif()
+
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    CHECK_CXX_SOURCE_COMPILES("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND NOT MSVC)
+    execute_process(COMMAND grep flags /proc/cpuinfo
+                    COMMAND head -1
+                    OUTPUT_VARIABLE flags_ver)
+    message(STATUS "CPU ${flags_ver}")
+  endif()
+
+  message(STATUS "BUILD_ENABLE_AVX512: ${BUILD_ENABLE_AVX512}")
+  # Runtime SIMD level it can get from compiler
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512)
+    message(STATUS "Enable the AVX512 vector decode of bit-packing, compiler support AVX512")
+    set(ORC_HAVE_RUNTIME_AVX512 ON)
+    set(ORC_SIMD_LEVEL "AVX512")
+    add_definitions(-DORC_HAVE_RUNTIME_AVX512)
+  elseif(BUILD_ENABLE_AVX512 AND NOT CXX_SUPPORTS_AVX512)
+    message(FATAL_ERROR "AVX512 required but compiler doesn't support it, failed to enable AVX512.")
+    set(ORC_HAVE_RUNTIME_AVX512 OFF)
+  elseif(NOT BUILD_ENABLE_AVX512)
+    set(ORC_HAVE_RUNTIME_AVX512 OFF)
+    message(STATUS "Disable the AVX512 vector decode of bit-packing")
+  endif()
+  if(ORC_SIMD_LEVEL STREQUAL "DEFAULT")
+    set(ORC_SIMD_LEVEL "NONE")
+  endif()
+
+  # Enable additional instruction sets if they are supported
+  if(MINGW)

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1138646562


##########
cmake_modules/ConfigSimdLevel.cmake:
##########
@@ -0,0 +1,105 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(CheckCXXCompilerFlag)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_SIMD_LEVEL)
+  set(ORC_SIMD_LEVEL
+      "DEFAULT"
+      CACHE STRING "Compile time SIMD optimization level")
+endif()
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(STATUS "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+  else()
+    # "arch=native" selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine.
+    # Using -march=native enables all instruction subsets supported by the local machine.
+    # Using -mtune=native produces code optimized for the local machine under the constraints of the selected instruction set.
+    set(ORC_AVX512_FLAG "-march=native -mtune=native")
+  endif()
+
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    CHECK_CXX_SOURCE_COMPILES("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND NOT MSVC)
+    execute_process(COMMAND grep flags /proc/cpuinfo
+                    COMMAND head -1
+                    OUTPUT_VARIABLE flags_ver)
+    message(STATUS "CPU ${flags_ver}")
+  endif()
+
+  message(STATUS "BUILD_ENABLE_AVX512: ${BUILD_ENABLE_AVX512}")
+  # Runtime SIMD level it can get from compiler
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512)
+    message(STATUS "Enable the AVX512 vector decode of bit-packing, compiler support AVX512")

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] coderex2522 commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "coderex2522 (via GitHub)" <gi...@apache.org>.

coderex2522 commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1105379569


##########
c++/src/CpuInfoUtil.hh:
##########
@@ -0,0 +1,110 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_CPUINFOUTIL_HH
+#define ORC_CPUINFOUTIL_HH
+
+#include <cstdint>
+#include <memory>
+#include <string>
+
+namespace orc {
+
+  /// CpuInfo is an interface to query for cpu information at runtime.  The caller can
+  /// ask for the sizes of the caches and what hardware features are supported.
+  /// On Linux, this information is pulled from a couple of sys files (/proc/cpuinfo and
+  /// /sys/devices)
+  class CpuInfo {
+   public:
+    ~CpuInfo();
+
+    /// x86 features
+    static constexpr int64_t SSSE3 = (1LL << 0);
+    static constexpr int64_t SSE4_1 = (1LL << 1);
+    static constexpr int64_t SSE4_2 = (1LL << 2);
+    static constexpr int64_t POPCNT = (1LL << 3);
+    static constexpr int64_t AVX = (1LL << 4);
+    static constexpr int64_t AVX2 = (1LL << 5);
+    static constexpr int64_t AVX512F = (1LL << 6);
+    static constexpr int64_t AVX512CD = (1LL << 7);
+    static constexpr int64_t AVX512VL = (1LL << 8);
+    static constexpr int64_t AVX512DQ = (1LL << 9);
+    static constexpr int64_t AVX512BW = (1LL << 10);
+    static constexpr int64_t AVX512 = AVX512F | AVX512CD | AVX512VL | AVX512DQ | AVX512BW;
+    static constexpr int64_t BMI1 = (1LL << 11);
+    static constexpr int64_t BMI2 = (1LL << 12);
+
+    /// Arm features
+    static constexpr int64_t ASIMD = (1LL << 32);
+
+    /// Cache enums for L1 (data), L2 and L3
+    enum class CacheLevel { L1 = 0, L2, L3, Last = L3 };
+
+    /// CPU vendors
+    enum class Vendor { Unknown, Intel, AMD };
+
+    static const CpuInfo* GetInstance();
+
+    /// Returns all the flags for this cpu
+    int64_t hardwareFlags() const;
+
+    /// Returns the number of cores (including hyper-threaded) on this machine.
+    int numCores() const;
+
+    /// Returns the vendor of the cpu.
+    Vendor vendor() const;
+
+    /// Returns the model name of the cpu (e.g. Intel i7-2600)
+    const std::string& modelName() const;
+
+    /// Returns the size of the cache in KB at this cache level
+    int64_t CacheSize(CacheLevel level) const;
+
+    /// \brief Returns whether or not the given feature is enabled.
+    ///
+    /// IsSupported() is true if IsDetected() is also true and the feature
+    /// wasn't disabled by the user (for example by setting the ORC_USER_SIMD_LEVEL
+    /// environment variable).
+    bool IsSupported(int64_t flags) const;
+
+    /// Returns whether or not the given feature is available on the CPU.
+    bool IsDetected(int64_t flags) const;
+
+    /// Determine if the CPU meets the minimum CPU requirements and if not, issue an error
+    /// and terminate.
+    void VerifyCpuRequirements() const;
+
+    /// Toggle a hardware feature on and off.  It is not valid to turn on a feature
+    /// that the underlying hardware cannot support. This is useful for testing.
+    // void EnableFeature(int64_t flag, bool enable);
+
+    bool HasEfficientBmi2() const {
+      // BMI2 (pext, pdep) is only efficient on Intel X86 processors.
+      return vendor() == Vendor::Intel && IsSupported(BMI2);
+    }
+
+   private:
+    CpuInfo();
+
+    struct Impl;
+    std::unique_ptr<Impl> impl_;

Review Comment:
   If class CpuInfo is just an interface, I suggest that the impl_ variable should be removed from class CpuInfo.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1107113229


##########
CMakeLists.txt:
##########
@@ -157,6 +172,102 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(FATAL_ERROR "Unknown system processor")

Review Comment:
   Created a new cmake module "cmake_modules/ConfigSimdLevel.cmake" to config AVX512.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1169463985


##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2694 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(const uint32_t bitWidth, const uint32_t bitMaxSize,
+                                                uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint64_t& tailBitLen, uint32_t& backupByteLen,
+                                                uint64_t& numElements, bool& resetBuf,
+                                                const uint8_t*& srcPtr, int64_t*& dstPtr) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveByteLen(remainingNumElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveByteLen(remainingNumElements * bitWidth);
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      numElements = remainingNumElements;
+      resetBuf = false;
+      remainingNumElements = 0;
+    } else {
+      uint64_t leadingBits = 0;
+      if (startBit != 0) leadingBits = ORC_VECTOR_BYTE_WIDTH - startBit;
+      uint64_t bufRestBitLen = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + leadingBits;
+      numElements = bufRestBitLen / bitWidth;
+      remainingNumElements -= numElements;
+      tailBitLen = fmod(bufRestBitLen, bitWidth);
+      resetBuf = true;
+    }
+
+    if (tailBitLen != 0) {
+      backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+      tailBitLen = 0;
+    }
+
+    if (startBit > 0) {
+      uint32_t align = getAlign(startBit, bitWidth, bitMaxSize);
+      if (align > numElements) {
+        align = numElements;
+      }
+      if (align != 0) {
+        bufMoveByteLen -= moveByteLen(align * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+        plainUnpackLongs(dstPtr, 0, align, bitWidth, startBit);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+        dstPtr += align;
+        numElements -= align;
+      }
+    }
+  }
+
+  inline void UnpackAvx512::alignTailerBoundary(const uint32_t bitWidth, uint64_t& startBit,
+                                                uint64_t& bufMoveByteLen, uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr) {
+    if (numElements > 0) {
+      if (startBit != 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+      } else {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+      }
+      plainUnpackLongs(dstPtr, 0, numElements, bitWidth, startBit);
+      srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+      dstPtr += numElements;
+      bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                resetBuf, backupByteLen);
+      return;
+    }
+
+    if (backupByteLen != 0) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+      plainUnpackLongs(dstPtr, 0, 1, bitWidth, startBit);
+      dstPtr++;
+      backupByteLen = 0;
+      remainingNumElements--;
+    } else {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+    }
+
+    bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bufMoveByteLen = 0;
+    srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+  }
+
+  void UnpackAvx512::vectorUnpack1(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 1;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __m512i reverseMask1u = _mm512_loadu_si512(reverseMaskTable1u);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          uint64_t src_64 = *reinterpret_cast<uint64_t*>(const_cast<uint8_t*>(srcPtr));
+          // convert mask to 512-bit register. 0 --> 0x00, 1 --> 0xFF
+          __m512i srcmm = _mm512_movm_epi8(src_64);
+          // make 0x00 --> 0x00, 0xFF --> 0x01
+          srcmm = _mm512_abs_epi8(srcmm);
+          srcmm = _mm512_shuffle_epi8(srcmm, reverseMask1u);
+          _mm512_storeu_si512(simdPtr, srcmm);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack2(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 2;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_16U;         // first 16 bytes (64 elements)
+        __m512i parse_mask = _mm512_set1_epi16(0x0303);  // 2 times 1 then (8 - 2) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm3 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          __m512i srcmm0, srcmm1, srcmm2, tmpmm;
+
+          srcmm2 = _mm512_srli_epi16(srcmm3, 2);
+          srcmm1 = _mm512_srli_epi16(srcmm3, 4);
+          srcmm0 = _mm512_srli_epi16(srcmm3, 6);
+
+          // turn 2 bitWidth into 8 by zeroing 3 of each 4 elements.
+          // move them into their places
+          // srcmm0: a e i m 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm1: b f j n 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);        // ab ef 00 00 00 00 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);       // ij mn 00 00 00 00 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x00);  // ab ef ab ef ij mn ij mn
+
+          // srcmm2: c g k o 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm3: d h l p 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm2, srcmm3);        // cd gh 00 00 00 00 00 00
+          srcmm1 = _mm512_unpackhi_epi8(srcmm2, srcmm3);       // kl op 00 00 00 00 00 00
+          srcmm1 = _mm512_shuffle_i64x2(tmpmm, srcmm1, 0x00);  // cd gh cd gh kl op kl op
+
+          tmpmm = _mm512_unpacklo_epi16(srcmm0, srcmm1);        // abcd abcd ijkl ijkl
+          srcmm0 = _mm512_unpackhi_epi16(srcmm0, srcmm1);       // efgh efgh mnop mnop
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x88);   // abcd ijkl efgh mnop
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // abcd efgh ijkl mnop
+
+          srcmm0 = _mm512_and_si512(srcmm0, parse_mask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack3(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 3;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable3u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable3u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable3u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable3u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable3u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack4(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 4;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_32U;        // first 32 bytes (64 elements)
+        __m512i parseMask = _mm512_set1_epi16(0x0F0F);  // 4 times 1 then (8 - 4) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm0, srcmm1, tmpmm;
+
+          srcmm1 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm0 = _mm512_srli_epi16(srcmm1, 4);
+
+          // move elements into their places
+          // srcmm0: a c e g 0 0 0 0
+          // srcmm1: b d f h 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);         // ab ef 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);        // cd gh 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x44);   // ab ef cd gh
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // ab cd ef gh
+
+          // turn 4 bitWidth into 8 by zeroing 4 of each 8 bits.
+          srcmm0 = _mm512_and_si512(srcmm0, parseMask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack5(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 5;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable5u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable5u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable5u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable5u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable5u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack6(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 6;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable6u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable6u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable6u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable6u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable6u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack7(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 7;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable7u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable7u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable7u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable7u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable7u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack9(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 9;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable9u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable9u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable9u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable9u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable9u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable9u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable9u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 7);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack10(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 10;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable10u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable10u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable10u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack11(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 11;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable11u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable11u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable11u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable11u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable11u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable11u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable11u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable11u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable11u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4u);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 5);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack12(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 12;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable12u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable12u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable12u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack13(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 13;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable13u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable13u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable13u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable13u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable13u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable13u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable13u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable13u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable13u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 3);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack14(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 14;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable14u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable14u_1);
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable14u);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable14u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable14u_1);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack15(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 15;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable15u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable15u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable15u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable15u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable15u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable15u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable15u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable15u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable15u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 1);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack16(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 16;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = len;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    int64_t* dstPtr = data + offset;
+    bool resetBuf = false;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      bufMoveByteLen += moveByteLen(len * bitWidth);
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        numElements = len;
+        resetBuf = false;
+      } else {
+        numElements = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH / bitWidth;
+        len -= numElements;
+        tailBitLen = fmod(bufRestByteLen * ORC_VECTOR_BYTE_WIDTH, bitWidth);
+        resetBuf = true;
+      }
+
+      if (tailBitLen != 0) {
+        backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+        tailBitLen = 0;
+      }
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm = _mm512_loadu_si512(srcPtr);
+          srcmm = _mm512_shuffle_epi8(srcmm, reverse_mask_16u);
+          _mm512_storeu_si512(simdPtr, srcmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      if (numElements > 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+        unpackDefault.unrolledUnpack16(dstPtr, 0, numElements);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        dstPtr += numElements;
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+      }
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                  resetBuf, backupByteLen);
+        return;

Review Comment:
   It means that the Byte length in the buffer which need to be processed is less than the buffer rest length, and all of the data can be found in the current buffer, so after unpacking and resetting the buffer, we can return directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1173239709


##########
c++/src/RLEv2.hh:
##########
@@ -220,17 +251,40 @@ namespace orc {
 
     const std::unique_ptr<SeekableInputStream> inputStream;
     const bool isSigned;
-
     unsigned char firstByte;
-    uint64_t runLength;  // Length of the current run
-    uint64_t runRead;    // Number of returned values of the current run
-    const char* bufferStart;
-    const char* bufferEnd;
-    uint32_t bitsLeft;                  // Used by readLongs when bitSize < 8
-    uint32_t curByte;                   // Used by anything that uses readLongs
+    char* bufferStart;
+    char* bufferEnd;
+    uint64_t runLength;                 // Length of the current run
+    uint64_t runRead;                   // Number of returned values of the current run
+    uint32_t bitsLeft;  		// Used by readLongs when bitSize < 8
+    uint32_t curByte;   		// Used by anything that uses readLongs
     DataBuffer<int64_t> unpackedPatch;  // Used by PATCHED_BASE
     DataBuffer<int64_t> literals;       // Values of the current run
   };
+
+  inline void RleDecoderV2::resetBufferStart(uint64_t len, bool resetBuf, uint32_t backupByteLen) {
+    char* bufStart = getBufStart();
+    uint64_t remainingLen = bufLength();
+    int bufferLength = 0;
+    const void* bufferPointer = nullptr;
+
+    if (backupByteLen != 0) {
+      inputStream->BackUp(backupByteLen);
+    }
+
+    if (len >= remainingLen && resetBuf) {
+      if (!inputStream->Next(&bufferPointer, &bufferLength)) {
+        throw ParseError("bad read in RleDecoderV2::resetBufferStart");
+      }
+    }
+
+    if (bufferPointer == nullptr) {
+      setBufStart(bufStart + len);
+    } else {
+      setBufStart(const_cast<char*>(static_cast<const char*>(bufferPointer)));

Review Comment:
   Sorry for that. Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1107120154


##########
CMakeLists.txt:
##########
@@ -87,6 +91,17 @@ if (BUILD_POSITION_INDEPENDENT_LIB)
   set(CMAKE_POSITION_INDEPENDENT_CODE ON)
 endif ()
 
+if(NOT DEFINED ORC_SIMD_LEVEL)

Review Comment:
   Fixed. Delete ORC_RUNTIME_SIMD_LEVEL



##########
c++/src/BitUnpackerAvx512.hh:
##########
@@ -0,0 +1,488 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef VECTOR_DECODER_HH

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1109828191


##########
c++/test/TestRleVectorDecoder.cc:
##########
@@ -106,7 +106,12 @@ namespace orc {
     int32_t lpad = offset * BARWIDTH / total;
     int32_t rpad = BARWIDTH - lpad;
 
-    printf("\r%s:%3d%% [%.*s%*s] [%ld /%ld]", testName, val, lpad, BARSTR, rpad, "", offset, total);
+#ifdef __APPLE__
+    printf("\r%s:%3d%% [%.*s%*s] [%lld/%lld]", testName, val, lpad, BARSTR, rpad, "", offset,

Review Comment:
   OK, thank you for your reminder.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1141576220


##########
c++/src/RleDecoderV2.cc:
##########
@@ -17,26 +17,32 @@
  */
 
 #include "Adaptor.hh"
+// #include "Bpacking.hh"

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1144370809


##########
c++/src/Bpacking.hh:
##########
@@ -0,0 +1,32 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKING_HH
+#define ORC_BPACKING_HH
+
+#include <cstdint>
+
+namespace orc {
+  class BitUnpack {
+   public:
+    static int readLongs(RleDecoderV2* decoder, int64_t* data, uint64_t offset, uint64_t len,

Review Comment:
   Done.
   https://github.com/wpleonardo/orc/blob/f053f9c73bf13fe29aff95cfe4cb71857c57da07/c%2B%2B/src/Bpacking.hh#L25



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1144373730


##########
c++/src/CpuInfoUtil.hh:
##########
@@ -0,0 +1,113 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * @file CpuInfoUtil.hh code borrowing from
+ * https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/cpu_info.h
+ * @file CpuInfoUtil.cc code borrowing from
+ * https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/cpu_info.cc

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1455297617

   Hi @wgtmac , I have already created a new CI action workflow in .github/workflows/build_and_test.yml. Below is thedetailedl information about this change:
   1. Added 2 new workflows: simdUbuntu(ubuntu-20.04 & ubuntu-22.04) and simdWindows(windows-2019). In these new workflows, the option BUILD_ENABLE_AVX512 is set as "ON" in the cmake command, and also including the ENV ORC_USER_SIMD_LEVEL is set as avx512
   2. The default value of cmake option BUILD_ENABLE_AVX512 is changed back to "OFF". It means that in the original CI action workflow AVX512 feature will not be enabled. The AVX512 feature only be enabled in the new workflows.
   3. If customers set cmake option BUILD_ENABLE_AVX512=ON and the current machine doesn't support avx512, it will have a fatal error message in the cmake process, and the process will be shutdown.
   4. I found that in our community CI test machines, not all of them support AVX512 CPU flags. So it would be better if we make sure the new added workflows are running on these machines that support AVX512.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1537023637

   Thank you very much for you help, Gang! ^_^
   
   
   B&R,
   Wang Peng
   
   
   
   
   
   
   
   At 2023-05-06 09:52:08, "Gang Wu" ***@***.***> wrote:
   
   @wgtmac approved this pull request.
   
   I will merge it by the end of this week if no further comment.
   
   —
   Reply to this email directly, view it on GitHub, or unsubscribe.
   You are receiving this because you were mentioned.Message ID: ***@***.***>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1407234458

   > Gentle ping, @wpleonardo .
   
   Sorry, the past few days are my holiday, I will back to work and follow your suggestions in the next few days.
   Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by GitBox <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1378885855

   Welcome to the Apache ORC community! @wpleonardo 
   
   This feature looks promising. Will take a look this week.
   
   cc @stiga-huang @coderex2522 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1429238452

   Thank you very much for your suggestions. I'm fixing it now. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1138645109


##########
CMakeLists.txt:
##########
@@ -169,6 +173,9 @@ enable_testing()
 
 INCLUDE(CheckSourceCompiles)
 INCLUDE(ThirdpartyToolchain)
+if (BUILD_ENABLE_AVX512 AND NOT APPLE)

Review Comment:
   OK, Added.



##########
cmake_modules/ConfigSimdLevel.cmake:
##########
@@ -0,0 +1,105 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(CheckCXXCompilerFlag)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_SIMD_LEVEL)
+  set(ORC_SIMD_LEVEL
+      "DEFAULT"
+      CACHE STRING "Compile time SIMD optimization level")
+endif()
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(STATUS "Unknown system processor")

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1138671692


##########
c++/src/RleDecoderV2.cc:
##########
@@ -17,26 +17,31 @@
  */
 
 #include "Adaptor.hh"
+#include "Bpacking.hh"
 #include "Compression.hh"
+#include "Dispatch.hh"
 #include "RLEV2Util.hh"
 #include "RLEv2.hh"
 #include "Utils.hh"
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+#include "BpackingAvx512.hh"
+#endif

Review Comment:
   https://github.com/wpleonardo/orc/blob/fe5b6c7a29721bb5d8c4699a0b072d64555d600d/c%2B%2B/src/RleDecoderV2.cc#L78
   Because in the above function, we need to decide which function is added into the vector. When AVX512 enable, we need this header file to have function definition of AVX512 Bit-unpacking.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1138687810


##########
c++/src/CMakeLists.txt:
##########
@@ -184,7 +184,11 @@ set(SOURCE_FILES
   Timezone.cc
   TypeImpl.cc
   Vector.cc
-  Writer.cc)
+  Writer.cc
+  CpuInfoUtil.cc
+  BpackingDefault.cc
+  BpackingAvx512.cc
+  Bpacking.cc)

Review Comment:
   Already changed the building source_files when Build_Enable_AVX512 is true. And also changed the function definitions about  BitUnpack::readLongs, BitUnpackDefault::readLongs, and BitUnpackAVX512::readLongs
   https://github.com/wpleonardo/orc/blob/fe5b6c7a29721bb5d8c4699a0b072d64555d600d/c%2B%2B/src/RleDecoderV2.cc#L75
   https://github.com/wpleonardo/orc/blob/fe5b6c7a29721bb5d8c4699a0b072d64555d600d/c%2B%2B/src/Bpacking.hh#L29
   https://github.com/wpleonardo/orc/blob/fe5b6c7a29721bb5d8c4699a0b072d64555d600d/c%2B%2B/src/BpackingDefault.hh#L55
   https://github.com/wpleonardo/orc/blob/fe5b6c7a29721bb5d8c4699a0b072d64555d600d/c%2B%2B/src/BpackingAvx512.hh#L87



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1138687810


##########
c++/src/CMakeLists.txt:
##########
@@ -184,7 +184,11 @@ set(SOURCE_FILES
   Timezone.cc
   TypeImpl.cc
   Vector.cc
-  Writer.cc)
+  Writer.cc
+  CpuInfoUtil.cc
+  BpackingDefault.cc
+  BpackingAvx512.cc
+  Bpacking.cc)

Review Comment:
   https://github.com/wpleonardo/orc/blob/fe5b6c7a29721bb5d8c4699a0b072d64555d600d/c%2B%2B/src/CMakeLists.txt#L196
   Already changed the building source_files when Build_Enable_AVX512 is true. And also changed the function definitions about  BitUnpack::readLongs, BitUnpackDefault::readLongs, and BitUnpackAVX512::readLongs
   https://github.com/wpleonardo/orc/blob/fe5b6c7a29721bb5d8c4699a0b072d64555d600d/c%2B%2B/src/RleDecoderV2.cc#L75



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1472985002

   I added a license check workflow yesterday. Simply copy the license header from other source files to the new files you mentioned should solve the problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1139740147


##########
.github/workflows/build_and_test.yml:
##########
@@ -91,6 +91,49 @@ jobs:
         cmake --build . --config Debug
         ctest -C Debug --output-on-failure
 
+  simdUbuntu:
+    name: "SIMD programming using C++ intrinsic functions on ${{ matrix.os }}"
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os:
+          - ubuntu-22.04
+        cxx:
+          - clang++
+    env:
+      ORC_USER_SIMD_LEVEL: avx512

Review Comment:
   Is it case-insensitive? I prefer uppercase here.



##########
.github/workflows/build_and_test.yml:
##########
@@ -91,6 +91,49 @@ jobs:
         cmake --build . --config Debug
         ctest -C Debug --output-on-failure
 
+  simdUbuntu:
+    name: "SIMD programming using C++ intrinsic functions on ${{ matrix.os }}"
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os:
+          - ubuntu-22.04
+        cxx:
+          - clang++
+    env:
+      ORC_USER_SIMD_LEVEL: avx512
+    steps:
+    - name: Checkout
+      uses: actions/checkout@v2
+    - name: "Test"
+      run: |
+        mkdir -p ~/.m2
+        mkdir build
+        cd build
+        cmake -DBUILD_JAVA=OFF -DBUILD_ENABLE_AVX512=ON ..
+        make package test-out
+
+  simdWindows:
+    name: "SIMD programming using C++ intrinsic functions on Windows"
+    runs-on: windows-2019
+    env:
+      ORC_USER_SIMD_LEVEL: avx512

Review Comment:
   ditto



##########
c++/src/BpackingAvx512.hh:
##########
@@ -0,0 +1,93 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGAVX512_HH
+#define ORC_BPACKINGAVX512_HH
+
+#include <stdlib.h>
+#include <cstdint>
+
+#include "BpackingDefault.hh"
+#include "Dispatch.hh"
+#include "RLEv2.hh"

Review Comment:
   We'd better use forward declaration and remove unnecessary inclusion.



##########
c++/src/BitUnpackerAvx512.hh:
##########
@@ -0,0 +1,488 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BIT_UNPACKER_AVX512_HH
+#define ORC_BIT_UNPACKER_AVX512_HH
+
+// Mingw-w64 defines strcasecmp in string.h
+#if defined(_WIN32) && !defined(strcasecmp)
+#include <string.h>
+#define strcasecmp stricmp
+#else
+#include <strings.h>
+#endif
+
+#include <immintrin.h>
+#include <cstdint>
+#include <vector>
+
+namespace orc {
+#define ORC_VECTOR_BITS_2_BYTE(x) \
+  (((x) + 7u) >> 3u) /**< Convert a number of bits to a number of bytes */
+#define ORC_VECTOR_ONE_64U (1ULL)
+#define ORC_VECTOR_MAX_16U 0xFFFF     /**< Max value for uint16_t */
+#define ORC_VECTOR_MAX_32U 0xFFFFFFFF /**< Max value for uint32_t */
+#define ORC_VECTOR_BYTE_WIDTH 8u      /**< Byte width in bits */
+#define ORC_VECTOR_WORD_WIDTH 16u     /**< Word width in bits */
+#define ORC_VECTOR_DWORD_WIDTH 32u    /**< Dword width in bits */
+#define ORC_VECTOR_QWORD_WIDTH 64u    /**< Qword width in bits */
+#define ORC_VECTOR_BIT_MASK(x) \
+  ((ORC_VECTOR_ONE_64U << (x)) - 1u) /**< Bit mask below bit position */
+
+#define ORC_VECTOR_BITS_2_WORD(x) \
+  (((x) + 15u) >> 4u) /**< Convert a number of bits to a number of words */
+#define ORC_VECTOR_BITS_2_DWORD(x) \
+  (((x) + 31u) >> 5u) /**< Convert a number of bits to a number of double words */
+
+  // ------------------------------------ 3u -----------------------------------------
+  static uint8_t shuffleIdxTable3u_0[64] = {
+      1u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 4u, 3u, 5u, 4u, 6u, 5u, 1u, 0u, 1u, 0u, 2u, 1u,
+      3u, 2u, 4u, 3u, 4u, 3u, 5u, 4u, 6u, 5u, 1u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 4u, 3u,
+      5u, 4u, 6u, 5u, 1u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 4u, 3u, 5u, 4u, 6u, 5u};
+  static uint8_t shuffleIdxTable3u_1[64] = {
+      0u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 0u, 0u, 1u, 0u, 2u, 1u,
+      3u, 2u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 0u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 3u, 2u, 4u, 3u,
+      5u, 4u, 6u, 5u, 0u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u};
+  static uint16_t shiftTable3u_0[32] = {13u, 7u,  9u,  11u, 13u, 7u,  9u,  11u, 13u, 7u,  9u,
+                                        11u, 13u, 7u,  9u,  11u, 13u, 7u,  9u,  11u, 13u, 7u,
+                                        9u,  11u, 13u, 7u,  9u,  11u, 13u, 7u,  9u,  11u};
+  static uint16_t shiftTable3u_1[32] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u,
+                                        0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u,
+                                        2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint16_t permutexIdxTable3u[32] = {0u,  1u,  2u,  0x0, 0x0, 0x0, 0x0, 0x0, 3u,  4u,  5u,
+                                            0x0, 0x0, 0x0, 0x0, 0x0, 6u,  7u,  8u,  0x0, 0x0, 0x0,
+                                            0x0, 0x0, 9u,  10u, 11u, 0x0, 0x0, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 5u -----------------------------------------
+  static uint8_t shuffleIdxTable5u_0[64] = {
+      1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u,
+      4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u,
+      8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u};
+  static uint8_t shuffleIdxTable5u_1[64] = {
+      1u, 0u, 2u,  1u, 3u, 2u, 5u, 4u, 6u,  5u, 7u, 6u, 8u, 7u, 10u, 9u, 1u, 0u, 2u,  1u, 3u, 2u,
+      5u, 4u, 6u,  5u, 7u, 6u, 8u, 7u, 10u, 9u, 1u, 0u, 2u, 1u, 3u,  2u, 5u, 4u, 6u,  5u, 7u, 6u,
+      8u, 7u, 10u, 9u, 1u, 0u, 2u, 1u, 3u,  2u, 5u, 4u, 6u, 5u, 7u,  6u, 8u, 7u, 10u, 9u};
+  static uint16_t shiftTable5u_0[32] = {11u, 9u,  7u,  5u, 11u, 9u,  7u,  5u, 11u, 9u,  7u,
+                                        5u,  11u, 9u,  7u, 5u,  11u, 9u,  7u, 5u,  11u, 9u,
+                                        7u,  5u,  11u, 9u, 7u,  5u,  11u, 9u, 7u,  5u};
+  static uint16_t shiftTable5u_1[32] = {2u, 4u, 6u, 0u, 2u, 4u, 6u, 0u, 2u, 4u, 6u,
+                                        0u, 2u, 4u, 6u, 0u, 2u, 4u, 6u, 0u, 2u, 4u,
+                                        6u, 0u, 2u, 4u, 6u, 0u, 2u, 4u, 6u, 0u};
+  static uint16_t permutexIdxTable5u[32] = {0u,  1u,  2u,  3u,  4u,  0x0, 0x0, 0x0, 5u,  6u,  7u,
+                                            8u,  9u,  0x0, 0x0, 0x0, 10u, 11u, 12u, 13u, 14u, 0x0,
+                                            0x0, 0x0, 15u, 16u, 17u, 18u, 19u, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 6u -----------------------------------------
+  static uint8_t shuffleIdxTable6u_0[64] = {
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u};
+  static uint8_t shuffleIdxTable6u_1[64] = {
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u,
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u,
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u,
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u};
+  static uint16_t shiftTable6u_0[32] = {10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u,
+                                        6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,
+                                        10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u};
+  static uint16_t shiftTable6u_1[32] = {4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u,
+                                        0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u,
+                                        4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u};
+  static uint32_t permutexIdxTable6u[16] = {0u, 1u, 2u, 0x0, 3u, 4u,  5u,  0x0,
+                                            6u, 7u, 8u, 0x0, 9u, 10u, 11u, 0x0};
+
+  // ------------------------------------ 7u -----------------------------------------
+  static uint8_t shuffleIdxTable7u_0[64] = {
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u,
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u,
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u,
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u};
+  static uint8_t shuffleIdxTable7u_1[64] = {
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u,
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u,
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u,
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u};
+  static uint16_t shiftTable7u_0[32] = {9u, 3u, 5u, 7u, 9u, 3u, 5u, 7u, 9u, 3u, 5u,
+                                        7u, 9u, 3u, 5u, 7u, 9u, 3u, 5u, 7u, 9u, 3u,
+                                        5u, 7u, 9u, 3u, 5u, 7u, 9u, 3u, 5u, 7u};
+  static uint16_t shiftTable7u_1[32] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u,
+                                        0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u,
+                                        2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint16_t permutexIdxTable7u[32] = {0u,  1u,  2u,  3u,  4u,  5u,  6u,  0x0, 7u,  8u,  9u,
+                                            10u, 11u, 12u, 13u, 0x0, 14u, 15u, 16u, 17u, 18u, 19u,
+                                            20u, 0x0, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 0x0};
+
+  // ------------------------------------ 9u -----------------------------------------
+  static uint16_t permutexIdxTable9u_0[32] = {0u,  1u,  1u,  2u,  2u,  3u,  3u,  4u,  4u,  5u,  5u,
+                                              6u,  6u,  7u,  7u,  8u,  9u,  10u, 10u, 11u, 11u, 12u,
+                                              12u, 13u, 13u, 14u, 14u, 15u, 15u, 16u, 16u, 17u};
+  static uint16_t permutexIdxTable9u_1[32] = {0u,  1u,  1u,  2u,  2u,  3u,  3u,  4u,  5u,  6u,  6u,
+                                              7u,  7u,  8u,  8u,  9u,  9u,  10u, 10u, 11u, 11u, 12u,
+                                              12u, 13u, 14u, 15u, 15u, 16u, 16u, 17u, 17u, 18u};
+  static uint32_t shiftTable9u_0[16] = {0u, 2u, 4u, 6u, 8u, 10u, 12u, 14u,
+                                        0u, 2u, 4u, 6u, 8u, 10u, 12u, 14u};
+  static uint32_t shiftTable9u_1[16] = {7u, 5u, 3u, 1u, 15u, 13u, 11u, 9u,
+                                        7u, 5u, 3u, 1u, 15u, 13u, 11u, 9u};
+
+  static uint8_t shuffleIdxTable9u_0[64] = {
+      1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 7u, 6u, 8u, 7u, 1u, 0u, 2u, 1u, 3u, 2u,
+      4u, 3u, 5u, 4u, 6u, 5u, 7u, 6u, 8u, 7u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u,
+      7u, 6u, 8u, 7u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 7u, 6u, 8u, 7u};
+  static uint16_t shiftTable9u_2[32] = {7u, 6u, 5u, 4u, 3u, 2u, 1u, 0u, 7u, 6u, 5u,
+                                        4u, 3u, 2u, 1u, 0u, 7u, 6u, 5u, 4u, 3u, 2u,
+                                        1u, 0u, 7u, 6u, 5u, 4u, 3u, 2u, 1u, 0u};
+  static uint64_t gatherIdxTable9u[8] = {0u, 8u, 9u, 17u, 18u, 26u, 27u, 35u};
+
+  // ------------------------------------ 10u -----------------------------------------
+  static uint8_t shuffleIdxTable10u_0[64] = {
+      1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u,
+      4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u,
+      8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u};
+  static uint16_t shiftTable10u[32] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u,
+                                       0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u,
+                                       2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint16_t permutexIdxTable10u[32] = {0u,  1u,  2u,  3u,  4u,  0x0, 0x0, 0x0, 5u,  6u,  7u,
+                                             8u,  9u,  0x0, 0x0, 0x0, 10u, 11u, 12u, 13u, 14u, 0x0,
+                                             0x0, 0x0, 15u, 16u, 17u, 18u, 19u, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 11u -----------------------------------------
+  static uint16_t permutexIdxTable11u_0[32] = {
+      0u,  1u,  1u,  2u,  2u,  3u,  4u,  5u,  5u,  6u,  6u,  7u,  8u,  9u,  9u,  10u,
+      11u, 12u, 12u, 13u, 13u, 14u, 15u, 16u, 16u, 17u, 17u, 18u, 19u, 20u, 20u, 21u};
+  static uint16_t permutexIdxTable11u_1[32] = {
+      0u,  1u,  2u,  3u,  3u,  4u,  4u,  5u,  6u,  7u,  7u,  8u,  8u,  9u,  10u, 11u,
+      11u, 12u, 13u, 14u, 14u, 15u, 15u, 16u, 17u, 18u, 18u, 19u, 19u, 20u, 21u, 22u};
+  static uint32_t shiftTable11u_0[16] = {0u, 6u, 12u, 2u, 8u, 14u, 4u, 10u,
+                                         0u, 6u, 12u, 2u, 8u, 14u, 4u, 10u};
+  static uint32_t shiftTable11u_1[16] = {5u, 15u, 9u, 3u, 13u, 7u, 1u, 11u,
+                                         5u, 15u, 9u, 3u, 13u, 7u, 1u, 11u};
+
+  static uint8_t shuffleIdxTable11u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u};
+  static uint8_t shuffleIdxTable11u_1[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u};
+  static uint32_t shiftTable11u_2[16] = {21u, 15u, 17u, 19u, 21u, 15u, 17u, 19u,
+                                         21u, 15u, 17u, 19u, 21u, 15u, 17u, 19u};
+  static uint32_t shiftTable11u_3[16] = {6u, 4u, 10u, 8u, 6u, 4u, 10u, 8u,
+                                         6u, 4u, 10u, 8u, 6u, 4u, 10u, 8u};
+  static uint64_t gatherIdxTable11u[8] = {0u, 8u, 11u, 19u, 22u, 30u, 33u, 41u};
+
+  // ------------------------------------ 12u -----------------------------------------
+  static uint8_t shuffleIdxTable12u_0[64] = {
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u};
+  static uint16_t shiftTable12u[32] = {4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u,
+                                       0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u,
+                                       4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u};
+  static uint32_t permutexIdxTable12u[16] = {0u, 1u, 2u, 0x0, 3u, 4u,  5u,  0x0,
+                                             6u, 7u, 8u, 0x0, 9u, 10u, 11u, 0x0};
+
+  // ------------------------------------ 13u -----------------------------------------
+  static uint16_t permutexIdxTable13u_0[32] = {
+      0u,  1u,  1u,  2u,  3u,  4u,  4u,  5u,  6u,  7u,  8u,  9u,  9u,  10u, 11u, 12u,
+      13u, 14u, 14u, 15u, 16u, 17u, 17u, 18u, 19u, 20u, 21u, 22u, 22u, 23u, 24u, 25u};
+  static uint16_t permutexIdxTable13u_1[32] = {
+      0u,  1u,  2u,  3u,  4u,  5u,  5u,  6u,  7u,  8u,  8u,  9u,  10u, 11u, 12u, 13u,
+      13u, 14u, 15u, 16u, 17u, 18u, 18u, 19u, 20u, 21u, 21u, 22u, 23u, 24u, 25u, 26u};
+  static uint32_t shiftTable13u_0[16] = {0u, 10u, 4u, 14u, 8u, 2u, 12u, 6u,
+                                         0u, 10u, 4u, 14u, 8u, 2u, 12u, 6u};
+  static uint32_t shiftTable13u_1[16] = {3u, 9u, 15u, 5u, 11u, 1u, 7u, 13u,
+                                         3u, 9u, 15u, 5u, 11u, 1u, 7u, 13u};
+
+  static uint8_t shuffleIdxTable13u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u};
+  static uint8_t shuffleIdxTable13u_1[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u};
+  static uint32_t shiftTable13u_2[16] = {19u, 17u, 15u, 13u, 19u, 17u, 15u, 13u,
+                                         19u, 17u, 15u, 13u, 19u, 17u, 15u, 13u};
+  static uint32_t shiftTable13u_3[16] = {10u, 12u, 6u, 8u, 10u, 12u, 6u, 8u,
+                                         10u, 12u, 6u, 8u, 10u, 12u, 6u, 8u};
+  static uint64_t gatherIdxTable13u[8] = {0u, 8u, 13u, 21u, 26u, 34u, 39u, 47u};
+
+  // ------------------------------------ 14u -----------------------------------------
+  static uint8_t shuffleIdxTable14u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u};
+  static uint8_t shuffleIdxTable14u_1[64] = {
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u};
+  static uint32_t shiftTable14u_0[16] = {18u, 14u, 18u, 14u, 18u, 14u, 18u, 14u,
+                                         18u, 14u, 18u, 14u, 18u, 14u, 18u, 14u};
+  static uint32_t shiftTable14u_1[16] = {12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u,
+                                         12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u};
+  static uint16_t permutexIdxTable14u[32] = {0u,  1u,  2u,  3u,  4u,  5u,  6u,  0x0, 7u,  8u,  9u,
+                                             10u, 11u, 12u, 13u, 0x0, 14u, 15u, 16u, 17u, 18u, 19u,
+                                             20u, 0x0, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 0x0};
+
+  // ------------------------------------ 15u -----------------------------------------
+  static uint16_t permutexIdxTable15u_0[32] = {
+      0u,  1u,  1u,  2u,  3u,  4u,  5u,  6u,  7u,  8u,  9u,  10u, 11u, 12u, 13u, 14u,
+      15u, 16u, 16u, 17u, 18u, 19u, 20u, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 28u, 29u};
+  static uint16_t permutexIdxTable15u_1[32] = {
+      0u,  1u,  2u,  3u,  4u,  5u,  6u,  7u,  8u,  9u,  10u, 11u, 12u, 13u, 14u, 15u,
+      15u, 16u, 17u, 18u, 19u, 20u, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 28u, 29u, 30u};
+  static uint32_t shiftTable15u_0[16] = {0u, 14u, 12u, 10u, 8u, 6u, 4u, 2u,
+                                         0u, 14u, 12u, 10u, 8u, 6u, 4u, 2u};
+  static uint32_t shiftTable15u_1[16] = {1u, 3u, 5u, 7u, 9u, 11u, 13u, 15u,
+                                         1u, 3u, 5u, 7u, 9u, 11u, 13u, 15u};
+
+  static uint8_t shuffleIdxTable15u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u};
+  static uint8_t shuffleIdxTable15u_1[64] = {
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u};
+  static uint32_t shiftTable15u_2[16] = {17u, 11u, 13u, 15u, 17u, 11u, 13u, 15u,
+                                         17u, 11u, 13u, 15u, 17u, 11u, 13u, 15u};
+  static uint32_t shiftTable15u_3[16] = {14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u,
+                                         14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u};
+  static uint64_t gatherIdxTable15u[8] = {0u, 8u, 15u, 23u, 30u, 38u, 45u, 53u};
+
+  // ------------------------------------ 17u -----------------------------------------
+  static uint32_t permutexIdxTable17u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 6u, 7u, 7u, 8u};
+  static uint32_t permutexIdxTable17u_1[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 6u, 7u, 7u, 8u};
+  static uint64_t shiftTable17u_0[8] = {0u, 2u, 4u, 6u, 8u, 10u, 12u, 14u};
+  static uint64_t shiftTable17u_1[8] = {15u, 13u, 11u, 9u, 7u, 5u, 3u, 1u};
+
+  static uint8_t shuffleIdxTable17u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u,
+      3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u,
+      9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u};
+  static uint32_t shiftTable17u_2[16] = {15u, 14u, 13u, 12u, 11u, 10u, 9u, 8u,
+                                         15u, 14u, 13u, 12u, 11u, 10u, 9u, 8u};
+  static uint64_t gatherIdxTable17u[8] = {0u, 8u, 8u, 16u, 17u, 25u, 25u, 33u};
+
+  // ------------------------------------ 18u -----------------------------------------
+  static uint32_t permutexIdxTable18u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 6u, 7u, 7u, 8u};
+  static uint32_t permutexIdxTable18u_1[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               5u, 6u, 6u, 7u, 7u, 8u, 8u, 9u};
+  static uint64_t shiftTable18u_0[8] = {0u, 4u, 8u, 12u, 16u, 20u, 24u, 28u};
+  static uint64_t shiftTable18u_1[8] = {14u, 10u, 6u, 2u, 30u, 26u, 22u, 18u};
+
+  static uint8_t shuffleIdxTable18u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u,
+      3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u,
+      9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u};
+  static uint32_t shiftTable18u_2[16] = {14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u,
+                                         14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u};
+  static uint64_t gatherIdxTable18u[8] = {0u, 8u, 9u, 17u, 18u, 26u, 27u, 35u};
+
+  // ------------------------------------ 19u -----------------------------------------
+  static uint32_t permutexIdxTable19u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 7u, 8u, 8u, 9u};
+  static uint32_t permutexIdxTable19u_1[16] = {0u, 1u, 1u, 2u, 2u, 3u, 4u, 5u,
+                                               5u, 6u, 6u, 7u, 7u, 8u, 8u, 9u};
+  static uint64_t shiftTable19u_0[8] = {0u, 6u, 12u, 18u, 24u, 30u, 4u, 10u};
+  static uint64_t shiftTable19u_1[8] = {13u, 7u, 1u, 27u, 21u, 15u, 9u, 3u};
+
+  static uint8_t shuffleIdxTable19u_0[64] = {
+      3u,  2u, 1u, 0u, 5u, 4u, 3u,  2u, 7u, 6u, 5u, 4u, 10u, 9u, 8u, 7u, 3u,  2u, 1u, 0u, 5u, 4u,
+      3u,  2u, 8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u,  2u, 7u, 6u, 5u, 4u,
+      10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u, 2u, 8u,  7u, 6u, 5u, 10u, 9u, 8u, 7u};
+  static uint32_t shiftTable19u_2[16] = {13u, 10u, 7u, 12u, 9u, 6u, 11u, 8u,
+                                         13u, 10u, 7u, 12u, 9u, 6u, 11u, 8u};
+  static uint64_t gatherIdxTable19u[8] = {0u, 8u, 9u, 17u, 19u, 27u, 28u, 36u};
+
+  // ------------------------------------ 20u -----------------------------------------
+  static uint8_t shuffleIdxTable20u_0[64] = {
+      3u,  2u, 1u, 0u, 5u, 4u, 3u,  2u, 8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u,  2u, 1u, 0u, 5u, 4u,
+      3u,  2u, 8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u,  2u, 8u, 7u, 6u, 5u,
+      10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u, 2u, 8u,  7u, 6u, 5u, 10u, 9u, 8u, 7u};
+  static uint32_t shiftTable20u[16] = {12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u,
+                                       12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u};
+  static uint16_t permutexIdxTable20u[32] = {0u,  1u,  2u,  3u,  4u,  0x0, 0x0, 0x0, 5u,  6u,  7u,
+                                             8u,  9u,  0x0, 0x0, 0x0, 10u, 11u, 12u, 13u, 14u, 0x0,
+                                             0x0, 0x0, 15u, 16u, 17u, 18u, 19u, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 21u -----------------------------------------
+  static uint32_t permutexIdxTable21u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               5u, 6u, 6u, 7u, 7u, 8u, 9u, 10u};
+  static uint32_t permutexIdxTable21u_1[16] = {0u, 1u, 1u, 2u, 3u, 4u, 4u, 5u,
+                                               5u, 6u, 7u, 8u, 8u, 9u, 9u, 10u};
+  static uint64_t shiftTable21u_0[8] = {0u, 10u, 20u, 30u, 8u, 18u, 28u, 6u};
+  static uint64_t shiftTable21u_1[8] = {11u, 1u, 23u, 13u, 3u, 25u, 15u, 5u};
+
+  static uint8_t shuffleIdxTable21u_0[64] = {
+      3u,  2u, 1u, 0u, 5u, 4u, 3u,  2u,  8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u,  2u,  1u, 0u, 6u, 5u,
+      4u,  3u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u, 3u, 2u, 1u,  0u, 5u, 4u, 3u,  2u,  8u, 7u, 6u, 5u,
+      10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u,  6u, 5u, 4u, 3u, 8u,  7u, 6u, 5u, 11u, 10u, 9u, 8u};
+  static uint32_t shiftTable21u_2[16] = {11u, 6u, 9u, 4u, 7u, 10u, 5u, 8u,
+                                         11u, 6u, 9u, 4u, 7u, 10u, 5u, 8u};
+  static uint64_t gatherIdxTable21u[8] = {0u, 8u, 10u, 18u, 21u, 29u, 31u, 39u};
+
+  // ------------------------------------ 22u -----------------------------------------
+  static uint32_t permutexIdxTable22u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 4u, 5u,
+                                               5u, 6u, 6u, 7u, 8u, 9u, 9u, 10u};
+  static uint32_t permutexIdxTable22u_1[16] = {0u, 1u, 2u, 3u, 3u, 4u, 4u,  5u,
+                                               6u, 7u, 7u, 8u, 8u, 9u, 10u, 11u};
+  static uint64_t shiftTable22u_0[8] = {0u, 12u, 24u, 4u, 16u, 28u, 8u, 20u};
+  static uint64_t shiftTable22u_1[8] = {10u, 30u, 18u, 6u, 26u, 14u, 2u, 22u};
+
+  static uint8_t shuffleIdxTable22u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u};
+  static uint32_t shiftTable22u_2[16] = {10u, 4u, 6u, 8u, 10u, 4u, 6u, 8u,
+                                         10u, 4u, 6u, 8u, 10u, 4u, 6u, 8u};
+  static uint64_t gatherIdxTable22u[8] = {0u, 8u, 11u, 19u, 22u, 30u, 33u, 41u};
+
+  // ------------------------------------ 23u -----------------------------------------
+  static uint32_t permutexIdxTable23u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 4u,  5u,
+                                               5u, 6u, 7u, 8u, 8u, 9u, 10u, 11u};
+  static uint32_t permutexIdxTable23u_1[16] = {0u, 1u, 2u, 3u, 3u, 4u,  5u,  6u,
+                                               6u, 7u, 7u, 8u, 9u, 10u, 10u, 11u};
+  static uint64_t shiftTable23u_0[8] = {0u, 14u, 28u, 10u, 24u, 6u, 20u, 2u};
+  static uint64_t shiftTable23u_1[8] = {9u, 27u, 13u, 31u, 17u, 3u, 21u, 7u};
+
+  static uint8_t shuffleIdxTable23u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u,  8u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u,  8u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u};
+  static uint32_t shiftTable23u_2[16] = {9u, 2u, 3u, 4u, 5u, 6u, 7u, 8u,
+                                         9u, 2u, 3u, 4u, 5u, 6u, 7u, 8u};
+  static uint64_t gatherIdxTable23u[8] = {0u, 8u, 11u, 19u, 23u, 31u, 34u, 42u};
+
+  // ------------------------------------ 24u -----------------------------------------
+  static uint8_t shuffleIdxTable24u_0[64] = {
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF,
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF,
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF,
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF};
+  static uint32_t permutexIdxTable24u[16] = {0u, 1u, 2u, 0x0, 3u, 4u,  5u,  0x0,
+                                             6u, 7u, 8u, 0x0, 9u, 10u, 11u, 0x0};
+
+  // ------------------------------------ 26u -----------------------------------------
+  static uint32_t permutexIdxTable26u_0[16] = {0u, 1u, 1u, 2u, 3u, 4u,  4u,  5u,
+                                               6u, 7u, 8u, 9u, 9u, 10u, 11u, 12u};
+  static uint32_t permutexIdxTable26u_1[16] = {0u, 1u, 2u, 3u, 4u,  5u,  5u,  6u,
+                                               7u, 8u, 8u, 9u, 10u, 11u, 12u, 13u};
+  static uint64_t shiftTable26u_0[8] = {0u, 20u, 8u, 28u, 16u, 4u, 24u, 12u};
+  static uint64_t shiftTable26u_1[8] = {6u, 18u, 30u, 10u, 22u, 2u, 14u, 26u};
+
+  static uint8_t shuffleIdxTable26u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u};
+  static uint32_t shiftTable26u_2[16] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u,
+                                         6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint64_t gatherIdxTable26u[8] = {0u, 8u, 13u, 21u, 26u, 34u, 39u, 47u};
+
+  // ------------------------------------ 28u -----------------------------------------
+  static uint8_t shuffleIdxTable28u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u};
+  static uint32_t shiftTable28u[16] = {4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u,
+                                       4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u};
+  static uint16_t permutexIdxTable28u[32] = {0u,  1u,  2u,  3u,  4u,  5u,  6u,  0x0, 7u,  8u,  9u,
+                                             10u, 11u, 12u, 13u, 0x0, 14u, 15u, 16u, 17u, 18u, 19u,
+                                             20u, 0x0, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 0x0};
+
+  // ------------------------------------ 30u -----------------------------------------
+  static uint32_t permutexIdxTable30u_0[16] = {0u, 1u, 1u, 2u,  3u,  4u,  5u,  6u,
+                                               7u, 8u, 9u, 10u, 11u, 12u, 13u, 14u};
+  static uint32_t permutexIdxTable30u_1[16] = {0u, 1u, 2u,  3u,  4u,  5u,  6u,  7u,
+                                               8u, 9u, 10u, 11u, 12u, 13u, 14u, 15u};
+  static uint64_t shiftTable30u_0[8] = {0u, 28u, 24u, 20u, 16u, 12u, 8u, 4u};
+  static uint64_t shiftTable30u_1[8] = {2u, 6u, 10u, 14u, 18u, 22u, 26u, 30u};
+
+  static uint8_t shuffleIdxTable30u_0[64] = {
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u,
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u,
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u,
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u};
+  static uint8_t shuffleIdxTable30u_1[64] = {
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u,
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u,
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u,
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u};
+  static uint64_t shiftTable30u_2[8] = {34u, 30u, 34u, 30u, 34u, 30u, 34u, 30u};
+  static uint64_t shiftTable30u_3[8] = {28u, 24u, 28u, 24u, 28u, 24u, 28u, 24u};
+  static uint64_t gatherIdxTable30u[8] = {0u, 8u, 15u, 23u, 30u, 38u, 45u, 53u};
+
+  static uint64_t nibbleReverseTable[8] = {
+      0x0E060A020C040800, 0x0F070B030D050901, 0x0E060A020C040800, 0x0F070B030D050901,
+      0x0E060A020C040800, 0x0F070B030D050901, 0x0E060A020C040800, 0x0F070B030D050901};
+
+  static uint64_t reverseMaskTable1u[8] = {
+      0x0001020304050607, 0x08090A0B0C0D0E0F, 0x1011121314151617, 0x18191A1B1C1D1E1F,
+      0x2021222324252627, 0x28292A2B2C2D2E2F, 0x3031323334353637, 0x38393A3B3C3D3E3F};
+
+  static uint64_t reverseMaskTable16u[8] = {
+      0x0607040502030001, 0x0E0F0C0D0A0B0809, 0x1617141512131011, 0x1E1F1C1D1A1B1819,
+      0x2627242522232021, 0x2E2F2C2D2A2B2829, 0x3637343532333031, 0x3E3F3C3D3A3B3839};
+
+  static uint64_t reverseMaskTable32u[8] = {
+      0x0405060700010203, 0x0C0D0E0F08090A0B, 0x1415161710111213, 0x1C1D1E1F18191A1B,
+      0x2425262720212223, 0x2C2D2E2F28292A2B, 0x3435363730313233, 0x3C3D3E3F38393A3B};
+
+  uint32_t getAlign(uint32_t start_bit, uint32_t base, uint32_t bitsize) {

Review Comment:
   ```suggestion
     inline uint32_t getAlign(uint32_t start_bit, uint32_t base, uint32_t bitsize) {
   ```



##########
c++/src/BpackingDefault.hh:
##########
@@ -0,0 +1,61 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGDEFAULT_HH
+#define ORC_BPACKINGDEFAULT_HH
+
+#include <stdlib.h>
+#include <cstdint>
+
+#include "Bpacking.hh"
+#include "RLEv2.hh"
+#include "io/InputStream.hh"
+#include "io/OutputStream.hh"

Review Comment:
   Ditto



##########
c++/src/BpackingAvx512.hh:
##########
@@ -0,0 +1,93 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGAVX512_HH
+#define ORC_BPACKINGAVX512_HH
+
+#include <stdlib.h>
+#include <cstdint>
+
+#include "BpackingDefault.hh"
+#include "Dispatch.hh"
+#include "RLEv2.hh"
+#include "io/InputStream.hh"
+#include "io/OutputStream.hh"

Review Comment:
   Are they required or can be removed?



##########
c++/test/TestRleVectorDecoder.cc:
##########
@@ -0,0 +1,561 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <cstdlib>
+
+#include "MemoryOutputStream.hh"
+#include "RLEv2.hh"
+#include "wrap/gtest-wrapper.h"
+#include "wrap/orc-proto-wrapper.hh"
+
+#ifdef __clang__
+DIAGNOSTIC_IGNORE("-Wmissing-variable-declarations")
+#endif
+
+namespace orc {
+  using ::testing::TestWithParam;
+  using ::testing::Values;
+
+  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024;  // 1M
+  const char finish = '#';
+  std::string flags = "-\\|/";
+
+  class RleV2BitUnpackAvx512Test : public TestWithParam<bool> {
+    virtual void SetUp();
+
+   protected:
+    bool alignBitpacking;
+    std::unique_ptr<RleEncoder> getEncoder(RleVersion version, MemoryOutputStream& memStream,
+                                           bool isSigned);
+
+    void runExampleTest(int64_t* inputData, uint64_t inputLength, unsigned char* expectedOutput,
+                        uint64_t outputLength);
+
+    void runTest(RleVersion version, uint64_t numValues, int64_t start, int64_t delta, bool random,
+                 bool isSigned, uint8_t bitWidth, uint64_t blockSize = 0, uint64_t numNulls = 0);
+  };
+
+  void vectorDecodeAndVerify(RleVersion version, const MemoryOutputStream& memStream, int64_t* data,
+                             uint64_t numValues, const char* notNull, uint64_t blockSize,
+                             bool isSinged) {
+    std::unique_ptr<RleDecoder> decoder =
+        createRleDecoder(std::unique_ptr<SeekableArrayInputStream>(new SeekableArrayInputStream(
+                             memStream.getData(), memStream.getLength(), blockSize)),
+                         isSinged, version, *getDefaultPool(), getDefaultReaderMetrics());
+
+    int64_t* decodedData = new int64_t[numValues];
+    decoder->next(decodedData, numValues, notNull);
+
+    for (uint64_t i = 0; i < numValues; ++i) {
+      if (!notNull || notNull[i]) {
+        EXPECT_EQ(data[i], decodedData[i]);
+      }
+    }
+
+    delete[] decodedData;
+  }
+
+  void RleV2BitUnpackAvx512Test::SetUp() {
+    alignBitpacking = GetParam();
+  }
+
+  void generateDataFolBits(uint64_t numValues, int64_t start, int64_t delta, bool random,

Review Comment:
   ```suggestion
     void generateDataForBits(uint64_t numValues, int64_t start, int64_t delta, bool random,
   ```



##########
c++/test/CMakeLists.txt:
##########
@@ -18,6 +18,10 @@ include_directories(
 
 set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX17_FLAGS} ${WARN_FLAGS}")
 
+if(BUILD_ENABLE_AVX512)
+  set(SIMD_TEST TestRleVectorDecoder.cc)

Review Comment:
   ```suggestion
     set(SIMD_TEST_SRCS TestRleVectorDecoder.cc)
   ```



##########
c++/src/CMakeLists.txt:
##########
@@ -184,13 +184,21 @@ set(SOURCE_FILES
   Timezone.cc
   TypeImpl.cc
   Vector.cc
-  Writer.cc)
+  Writer.cc
+  CpuInfoUtil.cc
+  BpackingDefault.cc)

Review Comment:
   Why `CpuInfoUtil.cc` is always required?



##########
c++/src/Bpacking.hh:
##########
@@ -0,0 +1,34 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKING_HH
+#define ORC_BPACKING_HH
+
+#include <cstdint>
+
+#include "RLEv2.hh"

Review Comment:
   Can we use forward declaration for `RleDecoderV2` and do not include `RLEv2.hh` in the header?



##########
c++/src/RleDecoderV2.cc:
##########
@@ -17,26 +17,32 @@
  */
 
 #include "Adaptor.hh"
+// #include "Bpacking.hh"

Review Comment:
   Remove it



##########
c++/src/CMakeLists.txt:
##########
@@ -184,13 +184,21 @@ set(SOURCE_FILES
   Timezone.cc
   TypeImpl.cc
   Vector.cc
-  Writer.cc)
+  Writer.cc
+  CpuInfoUtil.cc
+  BpackingDefault.cc)

Review Comment:
   Please sort them alphabetically.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1139933154


##########
c++/src/CMakeLists.txt:
##########
@@ -184,13 +184,21 @@ set(SOURCE_FILES
   Timezone.cc
   TypeImpl.cc
   Vector.cc
-  Writer.cc)
+  Writer.cc
+  CpuInfoUtil.cc
+  BpackingDefault.cc)

Review Comment:
   Sorry for making a mistake. Already removed CpuInfoUtil.cc from general source build.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1092915841


##########
CMakeLists.txt:
##########
@@ -157,6 +172,139 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "powerpc|ppc")
+    set(ORC_CPU_FLAG "ppc")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "s390x")
+    set(ORC_CPU_FLAG "s390x")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "riscv64")
+    set(ORC_CPU_FLAG "riscv64")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_SSE4_2_FLAG "")
+    set(ORC_AVX2_FLAG "/arch:AVX2")
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    set(ORC_SSE4_2_FLAG "-msse4.2")
+    set(ORC_AVX2_FLAG "-march=haswell")
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")
+    # Append the avx2/avx512 subset option also, fix issue ORC-9877 for homebrew-cpp

Review Comment:
   Sorry for bad reference



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo closed pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo closed pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode
URL: https://github.com/apache/orc/pull/1375


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1169462976


##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2694 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(const uint32_t bitWidth, const uint32_t bitMaxSize,
+                                                uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint64_t& tailBitLen, uint32_t& backupByteLen,
+                                                uint64_t& numElements, bool& resetBuf,
+                                                const uint8_t*& srcPtr, int64_t*& dstPtr) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveByteLen(remainingNumElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveByteLen(remainingNumElements * bitWidth);
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      numElements = remainingNumElements;
+      resetBuf = false;
+      remainingNumElements = 0;
+    } else {
+      uint64_t leadingBits = 0;
+      if (startBit != 0) leadingBits = ORC_VECTOR_BYTE_WIDTH - startBit;
+      uint64_t bufRestBitLen = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + leadingBits;
+      numElements = bufRestBitLen / bitWidth;
+      remainingNumElements -= numElements;
+      tailBitLen = fmod(bufRestBitLen, bitWidth);
+      resetBuf = true;
+    }
+
+    if (tailBitLen != 0) {
+      backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+      tailBitLen = 0;
+    }
+
+    if (startBit > 0) {
+      uint32_t align = getAlign(startBit, bitWidth, bitMaxSize);
+      if (align > numElements) {
+        align = numElements;
+      }
+      if (align != 0) {
+        bufMoveByteLen -= moveByteLen(align * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+        plainUnpackLongs(dstPtr, 0, align, bitWidth, startBit);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+        dstPtr += align;
+        numElements -= align;
+      }
+    }
+  }
+
+  inline void UnpackAvx512::alignTailerBoundary(const uint32_t bitWidth, uint64_t& startBit,
+                                                uint64_t& bufMoveByteLen, uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr) {
+    if (numElements > 0) {
+      if (startBit != 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+      } else {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+      }
+      plainUnpackLongs(dstPtr, 0, numElements, bitWidth, startBit);
+      srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+      dstPtr += numElements;
+      bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                resetBuf, backupByteLen);
+      return;
+    }
+
+    if (backupByteLen != 0) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+      plainUnpackLongs(dstPtr, 0, 1, bitWidth, startBit);
+      dstPtr++;
+      backupByteLen = 0;
+      remainingNumElements--;
+    } else {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+    }
+
+    bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bufMoveByteLen = 0;
+    srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+  }
+
+  void UnpackAvx512::vectorUnpack1(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 1;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __m512i reverseMask1u = _mm512_loadu_si512(reverseMaskTable1u);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          uint64_t src_64 = *reinterpret_cast<uint64_t*>(const_cast<uint8_t*>(srcPtr));
+          // convert mask to 512-bit register. 0 --> 0x00, 1 --> 0xFF
+          __m512i srcmm = _mm512_movm_epi8(src_64);
+          // make 0x00 --> 0x00, 0xFF --> 0x01
+          srcmm = _mm512_abs_epi8(srcmm);
+          srcmm = _mm512_shuffle_epi8(srcmm, reverseMask1u);
+          _mm512_storeu_si512(simdPtr, srcmm);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack2(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 2;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_16U;         // first 16 bytes (64 elements)
+        __m512i parse_mask = _mm512_set1_epi16(0x0303);  // 2 times 1 then (8 - 2) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm3 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          __m512i srcmm0, srcmm1, srcmm2, tmpmm;
+
+          srcmm2 = _mm512_srli_epi16(srcmm3, 2);
+          srcmm1 = _mm512_srli_epi16(srcmm3, 4);
+          srcmm0 = _mm512_srli_epi16(srcmm3, 6);
+
+          // turn 2 bitWidth into 8 by zeroing 3 of each 4 elements.
+          // move them into their places
+          // srcmm0: a e i m 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm1: b f j n 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);        // ab ef 00 00 00 00 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);       // ij mn 00 00 00 00 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x00);  // ab ef ab ef ij mn ij mn
+
+          // srcmm2: c g k o 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm3: d h l p 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm2, srcmm3);        // cd gh 00 00 00 00 00 00
+          srcmm1 = _mm512_unpackhi_epi8(srcmm2, srcmm3);       // kl op 00 00 00 00 00 00
+          srcmm1 = _mm512_shuffle_i64x2(tmpmm, srcmm1, 0x00);  // cd gh cd gh kl op kl op
+
+          tmpmm = _mm512_unpacklo_epi16(srcmm0, srcmm1);        // abcd abcd ijkl ijkl
+          srcmm0 = _mm512_unpackhi_epi16(srcmm0, srcmm1);       // efgh efgh mnop mnop
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x88);   // abcd ijkl efgh mnop
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // abcd efgh ijkl mnop
+
+          srcmm0 = _mm512_and_si512(srcmm0, parse_mask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack3(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 3;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable3u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable3u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable3u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable3u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable3u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack4(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 4;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_32U;        // first 32 bytes (64 elements)
+        __m512i parseMask = _mm512_set1_epi16(0x0F0F);  // 4 times 1 then (8 - 4) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm0, srcmm1, tmpmm;
+
+          srcmm1 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm0 = _mm512_srli_epi16(srcmm1, 4);
+
+          // move elements into their places
+          // srcmm0: a c e g 0 0 0 0
+          // srcmm1: b d f h 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);         // ab ef 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);        // cd gh 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x44);   // ab ef cd gh
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // ab cd ef gh
+
+          // turn 4 bitWidth into 8 by zeroing 4 of each 8 bits.
+          srcmm0 = _mm512_and_si512(srcmm0, parseMask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack5(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 5;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable5u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable5u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable5u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable5u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable5u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack6(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 6;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable6u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable6u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable6u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable6u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable6u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack7(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 7;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable7u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable7u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable7u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable7u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable7u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack9(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 9;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable9u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable9u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable9u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable9u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable9u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable9u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable9u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 7);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack10(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 10;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable10u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable10u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable10u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack11(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 11;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable11u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable11u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable11u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable11u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable11u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable11u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable11u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable11u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable11u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4u);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 5);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack12(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 12;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable12u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable12u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable12u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack13(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 13;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable13u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable13u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable13u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable13u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable13u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable13u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable13u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable13u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable13u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 3);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack14(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 14;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable14u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable14u_1);
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable14u);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable14u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable14u_1);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack15(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 15;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable15u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable15u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable15u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable15u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable15u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable15u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable15u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable15u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable15u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 1);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack16(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 16;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = len;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    int64_t* dstPtr = data + offset;
+    bool resetBuf = false;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      bufMoveByteLen += moveByteLen(len * bitWidth);
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        numElements = len;
+        resetBuf = false;

Review Comment:
   Actually, when numElements = len, it means the last data will be processed, even len is not set to 0, it will also be return after unpacking.
   Anyway, I have already deleted this part code, use the function alignHeaderBoundary to instead.
   
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode [orc]

Posted by "taiyang-li (via GitHub)" <gi...@apache.org>.

taiyang-li commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1758990062

   @wpleonardo still find no improvement if just select int type columns.
   
   Q: `select reporttime,appid,uid,platform,nettype,clientversioncode,sdkversioncode,statversion,heartcount,msgcount,giftcount,barragecount,entrytype,prefetchedms,linkdstate,networkavailable,starttimestamp,sessionlogints,medialogints,sdkboundts,msconnectedts,vsconnectedts,firstiframets,ownerstatus,stopreason,totaltime,cpuusageavg,memusageavg,backgroundtotal,foregroundtotal,firstvideopackts,firstvoicerecvts,firstvoiceplayts,firstiframeassemblets,uiinitts,uiloadedts,uiappearedts,setvideoviewts,blurviewdimissts,preparesdkinqueuets,preparesdkexects,startsdkinqueuets,startsdkexects,sdkjoinchannelinqueuets,sdkjoinchannelexects,lastsdkleavechannelinqueuets,lastsdkleavechannelexects,unused_1,unused_2,setvideoviewinqueuets,setvideoviewexects,livetype,audiostatus,firstiframesize,firstiframedecodetime,extras,entrancetype,entrancemode,mclientip,mnc,mcc,vsipsuccess,msipsuccess,vsipfail,msipfail,mediaflag,proxyflag,redirectcount,directorrescode,playcentertype,videomutetype,owneruid from  file('tes
 t.orc')  format Null;`
   
   with avx512: 
   ```
   0 rows in set. Elapsed: 1.629 sec. Processed 1.20 million rows, 1.90 GB (738.86 thousand rows/s., 1.17 GB/s.)
   0 rows in set. Elapsed: 1.698 sec. Processed 1.20 million rows, 1.90 GB (708.46 thousand rows/s., 1.12 GB/s.)
   0 rows in set. Elapsed: 1.572 sec. Processed 1.20 million rows, 1.90 GB (765.62 thousand rows/s., 1.21 GB/s.)
   ```
   
   without avx512
   ```
   0 rows in set. Elapsed: 1.403 sec. Processed 1.20 million rows, 1.90 GB (857.57 thousand rows/s., 1.36 GB/s.)
   0 rows in set. Elapsed: 1.505 sec. Processed 1.20 million rows, 1.90 GB (799.62 thousand rows/s., 1.26 GB/s.)
   0 rows in set. Elapsed: 1.414 sec. Processed 1.20 million rows, 1.90 GB (851.23 thousand rows/s., 1.35 GB/s.)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] stiga-huang commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "stiga-huang (via GitHub)" <gi...@apache.org>.

stiga-huang commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1161720614


##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2724 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen, uint64_t& len,
+                                                uint32_t& bitWidth, uint64_t& tailBitLen,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr, uint32_t bitMaxSize) {

Review Comment:
   Some suggestions on this long parameter list:
   - There are several parameters about length. Rename `len` to something more meaningful, e.g. `remainingNumElements`
   - `bitWidth` is a const argument. Let's use `uint32_t bitWidth` directly, or `const uint32_t bitWidth` to avoid modifying it unintentionally.
   - Put input parameters (`bitWidth`, `bitMaxSize`) before output parameters, based on Google C++ code style:
   https://google.github.io/styleguide/cppguide.html#Inputs_and_Outputs
   
   



##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2724 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen, uint64_t& len,
+                                                uint32_t& bitWidth, uint64_t& tailBitLen,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr, uint32_t bitMaxSize) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveLen(len * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH, ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveLen(len * bitWidth, ORC_VECTOR_BYTE_WIDTH);
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      numElements = len;
+      resetBuf = false;
+      len -= numElements;
+    } else {
+      if (startBit != 0) {
+        numElements =
+            (bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + ORC_VECTOR_BYTE_WIDTH - startBit) / bitWidth;
+        len -= numElements;
+        tailBitLen = fmod(bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + ORC_VECTOR_BYTE_WIDTH - startBit,
+                          bitWidth);
+        resetBuf = true;
+      } else {
+        numElements = (bufRestByteLen * ORC_VECTOR_BYTE_WIDTH) / bitWidth;
+        len -= numElements;
+        tailBitLen = fmod(bufRestByteLen * ORC_VECTOR_BYTE_WIDTH, bitWidth);
+        resetBuf = true;
+      }

Review Comment:
   These codes are similar. We can simplify them to
   ```cpp
         uint64_t leadingBits = 0;
         if (startBit != 0) leadingBits = ORC_VECTOR_BYTE_WIDTH - startBit;
         uint64_t bufRestBitLen = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + leadingBits;
         numElements = bufRestBitLen / bitWidth;
         len -= numElements;
         tailBitLen = fmod(bufRestBitLen, bitWidth);
         resetBuf = true;
   ```



##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2724 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen, uint64_t& len,
+                                                uint32_t& bitWidth, uint64_t& tailBitLen,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr, uint32_t bitMaxSize) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveLen(len * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH, ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveLen(len * bitWidth, ORC_VECTOR_BYTE_WIDTH);
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      numElements = len;
+      resetBuf = false;
+      len -= numElements;

Review Comment:
   `len` can be set to 0 directly.



##########
c++/src/BitUnpackerAvx512.hh:
##########
@@ -0,0 +1,488 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BIT_UNPACKER_AVX512_HH
+#define ORC_BIT_UNPACKER_AVX512_HH
+
+// Mingw-w64 defines strcasecmp in string.h
+#if defined(_WIN32) && !defined(strcasecmp)
+#include <string.h>
+#define strcasecmp stricmp
+#else
+#include <strings.h>
+#endif
+
+#include <immintrin.h>
+#include <cstdint>
+#include <vector>
+
+namespace orc {
+#define ORC_VECTOR_BITS_2_BYTE(x) \
+  (((x) + 7u) >> 3u) /**< Convert a number of bits to a number of bytes */
+#define ORC_VECTOR_ONE_64U (1ULL)
+#define ORC_VECTOR_MAX_16U 0xFFFF     /**< Max value for uint16_t */
+#define ORC_VECTOR_MAX_32U 0xFFFFFFFF /**< Max value for uint32_t */
+#define ORC_VECTOR_BYTE_WIDTH 8u      /**< Byte width in bits */
+#define ORC_VECTOR_WORD_WIDTH 16u     /**< Word width in bits */
+#define ORC_VECTOR_DWORD_WIDTH 32u    /**< Dword width in bits */
+#define ORC_VECTOR_QWORD_WIDTH 64u    /**< Qword width in bits */
+#define ORC_VECTOR_BIT_MASK(x) \
+  ((ORC_VECTOR_ONE_64U << (x)) - 1u) /**< Bit mask below bit position */
+
+#define ORC_VECTOR_BITS_2_WORD(x) \
+  (((x) + 15u) >> 4u) /**< Convert a number of bits to a number of words */
+#define ORC_VECTOR_BITS_2_DWORD(x) \
+  (((x) + 31u) >> 5u) /**< Convert a number of bits to a number of double words */
+
+  // ------------------------------------ 3u -----------------------------------------
+  static uint8_t shuffleIdxTable3u_0[64] = {
+      1u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 4u, 3u, 5u, 4u, 6u, 5u, 1u, 0u, 1u, 0u, 2u, 1u,
+      3u, 2u, 4u, 3u, 4u, 3u, 5u, 4u, 6u, 5u, 1u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 4u, 3u,
+      5u, 4u, 6u, 5u, 1u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 4u, 3u, 5u, 4u, 6u, 5u};
+  static uint8_t shuffleIdxTable3u_1[64] = {
+      0u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 0u, 0u, 1u, 0u, 2u, 1u,
+      3u, 2u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 0u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 3u, 2u, 4u, 3u,
+      5u, 4u, 6u, 5u, 0u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u};
+  static uint16_t shiftTable3u_0[32] = {13u, 7u,  9u,  11u, 13u, 7u,  9u,  11u, 13u, 7u,  9u,
+                                        11u, 13u, 7u,  9u,  11u, 13u, 7u,  9u,  11u, 13u, 7u,
+                                        9u,  11u, 13u, 7u,  9u,  11u, 13u, 7u,  9u,  11u};
+  static uint16_t shiftTable3u_1[32] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u,
+                                        0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u,
+                                        2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint16_t permutexIdxTable3u[32] = {0u,  1u,  2u,  0x0, 0x0, 0x0, 0x0, 0x0, 3u,  4u,  5u,
+                                            0x0, 0x0, 0x0, 0x0, 0x0, 6u,  7u,  8u,  0x0, 0x0, 0x0,
+                                            0x0, 0x0, 9u,  10u, 11u, 0x0, 0x0, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 5u -----------------------------------------
+  static uint8_t shuffleIdxTable5u_0[64] = {
+      1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u,
+      4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u,
+      8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u};
+  static uint8_t shuffleIdxTable5u_1[64] = {
+      1u, 0u, 2u,  1u, 3u, 2u, 5u, 4u, 6u,  5u, 7u, 6u, 8u, 7u, 10u, 9u, 1u, 0u, 2u,  1u, 3u, 2u,
+      5u, 4u, 6u,  5u, 7u, 6u, 8u, 7u, 10u, 9u, 1u, 0u, 2u, 1u, 3u,  2u, 5u, 4u, 6u,  5u, 7u, 6u,
+      8u, 7u, 10u, 9u, 1u, 0u, 2u, 1u, 3u,  2u, 5u, 4u, 6u, 5u, 7u,  6u, 8u, 7u, 10u, 9u};
+  static uint16_t shiftTable5u_0[32] = {11u, 9u,  7u,  5u, 11u, 9u,  7u,  5u, 11u, 9u,  7u,
+                                        5u,  11u, 9u,  7u, 5u,  11u, 9u,  7u, 5u,  11u, 9u,
+                                        7u,  5u,  11u, 9u, 7u,  5u,  11u, 9u, 7u,  5u};
+  static uint16_t shiftTable5u_1[32] = {2u, 4u, 6u, 0u, 2u, 4u, 6u, 0u, 2u, 4u, 6u,
+                                        0u, 2u, 4u, 6u, 0u, 2u, 4u, 6u, 0u, 2u, 4u,
+                                        6u, 0u, 2u, 4u, 6u, 0u, 2u, 4u, 6u, 0u};
+  static uint16_t permutexIdxTable5u[32] = {0u,  1u,  2u,  3u,  4u,  0x0, 0x0, 0x0, 5u,  6u,  7u,
+                                            8u,  9u,  0x0, 0x0, 0x0, 10u, 11u, 12u, 13u, 14u, 0x0,
+                                            0x0, 0x0, 15u, 16u, 17u, 18u, 19u, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 6u -----------------------------------------
+  static uint8_t shuffleIdxTable6u_0[64] = {
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u};
+  static uint8_t shuffleIdxTable6u_1[64] = {
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u,
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u,
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u,
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u};
+  static uint16_t shiftTable6u_0[32] = {10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u,
+                                        6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,
+                                        10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u};
+  static uint16_t shiftTable6u_1[32] = {4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u,
+                                        0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u,
+                                        4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u};
+  static uint32_t permutexIdxTable6u[16] = {0u, 1u, 2u, 0x0, 3u, 4u,  5u,  0x0,
+                                            6u, 7u, 8u, 0x0, 9u, 10u, 11u, 0x0};
+
+  // ------------------------------------ 7u -----------------------------------------
+  static uint8_t shuffleIdxTable7u_0[64] = {
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u,
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u,
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u,
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u};
+  static uint8_t shuffleIdxTable7u_1[64] = {
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u,
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u,
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u,
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u};
+  static uint16_t shiftTable7u_0[32] = {9u, 3u, 5u, 7u, 9u, 3u, 5u, 7u, 9u, 3u, 5u,
+                                        7u, 9u, 3u, 5u, 7u, 9u, 3u, 5u, 7u, 9u, 3u,
+                                        5u, 7u, 9u, 3u, 5u, 7u, 9u, 3u, 5u, 7u};
+  static uint16_t shiftTable7u_1[32] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u,
+                                        0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u,
+                                        2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint16_t permutexIdxTable7u[32] = {0u,  1u,  2u,  3u,  4u,  5u,  6u,  0x0, 7u,  8u,  9u,
+                                            10u, 11u, 12u, 13u, 0x0, 14u, 15u, 16u, 17u, 18u, 19u,
+                                            20u, 0x0, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 0x0};
+
+  // ------------------------------------ 9u -----------------------------------------
+  static uint16_t permutexIdxTable9u_0[32] = {0u,  1u,  1u,  2u,  2u,  3u,  3u,  4u,  4u,  5u,  5u,
+                                              6u,  6u,  7u,  7u,  8u,  9u,  10u, 10u, 11u, 11u, 12u,
+                                              12u, 13u, 13u, 14u, 14u, 15u, 15u, 16u, 16u, 17u};
+  static uint16_t permutexIdxTable9u_1[32] = {0u,  1u,  1u,  2u,  2u,  3u,  3u,  4u,  5u,  6u,  6u,
+                                              7u,  7u,  8u,  8u,  9u,  9u,  10u, 10u, 11u, 11u, 12u,
+                                              12u, 13u, 14u, 15u, 15u, 16u, 16u, 17u, 17u, 18u};
+  static uint32_t shiftTable9u_0[16] = {0u, 2u, 4u, 6u, 8u, 10u, 12u, 14u,
+                                        0u, 2u, 4u, 6u, 8u, 10u, 12u, 14u};
+  static uint32_t shiftTable9u_1[16] = {7u, 5u, 3u, 1u, 15u, 13u, 11u, 9u,
+                                        7u, 5u, 3u, 1u, 15u, 13u, 11u, 9u};
+
+  static uint8_t shuffleIdxTable9u_0[64] = {
+      1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 7u, 6u, 8u, 7u, 1u, 0u, 2u, 1u, 3u, 2u,
+      4u, 3u, 5u, 4u, 6u, 5u, 7u, 6u, 8u, 7u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u,
+      7u, 6u, 8u, 7u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 7u, 6u, 8u, 7u};
+  static uint16_t shiftTable9u_2[32] = {7u, 6u, 5u, 4u, 3u, 2u, 1u, 0u, 7u, 6u, 5u,
+                                        4u, 3u, 2u, 1u, 0u, 7u, 6u, 5u, 4u, 3u, 2u,
+                                        1u, 0u, 7u, 6u, 5u, 4u, 3u, 2u, 1u, 0u};
+  static uint64_t gatherIdxTable9u[8] = {0u, 8u, 9u, 17u, 18u, 26u, 27u, 35u};
+
+  // ------------------------------------ 10u -----------------------------------------
+  static uint8_t shuffleIdxTable10u_0[64] = {
+      1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u,
+      4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u,
+      8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u};
+  static uint16_t shiftTable10u[32] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u,
+                                       0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u,
+                                       2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint16_t permutexIdxTable10u[32] = {0u,  1u,  2u,  3u,  4u,  0x0, 0x0, 0x0, 5u,  6u,  7u,
+                                             8u,  9u,  0x0, 0x0, 0x0, 10u, 11u, 12u, 13u, 14u, 0x0,
+                                             0x0, 0x0, 15u, 16u, 17u, 18u, 19u, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 11u -----------------------------------------
+  static uint16_t permutexIdxTable11u_0[32] = {
+      0u,  1u,  1u,  2u,  2u,  3u,  4u,  5u,  5u,  6u,  6u,  7u,  8u,  9u,  9u,  10u,
+      11u, 12u, 12u, 13u, 13u, 14u, 15u, 16u, 16u, 17u, 17u, 18u, 19u, 20u, 20u, 21u};
+  static uint16_t permutexIdxTable11u_1[32] = {
+      0u,  1u,  2u,  3u,  3u,  4u,  4u,  5u,  6u,  7u,  7u,  8u,  8u,  9u,  10u, 11u,
+      11u, 12u, 13u, 14u, 14u, 15u, 15u, 16u, 17u, 18u, 18u, 19u, 19u, 20u, 21u, 22u};
+  static uint32_t shiftTable11u_0[16] = {0u, 6u, 12u, 2u, 8u, 14u, 4u, 10u,
+                                         0u, 6u, 12u, 2u, 8u, 14u, 4u, 10u};
+  static uint32_t shiftTable11u_1[16] = {5u, 15u, 9u, 3u, 13u, 7u, 1u, 11u,
+                                         5u, 15u, 9u, 3u, 13u, 7u, 1u, 11u};
+
+  static uint8_t shuffleIdxTable11u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u};
+  static uint8_t shuffleIdxTable11u_1[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u};
+  static uint32_t shiftTable11u_2[16] = {21u, 15u, 17u, 19u, 21u, 15u, 17u, 19u,
+                                         21u, 15u, 17u, 19u, 21u, 15u, 17u, 19u};
+  static uint32_t shiftTable11u_3[16] = {6u, 4u, 10u, 8u, 6u, 4u, 10u, 8u,
+                                         6u, 4u, 10u, 8u, 6u, 4u, 10u, 8u};
+  static uint64_t gatherIdxTable11u[8] = {0u, 8u, 11u, 19u, 22u, 30u, 33u, 41u};
+
+  // ------------------------------------ 12u -----------------------------------------
+  static uint8_t shuffleIdxTable12u_0[64] = {
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u};
+  static uint16_t shiftTable12u[32] = {4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u,
+                                       0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u,
+                                       4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u};
+  static uint32_t permutexIdxTable12u[16] = {0u, 1u, 2u, 0x0, 3u, 4u,  5u,  0x0,
+                                             6u, 7u, 8u, 0x0, 9u, 10u, 11u, 0x0};
+
+  // ------------------------------------ 13u -----------------------------------------
+  static uint16_t permutexIdxTable13u_0[32] = {
+      0u,  1u,  1u,  2u,  3u,  4u,  4u,  5u,  6u,  7u,  8u,  9u,  9u,  10u, 11u, 12u,
+      13u, 14u, 14u, 15u, 16u, 17u, 17u, 18u, 19u, 20u, 21u, 22u, 22u, 23u, 24u, 25u};
+  static uint16_t permutexIdxTable13u_1[32] = {
+      0u,  1u,  2u,  3u,  4u,  5u,  5u,  6u,  7u,  8u,  8u,  9u,  10u, 11u, 12u, 13u,
+      13u, 14u, 15u, 16u, 17u, 18u, 18u, 19u, 20u, 21u, 21u, 22u, 23u, 24u, 25u, 26u};
+  static uint32_t shiftTable13u_0[16] = {0u, 10u, 4u, 14u, 8u, 2u, 12u, 6u,
+                                         0u, 10u, 4u, 14u, 8u, 2u, 12u, 6u};
+  static uint32_t shiftTable13u_1[16] = {3u, 9u, 15u, 5u, 11u, 1u, 7u, 13u,
+                                         3u, 9u, 15u, 5u, 11u, 1u, 7u, 13u};
+
+  static uint8_t shuffleIdxTable13u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u};
+  static uint8_t shuffleIdxTable13u_1[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u};
+  static uint32_t shiftTable13u_2[16] = {19u, 17u, 15u, 13u, 19u, 17u, 15u, 13u,
+                                         19u, 17u, 15u, 13u, 19u, 17u, 15u, 13u};
+  static uint32_t shiftTable13u_3[16] = {10u, 12u, 6u, 8u, 10u, 12u, 6u, 8u,
+                                         10u, 12u, 6u, 8u, 10u, 12u, 6u, 8u};
+  static uint64_t gatherIdxTable13u[8] = {0u, 8u, 13u, 21u, 26u, 34u, 39u, 47u};
+
+  // ------------------------------------ 14u -----------------------------------------
+  static uint8_t shuffleIdxTable14u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u};
+  static uint8_t shuffleIdxTable14u_1[64] = {
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u};
+  static uint32_t shiftTable14u_0[16] = {18u, 14u, 18u, 14u, 18u, 14u, 18u, 14u,
+                                         18u, 14u, 18u, 14u, 18u, 14u, 18u, 14u};
+  static uint32_t shiftTable14u_1[16] = {12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u,
+                                         12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u};
+  static uint16_t permutexIdxTable14u[32] = {0u,  1u,  2u,  3u,  4u,  5u,  6u,  0x0, 7u,  8u,  9u,
+                                             10u, 11u, 12u, 13u, 0x0, 14u, 15u, 16u, 17u, 18u, 19u,
+                                             20u, 0x0, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 0x0};
+
+  // ------------------------------------ 15u -----------------------------------------
+  static uint16_t permutexIdxTable15u_0[32] = {
+      0u,  1u,  1u,  2u,  3u,  4u,  5u,  6u,  7u,  8u,  9u,  10u, 11u, 12u, 13u, 14u,
+      15u, 16u, 16u, 17u, 18u, 19u, 20u, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 28u, 29u};
+  static uint16_t permutexIdxTable15u_1[32] = {
+      0u,  1u,  2u,  3u,  4u,  5u,  6u,  7u,  8u,  9u,  10u, 11u, 12u, 13u, 14u, 15u,
+      15u, 16u, 17u, 18u, 19u, 20u, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 28u, 29u, 30u};
+  static uint32_t shiftTable15u_0[16] = {0u, 14u, 12u, 10u, 8u, 6u, 4u, 2u,
+                                         0u, 14u, 12u, 10u, 8u, 6u, 4u, 2u};
+  static uint32_t shiftTable15u_1[16] = {1u, 3u, 5u, 7u, 9u, 11u, 13u, 15u,
+                                         1u, 3u, 5u, 7u, 9u, 11u, 13u, 15u};
+
+  static uint8_t shuffleIdxTable15u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u};
+  static uint8_t shuffleIdxTable15u_1[64] = {
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u};
+  static uint32_t shiftTable15u_2[16] = {17u, 11u, 13u, 15u, 17u, 11u, 13u, 15u,
+                                         17u, 11u, 13u, 15u, 17u, 11u, 13u, 15u};
+  static uint32_t shiftTable15u_3[16] = {14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u,
+                                         14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u};
+  static uint64_t gatherIdxTable15u[8] = {0u, 8u, 15u, 23u, 30u, 38u, 45u, 53u};
+
+  // ------------------------------------ 17u -----------------------------------------
+  static uint32_t permutexIdxTable17u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 6u, 7u, 7u, 8u};
+  static uint32_t permutexIdxTable17u_1[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 6u, 7u, 7u, 8u};
+  static uint64_t shiftTable17u_0[8] = {0u, 2u, 4u, 6u, 8u, 10u, 12u, 14u};
+  static uint64_t shiftTable17u_1[8] = {15u, 13u, 11u, 9u, 7u, 5u, 3u, 1u};
+
+  static uint8_t shuffleIdxTable17u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u,
+      3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u,
+      9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u};
+  static uint32_t shiftTable17u_2[16] = {15u, 14u, 13u, 12u, 11u, 10u, 9u, 8u,
+                                         15u, 14u, 13u, 12u, 11u, 10u, 9u, 8u};
+  static uint64_t gatherIdxTable17u[8] = {0u, 8u, 8u, 16u, 17u, 25u, 25u, 33u};
+
+  // ------------------------------------ 18u -----------------------------------------
+  static uint32_t permutexIdxTable18u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 6u, 7u, 7u, 8u};
+  static uint32_t permutexIdxTable18u_1[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               5u, 6u, 6u, 7u, 7u, 8u, 8u, 9u};
+  static uint64_t shiftTable18u_0[8] = {0u, 4u, 8u, 12u, 16u, 20u, 24u, 28u};
+  static uint64_t shiftTable18u_1[8] = {14u, 10u, 6u, 2u, 30u, 26u, 22u, 18u};
+
+  static uint8_t shuffleIdxTable18u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u,
+      3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u,
+      9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u};
+  static uint32_t shiftTable18u_2[16] = {14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u,
+                                         14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u};
+  static uint64_t gatherIdxTable18u[8] = {0u, 8u, 9u, 17u, 18u, 26u, 27u, 35u};
+
+  // ------------------------------------ 19u -----------------------------------------
+  static uint32_t permutexIdxTable19u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 7u, 8u, 8u, 9u};
+  static uint32_t permutexIdxTable19u_1[16] = {0u, 1u, 1u, 2u, 2u, 3u, 4u, 5u,
+                                               5u, 6u, 6u, 7u, 7u, 8u, 8u, 9u};
+  static uint64_t shiftTable19u_0[8] = {0u, 6u, 12u, 18u, 24u, 30u, 4u, 10u};
+  static uint64_t shiftTable19u_1[8] = {13u, 7u, 1u, 27u, 21u, 15u, 9u, 3u};
+
+  static uint8_t shuffleIdxTable19u_0[64] = {
+      3u,  2u, 1u, 0u, 5u, 4u, 3u,  2u, 7u, 6u, 5u, 4u, 10u, 9u, 8u, 7u, 3u,  2u, 1u, 0u, 5u, 4u,
+      3u,  2u, 8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u,  2u, 7u, 6u, 5u, 4u,
+      10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u, 2u, 8u,  7u, 6u, 5u, 10u, 9u, 8u, 7u};
+  static uint32_t shiftTable19u_2[16] = {13u, 10u, 7u, 12u, 9u, 6u, 11u, 8u,
+                                         13u, 10u, 7u, 12u, 9u, 6u, 11u, 8u};
+  static uint64_t gatherIdxTable19u[8] = {0u, 8u, 9u, 17u, 19u, 27u, 28u, 36u};
+
+  // ------------------------------------ 20u -----------------------------------------
+  static uint8_t shuffleIdxTable20u_0[64] = {
+      3u,  2u, 1u, 0u, 5u, 4u, 3u,  2u, 8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u,  2u, 1u, 0u, 5u, 4u,
+      3u,  2u, 8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u,  2u, 8u, 7u, 6u, 5u,
+      10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u, 2u, 8u,  7u, 6u, 5u, 10u, 9u, 8u, 7u};
+  static uint32_t shiftTable20u[16] = {12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u,
+                                       12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u};
+  static uint16_t permutexIdxTable20u[32] = {0u,  1u,  2u,  3u,  4u,  0x0, 0x0, 0x0, 5u,  6u,  7u,
+                                             8u,  9u,  0x0, 0x0, 0x0, 10u, 11u, 12u, 13u, 14u, 0x0,
+                                             0x0, 0x0, 15u, 16u, 17u, 18u, 19u, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 21u -----------------------------------------
+  static uint32_t permutexIdxTable21u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               5u, 6u, 6u, 7u, 7u, 8u, 9u, 10u};
+  static uint32_t permutexIdxTable21u_1[16] = {0u, 1u, 1u, 2u, 3u, 4u, 4u, 5u,
+                                               5u, 6u, 7u, 8u, 8u, 9u, 9u, 10u};
+  static uint64_t shiftTable21u_0[8] = {0u, 10u, 20u, 30u, 8u, 18u, 28u, 6u};
+  static uint64_t shiftTable21u_1[8] = {11u, 1u, 23u, 13u, 3u, 25u, 15u, 5u};
+
+  static uint8_t shuffleIdxTable21u_0[64] = {
+      3u,  2u, 1u, 0u, 5u, 4u, 3u,  2u,  8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u,  2u,  1u, 0u, 6u, 5u,
+      4u,  3u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u, 3u, 2u, 1u,  0u, 5u, 4u, 3u,  2u,  8u, 7u, 6u, 5u,
+      10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u,  6u, 5u, 4u, 3u, 8u,  7u, 6u, 5u, 11u, 10u, 9u, 8u};
+  static uint32_t shiftTable21u_2[16] = {11u, 6u, 9u, 4u, 7u, 10u, 5u, 8u,
+                                         11u, 6u, 9u, 4u, 7u, 10u, 5u, 8u};
+  static uint64_t gatherIdxTable21u[8] = {0u, 8u, 10u, 18u, 21u, 29u, 31u, 39u};
+
+  // ------------------------------------ 22u -----------------------------------------
+  static uint32_t permutexIdxTable22u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 4u, 5u,
+                                               5u, 6u, 6u, 7u, 8u, 9u, 9u, 10u};
+  static uint32_t permutexIdxTable22u_1[16] = {0u, 1u, 2u, 3u, 3u, 4u, 4u,  5u,
+                                               6u, 7u, 7u, 8u, 8u, 9u, 10u, 11u};
+  static uint64_t shiftTable22u_0[8] = {0u, 12u, 24u, 4u, 16u, 28u, 8u, 20u};
+  static uint64_t shiftTable22u_1[8] = {10u, 30u, 18u, 6u, 26u, 14u, 2u, 22u};
+
+  static uint8_t shuffleIdxTable22u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u};
+  static uint32_t shiftTable22u_2[16] = {10u, 4u, 6u, 8u, 10u, 4u, 6u, 8u,
+                                         10u, 4u, 6u, 8u, 10u, 4u, 6u, 8u};
+  static uint64_t gatherIdxTable22u[8] = {0u, 8u, 11u, 19u, 22u, 30u, 33u, 41u};
+
+  // ------------------------------------ 23u -----------------------------------------
+  static uint32_t permutexIdxTable23u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 4u,  5u,
+                                               5u, 6u, 7u, 8u, 8u, 9u, 10u, 11u};
+  static uint32_t permutexIdxTable23u_1[16] = {0u, 1u, 2u, 3u, 3u, 4u,  5u,  6u,
+                                               6u, 7u, 7u, 8u, 9u, 10u, 10u, 11u};
+  static uint64_t shiftTable23u_0[8] = {0u, 14u, 28u, 10u, 24u, 6u, 20u, 2u};
+  static uint64_t shiftTable23u_1[8] = {9u, 27u, 13u, 31u, 17u, 3u, 21u, 7u};
+
+  static uint8_t shuffleIdxTable23u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u,  8u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u,  8u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u};
+  static uint32_t shiftTable23u_2[16] = {9u, 2u, 3u, 4u, 5u, 6u, 7u, 8u,
+                                         9u, 2u, 3u, 4u, 5u, 6u, 7u, 8u};
+  static uint64_t gatherIdxTable23u[8] = {0u, 8u, 11u, 19u, 23u, 31u, 34u, 42u};
+
+  // ------------------------------------ 24u -----------------------------------------
+  static uint8_t shuffleIdxTable24u_0[64] = {
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF,
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF,
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF,
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF};
+  static uint32_t permutexIdxTable24u[16] = {0u, 1u, 2u, 0x0, 3u, 4u,  5u,  0x0,
+                                             6u, 7u, 8u, 0x0, 9u, 10u, 11u, 0x0};
+
+  // ------------------------------------ 26u -----------------------------------------
+  static uint32_t permutexIdxTable26u_0[16] = {0u, 1u, 1u, 2u, 3u, 4u,  4u,  5u,
+                                               6u, 7u, 8u, 9u, 9u, 10u, 11u, 12u};
+  static uint32_t permutexIdxTable26u_1[16] = {0u, 1u, 2u, 3u, 4u,  5u,  5u,  6u,
+                                               7u, 8u, 8u, 9u, 10u, 11u, 12u, 13u};
+  static uint64_t shiftTable26u_0[8] = {0u, 20u, 8u, 28u, 16u, 4u, 24u, 12u};
+  static uint64_t shiftTable26u_1[8] = {6u, 18u, 30u, 10u, 22u, 2u, 14u, 26u};
+
+  static uint8_t shuffleIdxTable26u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u};
+  static uint32_t shiftTable26u_2[16] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u,
+                                         6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint64_t gatherIdxTable26u[8] = {0u, 8u, 13u, 21u, 26u, 34u, 39u, 47u};
+
+  // ------------------------------------ 28u -----------------------------------------
+  static uint8_t shuffleIdxTable28u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u};
+  static uint32_t shiftTable28u[16] = {4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u,
+                                       4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u};
+  static uint16_t permutexIdxTable28u[32] = {0u,  1u,  2u,  3u,  4u,  5u,  6u,  0x0, 7u,  8u,  9u,
+                                             10u, 11u, 12u, 13u, 0x0, 14u, 15u, 16u, 17u, 18u, 19u,
+                                             20u, 0x0, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 0x0};
+
+  // ------------------------------------ 30u -----------------------------------------
+  static uint32_t permutexIdxTable30u_0[16] = {0u, 1u, 1u, 2u,  3u,  4u,  5u,  6u,
+                                               7u, 8u, 9u, 10u, 11u, 12u, 13u, 14u};
+  static uint32_t permutexIdxTable30u_1[16] = {0u, 1u, 2u,  3u,  4u,  5u,  6u,  7u,
+                                               8u, 9u, 10u, 11u, 12u, 13u, 14u, 15u};
+  static uint64_t shiftTable30u_0[8] = {0u, 28u, 24u, 20u, 16u, 12u, 8u, 4u};
+  static uint64_t shiftTable30u_1[8] = {2u, 6u, 10u, 14u, 18u, 22u, 26u, 30u};
+
+  static uint8_t shuffleIdxTable30u_0[64] = {
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u,
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u,
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u,
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u};
+  static uint8_t shuffleIdxTable30u_1[64] = {
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u,
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u,
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u,
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u};
+  static uint64_t shiftTable30u_2[8] = {34u, 30u, 34u, 30u, 34u, 30u, 34u, 30u};
+  static uint64_t shiftTable30u_3[8] = {28u, 24u, 28u, 24u, 28u, 24u, 28u, 24u};
+  static uint64_t gatherIdxTable30u[8] = {0u, 8u, 15u, 23u, 30u, 38u, 45u, 53u};
+
+  static uint64_t nibbleReverseTable[8] = {
+      0x0E060A020C040800, 0x0F070B030D050901, 0x0E060A020C040800, 0x0F070B030D050901,
+      0x0E060A020C040800, 0x0F070B030D050901, 0x0E060A020C040800, 0x0F070B030D050901};
+
+  static uint64_t reverseMaskTable1u[8] = {
+      0x0001020304050607, 0x08090A0B0C0D0E0F, 0x1011121314151617, 0x18191A1B1C1D1E1F,
+      0x2021222324252627, 0x28292A2B2C2D2E2F, 0x3031323334353637, 0x38393A3B3C3D3E3F};
+
+  static uint64_t reverseMaskTable16u[8] = {
+      0x0607040502030001, 0x0E0F0C0D0A0B0809, 0x1617141512131011, 0x1E1F1C1D1A1B1819,
+      0x2627242522232021, 0x2E2F2C2D2A2B2829, 0x3637343532333031, 0x3E3F3C3D3A3B3839};
+
+  static uint64_t reverseMaskTable32u[8] = {
+      0x0405060700010203, 0x0C0D0E0F08090A0B, 0x1415161710111213, 0x1C1D1E1F18191A1B,
+      0x2425262720212223, 0x2C2D2E2F28292A2B, 0x3435363730313233, 0x3C3D3E3F38393A3B};
+
+  inline uint32_t getAlign(uint32_t startBit, uint32_t base, uint32_t bitSize) {
+    uint32_t remnant = bitSize - startBit;
+    uint32_t retValue = 0xFFFFFFFF;
+    for (uint32_t i = 0u; i < bitSize; ++i) {
+      uint32_t testValue = (i * base) % bitSize;
+      if (testValue == remnant) {
+        retValue = i;
+        break;
+      }
+    }
+    return retValue;
+  }
+
+  inline uint64_t moveLen(uint64_t x, uint64_t y) {

Review Comment:
   It's hard to understand the meaning of this method at a glance. Can we rename the parameters or add some comments? E.g. rename `x` to `numBits`, rename `moveLen` to `moveByteLen`?
   
   It seems `y` is always `ORC_VECTOR_BYTE_WIDTH`. Maybe we can ignore this parameter?
   
   The code can also be simplified:
   ```
     inline uint64_t moveLen(uint64_t x, uint64_t y) {
       uint64_t result = x / y;
       if (x % y != 0) ++result;
       return result;
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] stiga-huang commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "stiga-huang (via GitHub)" <gi...@apache.org>.

stiga-huang commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1147242733


##########
c++/src/CpuInfoUtil.cc:
##########
@@ -0,0 +1,545 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * @file CpuInfoUtil.cc is from Apache Arrow as of 2023-03-21
+ */
+
+#include "CpuInfoUtil.hh"
+
+#ifdef __APPLE__
+#include <sys/sysctl.h>
+#endif
+
+#ifndef _MSC_VER
+#include <unistd.h>
+#endif
+
+#ifdef _WIN32
+#define NOMINMAX
+#include <Windows.h>
+#include <intrin.h>
+#endif
+
+#include <algorithm>
+#include <array>
+#include <bitset>
+#include <cstdint>
+#include <fstream>
+#include <optional>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>
+
+#include "orc/Exceptions.hh"
+
+#undef CPUINFO_ARCH_X86
+
+#if defined(__i386) || defined(_M_IX86) || defined(__x86_64__) || defined(_M_X64)
+#define CPUINFO_ARCH_X86
+#endif
+
+#ifndef ORC_HAVE_RUNTIME_AVX512
+#define UNUSED(x) (void)(x)
+#endif
+
+namespace orc {
+
+  namespace {
+
+    constexpr int kCacheLevels = static_cast<int>(CpuInfo::CacheLevel::Last) + 1;
+
+    //============================== OS Dependent ==============================//
+
+#if defined(_WIN32)
+    //------------------------------ WINDOWS ------------------------------//
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = nullptr;
+      PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer_position = nullptr;
+      DWORD buffer_size = 0;
+      size_t offset = 0;
+      typedef BOOL(WINAPI * GetLogicalProcessorInformationFuncPointer)(void*, void*);
+      GetLogicalProcessorInformationFuncPointer func_pointer =
+          (GetLogicalProcessorInformationFuncPointer)GetProcAddress(
+              GetModuleHandle("kernel32"), "GetLogicalProcessorInformation");
+
+      if (!func_pointer) {
+        throw ParseError("Failed to find procedure GetLogicalProcessorInformation");
+      }
+
+      // Get buffer size
+      if (func_pointer(buffer, &buffer_size) && GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+        throw ParseError("Failed to get size of processor information buffer");
+      }
+
+      buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(buffer_size);
+      if (!buffer) {
+        return;
+      }
+
+      if (!func_pointer(buffer, &buffer_size)) {
+        free(buffer);
+        throw ParseError("Failed to get processor information");
+      }
+
+      buffer_position = buffer;
+      while (offset + sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION) <= buffer_size) {
+        if (RelationCache == buffer_position->Relationship) {
+          PCACHE_DESCRIPTOR cache = &buffer_position->Cache;
+          if (cache->Level >= 1 && cache->Level <= kCacheLevels) {
+            const int64_t current = (*cache_sizes)[cache->Level - 1];
+            (*cache_sizes)[cache->Level - 1] = std::max<int64_t>(current, cache->Size);
+          }
+        }
+        offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
+        buffer_position++;
+      }
+
+      free(buffer);
+    }
+
+#if defined(CPUINFO_ARCH_X86)
+    // On x86, get CPU features by cpuid, https://en.wikipedia.org/wiki/CPUID
+
+#if defined(__MINGW64_VERSION_MAJOR) && __MINGW64_VERSION_MAJOR < 5
+    void __cpuidex(int CPUInfo[4], int function_id, int subfunction_id) {
+      __asm__ __volatile__("cpuid"
+                           : "=a"(CPUInfo[0]), "=b"(CPUInfo[1]), "=c"(CPUInfo[2]), "=d"(CPUInfo[3])
+                           : "a"(function_id), "c"(subfunction_id));
+    }
+
+    int64_t _xgetbv(int xcr) {
+      int out = 0;
+      __asm__ __volatile__("xgetbv" : "=a"(out) : "c"(xcr) : "%edx");
+      return out;
+    }
+#endif  // MINGW
+
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      int register_EAX_id = 1;
+      int highest_valid_id = 0;
+      int highest_extended_valid_id = 0;
+      std::bitset<32> features_ECX;
+      std::array<int, 4> cpu_info;
+
+      // Get highest valid id
+      __cpuid(cpu_info.data(), 0);
+      highest_valid_id = cpu_info[0];
+      // HEX of "GenuineIntel": 47656E75 696E6549 6E74656C
+      // HEX of "AuthenticAMD": 41757468 656E7469 63414D44
+      if (cpu_info[1] == 0x756e6547 && cpu_info[3] == 0x49656e69 && cpu_info[2] == 0x6c65746e) {
+        *vendor = CpuInfo::Vendor::Intel;
+      } else if (cpu_info[1] == 0x68747541 && cpu_info[3] == 0x69746e65 &&
+                 cpu_info[2] == 0x444d4163) {
+        *vendor = CpuInfo::Vendor::AMD;
+      }
+
+      if (highest_valid_id <= register_EAX_id) {
+        return;
+      }
+
+      // EAX=1: Processor Info and Feature Bits
+      __cpuidex(cpu_info.data(), register_EAX_id, 0);
+      features_ECX = cpu_info[2];
+
+      // Get highest extended id
+      __cpuid(cpu_info.data(), 0x80000000);
+      highest_extended_valid_id = cpu_info[0];
+
+      // Retrieve CPU model name
+      if (highest_extended_valid_id >= static_cast<int>(0x80000004)) {
+        model_name->clear();
+        for (int i = 0x80000002; i <= static_cast<int>(0x80000004); ++i) {
+          __cpuidex(cpu_info.data(), i, 0);
+          *model_name += std::string(reinterpret_cast<char*>(cpu_info.data()), sizeof(cpu_info));
+        }
+      }
+
+      bool zmm_enabled = false;
+      if (features_ECX[27]) {  // OSXSAVE
+        // Query if the OS supports saving ZMM registers when switching contexts
+        int64_t xcr0 = _xgetbv(0);
+        zmm_enabled = (xcr0 & 0xE0) == 0xE0;
+      }
+
+      if (features_ECX[9]) *hardware_flags |= CpuInfo::SSSE3;
+      if (features_ECX[19]) *hardware_flags |= CpuInfo::SSE4_1;
+      if (features_ECX[20]) *hardware_flags |= CpuInfo::SSE4_2;
+      if (features_ECX[23]) *hardware_flags |= CpuInfo::POPCNT;
+      if (features_ECX[28]) *hardware_flags |= CpuInfo::AVX;
+
+      // cpuid with EAX=7, ECX=0: Extended Features
+      register_EAX_id = 7;
+      if (highest_valid_id > register_EAX_id) {
+        __cpuidex(cpu_info.data(), register_EAX_id, 0);
+        std::bitset<32> features_EBX = cpu_info[1];
+
+        if (features_EBX[3]) *hardware_flags |= CpuInfo::BMI1;
+        if (features_EBX[5]) *hardware_flags |= CpuInfo::AVX2;
+        if (features_EBX[8]) *hardware_flags |= CpuInfo::BMI2;
+        if (zmm_enabled) {
+          if (features_EBX[16]) *hardware_flags |= CpuInfo::AVX512F;
+          if (features_EBX[17]) *hardware_flags |= CpuInfo::AVX512DQ;
+          if (features_EBX[28]) *hardware_flags |= CpuInfo::AVX512CD;
+          if (features_EBX[30]) *hardware_flags |= CpuInfo::AVX512BW;
+          if (features_EBX[31]) *hardware_flags |= CpuInfo::AVX512VL;
+        }
+      }
+    }
+#endif
+
+#elif defined(__APPLE__)
+    //------------------------------ MACOS ------------------------------//
+    std::optional<int64_t> IntegerSysCtlByName(const char* name) {
+      size_t len = sizeof(int64_t);
+      int64_t data = 0;
+      if (sysctlbyname(name, &data, &len, nullptr, 0) == 0) {
+        return data;
+      }
+      // ENOENT is the official errno value for non-existing sysctl's,
+      // but EINVAL and ENOTSUP have been seen in the wild.
+      if (errno != ENOENT && errno != EINVAL && errno != ENOTSUP) {
+        std::ostringstream ss;
+        ss << "sysctlbyname failed for '" << name << "'";
+        throw ParseError(ss.str());
+      }
+      return std::nullopt;
+    }
+
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      static_assert(kCacheLevels >= 3, "");
+      auto c = IntegerSysCtlByName("hw.l1dcachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[0] = *c;
+      }
+      c = IntegerSysCtlByName("hw.l2cachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[1] = *c;
+      }
+      c = IntegerSysCtlByName("hw.l3cachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[2] = *c;
+      }
+    }
+
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      // hardware_flags
+      struct SysCtlCpuFeature {
+        const char* name;
+        int64_t flag;
+      };
+      std::vector<SysCtlCpuFeature> features = {
+#if defined(CPUINFO_ARCH_X86)
+        {"hw.optional.sse4_2",
+         CpuInfo::SSSE3 | CpuInfo::SSE4_1 | CpuInfo::SSE4_2 | CpuInfo::POPCNT},
+        {"hw.optional.avx1_0", CpuInfo::AVX},
+        {"hw.optional.avx2_0", CpuInfo::AVX2},
+        {"hw.optional.bmi1", CpuInfo::BMI1},
+        {"hw.optional.bmi2", CpuInfo::BMI2},
+        {"hw.optional.avx512f", CpuInfo::AVX512F},
+        {"hw.optional.avx512cd", CpuInfo::AVX512CD},
+        {"hw.optional.avx512dq", CpuInfo::AVX512DQ},
+        {"hw.optional.avx512bw", CpuInfo::AVX512BW},
+        {"hw.optional.avx512vl", CpuInfo::AVX512VL},
+#endif
+      };
+      for (const auto& feature : features) {
+        auto v = IntegerSysCtlByName(feature.name);
+        if (v.value_or(0)) {
+          *hardware_flags |= feature.flag;
+        }
+      }
+
+      // TODO: vendor, model_name
+      *vendor = CpuInfo::Vendor::Unknown;
+      *model_name = "Unknown";
+    }
+
+#else
+    //------------------------------ LINUX ------------------------------//
+    // Get cache size, return 0 on error
+    int64_t LinuxGetCacheSize(int level) {
+      // get cache size by sysconf()
+#ifdef _SC_LEVEL1_DCACHE_SIZE
+      const int kCacheSizeConf[] = {
+          _SC_LEVEL1_DCACHE_SIZE,
+          _SC_LEVEL2_CACHE_SIZE,
+          _SC_LEVEL3_CACHE_SIZE,
+      };
+      static_assert(sizeof(kCacheSizeConf) / sizeof(kCacheSizeConf[0]) == kCacheLevels, "");
+
+      errno = 0;
+      const int64_t cache_size = sysconf(kCacheSizeConf[level]);
+      if (errno == 0 && cache_size > 0) {
+        return cache_size;
+      }
+#endif
+
+      // get cache size from sysfs if sysconf() fails or not supported
+      const char* kCacheSizeSysfs[] = {
+          "/sys/devices/system/cpu/cpu0/cache/index0/size",  // l1d (index1 is l1i)
+          "/sys/devices/system/cpu/cpu0/cache/index2/size",  // l2
+          "/sys/devices/system/cpu/cpu0/cache/index3/size",  // l3
+      };
+      static_assert(sizeof(kCacheSizeSysfs) / sizeof(kCacheSizeSysfs[0]) == kCacheLevels, "");
+
+      std::ifstream cacheinfo(kCacheSizeSysfs[level], std::ios::in);
+      if (!cacheinfo) {
+        return 0;
+      }
+      // cacheinfo is one line like: 65536, 64K, 1M, etc.
+      uint64_t size = 0;
+      char unit = '\0';
+      cacheinfo >> size >> unit;
+      if (unit == 'K') {
+        size <<= 10;
+      } else if (unit == 'M') {
+        size <<= 20;
+      } else if (unit == 'G') {
+        size <<= 30;
+      } else if (unit != '\0') {
+        return 0;
+      }
+      return static_cast<int64_t>(size);
+    }
+
+    // Helper function to parse for hardware flags from /proc/cpuinfo
+    // values contains a list of space-separated flags.  check to see if the flags we
+    // care about are present.
+    // Returns a bitmap of flags.
+    int64_t LinuxParseCpuFlags(const std::string& values) {
+      const struct {
+        std::string name;
+        int64_t flag;
+      } flag_mappings[] = {
+#if defined(CPUINFO_ARCH_X86)
+        {"ssse3", CpuInfo::SSSE3},
+        {"sse4_1", CpuInfo::SSE4_1},
+        {"sse4_2", CpuInfo::SSE4_2},
+        {"popcnt", CpuInfo::POPCNT},
+        {"avx", CpuInfo::AVX},
+        {"avx2", CpuInfo::AVX2},
+        {"avx512f", CpuInfo::AVX512F},
+        {"avx512cd", CpuInfo::AVX512CD},
+        {"avx512vl", CpuInfo::AVX512VL},
+        {"avx512dq", CpuInfo::AVX512DQ},
+        {"avx512bw", CpuInfo::AVX512BW},
+        {"bmi1", CpuInfo::BMI1},
+        {"bmi2", CpuInfo::BMI2},
+#endif
+      };
+      const int64_t num_flags = sizeof(flag_mappings) / sizeof(flag_mappings[0]);
+
+      int64_t flags = 0;
+      for (int i = 0; i < num_flags; ++i) {
+        if (values.find(flag_mappings[i].name) != std::string::npos) {
+          flags |= flag_mappings[i].flag;
+        }
+      }
+      return flags;
+    }
+
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      for (int i = 0; i < kCacheLevels; ++i) {
+        const int64_t cache_size = LinuxGetCacheSize(i);
+        if (cache_size > 0) {
+          (*cache_sizes)[i] = cache_size;
+        }
+      }
+    }
+
+    static constexpr bool IsWhitespace(char c) {
+      return c == ' ' || c == '\t';
+    }
+
+    std::string TrimString(std::string value) {
+      size_t ltrim_chars = 0;
+      while (ltrim_chars < value.size() && IsWhitespace(value[ltrim_chars])) {
+        ++ltrim_chars;
+      }
+      value.erase(0, ltrim_chars);
+      size_t rtrim_chars = 0;
+      while (rtrim_chars < value.size() && IsWhitespace(value[value.size() - 1 - rtrim_chars])) {
+        ++rtrim_chars;
+      }
+      value.erase(value.size() - rtrim_chars, rtrim_chars);
+      return value;
+    }
+
+    // Read from /proc/cpuinfo
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      std::ifstream cpuinfo("/proc/cpuinfo", std::ios::in);
+      while (cpuinfo) {
+        std::string line;
+        std::getline(cpuinfo, line);
+        const size_t colon = line.find(':');
+        if (colon != std::string::npos) {
+          const std::string name = TrimString(line.substr(0, colon - 1));
+          const std::string value = TrimString(line.substr(colon + 1, std::string::npos));
+          if (name.compare("flags") == 0 || name.compare("Features") == 0) {
+            *hardware_flags |= LinuxParseCpuFlags(value);
+          } else if (name.compare("model name") == 0) {
+            *model_name = value;
+          } else if (name.compare("vendor_id") == 0) {
+            if (value.compare("GenuineIntel") == 0) {
+              *vendor = CpuInfo::Vendor::Intel;
+            } else if (value.compare("AuthenticAMD") == 0) {
+              *vendor = CpuInfo::Vendor::AMD;
+            }
+          }
+        }
+      }
+    }
+#endif  // WINDOWS, MACOS, LINUX
+
+    //============================== Arch Dependent ==============================//
+
+#if defined(CPUINFO_ARCH_X86)
+    //------------------------------ X86_64 ------------------------------//
+    bool ArchParseUserSimdLevel(const std::string& simd_level, int64_t* hardware_flags) {
+      enum {
+        USER_SIMD_NONE,
+        USER_SIMD_AVX512,
+        USER_SIMD_MAX,
+      };
+
+      int level = USER_SIMD_MAX;
+      // Parse the level
+      if (simd_level == "AVX512") {
+        level = USER_SIMD_AVX512;
+      } else if (simd_level == "NONE") {
+        level = USER_SIMD_NONE;
+      } else {
+        return false;
+      }
+
+      // Disable feature as the level
+      if (level < USER_SIMD_AVX512) {
+        *hardware_flags &= ~CpuInfo::AVX512;
+      }
+      return true;
+    }
+
+    void ArchVerifyCpuRequirements(const CpuInfo* ci) {
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+      if (!ci->isDetected(CpuInfo::AVX512)) {
+        throw ParseError("CPU does not support the Supplemental AVX512 instruction set");
+      }
+#else
+      UNUSED(ci);
+#endif
+    }
+
+#endif  // X86
+
+  }  // namespace
+
+  struct CpuInfo::Impl {
+    int64_t hardware_flags = 0;
+    int numCores = 0;
+    int64_t original_hardware_flags = 0;
+    Vendor vendor = Vendor::Unknown;
+    std::string model_name = "Unknown";
+    std::array<int64_t, kCacheLevels> cache_sizes{};
+
+    Impl() {
+      OsRetrieveCacheSize(&cache_sizes);
+      OsRetrieveCpuInfo(&hardware_flags, &vendor, &model_name);
+      original_hardware_flags = hardware_flags;
+      numCores = std::max(static_cast<int>(std::thread::hardware_concurrency()), 1);
+
+      // parse user simd level
+      const auto maybe_env_var = std::getenv("ORC_USER_SIMD_LEVEL");

Review Comment:
   Could you update the PR description about how to use the env var `ORC_USER_SIMD_LEVEL`? It'd be quite useful in troubleshooting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1147249880


##########
c++/src/CpuInfoUtil.cc:
##########
@@ -0,0 +1,545 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * @file CpuInfoUtil.cc is from Apache Arrow as of 2023-03-21
+ */
+
+#include "CpuInfoUtil.hh"
+
+#ifdef __APPLE__
+#include <sys/sysctl.h>
+#endif
+
+#ifndef _MSC_VER
+#include <unistd.h>
+#endif
+
+#ifdef _WIN32
+#define NOMINMAX
+#include <Windows.h>
+#include <intrin.h>
+#endif
+
+#include <algorithm>
+#include <array>
+#include <bitset>
+#include <cstdint>
+#include <fstream>
+#include <optional>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>
+
+#include "orc/Exceptions.hh"
+
+#undef CPUINFO_ARCH_X86
+
+#if defined(__i386) || defined(_M_IX86) || defined(__x86_64__) || defined(_M_X64)
+#define CPUINFO_ARCH_X86
+#endif
+
+#ifndef ORC_HAVE_RUNTIME_AVX512
+#define UNUSED(x) (void)(x)
+#endif
+
+namespace orc {
+
+  namespace {
+
+    constexpr int kCacheLevels = static_cast<int>(CpuInfo::CacheLevel::Last) + 1;
+
+    //============================== OS Dependent ==============================//
+
+#if defined(_WIN32)
+    //------------------------------ WINDOWS ------------------------------//
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = nullptr;
+      PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer_position = nullptr;
+      DWORD buffer_size = 0;
+      size_t offset = 0;
+      typedef BOOL(WINAPI * GetLogicalProcessorInformationFuncPointer)(void*, void*);
+      GetLogicalProcessorInformationFuncPointer func_pointer =
+          (GetLogicalProcessorInformationFuncPointer)GetProcAddress(
+              GetModuleHandle("kernel32"), "GetLogicalProcessorInformation");
+
+      if (!func_pointer) {
+        throw ParseError("Failed to find procedure GetLogicalProcessorInformation");
+      }
+
+      // Get buffer size
+      if (func_pointer(buffer, &buffer_size) && GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+        throw ParseError("Failed to get size of processor information buffer");
+      }
+
+      buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(buffer_size);
+      if (!buffer) {
+        return;
+      }
+
+      if (!func_pointer(buffer, &buffer_size)) {
+        free(buffer);
+        throw ParseError("Failed to get processor information");
+      }
+
+      buffer_position = buffer;
+      while (offset + sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION) <= buffer_size) {
+        if (RelationCache == buffer_position->Relationship) {
+          PCACHE_DESCRIPTOR cache = &buffer_position->Cache;
+          if (cache->Level >= 1 && cache->Level <= kCacheLevels) {
+            const int64_t current = (*cache_sizes)[cache->Level - 1];
+            (*cache_sizes)[cache->Level - 1] = std::max<int64_t>(current, cache->Size);
+          }
+        }
+        offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
+        buffer_position++;
+      }
+
+      free(buffer);
+    }
+
+#if defined(CPUINFO_ARCH_X86)
+    // On x86, get CPU features by cpuid, https://en.wikipedia.org/wiki/CPUID
+
+#if defined(__MINGW64_VERSION_MAJOR) && __MINGW64_VERSION_MAJOR < 5
+    void __cpuidex(int CPUInfo[4], int function_id, int subfunction_id) {
+      __asm__ __volatile__("cpuid"
+                           : "=a"(CPUInfo[0]), "=b"(CPUInfo[1]), "=c"(CPUInfo[2]), "=d"(CPUInfo[3])
+                           : "a"(function_id), "c"(subfunction_id));
+    }
+
+    int64_t _xgetbv(int xcr) {
+      int out = 0;
+      __asm__ __volatile__("xgetbv" : "=a"(out) : "c"(xcr) : "%edx");
+      return out;
+    }
+#endif  // MINGW
+
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      int register_EAX_id = 1;
+      int highest_valid_id = 0;
+      int highest_extended_valid_id = 0;
+      std::bitset<32> features_ECX;
+      std::array<int, 4> cpu_info;
+
+      // Get highest valid id
+      __cpuid(cpu_info.data(), 0);
+      highest_valid_id = cpu_info[0];
+      // HEX of "GenuineIntel": 47656E75 696E6549 6E74656C
+      // HEX of "AuthenticAMD": 41757468 656E7469 63414D44
+      if (cpu_info[1] == 0x756e6547 && cpu_info[3] == 0x49656e69 && cpu_info[2] == 0x6c65746e) {
+        *vendor = CpuInfo::Vendor::Intel;
+      } else if (cpu_info[1] == 0x68747541 && cpu_info[3] == 0x69746e65 &&
+                 cpu_info[2] == 0x444d4163) {
+        *vendor = CpuInfo::Vendor::AMD;
+      }
+
+      if (highest_valid_id <= register_EAX_id) {
+        return;
+      }
+
+      // EAX=1: Processor Info and Feature Bits
+      __cpuidex(cpu_info.data(), register_EAX_id, 0);
+      features_ECX = cpu_info[2];
+
+      // Get highest extended id
+      __cpuid(cpu_info.data(), 0x80000000);
+      highest_extended_valid_id = cpu_info[0];
+
+      // Retrieve CPU model name
+      if (highest_extended_valid_id >= static_cast<int>(0x80000004)) {
+        model_name->clear();
+        for (int i = 0x80000002; i <= static_cast<int>(0x80000004); ++i) {
+          __cpuidex(cpu_info.data(), i, 0);
+          *model_name += std::string(reinterpret_cast<char*>(cpu_info.data()), sizeof(cpu_info));
+        }
+      }
+
+      bool zmm_enabled = false;
+      if (features_ECX[27]) {  // OSXSAVE
+        // Query if the OS supports saving ZMM registers when switching contexts
+        int64_t xcr0 = _xgetbv(0);
+        zmm_enabled = (xcr0 & 0xE0) == 0xE0;
+      }
+
+      if (features_ECX[9]) *hardware_flags |= CpuInfo::SSSE3;
+      if (features_ECX[19]) *hardware_flags |= CpuInfo::SSE4_1;
+      if (features_ECX[20]) *hardware_flags |= CpuInfo::SSE4_2;
+      if (features_ECX[23]) *hardware_flags |= CpuInfo::POPCNT;
+      if (features_ECX[28]) *hardware_flags |= CpuInfo::AVX;
+
+      // cpuid with EAX=7, ECX=0: Extended Features
+      register_EAX_id = 7;
+      if (highest_valid_id > register_EAX_id) {
+        __cpuidex(cpu_info.data(), register_EAX_id, 0);
+        std::bitset<32> features_EBX = cpu_info[1];
+
+        if (features_EBX[3]) *hardware_flags |= CpuInfo::BMI1;
+        if (features_EBX[5]) *hardware_flags |= CpuInfo::AVX2;
+        if (features_EBX[8]) *hardware_flags |= CpuInfo::BMI2;
+        if (zmm_enabled) {
+          if (features_EBX[16]) *hardware_flags |= CpuInfo::AVX512F;
+          if (features_EBX[17]) *hardware_flags |= CpuInfo::AVX512DQ;
+          if (features_EBX[28]) *hardware_flags |= CpuInfo::AVX512CD;
+          if (features_EBX[30]) *hardware_flags |= CpuInfo::AVX512BW;
+          if (features_EBX[31]) *hardware_flags |= CpuInfo::AVX512VL;
+        }
+      }
+    }
+#endif
+
+#elif defined(__APPLE__)
+    //------------------------------ MACOS ------------------------------//
+    std::optional<int64_t> IntegerSysCtlByName(const char* name) {
+      size_t len = sizeof(int64_t);
+      int64_t data = 0;
+      if (sysctlbyname(name, &data, &len, nullptr, 0) == 0) {
+        return data;
+      }
+      // ENOENT is the official errno value for non-existing sysctl's,
+      // but EINVAL and ENOTSUP have been seen in the wild.
+      if (errno != ENOENT && errno != EINVAL && errno != ENOTSUP) {
+        std::ostringstream ss;
+        ss << "sysctlbyname failed for '" << name << "'";
+        throw ParseError(ss.str());
+      }
+      return std::nullopt;
+    }
+
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      static_assert(kCacheLevels >= 3, "");
+      auto c = IntegerSysCtlByName("hw.l1dcachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[0] = *c;
+      }
+      c = IntegerSysCtlByName("hw.l2cachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[1] = *c;
+      }
+      c = IntegerSysCtlByName("hw.l3cachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[2] = *c;
+      }
+    }
+
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      // hardware_flags
+      struct SysCtlCpuFeature {
+        const char* name;
+        int64_t flag;
+      };
+      std::vector<SysCtlCpuFeature> features = {
+#if defined(CPUINFO_ARCH_X86)
+        {"hw.optional.sse4_2",
+         CpuInfo::SSSE3 | CpuInfo::SSE4_1 | CpuInfo::SSE4_2 | CpuInfo::POPCNT},
+        {"hw.optional.avx1_0", CpuInfo::AVX},
+        {"hw.optional.avx2_0", CpuInfo::AVX2},
+        {"hw.optional.bmi1", CpuInfo::BMI1},
+        {"hw.optional.bmi2", CpuInfo::BMI2},
+        {"hw.optional.avx512f", CpuInfo::AVX512F},
+        {"hw.optional.avx512cd", CpuInfo::AVX512CD},
+        {"hw.optional.avx512dq", CpuInfo::AVX512DQ},
+        {"hw.optional.avx512bw", CpuInfo::AVX512BW},
+        {"hw.optional.avx512vl", CpuInfo::AVX512VL},
+#endif
+      };
+      for (const auto& feature : features) {
+        auto v = IntegerSysCtlByName(feature.name);
+        if (v.value_or(0)) {
+          *hardware_flags |= feature.flag;
+        }
+      }
+
+      // TODO: vendor, model_name
+      *vendor = CpuInfo::Vendor::Unknown;
+      *model_name = "Unknown";
+    }
+
+#else
+    //------------------------------ LINUX ------------------------------//
+    // Get cache size, return 0 on error
+    int64_t LinuxGetCacheSize(int level) {
+      // get cache size by sysconf()
+#ifdef _SC_LEVEL1_DCACHE_SIZE
+      const int kCacheSizeConf[] = {
+          _SC_LEVEL1_DCACHE_SIZE,
+          _SC_LEVEL2_CACHE_SIZE,
+          _SC_LEVEL3_CACHE_SIZE,
+      };
+      static_assert(sizeof(kCacheSizeConf) / sizeof(kCacheSizeConf[0]) == kCacheLevels, "");
+
+      errno = 0;
+      const int64_t cache_size = sysconf(kCacheSizeConf[level]);
+      if (errno == 0 && cache_size > 0) {
+        return cache_size;
+      }
+#endif
+
+      // get cache size from sysfs if sysconf() fails or not supported
+      const char* kCacheSizeSysfs[] = {
+          "/sys/devices/system/cpu/cpu0/cache/index0/size",  // l1d (index1 is l1i)
+          "/sys/devices/system/cpu/cpu0/cache/index2/size",  // l2
+          "/sys/devices/system/cpu/cpu0/cache/index3/size",  // l3
+      };
+      static_assert(sizeof(kCacheSizeSysfs) / sizeof(kCacheSizeSysfs[0]) == kCacheLevels, "");
+
+      std::ifstream cacheinfo(kCacheSizeSysfs[level], std::ios::in);
+      if (!cacheinfo) {
+        return 0;
+      }
+      // cacheinfo is one line like: 65536, 64K, 1M, etc.
+      uint64_t size = 0;
+      char unit = '\0';
+      cacheinfo >> size >> unit;
+      if (unit == 'K') {
+        size <<= 10;
+      } else if (unit == 'M') {
+        size <<= 20;
+      } else if (unit == 'G') {
+        size <<= 30;
+      } else if (unit != '\0') {
+        return 0;
+      }
+      return static_cast<int64_t>(size);
+    }
+
+    // Helper function to parse for hardware flags from /proc/cpuinfo
+    // values contains a list of space-separated flags.  check to see if the flags we
+    // care about are present.
+    // Returns a bitmap of flags.
+    int64_t LinuxParseCpuFlags(const std::string& values) {
+      const struct {
+        std::string name;
+        int64_t flag;
+      } flag_mappings[] = {
+#if defined(CPUINFO_ARCH_X86)
+        {"ssse3", CpuInfo::SSSE3},
+        {"sse4_1", CpuInfo::SSE4_1},
+        {"sse4_2", CpuInfo::SSE4_2},
+        {"popcnt", CpuInfo::POPCNT},
+        {"avx", CpuInfo::AVX},
+        {"avx2", CpuInfo::AVX2},
+        {"avx512f", CpuInfo::AVX512F},
+        {"avx512cd", CpuInfo::AVX512CD},
+        {"avx512vl", CpuInfo::AVX512VL},
+        {"avx512dq", CpuInfo::AVX512DQ},
+        {"avx512bw", CpuInfo::AVX512BW},
+        {"bmi1", CpuInfo::BMI1},
+        {"bmi2", CpuInfo::BMI2},
+#endif
+      };
+      const int64_t num_flags = sizeof(flag_mappings) / sizeof(flag_mappings[0]);
+
+      int64_t flags = 0;
+      for (int i = 0; i < num_flags; ++i) {
+        if (values.find(flag_mappings[i].name) != std::string::npos) {
+          flags |= flag_mappings[i].flag;
+        }
+      }
+      return flags;
+    }
+
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      for (int i = 0; i < kCacheLevels; ++i) {
+        const int64_t cache_size = LinuxGetCacheSize(i);
+        if (cache_size > 0) {
+          (*cache_sizes)[i] = cache_size;
+        }
+      }
+    }
+
+    static constexpr bool IsWhitespace(char c) {
+      return c == ' ' || c == '\t';
+    }
+
+    std::string TrimString(std::string value) {
+      size_t ltrim_chars = 0;
+      while (ltrim_chars < value.size() && IsWhitespace(value[ltrim_chars])) {
+        ++ltrim_chars;
+      }
+      value.erase(0, ltrim_chars);
+      size_t rtrim_chars = 0;
+      while (rtrim_chars < value.size() && IsWhitespace(value[value.size() - 1 - rtrim_chars])) {
+        ++rtrim_chars;
+      }
+      value.erase(value.size() - rtrim_chars, rtrim_chars);
+      return value;
+    }
+
+    // Read from /proc/cpuinfo
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      std::ifstream cpuinfo("/proc/cpuinfo", std::ios::in);
+      while (cpuinfo) {
+        std::string line;
+        std::getline(cpuinfo, line);
+        const size_t colon = line.find(':');
+        if (colon != std::string::npos) {
+          const std::string name = TrimString(line.substr(0, colon - 1));
+          const std::string value = TrimString(line.substr(colon + 1, std::string::npos));
+          if (name.compare("flags") == 0 || name.compare("Features") == 0) {
+            *hardware_flags |= LinuxParseCpuFlags(value);
+          } else if (name.compare("model name") == 0) {
+            *model_name = value;
+          } else if (name.compare("vendor_id") == 0) {
+            if (value.compare("GenuineIntel") == 0) {
+              *vendor = CpuInfo::Vendor::Intel;
+            } else if (value.compare("AuthenticAMD") == 0) {
+              *vendor = CpuInfo::Vendor::AMD;
+            }
+          }
+        }
+      }
+    }
+#endif  // WINDOWS, MACOS, LINUX
+
+    //============================== Arch Dependent ==============================//
+
+#if defined(CPUINFO_ARCH_X86)
+    //------------------------------ X86_64 ------------------------------//
+    bool ArchParseUserSimdLevel(const std::string& simd_level, int64_t* hardware_flags) {
+      enum {
+        USER_SIMD_NONE,
+        USER_SIMD_AVX512,
+        USER_SIMD_MAX,
+      };
+
+      int level = USER_SIMD_MAX;
+      // Parse the level
+      if (simd_level == "AVX512") {
+        level = USER_SIMD_AVX512;
+      } else if (simd_level == "NONE") {
+        level = USER_SIMD_NONE;
+      } else {
+        return false;
+      }
+
+      // Disable feature as the level
+      if (level < USER_SIMD_AVX512) {
+        *hardware_flags &= ~CpuInfo::AVX512;
+      }
+      return true;
+    }
+
+    void ArchVerifyCpuRequirements(const CpuInfo* ci) {
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+      if (!ci->isDetected(CpuInfo::AVX512)) {
+        throw ParseError("CPU does not support the Supplemental AVX512 instruction set");
+      }
+#else
+      UNUSED(ci);
+#endif
+    }
+
+#endif  // X86
+
+  }  // namespace
+
+  struct CpuInfo::Impl {
+    int64_t hardware_flags = 0;
+    int numCores = 0;
+    int64_t original_hardware_flags = 0;
+    Vendor vendor = Vendor::Unknown;
+    std::string model_name = "Unknown";
+    std::array<int64_t, kCacheLevels> cache_sizes{};
+
+    Impl() {
+      OsRetrieveCacheSize(&cache_sizes);
+      OsRetrieveCpuInfo(&hardware_flags, &vendor, &model_name);
+      original_hardware_flags = hardware_flags;
+      numCores = std::max(static_cast<int>(std::thread::hardware_concurrency()), 1);
+
+      // parse user simd level
+      const auto maybe_env_var = std::getenv("ORC_USER_SIMD_LEVEL");

Review Comment:
   OK. I will update it. Thank you very much for reminding me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1484641476

   > > The reason of CI test failed is the machine doesn't support AVX512. Maybe we'd better running these CI SIMD test on AVX512 machines. https://github.com/apache/orc/actions/runs/4528477658/jobs/7975338899?pr=1375#step:3:41
   > 
   > Could we make it robust? It is likely to happen in the future which may bother the code review.
   
   Hi @wgtmac , May I have a question about this situation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1148912625


##########
README.md:
##########
@@ -93,3 +93,16 @@ To build only the C++ library:
 % make test-out
 
 ```
+
+To build the C++ library with AVX512 enabled:

Review Comment:
   Please check if this looks good to you. @stiga-huang 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1169459975


##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2694 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(const uint32_t bitWidth, const uint32_t bitMaxSize,
+                                                uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint64_t& tailBitLen, uint32_t& backupByteLen,
+                                                uint64_t& numElements, bool& resetBuf,
+                                                const uint8_t*& srcPtr, int64_t*& dstPtr) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveByteLen(remainingNumElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveByteLen(remainingNumElements * bitWidth);
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      numElements = remainingNumElements;
+      resetBuf = false;
+      remainingNumElements = 0;
+    } else {
+      uint64_t leadingBits = 0;
+      if (startBit != 0) leadingBits = ORC_VECTOR_BYTE_WIDTH - startBit;
+      uint64_t bufRestBitLen = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + leadingBits;
+      numElements = bufRestBitLen / bitWidth;
+      remainingNumElements -= numElements;
+      tailBitLen = fmod(bufRestBitLen, bitWidth);
+      resetBuf = true;
+    }
+
+    if (tailBitLen != 0) {
+      backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+      tailBitLen = 0;
+    }
+
+    if (startBit > 0) {
+      uint32_t align = getAlign(startBit, bitWidth, bitMaxSize);
+      if (align > numElements) {
+        align = numElements;
+      }
+      if (align != 0) {
+        bufMoveByteLen -= moveByteLen(align * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+        plainUnpackLongs(dstPtr, 0, align, bitWidth, startBit);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+        dstPtr += align;
+        numElements -= align;
+      }
+    }
+  }
+
+  inline void UnpackAvx512::alignTailerBoundary(const uint32_t bitWidth, uint64_t& startBit,
+                                                uint64_t& bufMoveByteLen, uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr) {
+    if (numElements > 0) {
+      if (startBit != 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+      } else {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+      }
+      plainUnpackLongs(dstPtr, 0, numElements, bitWidth, startBit);
+      srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+      dstPtr += numElements;
+      bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                resetBuf, backupByteLen);
+      return;
+    }
+
+    if (backupByteLen != 0) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+      plainUnpackLongs(dstPtr, 0, 1, bitWidth, startBit);
+      dstPtr++;
+      backupByteLen = 0;
+      remainingNumElements--;
+    } else {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);

Review Comment:
   Fixed



##########
c++/src/RLEv2.hh:
##########
@@ -220,17 +221,36 @@ namespace orc {
 
     const std::unique_ptr<SeekableInputStream> inputStream;
     const bool isSigned;
-
     unsigned char firstByte;
-    uint64_t runLength;  // Length of the current run
-    uint64_t runRead;    // Number of returned values of the current run
-    const char* bufferStart;
-    const char* bufferEnd;
-    uint32_t bitsLeft;                  // Used by readLongs when bitSize < 8
-    uint32_t curByte;                   // Used by anything that uses readLongs
+    uint64_t runLength;                 // Length of the current run
+    uint64_t runRead;                   // Number of returned values of the current run
     DataBuffer<int64_t> unpackedPatch;  // Used by PATCHED_BASE
     DataBuffer<int64_t> literals;       // Values of the current run
   };
+
+  inline void RleDecoderV2::resetBufferStart(char** bufStart, char** bufEnd, uint64_t len,
+                                             bool resetBuf, uint32_t backupByteLen) {
+    uint64_t remainingLen = *bufEnd - *bufStart;
+    int bufferLength = 0;
+    const void* bufferPointer = nullptr;
+
+    if (backupByteLen != 0) {
+      inputStream->BackUp(backupByteLen);
+    }
+
+    if (len >= remainingLen && resetBuf == true) {

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1169464300


##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2694 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(const uint32_t bitWidth, const uint32_t bitMaxSize,
+                                                uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint64_t& tailBitLen, uint32_t& backupByteLen,
+                                                uint64_t& numElements, bool& resetBuf,
+                                                const uint8_t*& srcPtr, int64_t*& dstPtr) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveByteLen(remainingNumElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveByteLen(remainingNumElements * bitWidth);
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      numElements = remainingNumElements;
+      resetBuf = false;
+      remainingNumElements = 0;
+    } else {
+      uint64_t leadingBits = 0;
+      if (startBit != 0) leadingBits = ORC_VECTOR_BYTE_WIDTH - startBit;
+      uint64_t bufRestBitLen = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + leadingBits;
+      numElements = bufRestBitLen / bitWidth;
+      remainingNumElements -= numElements;
+      tailBitLen = fmod(bufRestBitLen, bitWidth);
+      resetBuf = true;
+    }
+
+    if (tailBitLen != 0) {
+      backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+      tailBitLen = 0;
+    }
+
+    if (startBit > 0) {
+      uint32_t align = getAlign(startBit, bitWidth, bitMaxSize);
+      if (align > numElements) {
+        align = numElements;
+      }
+      if (align != 0) {
+        bufMoveByteLen -= moveByteLen(align * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+        plainUnpackLongs(dstPtr, 0, align, bitWidth, startBit);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+        dstPtr += align;
+        numElements -= align;
+      }
+    }
+  }
+
+  inline void UnpackAvx512::alignTailerBoundary(const uint32_t bitWidth, uint64_t& startBit,
+                                                uint64_t& bufMoveByteLen, uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr) {
+    if (numElements > 0) {
+      if (startBit != 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+      } else {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+      }
+      plainUnpackLongs(dstPtr, 0, numElements, bitWidth, startBit);
+      srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+      dstPtr += numElements;
+      bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                resetBuf, backupByteLen);
+      return;
+    }
+
+    if (backupByteLen != 0) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+      plainUnpackLongs(dstPtr, 0, 1, bitWidth, startBit);
+      dstPtr++;
+      backupByteLen = 0;
+      remainingNumElements--;
+    } else {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+    }
+
+    bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bufMoveByteLen = 0;
+    srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+  }
+
+  void UnpackAvx512::vectorUnpack1(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 1;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __m512i reverseMask1u = _mm512_loadu_si512(reverseMaskTable1u);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          uint64_t src_64 = *reinterpret_cast<uint64_t*>(const_cast<uint8_t*>(srcPtr));
+          // convert mask to 512-bit register. 0 --> 0x00, 1 --> 0xFF
+          __m512i srcmm = _mm512_movm_epi8(src_64);
+          // make 0x00 --> 0x00, 0xFF --> 0x01
+          srcmm = _mm512_abs_epi8(srcmm);
+          srcmm = _mm512_shuffle_epi8(srcmm, reverseMask1u);
+          _mm512_storeu_si512(simdPtr, srcmm);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack2(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 2;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_16U;         // first 16 bytes (64 elements)
+        __m512i parse_mask = _mm512_set1_epi16(0x0303);  // 2 times 1 then (8 - 2) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm3 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          __m512i srcmm0, srcmm1, srcmm2, tmpmm;
+
+          srcmm2 = _mm512_srli_epi16(srcmm3, 2);
+          srcmm1 = _mm512_srli_epi16(srcmm3, 4);
+          srcmm0 = _mm512_srli_epi16(srcmm3, 6);
+
+          // turn 2 bitWidth into 8 by zeroing 3 of each 4 elements.
+          // move them into their places
+          // srcmm0: a e i m 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm1: b f j n 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);        // ab ef 00 00 00 00 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);       // ij mn 00 00 00 00 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x00);  // ab ef ab ef ij mn ij mn
+
+          // srcmm2: c g k o 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm3: d h l p 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm2, srcmm3);        // cd gh 00 00 00 00 00 00
+          srcmm1 = _mm512_unpackhi_epi8(srcmm2, srcmm3);       // kl op 00 00 00 00 00 00
+          srcmm1 = _mm512_shuffle_i64x2(tmpmm, srcmm1, 0x00);  // cd gh cd gh kl op kl op
+
+          tmpmm = _mm512_unpacklo_epi16(srcmm0, srcmm1);        // abcd abcd ijkl ijkl
+          srcmm0 = _mm512_unpackhi_epi16(srcmm0, srcmm1);       // efgh efgh mnop mnop
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x88);   // abcd ijkl efgh mnop
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // abcd efgh ijkl mnop
+
+          srcmm0 = _mm512_and_si512(srcmm0, parse_mask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack3(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 3;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable3u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable3u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable3u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable3u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable3u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack4(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 4;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_32U;        // first 32 bytes (64 elements)
+        __m512i parseMask = _mm512_set1_epi16(0x0F0F);  // 4 times 1 then (8 - 4) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm0, srcmm1, tmpmm;
+
+          srcmm1 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm0 = _mm512_srli_epi16(srcmm1, 4);
+
+          // move elements into their places
+          // srcmm0: a c e g 0 0 0 0
+          // srcmm1: b d f h 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);         // ab ef 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);        // cd gh 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x44);   // ab ef cd gh
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // ab cd ef gh
+
+          // turn 4 bitWidth into 8 by zeroing 4 of each 8 bits.
+          srcmm0 = _mm512_and_si512(srcmm0, parseMask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack5(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 5;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable5u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable5u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable5u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable5u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable5u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack6(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 6;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable6u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable6u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable6u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable6u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable6u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack7(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 7;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable7u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable7u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable7u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable7u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable7u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack9(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 9;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable9u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable9u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable9u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable9u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable9u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable9u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable9u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 7);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack10(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 10;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable10u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable10u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable10u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack11(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 11;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable11u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable11u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable11u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable11u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable11u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable11u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable11u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable11u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable11u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4u);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 5);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack12(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 12;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable12u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable12u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable12u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack13(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 13;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable13u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable13u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable13u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable13u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable13u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable13u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable13u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable13u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable13u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 3);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack14(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 14;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable14u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable14u_1);
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable14u);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable14u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable14u_1);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack15(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 15;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable15u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable15u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable15u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable15u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable15u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable15u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable15u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable15u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable15u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 1);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack16(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 16;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = len;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    int64_t* dstPtr = data + offset;
+    bool resetBuf = false;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      bufMoveByteLen += moveByteLen(len * bitWidth);
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        numElements = len;
+        resetBuf = false;
+      } else {
+        numElements = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH / bitWidth;
+        len -= numElements;
+        tailBitLen = fmod(bufRestByteLen * ORC_VECTOR_BYTE_WIDTH, bitWidth);
+        resetBuf = true;
+      }
+
+      if (tailBitLen != 0) {
+        backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+        tailBitLen = 0;
+      }
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm = _mm512_loadu_si512(srcPtr);
+          srcmm = _mm512_shuffle_epi8(srcmm, reverse_mask_16u);
+          _mm512_storeu_si512(simdPtr, srcmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      if (numElements > 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+        unpackDefault.unrolledUnpack16(dstPtr, 0, numElements);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        dstPtr += numElements;
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+      }
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                  resetBuf, backupByteLen);
+        return;
+      }
+
+      if (backupByteLen != 0) {
+        decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                  resetBuf, backupByteLen);
+        ;
+        unpackDefault.unrolledUnpack16(dstPtr, 0, 1);
+        dstPtr++;
+        backupByteLen = 0;
+        len--;
+      } else {
+        decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                  resetBuf, backupByteLen);
+      }
+
+      bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+      bufMoveByteLen = 0;
+      srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack17(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 17;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_32Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(bitWidth);
+        __m512i parseMask0 = _mm512_set1_epi32(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask32u = _mm512_loadu_si512(reverseMaskTable32u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable17u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable17u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable17u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable17u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable17u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable17u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable17u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1u);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+
+        if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi32(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi32(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi64(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi64(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi32(zmm[0], 0xAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi32(zmm[0], 15);
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask32u);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack18(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 18;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_32Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+        __mmask16 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_DWORD(bitWidth * 16));
+        __m512i parseMask0 = _mm512_set1_epi32(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask32u = _mm512_loadu_si512(reverseMaskTable32u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable18u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable18u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable18u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable18u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable18u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable18u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable18u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+
+        if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi32(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi32(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi32(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi64(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi64(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi32(zmm[0], 0xAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi32(zmm[0], 14);
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask32u);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack19(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 19;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_32Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(bitWidth);
+        __m512i parseMask0 = _mm512_set1_epi32(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask32u = _mm512_loadu_si512(reverseMaskTable32u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable19u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable19u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable19u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable19u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable19u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable19u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable19u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+
+        if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi32(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi32(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi64(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi64(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi32(zmm[0], 0xAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi32(zmm[0], 13);
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask32u);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack20(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 20;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_32Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+        __mmask16 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_DWORD(bitWidth * 16));
+        __m512i parseMask0 = _mm512_set1_epi32(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable20u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable20u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable20u);
+
+        while (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi32(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi32(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(vectorBuf, zmm);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack21(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 21;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_32Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(bitWidth);
+        __m512i parseMask0 = _mm512_set1_epi32(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask32u = _mm512_loadu_si512(reverseMaskTable32u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable21u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable21u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable21u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable21u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable21u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable21u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable21u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+
+        if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi32(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi32(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi64(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi64(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi32(zmm[0], 0xAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi32(zmm[0], 11);
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask32u);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack22(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 22;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_32Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+        __mmask16 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_DWORD(bitWidth * 16));
+        __m512i parseMask0 = _mm512_set1_epi32(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask32u = _mm512_loadu_si512(reverseMaskTable32u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable22u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable22u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable22u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable22u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable22u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable22u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable22u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+
+        if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi32(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi32(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi32(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi64(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi64(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi32(zmm[0], 0xAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi32(zmm[0], 10);
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask32u);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack23(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 23;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_32Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(bitWidth);
+        __m512i parseMask0 = _mm512_set1_epi32(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask32u = _mm512_loadu_si512(reverseMaskTable32u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable23u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable23u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable23u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable23u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable23u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable23u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable23u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+
+        if (numElements >= VECTOR_UNPACK_32BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi32(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi32(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi64(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi64(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi32(zmm[0], 0xAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi32(zmm[0], 9);
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask32u);
+
+          _mm512_storeu_si512(vectorBuf, zmm[0]);
+
+          srcPtr += 2 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 2 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 2 * bitWidth;
+          numElements -= VECTOR_UNPACK_32BIT_MAX_NUM;
+          std::copy(vectorBuf, vectorBuf + VECTOR_UNPACK_32BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_32BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack24(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 24;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      bufMoveByteLen += moveByteLen(len * bitWidth);
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        numElements = len;
+        resetBuf = false;

Review Comment:
   Fixed. Thank you very much for your reminding.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1092780103


##########
CMakeLists.txt:
##########
@@ -67,6 +67,10 @@ option(BUILD_CPP_ENABLE_METRICS
     "Enable the metrics collection at compile phase"
     OFF)
 
+option(BUILD_ENABLE_AVX512

Review Comment:
   ```suggestion
   option(BUILD_CPP_AVX512
   ```



##########
CMakeLists.txt:
##########
@@ -67,6 +67,10 @@ option(BUILD_CPP_ENABLE_METRICS
     "Enable the metrics collection at compile phase"
     OFF)
 
+option(BUILD_ENABLE_AVX512
+    "Enable AVX512 vector decode of bit-packing"

Review Comment:
   ```suggestion
       "Enable build with AVX512 at compile time"
   ```



##########
CMakeLists.txt:
##########
@@ -157,6 +172,139 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "powerpc|ppc")
+    set(ORC_CPU_FLAG "ppc")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "s390x")
+    set(ORC_CPU_FLAG "s390x")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "riscv64")
+    set(ORC_CPU_FLAG "riscv64")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_SSE4_2_FLAG "")
+    set(ORC_AVX2_FLAG "/arch:AVX2")
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    set(ORC_SSE4_2_FLAG "-msse4.2")
+    set(ORC_AVX2_FLAG "-march=haswell")
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")
+    # Append the avx2/avx512 subset option also, fix issue ORC-9877 for homebrew-cpp

Review Comment:
   What does `fix issue ORC-9877 for homebrew-cpp` mean?



##########
CMakeLists.txt:
##########
@@ -157,6 +172,139 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)

Review Comment:
   Are we supposed to support ppc, s390x and riscv64? The CI checks do not cover these architectures so we are unable to verify and maintain them. 
   
   cc @dongjoon-hyun 



##########
CMakeLists.txt:
##########
@@ -157,6 +172,139 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)

Review Comment:
   The architecture detecting logic below worth a separate file under `cmake_modules` directory and be included here.



##########
CMakeLists.txt:
##########
@@ -157,6 +172,139 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "powerpc|ppc")
+    set(ORC_CPU_FLAG "ppc")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "s390x")
+    set(ORC_CPU_FLAG "s390x")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "riscv64")
+    set(ORC_CPU_FLAG "riscv64")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_SSE4_2_FLAG "")

Review Comment:
   If this patch aims for AVX512 only, we can remove SSE4 and AVX2 for now. So flags like `ORC_AVX2_FLAG`, `CXX_SUPPORTS_SSE4_2`, and `CXX_SUPPORTS_AVX2` can be removed for now.



##########
c++/src/DetectPlatform.hh:
##########
@@ -0,0 +1,88 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DETECTPLATFORM_HH
+#define ORC_DETECTPLATFORM_HH
+
+#if defined(__GNUC__) || defined(__clang__)
+DIAGNOSTIC_IGNORE("-Wold-style-cast")
+#endif
+
+namespace orc {
+#ifdef _WIN32
+
+#include "intrin.h"
+//  Windows CPUID
+#define cpuid(info, x) __cpuidex(info, x, 0)
+#else
+//  GCC Intrinsics
+#include <cpuid.h>
+#include <dlfcn.h>
+
+  void cpuid(int info[4], int InfoType) {
+    __cpuid_count(InfoType, 0, info[0], info[1], info[2], info[3]);
+  }
+
+  unsigned long long xgetbv(unsigned int index) {
+    unsigned int eax, edx;
+    __asm__ __volatile__("xgetbv;" : "=a"(eax), "=d"(edx) : "c"(index));
+    return ((unsigned long long)edx << 32) | eax;
+  }
+
+#endif
+
+#define CPUID_AVX512F 0x00100000
+#define CPUID_AVX512CD 0x00200000
+#define CPUID_AVX512VL 0x04000000
+#define CPUID_AVX512BW 0x01000000
+#define CPUID_AVX512DQ 0x02000000
+#define EXC_OSXSAVE 0x08000000  // 27th  bit
+
+#define CPUID_AVX512_MASK \
+  (CPUID_AVX512F | CPUID_AVX512CD | CPUID_AVX512VL | CPUID_AVX512BW | CPUID_AVX512DQ)
+
+  enum class Arch { PX_ARCH = 0, AVX2_ARCH = 1, AVX512_ARCH = 2 };
+
+  Arch detectPlatform() {
+    Arch detectedPlatform = Arch::PX_ARCH;
+    int cpuInfo[4];
+    cpuid(cpuInfo, 1);
+
+    bool avx512SupportCpu = cpuInfo[1] & CPUID_AVX512_MASK;

Review Comment:
   ```suggestion
       bool avx512Supported = cpuInfo[1] & CPUID_AVX512_MASK;
   ```



##########
c++/src/DetectPlatform.hh:
##########
@@ -0,0 +1,88 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DETECTPLATFORM_HH
+#define ORC_DETECTPLATFORM_HH
+
+#if defined(__GNUC__) || defined(__clang__)
+DIAGNOSTIC_IGNORE("-Wold-style-cast")
+#endif
+
+namespace orc {
+#ifdef _WIN32
+
+#include "intrin.h"
+//  Windows CPUID
+#define cpuid(info, x) __cpuidex(info, x, 0)
+#else
+//  GCC Intrinsics
+#include <cpuid.h>
+#include <dlfcn.h>
+
+  void cpuid(int info[4], int InfoType) {
+    __cpuid_count(InfoType, 0, info[0], info[1], info[2], info[3]);
+  }
+
+  unsigned long long xgetbv(unsigned int index) {
+    unsigned int eax, edx;
+    __asm__ __volatile__("xgetbv;" : "=a"(eax), "=d"(edx) : "c"(index));
+    return ((unsigned long long)edx << 32) | eax;
+  }
+
+#endif
+
+#define CPUID_AVX512F 0x00100000
+#define CPUID_AVX512CD 0x00200000
+#define CPUID_AVX512VL 0x04000000
+#define CPUID_AVX512BW 0x01000000
+#define CPUID_AVX512DQ 0x02000000
+#define EXC_OSXSAVE 0x08000000  // 27th  bit
+
+#define CPUID_AVX512_MASK \
+  (CPUID_AVX512F | CPUID_AVX512CD | CPUID_AVX512VL | CPUID_AVX512BW | CPUID_AVX512DQ)
+
+  enum class Arch { PX_ARCH = 0, AVX2_ARCH = 1, AVX512_ARCH = 2 };
+
+  Arch detectPlatform() {
+    Arch detectedPlatform = Arch::PX_ARCH;
+    int cpuInfo[4];
+    cpuid(cpuInfo, 1);
+
+    bool avx512SupportCpu = cpuInfo[1] & CPUID_AVX512_MASK;
+    bool osUsesXSaveXStore = cpuInfo[2] & EXC_OSXSAVE;

Review Comment:
   ```suggestion
       bool xsaveSupported = cpuInfo[2] & EXC_OSXSAVE;
   ```



##########
c++/src/DetectPlatform.hh:
##########
@@ -0,0 +1,88 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DETECTPLATFORM_HH
+#define ORC_DETECTPLATFORM_HH
+
+#if defined(__GNUC__) || defined(__clang__)
+DIAGNOSTIC_IGNORE("-Wold-style-cast")
+#endif
+
+namespace orc {
+#ifdef _WIN32
+
+#include "intrin.h"
+//  Windows CPUID
+#define cpuid(info, x) __cpuidex(info, x, 0)
+#else
+//  GCC Intrinsics
+#include <cpuid.h>
+#include <dlfcn.h>
+
+  void cpuid(int info[4], int InfoType) {
+    __cpuid_count(InfoType, 0, info[0], info[1], info[2], info[3]);
+  }
+
+  unsigned long long xgetbv(unsigned int index) {
+    unsigned int eax, edx;
+    __asm__ __volatile__("xgetbv;" : "=a"(eax), "=d"(edx) : "c"(index));
+    return ((unsigned long long)edx << 32) | eax;
+  }
+
+#endif
+
+#define CPUID_AVX512F 0x00100000
+#define CPUID_AVX512CD 0x00200000
+#define CPUID_AVX512VL 0x04000000
+#define CPUID_AVX512BW 0x01000000
+#define CPUID_AVX512DQ 0x02000000
+#define EXC_OSXSAVE 0x08000000  // 27th  bit
+
+#define CPUID_AVX512_MASK \
+  (CPUID_AVX512F | CPUID_AVX512CD | CPUID_AVX512VL | CPUID_AVX512BW | CPUID_AVX512DQ)
+
+  enum class Arch { PX_ARCH = 0, AVX2_ARCH = 1, AVX512_ARCH = 2 };
+
+  Arch detectPlatform() {
+    Arch detectedPlatform = Arch::PX_ARCH;
+    int cpuInfo[4];
+    cpuid(cpuInfo, 1);
+
+    bool avx512SupportCpu = cpuInfo[1] & CPUID_AVX512_MASK;
+    bool osUsesXSaveXStore = cpuInfo[2] & EXC_OSXSAVE;
+
+    if (avx512SupportCpu && osUsesXSaveXStore) {
+      // Check if XMM state and YMM state are saved
+#ifdef _WIN32
+      unsigned long long xcrFeatureMask = _xgetbv(0); /* min VS2010 SP1 compiler is required */

Review Comment:
   What does `xcr` mean?



##########
c++/src/DetectPlatform.hh:
##########
@@ -0,0 +1,88 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DETECTPLATFORM_HH
+#define ORC_DETECTPLATFORM_HH
+
+#if defined(__GNUC__) || defined(__clang__)
+DIAGNOSTIC_IGNORE("-Wold-style-cast")
+#endif
+
+namespace orc {
+#ifdef _WIN32
+
+#include "intrin.h"
+//  Windows CPUID
+#define cpuid(info, x) __cpuidex(info, x, 0)
+#else
+//  GCC Intrinsics
+#include <cpuid.h>
+#include <dlfcn.h>
+
+  void cpuid(int info[4], int InfoType) {
+    __cpuid_count(InfoType, 0, info[0], info[1], info[2], info[3]);
+  }
+
+  unsigned long long xgetbv(unsigned int index) {
+    unsigned int eax, edx;
+    __asm__ __volatile__("xgetbv;" : "=a"(eax), "=d"(edx) : "c"(index));
+    return ((unsigned long long)edx << 32) | eax;
+  }
+
+#endif
+
+#define CPUID_AVX512F 0x00100000
+#define CPUID_AVX512CD 0x00200000
+#define CPUID_AVX512VL 0x04000000
+#define CPUID_AVX512BW 0x01000000
+#define CPUID_AVX512DQ 0x02000000
+#define EXC_OSXSAVE 0x08000000  // 27th  bit
+
+#define CPUID_AVX512_MASK \
+  (CPUID_AVX512F | CPUID_AVX512CD | CPUID_AVX512VL | CPUID_AVX512BW | CPUID_AVX512DQ)
+
+  enum class Arch { PX_ARCH = 0, AVX2_ARCH = 1, AVX512_ARCH = 2 };
+
+  Arch detectPlatform() {

Review Comment:
   Should we rename the function and the file name to `detect architecture`?



##########
c++/src/DetectPlatform.hh:
##########
@@ -0,0 +1,88 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DETECTPLATFORM_HH
+#define ORC_DETECTPLATFORM_HH
+
+#if defined(__GNUC__) || defined(__clang__)
+DIAGNOSTIC_IGNORE("-Wold-style-cast")
+#endif
+
+namespace orc {
+#ifdef _WIN32
+
+#include "intrin.h"
+//  Windows CPUID
+#define cpuid(info, x) __cpuidex(info, x, 0)
+#else
+//  GCC Intrinsics
+#include <cpuid.h>
+#include <dlfcn.h>
+
+  void cpuid(int info[4], int InfoType) {
+    __cpuid_count(InfoType, 0, info[0], info[1], info[2], info[3]);
+  }
+
+  unsigned long long xgetbv(unsigned int index) {
+    unsigned int eax, edx;
+    __asm__ __volatile__("xgetbv;" : "=a"(eax), "=d"(edx) : "c"(index));
+    return ((unsigned long long)edx << 32) | eax;
+  }
+
+#endif
+
+#define CPUID_AVX512F 0x00100000
+#define CPUID_AVX512CD 0x00200000
+#define CPUID_AVX512VL 0x04000000
+#define CPUID_AVX512BW 0x01000000
+#define CPUID_AVX512DQ 0x02000000
+#define EXC_OSXSAVE 0x08000000  // 27th  bit

Review Comment:
   Could you give a more meaningful name or add a line of comment?



##########
c++/src/DetectPlatform.hh:
##########
@@ -0,0 +1,88 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DETECTPLATFORM_HH
+#define ORC_DETECTPLATFORM_HH
+
+#if defined(__GNUC__) || defined(__clang__)
+DIAGNOSTIC_IGNORE("-Wold-style-cast")
+#endif
+
+namespace orc {
+#ifdef _WIN32
+
+#include "intrin.h"
+//  Windows CPUID
+#define cpuid(info, x) __cpuidex(info, x, 0)
+#else
+//  GCC Intrinsics
+#include <cpuid.h>
+#include <dlfcn.h>
+
+  void cpuid(int info[4], int InfoType) {
+    __cpuid_count(InfoType, 0, info[0], info[1], info[2], info[3]);
+  }
+
+  unsigned long long xgetbv(unsigned int index) {
+    unsigned int eax, edx;
+    __asm__ __volatile__("xgetbv;" : "=a"(eax), "=d"(edx) : "c"(index));
+    return ((unsigned long long)edx << 32) | eax;
+  }
+
+#endif
+
+#define CPUID_AVX512F 0x00100000
+#define CPUID_AVX512CD 0x00200000
+#define CPUID_AVX512VL 0x04000000
+#define CPUID_AVX512BW 0x01000000
+#define CPUID_AVX512DQ 0x02000000
+#define EXC_OSXSAVE 0x08000000  // 27th  bit
+
+#define CPUID_AVX512_MASK \
+  (CPUID_AVX512F | CPUID_AVX512CD | CPUID_AVX512VL | CPUID_AVX512BW | CPUID_AVX512DQ)
+
+  enum class Arch { PX_ARCH = 0, AVX2_ARCH = 1, AVX512_ARCH = 2 };
+
+  Arch detectPlatform() {
+    Arch detectedPlatform = Arch::PX_ARCH;

Review Comment:
   ```suggestion
       Arch arch = Arch::PX_ARCH;
   ```



##########
c++/src/RLEv2.hh:
##########
@@ -230,6 +265,14 @@ namespace orc {
     uint32_t curByte;                   // Used by anything that uses readLongs
     DataBuffer<int64_t> unpackedPatch;  // Used by PATCHED_BASE
     DataBuffer<int64_t> literals;       // Values of the current run
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+    uint8_t

Review Comment:
   How about move the comment above each variable definition? This will be more readable.



##########
c++/src/DetectPlatform.hh:
##########
@@ -0,0 +1,88 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DETECTPLATFORM_HH
+#define ORC_DETECTPLATFORM_HH
+
+#if defined(__GNUC__) || defined(__clang__)
+DIAGNOSTIC_IGNORE("-Wold-style-cast")
+#endif
+
+namespace orc {
+#ifdef _WIN32

Review Comment:
   Platform dependent function like `cpuid` can be defined in the file `Adaptor.hh.in`



##########
c++/src/RLEv2.hh:
##########
@@ -189,13 +192,45 @@ namespace orc {
       resetReadLongs();
     }
 
+    void resetBufferStart(uint64_t len, bool resetBuf, uint32_t backupLen);
     unsigned char readByte();
 
     int64_t readLongBE(uint64_t bsz);
     int64_t readVslong();
     uint64_t readVulong();
     void readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs);
-    void plainUnpackLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs);
+    void plainUnpackLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs,
+                          uint64_t& startBit);
+
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+    void unrolledUnpackVector1(int64_t* data, uint64_t offset, uint64_t len);

Review Comment:
   Rename to `vectorUnpackX` ?



##########
c++/src/RleDecoderV2.cc:
##########
@@ -18,11 +18,35 @@
 
 #include "Adaptor.hh"
 #include "Compression.hh"
+#include "DetectPlatform.hh"
 #include "RLEV2Util.hh"
 #include "RLEv2.hh"
 #include "Utils.hh"
+#include "VectorDecoder.hh"
 
 namespace orc {
+  void RleDecoderV2::resetBufferStart(uint64_t len, bool resetBuf, uint32_t backupByteLen) {
+    uint64_t restLen = bufferEnd - bufferStart;

Review Comment:
   ```suggestion
       uint64_t remainingLen = bufferEnd - bufferStart;
   ```



##########
CMakeLists.txt:
##########
@@ -157,6 +172,139 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "powerpc|ppc")
+    set(ORC_CPU_FLAG "ppc")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "s390x")
+    set(ORC_CPU_FLAG "s390x")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "riscv64")
+    set(ORC_CPU_FLAG "riscv64")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_SSE4_2_FLAG "")
+    set(ORC_AVX2_FLAG "/arch:AVX2")
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    set(ORC_SSE4_2_FLAG "-msse4.2")
+    set(ORC_AVX2_FLAG "-march=haswell")
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")
+    # Append the avx2/avx512 subset option also, fix issue ORC-9877 for homebrew-cpp
+    set(ORC_AVX2_FLAG "${ORC_AVX2_FLAG} -mavx2")
+    set(ORC_AVX512_FLAG
+        "${ORC_AVX512_FLAG} -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vbmi")
+  endif()
+  check_cxx_compiler_flag(${ORC_AVX512_FLAG} CXX_SUPPORTS_AVX512)
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    check_cxx_source_compiles("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+  # Runtime SIMD level it can get from compiler and ORC_RUNTIME_SIMD_LEVEL
+  if(CXX_SUPPORTS_SSE4_2 AND ORC_RUNTIME_SIMD_LEVEL MATCHES
+                             "^(SSE4_2|AVX2|AVX512|MAX)$")
+    set(ORC_HAVE_RUNTIME_SSE4_2 ON)
+    set(ORC_SIMD_LEVEL "SSE4_2")
+    add_definitions(-DORC_HAVE_RUNTIME_SSE4_2)
+  endif()
+  if(CXX_SUPPORTS_AVX2 AND ORC_RUNTIME_SIMD_LEVEL MATCHES "^(AVX2|AVX512|MAX)$")
+    set(ORC_HAVE_RUNTIME_AVX2 ON)
+    set(ORC_SIMD_LEVEL "AVX2")
+    add_definitions(-DORC_HAVE_RUNTIME_AVX2 -DORC_HAVE_RUNTIME_BMI2)
+  endif()
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND ORC_RUNTIME_SIMD_LEVEL MATCHES "^(AVX512|MAX)$")
+    message(STATUS "Enable the AVX512 vector decode of bit-packing")
+    set(ORC_HAVE_RUNTIME_AVX512 ON)
+    set(ORC_SIMD_LEVEL "AVX512")
+    add_definitions(-DORC_HAVE_RUNTIME_AVX512 -DORC_HAVE_RUNTIME_BMI2)
+  else ()
+    message(STATUS "Disable the AVX512 vector decode of bit-packing")
+  endif()
+  if(ORC_SIMD_LEVEL STREQUAL "DEFAULT")
+    set(ORC_SIMD_LEVEL "NONE")
+  endif()
+elseif(ORC_CPU_FLAG STREQUAL "ppc")
+  # power compiler flags, gcc/clang only
+  set(ORC_ALTIVEC_FLAG "-maltivec")
+  check_cxx_compiler_flag(${ORC_ALTIVEC_FLAG} CXX_SUPPORTS_ALTIVEC)
+  if(ORC_SIMD_LEVEL STREQUAL "DEFAULT")
+    set(ORC_SIMD_LEVEL "NONE")
+  endif()
+elseif(ORC_CPU_FLAG STREQUAL "aarch64")
+  # Arm64 compiler flags, gcc/clang only
+  set(ORC_ARMV8_MARCH "armv8-a")
+  check_cxx_compiler_flag("-march=${ORC_ARMV8_MARCH}+sve" CXX_SUPPORTS_SVE)
+  if(ORC_SIMD_LEVEL STREQUAL "DEFAULT")
+    set(ORC_SIMD_LEVEL "NEON")
+  endif()
+endif()
+
+# Only enable additional instruction sets if they are supported
+if(ORC_CPU_FLAG STREQUAL "x86")
+  if(MINGW)
+    # Enable _xgetbv() intrinsic to query OS support for ZMM register saves
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mxsave")
+  endif()
+  if(ORC_SIMD_LEVEL STREQUAL "AVX512")
+    if(NOT CXX_SUPPORTS_AVX512)
+      message(FATAL_ERROR "AVX512 required but compiler doesn't support it.")
+    endif()
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${ORC_AVX512_FLAG}")
+    add_definitions(-DORC_HAVE_AVX512 -DORC_HAVE_AVX2 -DORC_HAVE_BMI2
+                    -DORC_HAVE_SSE4_2)
+  elseif(ORC_SIMD_LEVEL STREQUAL "AVX2")

Review Comment:
   We can remove levels other than `AVX512` for now to make it simpler.



##########
c++/src/RleDecoderV2.cc:
##########
@@ -67,6 +91,149 @@ namespace orc {
   }
 
   void RleDecoderV2::readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs) {
+    uint64_t startBit = 0;
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+    const auto runtimeEnable = getenv("ENABLE_RUNTIME_AVX512");
+    std::string avxRuntimeEnable = runtimeEnable == nullptr ? "OFF" : std::string(runtimeEnable);
+    if (detectPlatform() == Arch::AVX512_ARCH && strcasecmp(avxRuntimeEnable.c_str(), "on") == 0) {
+      switch (fbs) {

Review Comment:
   We can wrap lines between 99 and 234 to a separate function named something like `readLongsAvx512`



##########
c++/src/VectorDecoder.hh:
##########
@@ -0,0 +1,488 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef VECTOR_DECODER_HH
+#define VECTOR_DECODER_HH
+
+#include <string.h>

Review Comment:
   Move it below line 24?



##########
c++/src/RleDecoderV2.cc:
##########
@@ -67,6 +91,149 @@ namespace orc {
   }
 
   void RleDecoderV2::readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs) {
+    uint64_t startBit = 0;
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+    const auto runtimeEnable = getenv("ENABLE_RUNTIME_AVX512");

Review Comment:
   Can we add a flag or enum value to class `RleDecoderV2` so it can decide how to dispatch functions at runtime? In this way, we can simply make the decision at the creation time of `RleDecoderV2`. Otherwise the decision is made on every call to `readLongs`.



##########
c++/test/TestRleVectorDecoder.cc:
##########
@@ -0,0 +1,608 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <inttypes.h>

Review Comment:
   I assume this include file will fail on some platforms (MSVC?)



##########
c++/src/RleDecoderV2.cc:
##########
@@ -97,10 +264,4151 @@ namespace orc {
         return;
       default:
         // Fallback to the default implementation for deprecated bit size.
-        plainUnpackLongs(data, offset, len, fbs);
+        plainUnpackLongs(data, offset, len, fbs, startBit);
+        return;
+    }
+#endif
+  }
+
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+  void RleDecoderV2::unrolledUnpackVector1(int64_t* data, uint64_t offset, uint64_t len) {

Review Comment:
   Do you have the script that generates the vectorized code? It would be great if it is committed alongside for future maintenance.



##########
c++/src/RleDecoderV2.cc:
##########
@@ -67,6 +91,149 @@ namespace orc {
   }
 
   void RleDecoderV2::readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs) {
+    uint64_t startBit = 0;
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+    const auto runtimeEnable = getenv("ENABLE_RUNTIME_AVX512");
+    std::string avxRuntimeEnable = runtimeEnable == nullptr ? "OFF" : std::string(runtimeEnable);
+    if (detectPlatform() == Arch::AVX512_ARCH && strcasecmp(avxRuntimeEnable.c_str(), "on") == 0) {
+      switch (fbs) {
+        case 1:
+          unrolledUnpackVector1(data, offset, len);
+          return;
+        case 2:
+          unrolledUnpackVector2(data, offset, len);
+          return;
+        case 3:
+          unrolledUnpackVector3(data, offset, len);
+          return;
+        case 4:
+          unrolledUnpackVector4(data, offset, len);
+          return;
+        case 5:
+          unrolledUnpackVector5(data, offset, len);
+          return;
+        case 6:
+          unrolledUnpackVector6(data, offset, len);
+          return;
+        case 7:
+          unrolledUnpackVector7(data, offset, len);
+          return;
+        case 8:
+          unrolledUnpack8(data, offset, len);
+          return;
+        case 9:
+          unrolledUnpackVector9(data, offset, len);
+          return;
+        case 10:
+          unrolledUnpackVector10(data, offset, len);
+          return;
+        case 11:
+          unrolledUnpackVector11(data, offset, len);
+          return;
+        case 12:
+          unrolledUnpackVector12(data, offset, len);
+          return;
+        case 13:
+          unrolledUnpackVector13(data, offset, len);
+          return;
+        case 14:
+          unrolledUnpackVector14(data, offset, len);
+          return;
+        case 15:
+          unrolledUnpackVector15(data, offset, len);
+          return;
+        case 16:
+          unrolledUnpackVector16(data, offset, len);
+          return;
+        case 17:
+          unrolledUnpackVector17(data, offset, len);
+          return;
+        case 18:
+          unrolledUnpackVector18(data, offset, len);
+          return;
+        case 19:
+          unrolledUnpackVector19(data, offset, len);
+          return;
+        case 20:
+          unrolledUnpackVector20(data, offset, len);
+          return;
+        case 21:
+          unrolledUnpackVector21(data, offset, len);
+          return;
+        case 22:
+          unrolledUnpackVector22(data, offset, len);
+          return;
+        case 23:
+          unrolledUnpackVector23(data, offset, len);
+          return;
+        case 24:
+          unrolledUnpackVector24(data, offset, len);
+          return;
+        case 26:
+          unrolledUnpackVector26(data, offset, len);
+          return;
+        case 28:
+          unrolledUnpackVector28(data, offset, len);
+          return;
+        case 30:
+          unrolledUnpackVector30(data, offset, len);
+          return;
+        case 32:
+          unrolledUnpackVector32(data, offset, len);
+          return;
+        case 40:

Review Comment:
   It seems cases here and below are handled by `plainUnpackLongs` already.



##########
CMakeLists.txt:
##########
@@ -67,6 +67,10 @@ option(BUILD_CPP_ENABLE_METRICS
     "Enable the metrics collection at compile phase"
     OFF)
 
+option(BUILD_ENABLE_AVX512
+    "Enable AVX512 vector decode of bit-packing"

Review Comment:
   IMO, here we do not need to say what AVX512 is used for.



##########
c++/test/TestRleVectorDecoder.cc:
##########
@@ -0,0 +1,608 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <inttypes.h>
+
+#include <cstdlib>
+
+#include "MemoryOutputStream.hh"
+#include "RLEv2.hh"
+#include "wrap/gtest-wrapper.h"
+#include "wrap/orc-proto-wrapper.hh"
+
+#ifdef __clang__
+DIAGNOSTIC_IGNORE("-Wmissing-variable-declarations")
+#endif
+
+namespace orc {
+
+  using ::testing::TestWithParam;
+  using ::testing::Values;
+
+  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024;  // 1M
+
+  class RleVectorTest : public TestWithParam<bool> {

Review Comment:
   ```suggestion
     class RleV2BitUnpackAvx512Test : public TestWithParam<bool> {
   ```



##########
c++/src/VectorDecoder.hh:
##########
@@ -0,0 +1,488 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef VECTOR_DECODER_HH

Review Comment:
   Is this file specific to AVX512? If true, we had better rename the header file and variable names to explicitly speak AVX512. For example, rename `VectorDecoder.hh` to `BitUnpackerAvx512.hh`



##########
c++/src/RLEv2.hh:
##########
@@ -230,6 +265,14 @@ namespace orc {
     uint32_t curByte;                   // Used by anything that uses readLongs
     DataBuffer<int64_t> unpackedPatch;  // Used by PATCHED_BASE
     DataBuffer<int64_t> literals;       // Values of the current run
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+    uint8_t

Review Comment:
   In addition, my concern here is that if we support different instruction sets, here will add more buffers.



##########
c++/src/RLEv2.hh:
##########
@@ -189,13 +192,45 @@ namespace orc {
       resetReadLongs();
     }
 
+    void resetBufferStart(uint64_t len, bool resetBuf, uint32_t backupLen);
     unsigned char readByte();
 
     int64_t readLongBE(uint64_t bsz);
     int64_t readVslong();
     uint64_t readVulong();
     void readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs);
-    void plainUnpackLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs);
+    void plainUnpackLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs,
+                          uint64_t& startBit);
+
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+    void unrolledUnpackVector1(int64_t* data, uint64_t offset, uint64_t len);

Review Comment:
   Why not define them in `RleDecoderV2.cc` and delete the declarations here? They are not supposed to be used elsewhere.



##########
c++/src/RleDecoderV2.cc:
##########
@@ -67,6 +91,149 @@ namespace orc {
   }
 
   void RleDecoderV2::readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs) {
+    uint64_t startBit = 0;
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+    const auto runtimeEnable = getenv("ENABLE_RUNTIME_AVX512");
+    std::string avxRuntimeEnable = runtimeEnable == nullptr ? "OFF" : std::string(runtimeEnable);
+    if (detectPlatform() == Arch::AVX512_ARCH && strcasecmp(avxRuntimeEnable.c_str(), "on") == 0) {
+      switch (fbs) {

Review Comment:
   It also would be better to put these AVX512 functions in a separate header like `VectorDecoder.hh` (I have suggested renaming it in another comment). In this way, we can add future implementation much easier. Probably the functions should not rely on the internal class variables but use input parameters instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1092942734


##########
c++/src/RleDecoderV2.cc:
##########
@@ -97,10 +264,4151 @@ namespace orc {
         return;
       default:
         // Fallback to the default implementation for deprecated bit size.
-        plainUnpackLongs(data, offset, len, fbs);
+        plainUnpackLongs(data, offset, len, fbs, startBit);
+        return;
+    }
+#endif
+  }
+
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+  void RleDecoderV2::unrolledUnpackVector1(int64_t* data, uint64_t offset, uint64_t len) {

Review Comment:
   No, we don't have the script to generate these code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1436285298

   > @wpleonardo It seems that the CI check does not provide sufficient error message which makes the debugging painful. Please check out the docker files provided here: https://github.com/apache/orc/tree/main/docker. Hope it helps.
   
   Thank you very much! Let me check it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1133187451


##########
.github/workflows/build_and_test.yml:
##########
@@ -91,6 +91,50 @@ jobs:
         cmake --build . --config Debug
         ctest -C Debug --output-on-failure
 
+  simdUbuntu:
+    name: "SIMD programming using C++ intrinsic functions on ${{ matrix.os }}"
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os:
+          - ubuntu-20.04

Review Comment:
   OK, I will delete ubuntu-20.04



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1448186916

   Hi @wgtmac , I found that macOS doesn't fully support AVX512. 
   For example, if we want to convert data type uint64_t to double with AVX512 compiler option, there will be "Illegal instruction" error in the running time.
   
   macOS also has another issue to support AVX512, and it has a different way dealing with AVX512 than Windows and Linux.
   We can find the description about it in 
   https://github.com/apple/darwin-xnu/blob/0a798f6738bc1db01281fc08ae024145e84df927/osfmk/i386/fpu.c#L176
   
   By default AVX512 is off in the newly created thread, which means CPUID flags will indicate AVX512 availability, but OS support check (XCR0) will not succeed.
   AVX512 can be enabled either by calling thread_set_state() or by executing any AVX512 instruction, which would cause #UD exception handled by the OS.
   
   So I choose to skip macOS to support AVX512 decode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1475173785

   The reason is that windows test running on a machine that doesn't support AVX512 and cmake program check_cxx_source_compiles "CXX_SUPPORTS_AVX512" doesn't recognize it.
   It causes that although building ORC binary success, orc_test running failed due to CPU not having AVX512 flags.
   Below are the CPU flags printed by that CI test windows machine:
   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht rep_good nopl xtopology cpuid pni pclmuldq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt md_clear
   https://github.com/apache/orc/actions/runs/4444354324/jobs/7802486868?pr=1375#step:4:49
   We can see that there are no AVX512 flags. I will add another compiler check "check_cxx_compiler_flag("-mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw" COMPILER_SUPPORT_AVX512)"  to fix this issue.
   This code and other comments will be updated before Monday night. Thank you very much.
   
   > Windows SIMD test is failing: https://github.com/apache/orc/actions/runs/4444354324/jobs/7802486868?pr=1375 @wpleonardo
   > 
   > ```
   > [----------] 54 tests from OrcTest/RleV2BitUnpackAvx512Test
   > [ RUN      ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_1bit/0
   > unknown file: error: SEH exception with code 0xc000001d thrown in the test body.
   > [  FAILED  ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_1bit/0, where GetParam() = true (2 ms)
   > [ RUN      ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_1bit/1
   > unknown file: error: SEH exception with code 0xc000001d thrown in the test body.
   > [  FAILED  ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_1bit/1, where GetParam() = false (1 ms)
   > [ RUN      ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_2bit/0
   > unknown file: error: SEH exception with code 0xc000001d thrown in the test body.
   > [  FAILED  ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_2bit/0, where GetParam() = true (1 ms)
   > [ RUN      ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_2bit/1
   > unknown file: error: SEH exception with code 0xc000001d thrown in the test body.
   > [  FAILED  ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_2bit/1, where GetParam() = false (1 ms)
   > ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1144371802


##########
c++/src/BpackingAvx512.hh:
##########
@@ -0,0 +1,89 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGAVX512_HH
+#define ORC_BPACKINGAVX512_HH
+
+#include <stdlib.h>
+#include <cstdint>
+
+#include "BpackingDefault.hh"
+
+namespace orc {
+
+#define MAX_VECTOR_BUF_8BIT_LENGTH 64
+#define MAX_VECTOR_BUF_16BIT_LENGTH 32
+#define MAX_VECTOR_BUF_32BIT_LENGTH 16
+
+  class UnpackAvx512 {
+   public:
+    UnpackAvx512(RleDecoderV2* dec);

Review Comment:
   Done.
   https://github.com/wpleonardo/orc/blob/f053f9c73bf13fe29aff95cfe4cb71857c57da07/c%2B%2B/src/BpackingAvx512.hh#L33



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1139736693


##########
c++/src/CpuInfoUtil.hh:
##########
@@ -0,0 +1,109 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_CPUINFOUTIL_HH
+#define ORC_CPUINFOUTIL_HH
+
+#include <cstdint>
+#include <memory>
+#include <string>
+
+namespace orc {
+
+  /**
+   * CpuInfo is an interface to query for cpu information at runtime.  The caller can
+   * ask for the sizes of the caches and what hardware features are supported.
+   * On Linux, this information is pulled from a couple of sys files (/proc/cpuinfo and
+   * /sys/devices)
+   */
+  class CpuInfo {

Review Comment:
   Both are Apache-2.0 license, I think it is OK.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1138650713


##########
c++/src/CMakeLists.txt:
##########
@@ -184,7 +184,11 @@ set(SOURCE_FILES
   Timezone.cc
   TypeImpl.cc
   Vector.cc
-  Writer.cc)
+  Writer.cc
+  CpuInfoUtil.cc
+  BpackingDefault.cc
+  BpackingAvx512.cc
+  Bpacking.cc)

Review Comment:
   OK, already changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode [orc]

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1757658298

   > @wpleonardo I tried, but still find no improvement
   > 
   > ```
   > orc file(snappy + unaligned) + avx512
   > 0 rows in set. Elapsed: 3.478 sec. Processed 1.20 million rows, 539.37 MB (345.98 thousand rows/s., 155.08 MB/s.)
   > 0 rows in set. Elapsed: 3.424 sec. Processed 1.20 million rows, 539.37 MB (351.44 thousand rows/s., 157.53 MB/s.)
   > 0 rows in set. Elapsed: 3.444 sec. Processed 1.20 million rows, 539.37 MB (349.44 thousand rows/s., 156.63 MB/s.)
   > 
   > 
   > orc file (snappy + unaligned) +  none
   > 0 rows in set. Elapsed: 3.362 sec. Processed 1.20 million rows, 539.37 MB (357.89 thousand rows/s., 160.42 MB/s.)
   > 0 rows in set. Elapsed: 3.535 sec. Processed 1.20 million rows, 539.37 MB (340.43 thousand rows/s., 152.59 MB/s.)
   > 0 rows in set. Elapsed: 3.370 sec. Processed 1.20 million rows, 539.37 MB (357.08 thousand rows/s., 160.06 MB/s.)
   >  
   > 
   > orc file (lz4 + unaligned) + avx512
   > 0 rows in set. Elapsed: 3.075 sec. Processed 1.20 million rows, 1.90 GB (391.26 thousand rows/s., 618.31 MB/s.)
   > 0 rows in set. Elapsed: 3.082 sec. Processed 1.20 million rows, 1.90 GB (390.46 thousand rows/s., 617.05 MB/s.)
   > 0 rows in set. Elapsed: 3.014 sec. Processed 1.20 million rows, 1.90 GB (399.18 thousand rows/s., 630.82 MB/s.)
   > 
   > 
   > orc file (lz4 + unaligned) + none 
   > rows in set. Elapsed: 2.973 sec. Processed 1.20 million rows, 1.90 GB (404.76 thousand rows/s., 639.64 MB/s.)
   > 0 rows in set. Elapsed: 3.070 sec. Processed 1.20 million rows, 1.90 GB (391.90 thousand rows/s., 619.32 MB/s.)
   > 0 rows in set. Elapsed: 2.903 sec. Processed 1.20 million rows, 1.90 GB (414.51 thousand rows/s., 655.05 MB/s.)
   > ```
   
   Could you do a simple test first, for example, just select the int64 column instead of all columns?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1413007448

   > Do you know why this PR causes `ILLEGAL` failures?
   > 
   > ```
   > 75% tests passed, 2 tests failed out of 8
   > 
   > Total Test time (real) = 545.23 sec
   > 
   > The following tests FAILED:
   > 	  1 - orc-test (ILLEGAL)
   > 	  8 - tool-test (ILLEGAL)
   > ```
   
   All checks have already passed. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1432441361

   > Hi @wgtmac @coderex2522 , I just modified the code follow your suggestions, please check it. Thank you very much for your help! 1.Modified the CMakelists, delete the part of aarch64 and ORC_RUNTIME_SIMD_LEVEL, also changed the printed message content 2.Modified the print content and style about BUILD_ENABLE_AVX512, CXX_SUPPORTS_AVX512, ORC_HAVE_RUNTIME_AVX512 and ORC_SIMD_LEVEL Delete the print of CXX_SUPPORTS_AVX512 Below is the print information in the cmake process: -- System processor: x86_64 -- Performing Test CXX_SUPPORTS_AVX512 -- Performing Test CXX_SUPPORTS_AVX512 - Success -- BUILD_ENABLE_AVX512: ON -- Enable the AVX512 vector decode of bit-packing, compiler support AVX512 -- ORC_HAVE_RUNTIME_AVX512: ON, ORC_SIMD_LEVEL: AVX512
   > 
   > -- System processor: x86_64 -- Performing Test CXX_SUPPORTS_AVX512 -- Performing Test CXX_SUPPORTS_AVX512 - Success -- BUILD_ENABLE_AVX512: OFF -- Disable the AVX512 vector decode of bit-packing -- ORC_HAVE_RUNTIME_AVX512: OFF, ORC_SIMD_LEVEL: NONE 3.Separate the configuration of AVX512 from CMakelists, and create a new cmake module "cmake_modules/ConfigSimdLevel.cmake" file 4.The default value of BUILD_ENABLE_AVX512 is still ON. Do we need to change it back to OFF? 5.Modified the style of code comments 6.Delete message(FATAL_ERROR "Unknown system processor"), to avoid break down the build process.
   
   Let's fix the CI check first. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1441153998

   > @wpleonardo It seems that the CI check does not provide sufficient error message which makes the debugging painful. Please check out the docker files provided here: https://github.com/apache/orc/tree/main/docker. Hope it helps.
   
   Hi @wgtmac , as your suggestions, I have already run the CI test in different platforms' docker containers. But I can't reproduce the CI failed testcase in my own test.
   The orc-test and tool-test, which failed in your CI test, passed on all of the platforms (including ubuntu22, ubuntu20, ubuntu18, debian10_jdk-11, CentOS 7). I don't find the MACOS docker image in https://hub.docker.com/r/apache/orc-dev/tags?page=1&ordering=-name
   In my own test, I closed JAVA build in my own CI test (cmake .. -DBUILD_JAVA=OFF && make package test-out), due to the JAVA parts passed in your CI test and Java parts should have no relationship with C++ in my opinion.
   Do you have any idea about it? Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1107121064


##########
c++/src/CpuInfoUtil.hh:
##########
@@ -0,0 +1,110 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_CPUINFOUTIL_HH
+#define ORC_CPUINFOUTIL_HH
+
+#include <cstdint>
+#include <memory>
+#include <string>
+
+namespace orc {
+
+  /// CpuInfo is an interface to query for cpu information at runtime.  The caller can
+  /// ask for the sizes of the caches and what hardware features are supported.
+  /// On Linux, this information is pulled from a couple of sys files (/proc/cpuinfo and
+  /// /sys/devices)

Review Comment:
   Fixed



##########
c++/src/CpuInfoUtil.hh:
##########
@@ -0,0 +1,110 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_CPUINFOUTIL_HH
+#define ORC_CPUINFOUTIL_HH
+
+#include <cstdint>
+#include <memory>
+#include <string>
+
+namespace orc {
+
+  /// CpuInfo is an interface to query for cpu information at runtime.  The caller can
+  /// ask for the sizes of the caches and what hardware features are supported.
+  /// On Linux, this information is pulled from a couple of sys files (/proc/cpuinfo and
+  /// /sys/devices)
+  class CpuInfo {
+   public:
+    ~CpuInfo();
+
+    /// x86 features
+    static constexpr int64_t SSSE3 = (1LL << 0);
+    static constexpr int64_t SSE4_1 = (1LL << 1);
+    static constexpr int64_t SSE4_2 = (1LL << 2);
+    static constexpr int64_t POPCNT = (1LL << 3);
+    static constexpr int64_t AVX = (1LL << 4);
+    static constexpr int64_t AVX2 = (1LL << 5);
+    static constexpr int64_t AVX512F = (1LL << 6);
+    static constexpr int64_t AVX512CD = (1LL << 7);
+    static constexpr int64_t AVX512VL = (1LL << 8);
+    static constexpr int64_t AVX512DQ = (1LL << 9);
+    static constexpr int64_t AVX512BW = (1LL << 10);
+    static constexpr int64_t AVX512 = AVX512F | AVX512CD | AVX512VL | AVX512DQ | AVX512BW;
+    static constexpr int64_t BMI1 = (1LL << 11);
+    static constexpr int64_t BMI2 = (1LL << 12);
+
+    /// Arm features
+    static constexpr int64_t ASIMD = (1LL << 32);
+
+    /// Cache enums for L1 (data), L2 and L3
+    enum class CacheLevel { L1 = 0, L2, L3, Last = L3 };
+
+    /// CPU vendors
+    enum class Vendor { Unknown, Intel, AMD };
+
+    static const CpuInfo* GetInstance();
+
+    /// Returns all the flags for this cpu
+    int64_t hardwareFlags() const;
+
+    /// Returns the number of cores (including hyper-threaded) on this machine.
+    int numCores() const;
+
+    /// Returns the vendor of the cpu.
+    Vendor vendor() const;
+
+    /// Returns the model name of the cpu (e.g. Intel i7-2600)
+    const std::string& modelName() const;
+
+    /// Returns the size of the cache in KB at this cache level
+    int64_t CacheSize(CacheLevel level) const;
+
+    /// \brief Returns whether or not the given feature is enabled.
+    ///
+    /// IsSupported() is true if IsDetected() is also true and the feature
+    /// wasn't disabled by the user (for example by setting the ORC_USER_SIMD_LEVEL
+    /// environment variable).
+    bool IsSupported(int64_t flags) const;
+
+    /// Returns whether or not the given feature is available on the CPU.
+    bool IsDetected(int64_t flags) const;
+
+    /// Determine if the CPU meets the minimum CPU requirements and if not, issue an error
+    /// and terminate.
+    void VerifyCpuRequirements() const;
+
+    /// Toggle a hardware feature on and off.  It is not valid to turn on a feature
+    /// that the underlying hardware cannot support. This is useful for testing.
+    // void EnableFeature(int64_t flag, bool enable);

Review Comment:
   Fixed. Sorry for that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1107120791


##########
c++/src/Dispatch.hh:
##########
@@ -0,0 +1,109 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DISPATCH_HH
+#define ORC_DISPATCH_HH
+
+#include <utility>
+#include <vector>
+
+#include "CpuInfoUtil.hh"
+
+namespace orc {
+  enum class DispatchLevel : int {
+    // These dispatch levels, corresponding to instruction set features,
+    // are sorted in increasing order of preference.
+    NONE = 0,
+    AVX512,
+    MAX
+  };
+
+  /*
+    A facility for dynamic dispatch according to available DispatchLevel.
+
+    Typical use:
+
+      static void my_function_default(...);
+      static void my_function_avx512(...);
+
+      struct MyDynamicFunction {
+        using FunctionType = decltype(&my_function_default);
+
+        static std::vector<std::pair<DispatchLevel, FunctionType>> implementations() {
+          return {
+            { DispatchLevel::NONE, my_function_default }
+      #if defined(ARROW_HAVE_RUNTIME_AVX512)
+            , { DispatchLevel::AVX512, my_function_avx512 }
+      #endif
+          };
+        }
+      };
+
+      void my_function(...) {
+        static DynamicDispatch<MyDynamicFunction> dispatch;
+        return dispatch.func(...);
+      }
+  */
+  template <typename DynamicFunction>
+  class DynamicDispatch {
+   protected:
+    using FunctionType = typename DynamicFunction::FunctionType;
+    using Implementation = std::pair<DispatchLevel, FunctionType>;
+
+   public:
+    DynamicDispatch() {
+      Resolve(DynamicFunction::implementations());
+    }
+
+    FunctionType func = {};
+
+   protected:
+    // Use the Implementation with the highest DispatchLevel
+    void Resolve(const std::vector<Implementation>& implementations) {
+      Implementation cur{DispatchLevel::NONE, {}};
+
+      for (const auto& impl : implementations) {
+        if (impl.first >= cur.first && IsSupported(impl.first)) {
+          // Higher (or same) level than current
+          cur = impl;
+        }
+      }
+
+      if (!cur.second) {
+        throw InvalidArgument("No appropriate implementation found");
+      }
+      func = cur.second;
+    }
+
+   private:
+    bool IsSupported(DispatchLevel level) const {
+      static const auto cpu_info = orc::CpuInfo::GetInstance();

Review Comment:
   Fixed



##########
c++/src/Dispatch.hh:
##########
@@ -0,0 +1,109 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DISPATCH_HH
+#define ORC_DISPATCH_HH
+
+#include <utility>
+#include <vector>
+
+#include "CpuInfoUtil.hh"
+
+namespace orc {
+  enum class DispatchLevel : int {
+    // These dispatch levels, corresponding to instruction set features,
+    // are sorted in increasing order of preference.
+    NONE = 0,
+    AVX512,
+    MAX
+  };
+
+  /*
+    A facility for dynamic dispatch according to available DispatchLevel.
+
+    Typical use:
+
+      static void my_function_default(...);
+      static void my_function_avx512(...);
+
+      struct MyDynamicFunction {
+        using FunctionType = decltype(&my_function_default);
+
+        static std::vector<std::pair<DispatchLevel, FunctionType>> implementations() {
+          return {
+            { DispatchLevel::NONE, my_function_default }
+      #if defined(ARROW_HAVE_RUNTIME_AVX512)

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1107120436


##########
c++/src/BitUnpackerAvx512.hh:
##########
@@ -0,0 +1,488 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef VECTOR_DECODER_HH
+#define VECTOR_DECODER_HH
+
+// Mingw-w64 defines strcasecmp in string.h
+#if defined(_WIN32) && !defined(strcasecmp)
+#include <string.h>
+#define strcasecmp stricmp
+#else
+#include <strings.h>
+#endif
+
+#if defined(ORC_HAVE_RUNTIME_AVX512)

Review Comment:
   Fixed



##########
c++/src/Bpacking.hh:
##########
@@ -0,0 +1,40 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKING_HH
+#define ORC_BPACKING_HH
+
+#include <stdint.h>
+
+#include "BpackingDefault.hh"
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+#include "BpackingAvx512.hh"

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1479086120

   > @stiga-huang @coderex2522 Could you please take a look again? It generally looks good to me now.
   
   Thank you very much for your kind and helpful comments @wgtmac 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1133909981


##########
cmake_modules/ConfigSimdLevel.cmake:
##########
@@ -0,0 +1,105 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(CheckCXXCompilerFlag)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_SIMD_LEVEL)
+  set(ORC_SIMD_LEVEL
+      "DEFAULT"
+      CACHE STRING "Compile time SIMD optimization level")
+endif()
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(STATUS "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+  else()
+    # "arch=native" selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine.
+    # Using -march=native enables all instruction subsets supported by the local machine.
+    # Using -mtune=native produces code optimized for the local machine under the constraints of the selected instruction set.
+    set(ORC_AVX512_FLAG "-march=native -mtune=native")
+  endif()
+
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    CHECK_CXX_SOURCE_COMPILES("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND NOT MSVC)
+    execute_process(COMMAND grep flags /proc/cpuinfo
+                    COMMAND head -1
+                    OUTPUT_VARIABLE flags_ver)
+    message(STATUS "CPU ${flags_ver}")
+  endif()
+
+  message(STATUS "BUILD_ENABLE_AVX512: ${BUILD_ENABLE_AVX512}")
+  # Runtime SIMD level it can get from compiler
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512)
+    message(STATUS "Enable the AVX512 vector decode of bit-packing, compiler support AVX512")
+    set(ORC_HAVE_RUNTIME_AVX512 ON)
+    set(ORC_SIMD_LEVEL "AVX512")
+    add_definitions(-DORC_HAVE_RUNTIME_AVX512)
+  elseif(BUILD_ENABLE_AVX512 AND NOT CXX_SUPPORTS_AVX512)
+    message(FATAL_ERROR "AVX512 required but compiler doesn't support it, failed to enable AVX512.")
+    set(ORC_HAVE_RUNTIME_AVX512 OFF)
+  elseif(NOT BUILD_ENABLE_AVX512)

Review Comment:
   Yes, we can remove the judgement of BUILD_ENABLE_AVX512 in ConfigSimdLevel.cmake. Thank you very much for reminding.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1141576118


##########
c++/test/TestRleVectorDecoder.cc:
##########
@@ -0,0 +1,561 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <cstdlib>
+
+#include "MemoryOutputStream.hh"
+#include "RLEv2.hh"
+#include "wrap/gtest-wrapper.h"
+#include "wrap/orc-proto-wrapper.hh"
+
+#ifdef __clang__
+DIAGNOSTIC_IGNORE("-Wmissing-variable-declarations")
+#endif
+
+namespace orc {
+  using ::testing::TestWithParam;
+  using ::testing::Values;
+
+  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024;  // 1M
+  const char finish = '#';
+  std::string flags = "-\\|/";
+
+  class RleV2BitUnpackAvx512Test : public TestWithParam<bool> {
+    virtual void SetUp();
+
+   protected:
+    bool alignBitpacking;
+    std::unique_ptr<RleEncoder> getEncoder(RleVersion version, MemoryOutputStream& memStream,
+                                           bool isSigned);
+
+    void runExampleTest(int64_t* inputData, uint64_t inputLength, unsigned char* expectedOutput,
+                        uint64_t outputLength);
+
+    void runTest(RleVersion version, uint64_t numValues, int64_t start, int64_t delta, bool random,
+                 bool isSigned, uint8_t bitWidth, uint64_t blockSize = 0, uint64_t numNulls = 0);
+  };
+
+  void vectorDecodeAndVerify(RleVersion version, const MemoryOutputStream& memStream, int64_t* data,
+                             uint64_t numValues, const char* notNull, uint64_t blockSize,
+                             bool isSinged) {
+    std::unique_ptr<RleDecoder> decoder =
+        createRleDecoder(std::unique_ptr<SeekableArrayInputStream>(new SeekableArrayInputStream(
+                             memStream.getData(), memStream.getLength(), blockSize)),
+                         isSinged, version, *getDefaultPool(), getDefaultReaderMetrics());
+
+    int64_t* decodedData = new int64_t[numValues];
+    decoder->next(decodedData, numValues, notNull);
+
+    for (uint64_t i = 0; i < numValues; ++i) {
+      if (!notNull || notNull[i]) {
+        EXPECT_EQ(data[i], decodedData[i]);
+      }
+    }
+
+    delete[] decodedData;
+  }
+
+  void RleV2BitUnpackAvx512Test::SetUp() {
+    alignBitpacking = GetParam();
+  }
+
+  void generateDataFolBits(uint64_t numValues, int64_t start, int64_t delta, bool random,

Review Comment:
   Done.



##########
c++/test/CMakeLists.txt:
##########
@@ -18,6 +18,10 @@ include_directories(
 
 set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX17_FLAGS} ${WARN_FLAGS}")
 
+if(BUILD_ENABLE_AVX512)
+  set(SIMD_TEST TestRleVectorDecoder.cc)

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1141583305


##########
c++/src/CMakeLists.txt:
##########
@@ -184,13 +184,21 @@ set(SOURCE_FILES
   Timezone.cc
   TypeImpl.cc
   Vector.cc
-  Writer.cc)
+  Writer.cc
+  CpuInfoUtil.cc
+  BpackingDefault.cc)

Review Comment:
   Already sort them alphabetically.
   
   The reason for CpuInfoUtil.cc building default is 
   [RleDecoderV2.cc] readLongs => [Dispatch.hh] DynamicDispatch => Resolve
   The function Resolve uses CpuInfo to judge if the CPU supports the current Bit-unpacking function.
   So I feel that keeping CpuInfoUtil.cc in the default building source file list, maybe better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1141577945


##########
c++/src/Bpacking.hh:
##########
@@ -0,0 +1,34 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKING_HH
+#define ORC_BPACKING_HH
+
+#include <cstdint>
+
+#include "RLEv2.hh"

Review Comment:
   Yes, removed #include "RLEv2.hh" in Bpacking.hh



##########
c++/src/BitUnpackerAvx512.hh:
##########
@@ -0,0 +1,488 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BIT_UNPACKER_AVX512_HH
+#define ORC_BIT_UNPACKER_AVX512_HH
+
+// Mingw-w64 defines strcasecmp in string.h
+#if defined(_WIN32) && !defined(strcasecmp)
+#include <string.h>
+#define strcasecmp stricmp
+#else
+#include <strings.h>
+#endif
+
+#include <immintrin.h>
+#include <cstdint>
+#include <vector>
+
+namespace orc {
+#define ORC_VECTOR_BITS_2_BYTE(x) \
+  (((x) + 7u) >> 3u) /**< Convert a number of bits to a number of bytes */
+#define ORC_VECTOR_ONE_64U (1ULL)
+#define ORC_VECTOR_MAX_16U 0xFFFF     /**< Max value for uint16_t */
+#define ORC_VECTOR_MAX_32U 0xFFFFFFFF /**< Max value for uint32_t */
+#define ORC_VECTOR_BYTE_WIDTH 8u      /**< Byte width in bits */
+#define ORC_VECTOR_WORD_WIDTH 16u     /**< Word width in bits */
+#define ORC_VECTOR_DWORD_WIDTH 32u    /**< Dword width in bits */
+#define ORC_VECTOR_QWORD_WIDTH 64u    /**< Qword width in bits */
+#define ORC_VECTOR_BIT_MASK(x) \
+  ((ORC_VECTOR_ONE_64U << (x)) - 1u) /**< Bit mask below bit position */
+
+#define ORC_VECTOR_BITS_2_WORD(x) \
+  (((x) + 15u) >> 4u) /**< Convert a number of bits to a number of words */
+#define ORC_VECTOR_BITS_2_DWORD(x) \
+  (((x) + 31u) >> 5u) /**< Convert a number of bits to a number of double words */
+
+  // ------------------------------------ 3u -----------------------------------------
+  static uint8_t shuffleIdxTable3u_0[64] = {
+      1u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 4u, 3u, 5u, 4u, 6u, 5u, 1u, 0u, 1u, 0u, 2u, 1u,
+      3u, 2u, 4u, 3u, 4u, 3u, 5u, 4u, 6u, 5u, 1u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 4u, 3u,
+      5u, 4u, 6u, 5u, 1u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 4u, 3u, 5u, 4u, 6u, 5u};
+  static uint8_t shuffleIdxTable3u_1[64] = {
+      0u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 0u, 0u, 1u, 0u, 2u, 1u,
+      3u, 2u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 0u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 3u, 2u, 4u, 3u,
+      5u, 4u, 6u, 5u, 0u, 0u, 1u, 0u, 2u, 1u, 3u, 2u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u};
+  static uint16_t shiftTable3u_0[32] = {13u, 7u,  9u,  11u, 13u, 7u,  9u,  11u, 13u, 7u,  9u,
+                                        11u, 13u, 7u,  9u,  11u, 13u, 7u,  9u,  11u, 13u, 7u,
+                                        9u,  11u, 13u, 7u,  9u,  11u, 13u, 7u,  9u,  11u};
+  static uint16_t shiftTable3u_1[32] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u,
+                                        0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u,
+                                        2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint16_t permutexIdxTable3u[32] = {0u,  1u,  2u,  0x0, 0x0, 0x0, 0x0, 0x0, 3u,  4u,  5u,
+                                            0x0, 0x0, 0x0, 0x0, 0x0, 6u,  7u,  8u,  0x0, 0x0, 0x0,
+                                            0x0, 0x0, 9u,  10u, 11u, 0x0, 0x0, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 5u -----------------------------------------
+  static uint8_t shuffleIdxTable5u_0[64] = {
+      1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u,
+      4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u,
+      8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u};
+  static uint8_t shuffleIdxTable5u_1[64] = {
+      1u, 0u, 2u,  1u, 3u, 2u, 5u, 4u, 6u,  5u, 7u, 6u, 8u, 7u, 10u, 9u, 1u, 0u, 2u,  1u, 3u, 2u,
+      5u, 4u, 6u,  5u, 7u, 6u, 8u, 7u, 10u, 9u, 1u, 0u, 2u, 1u, 3u,  2u, 5u, 4u, 6u,  5u, 7u, 6u,
+      8u, 7u, 10u, 9u, 1u, 0u, 2u, 1u, 3u,  2u, 5u, 4u, 6u, 5u, 7u,  6u, 8u, 7u, 10u, 9u};
+  static uint16_t shiftTable5u_0[32] = {11u, 9u,  7u,  5u, 11u, 9u,  7u,  5u, 11u, 9u,  7u,
+                                        5u,  11u, 9u,  7u, 5u,  11u, 9u,  7u, 5u,  11u, 9u,
+                                        7u,  5u,  11u, 9u, 7u,  5u,  11u, 9u, 7u,  5u};
+  static uint16_t shiftTable5u_1[32] = {2u, 4u, 6u, 0u, 2u, 4u, 6u, 0u, 2u, 4u, 6u,
+                                        0u, 2u, 4u, 6u, 0u, 2u, 4u, 6u, 0u, 2u, 4u,
+                                        6u, 0u, 2u, 4u, 6u, 0u, 2u, 4u, 6u, 0u};
+  static uint16_t permutexIdxTable5u[32] = {0u,  1u,  2u,  3u,  4u,  0x0, 0x0, 0x0, 5u,  6u,  7u,
+                                            8u,  9u,  0x0, 0x0, 0x0, 10u, 11u, 12u, 13u, 14u, 0x0,
+                                            0x0, 0x0, 15u, 16u, 17u, 18u, 19u, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 6u -----------------------------------------
+  static uint8_t shuffleIdxTable6u_0[64] = {
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u};
+  static uint8_t shuffleIdxTable6u_1[64] = {
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u,
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u,
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u,
+      1u, 0u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 9u, 8u, 10u, 9u, 12u, 11u};
+  static uint16_t shiftTable6u_0[32] = {10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u,
+                                        6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,
+                                        10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u,  10u, 6u};
+  static uint16_t shiftTable6u_1[32] = {4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u,
+                                        0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u,
+                                        4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u};
+  static uint32_t permutexIdxTable6u[16] = {0u, 1u, 2u, 0x0, 3u, 4u,  5u,  0x0,
+                                            6u, 7u, 8u, 0x0, 9u, 10u, 11u, 0x0};
+
+  // ------------------------------------ 7u -----------------------------------------
+  static uint8_t shuffleIdxTable7u_0[64] = {
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u,
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u,
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u,
+      1u, 0u, 2u, 1u, 4u, 3u, 6u, 5u, 8u, 7u, 9u, 8u, 11u, 10u, 13u, 12u};
+  static uint8_t shuffleIdxTable7u_1[64] = {
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u,
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u,
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u,
+      1u, 0u, 3u, 2u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 12u, 11u, 14u, 13u};
+  static uint16_t shiftTable7u_0[32] = {9u, 3u, 5u, 7u, 9u, 3u, 5u, 7u, 9u, 3u, 5u,
+                                        7u, 9u, 3u, 5u, 7u, 9u, 3u, 5u, 7u, 9u, 3u,
+                                        5u, 7u, 9u, 3u, 5u, 7u, 9u, 3u, 5u, 7u};
+  static uint16_t shiftTable7u_1[32] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u,
+                                        0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u,
+                                        2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint16_t permutexIdxTable7u[32] = {0u,  1u,  2u,  3u,  4u,  5u,  6u,  0x0, 7u,  8u,  9u,
+                                            10u, 11u, 12u, 13u, 0x0, 14u, 15u, 16u, 17u, 18u, 19u,
+                                            20u, 0x0, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 0x0};
+
+  // ------------------------------------ 9u -----------------------------------------
+  static uint16_t permutexIdxTable9u_0[32] = {0u,  1u,  1u,  2u,  2u,  3u,  3u,  4u,  4u,  5u,  5u,
+                                              6u,  6u,  7u,  7u,  8u,  9u,  10u, 10u, 11u, 11u, 12u,
+                                              12u, 13u, 13u, 14u, 14u, 15u, 15u, 16u, 16u, 17u};
+  static uint16_t permutexIdxTable9u_1[32] = {0u,  1u,  1u,  2u,  2u,  3u,  3u,  4u,  5u,  6u,  6u,
+                                              7u,  7u,  8u,  8u,  9u,  9u,  10u, 10u, 11u, 11u, 12u,
+                                              12u, 13u, 14u, 15u, 15u, 16u, 16u, 17u, 17u, 18u};
+  static uint32_t shiftTable9u_0[16] = {0u, 2u, 4u, 6u, 8u, 10u, 12u, 14u,
+                                        0u, 2u, 4u, 6u, 8u, 10u, 12u, 14u};
+  static uint32_t shiftTable9u_1[16] = {7u, 5u, 3u, 1u, 15u, 13u, 11u, 9u,
+                                        7u, 5u, 3u, 1u, 15u, 13u, 11u, 9u};
+
+  static uint8_t shuffleIdxTable9u_0[64] = {
+      1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 7u, 6u, 8u, 7u, 1u, 0u, 2u, 1u, 3u, 2u,
+      4u, 3u, 5u, 4u, 6u, 5u, 7u, 6u, 8u, 7u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u,
+      7u, 6u, 8u, 7u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 5u, 4u, 6u, 5u, 7u, 6u, 8u, 7u};
+  static uint16_t shiftTable9u_2[32] = {7u, 6u, 5u, 4u, 3u, 2u, 1u, 0u, 7u, 6u, 5u,
+                                        4u, 3u, 2u, 1u, 0u, 7u, 6u, 5u, 4u, 3u, 2u,
+                                        1u, 0u, 7u, 6u, 5u, 4u, 3u, 2u, 1u, 0u};
+  static uint64_t gatherIdxTable9u[8] = {0u, 8u, 9u, 17u, 18u, 26u, 27u, 35u};
+
+  // ------------------------------------ 10u -----------------------------------------
+  static uint8_t shuffleIdxTable10u_0[64] = {
+      1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u,
+      4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u,
+      8u, 7u, 9u, 8u, 1u, 0u, 2u, 1u, 3u, 2u, 4u, 3u, 6u, 5u, 7u, 6u, 8u, 7u, 9u, 8u};
+  static uint16_t shiftTable10u[32] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u,
+                                       0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u,
+                                       2u, 0u, 6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint16_t permutexIdxTable10u[32] = {0u,  1u,  2u,  3u,  4u,  0x0, 0x0, 0x0, 5u,  6u,  7u,
+                                             8u,  9u,  0x0, 0x0, 0x0, 10u, 11u, 12u, 13u, 14u, 0x0,
+                                             0x0, 0x0, 15u, 16u, 17u, 18u, 19u, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 11u -----------------------------------------
+  static uint16_t permutexIdxTable11u_0[32] = {
+      0u,  1u,  1u,  2u,  2u,  3u,  4u,  5u,  5u,  6u,  6u,  7u,  8u,  9u,  9u,  10u,
+      11u, 12u, 12u, 13u, 13u, 14u, 15u, 16u, 16u, 17u, 17u, 18u, 19u, 20u, 20u, 21u};
+  static uint16_t permutexIdxTable11u_1[32] = {
+      0u,  1u,  2u,  3u,  3u,  4u,  4u,  5u,  6u,  7u,  7u,  8u,  8u,  9u,  10u, 11u,
+      11u, 12u, 13u, 14u, 14u, 15u, 15u, 16u, 17u, 18u, 18u, 19u, 19u, 20u, 21u, 22u};
+  static uint32_t shiftTable11u_0[16] = {0u, 6u, 12u, 2u, 8u, 14u, 4u, 10u,
+                                         0u, 6u, 12u, 2u, 8u, 14u, 4u, 10u};
+  static uint32_t shiftTable11u_1[16] = {5u, 15u, 9u, 3u, 13u, 7u, 1u, 11u,
+                                         5u, 15u, 9u, 3u, 13u, 7u, 1u, 11u};
+
+  static uint8_t shuffleIdxTable11u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u};
+  static uint8_t shuffleIdxTable11u_1[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 8u, 7u, 6u, 0u, 11u, 10u, 9u, 0u};
+  static uint32_t shiftTable11u_2[16] = {21u, 15u, 17u, 19u, 21u, 15u, 17u, 19u,
+                                         21u, 15u, 17u, 19u, 21u, 15u, 17u, 19u};
+  static uint32_t shiftTable11u_3[16] = {6u, 4u, 10u, 8u, 6u, 4u, 10u, 8u,
+                                         6u, 4u, 10u, 8u, 6u, 4u, 10u, 8u};
+  static uint64_t gatherIdxTable11u[8] = {0u, 8u, 11u, 19u, 22u, 30u, 33u, 41u};
+
+  // ------------------------------------ 12u -----------------------------------------
+  static uint8_t shuffleIdxTable12u_0[64] = {
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u,
+      1u, 0u, 2u, 1u, 4u, 3u, 5u, 4u, 7u, 6u, 8u, 7u, 10u, 9u, 11u, 10u};
+  static uint16_t shiftTable12u[32] = {4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u,
+                                       0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u,
+                                       4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u};
+  static uint32_t permutexIdxTable12u[16] = {0u, 1u, 2u, 0x0, 3u, 4u,  5u,  0x0,
+                                             6u, 7u, 8u, 0x0, 9u, 10u, 11u, 0x0};
+
+  // ------------------------------------ 13u -----------------------------------------
+  static uint16_t permutexIdxTable13u_0[32] = {
+      0u,  1u,  1u,  2u,  3u,  4u,  4u,  5u,  6u,  7u,  8u,  9u,  9u,  10u, 11u, 12u,
+      13u, 14u, 14u, 15u, 16u, 17u, 17u, 18u, 19u, 20u, 21u, 22u, 22u, 23u, 24u, 25u};
+  static uint16_t permutexIdxTable13u_1[32] = {
+      0u,  1u,  2u,  3u,  4u,  5u,  5u,  6u,  7u,  8u,  8u,  9u,  10u, 11u, 12u, 13u,
+      13u, 14u, 15u, 16u, 17u, 18u, 18u, 19u, 20u, 21u, 21u, 22u, 23u, 24u, 25u, 26u};
+  static uint32_t shiftTable13u_0[16] = {0u, 10u, 4u, 14u, 8u, 2u, 12u, 6u,
+                                         0u, 10u, 4u, 14u, 8u, 2u, 12u, 6u};
+  static uint32_t shiftTable13u_1[16] = {3u, 9u, 15u, 5u, 11u, 1u, 7u, 13u,
+                                         3u, 9u, 15u, 5u, 11u, 1u, 7u, 13u};
+
+  static uint8_t shuffleIdxTable13u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u};
+  static uint8_t shuffleIdxTable13u_1[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 0u, 10u, 9u, 8u, 0u, 13u, 12u, 11u, 0u};
+  static uint32_t shiftTable13u_2[16] = {19u, 17u, 15u, 13u, 19u, 17u, 15u, 13u,
+                                         19u, 17u, 15u, 13u, 19u, 17u, 15u, 13u};
+  static uint32_t shiftTable13u_3[16] = {10u, 12u, 6u, 8u, 10u, 12u, 6u, 8u,
+                                         10u, 12u, 6u, 8u, 10u, 12u, 6u, 8u};
+  static uint64_t gatherIdxTable13u[8] = {0u, 8u, 13u, 21u, 26u, 34u, 39u, 47u};
+
+  // ------------------------------------ 14u -----------------------------------------
+  static uint8_t shuffleIdxTable14u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u};
+  static uint8_t shuffleIdxTable14u_1[64] = {
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 10u, 9u, 8u, 0u, 14u, 13u, 12u, 0u};
+  static uint32_t shiftTable14u_0[16] = {18u, 14u, 18u, 14u, 18u, 14u, 18u, 14u,
+                                         18u, 14u, 18u, 14u, 18u, 14u, 18u, 14u};
+  static uint32_t shiftTable14u_1[16] = {12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u,
+                                         12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u};
+  static uint16_t permutexIdxTable14u[32] = {0u,  1u,  2u,  3u,  4u,  5u,  6u,  0x0, 7u,  8u,  9u,
+                                             10u, 11u, 12u, 13u, 0x0, 14u, 15u, 16u, 17u, 18u, 19u,
+                                             20u, 0x0, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 0x0};
+
+  // ------------------------------------ 15u -----------------------------------------
+  static uint16_t permutexIdxTable15u_0[32] = {
+      0u,  1u,  1u,  2u,  3u,  4u,  5u,  6u,  7u,  8u,  9u,  10u, 11u, 12u, 13u, 14u,
+      15u, 16u, 16u, 17u, 18u, 19u, 20u, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 28u, 29u};
+  static uint16_t permutexIdxTable15u_1[32] = {
+      0u,  1u,  2u,  3u,  4u,  5u,  6u,  7u,  8u,  9u,  10u, 11u, 12u, 13u, 14u, 15u,
+      15u, 16u, 17u, 18u, 19u, 20u, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 28u, 29u, 30u};
+  static uint32_t shiftTable15u_0[16] = {0u, 14u, 12u, 10u, 8u, 6u, 4u, 2u,
+                                         0u, 14u, 12u, 10u, 8u, 6u, 4u, 2u};
+  static uint32_t shiftTable15u_1[16] = {1u, 3u, 5u, 7u, 9u, 11u, 13u, 15u,
+                                         1u, 3u, 5u, 7u, 9u, 11u, 13u, 15u};
+
+  static uint8_t shuffleIdxTable15u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 14u, 13u, 12u, 11u};
+  static uint8_t shuffleIdxTable15u_1[64] = {
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u,
+      3u, 2u, 1u, 0u, 7u, 6u, 5u, 0u, 11u, 10u, 9u, 0u, 15u, 14u, 13u, 0u};
+  static uint32_t shiftTable15u_2[16] = {17u, 11u, 13u, 15u, 17u, 11u, 13u, 15u,
+                                         17u, 11u, 13u, 15u, 17u, 11u, 13u, 15u};
+  static uint32_t shiftTable15u_3[16] = {14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u,
+                                         14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u};
+  static uint64_t gatherIdxTable15u[8] = {0u, 8u, 15u, 23u, 30u, 38u, 45u, 53u};
+
+  // ------------------------------------ 17u -----------------------------------------
+  static uint32_t permutexIdxTable17u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 6u, 7u, 7u, 8u};
+  static uint32_t permutexIdxTable17u_1[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 6u, 7u, 7u, 8u};
+  static uint64_t shiftTable17u_0[8] = {0u, 2u, 4u, 6u, 8u, 10u, 12u, 14u};
+  static uint64_t shiftTable17u_1[8] = {15u, 13u, 11u, 9u, 7u, 5u, 3u, 1u};
+
+  static uint8_t shuffleIdxTable17u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u,
+      3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u,
+      9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u};
+  static uint32_t shiftTable17u_2[16] = {15u, 14u, 13u, 12u, 11u, 10u, 9u, 8u,
+                                         15u, 14u, 13u, 12u, 11u, 10u, 9u, 8u};
+  static uint64_t gatherIdxTable17u[8] = {0u, 8u, 8u, 16u, 17u, 25u, 25u, 33u};
+
+  // ------------------------------------ 18u -----------------------------------------
+  static uint32_t permutexIdxTable18u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 6u, 7u, 7u, 8u};
+  static uint32_t permutexIdxTable18u_1[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               5u, 6u, 6u, 7u, 7u, 8u, 8u, 9u};
+  static uint64_t shiftTable18u_0[8] = {0u, 4u, 8u, 12u, 16u, 20u, 24u, 28u};
+  static uint64_t shiftTable18u_1[8] = {14u, 10u, 6u, 2u, 30u, 26u, 22u, 18u};
+
+  static uint8_t shuffleIdxTable18u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u,
+      3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u,
+      9u, 8u, 7u, 6u, 3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 7u, 6u, 5u, 4u, 9u, 8u, 7u, 6u};
+  static uint32_t shiftTable18u_2[16] = {14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u,
+                                         14u, 12u, 10u, 8u, 14u, 12u, 10u, 8u};
+  static uint64_t gatherIdxTable18u[8] = {0u, 8u, 9u, 17u, 18u, 26u, 27u, 35u};
+
+  // ------------------------------------ 19u -----------------------------------------
+  static uint32_t permutexIdxTable19u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               4u, 5u, 5u, 6u, 7u, 8u, 8u, 9u};
+  static uint32_t permutexIdxTable19u_1[16] = {0u, 1u, 1u, 2u, 2u, 3u, 4u, 5u,
+                                               5u, 6u, 6u, 7u, 7u, 8u, 8u, 9u};
+  static uint64_t shiftTable19u_0[8] = {0u, 6u, 12u, 18u, 24u, 30u, 4u, 10u};
+  static uint64_t shiftTable19u_1[8] = {13u, 7u, 1u, 27u, 21u, 15u, 9u, 3u};
+
+  static uint8_t shuffleIdxTable19u_0[64] = {
+      3u,  2u, 1u, 0u, 5u, 4u, 3u,  2u, 7u, 6u, 5u, 4u, 10u, 9u, 8u, 7u, 3u,  2u, 1u, 0u, 5u, 4u,
+      3u,  2u, 8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u,  2u, 7u, 6u, 5u, 4u,
+      10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u, 2u, 8u,  7u, 6u, 5u, 10u, 9u, 8u, 7u};
+  static uint32_t shiftTable19u_2[16] = {13u, 10u, 7u, 12u, 9u, 6u, 11u, 8u,
+                                         13u, 10u, 7u, 12u, 9u, 6u, 11u, 8u};
+  static uint64_t gatherIdxTable19u[8] = {0u, 8u, 9u, 17u, 19u, 27u, 28u, 36u};
+
+  // ------------------------------------ 20u -----------------------------------------
+  static uint8_t shuffleIdxTable20u_0[64] = {
+      3u,  2u, 1u, 0u, 5u, 4u, 3u,  2u, 8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u,  2u, 1u, 0u, 5u, 4u,
+      3u,  2u, 8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u,  2u, 8u, 7u, 6u, 5u,
+      10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u, 5u, 4u, 3u, 2u, 8u,  7u, 6u, 5u, 10u, 9u, 8u, 7u};
+  static uint32_t shiftTable20u[16] = {12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u,
+                                       12u, 8u, 12u, 8u, 12u, 8u, 12u, 8u};
+  static uint16_t permutexIdxTable20u[32] = {0u,  1u,  2u,  3u,  4u,  0x0, 0x0, 0x0, 5u,  6u,  7u,
+                                             8u,  9u,  0x0, 0x0, 0x0, 10u, 11u, 12u, 13u, 14u, 0x0,
+                                             0x0, 0x0, 15u, 16u, 17u, 18u, 19u, 0x0, 0x0, 0x0};
+
+  // ------------------------------------ 21u -----------------------------------------
+  static uint32_t permutexIdxTable21u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 3u, 4u,
+                                               5u, 6u, 6u, 7u, 7u, 8u, 9u, 10u};
+  static uint32_t permutexIdxTable21u_1[16] = {0u, 1u, 1u, 2u, 3u, 4u, 4u, 5u,
+                                               5u, 6u, 7u, 8u, 8u, 9u, 9u, 10u};
+  static uint64_t shiftTable21u_0[8] = {0u, 10u, 20u, 30u, 8u, 18u, 28u, 6u};
+  static uint64_t shiftTable21u_1[8] = {11u, 1u, 23u, 13u, 3u, 25u, 15u, 5u};
+
+  static uint8_t shuffleIdxTable21u_0[64] = {
+      3u,  2u, 1u, 0u, 5u, 4u, 3u,  2u,  8u, 7u, 6u, 5u, 10u, 9u, 8u, 7u, 3u,  2u,  1u, 0u, 6u, 5u,
+      4u,  3u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u, 3u, 2u, 1u,  0u, 5u, 4u, 3u,  2u,  8u, 7u, 6u, 5u,
+      10u, 9u, 8u, 7u, 3u, 2u, 1u,  0u,  6u, 5u, 4u, 3u, 8u,  7u, 6u, 5u, 11u, 10u, 9u, 8u};
+  static uint32_t shiftTable21u_2[16] = {11u, 6u, 9u, 4u, 7u, 10u, 5u, 8u,
+                                         11u, 6u, 9u, 4u, 7u, 10u, 5u, 8u};
+  static uint64_t gatherIdxTable21u[8] = {0u, 8u, 10u, 18u, 21u, 29u, 31u, 39u};
+
+  // ------------------------------------ 22u -----------------------------------------
+  static uint32_t permutexIdxTable22u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 4u, 5u,
+                                               5u, 6u, 6u, 7u, 8u, 9u, 9u, 10u};
+  static uint32_t permutexIdxTable22u_1[16] = {0u, 1u, 2u, 3u, 3u, 4u, 4u,  5u,
+                                               6u, 7u, 7u, 8u, 8u, 9u, 10u, 11u};
+  static uint64_t shiftTable22u_0[8] = {0u, 12u, 24u, 4u, 16u, 28u, 8u, 20u};
+  static uint64_t shiftTable22u_1[8] = {10u, 30u, 18u, 6u, 26u, 14u, 2u, 22u};
+
+  static uint8_t shuffleIdxTable22u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u, 8u};
+  static uint32_t shiftTable22u_2[16] = {10u, 4u, 6u, 8u, 10u, 4u, 6u, 8u,
+                                         10u, 4u, 6u, 8u, 10u, 4u, 6u, 8u};
+  static uint64_t gatherIdxTable22u[8] = {0u, 8u, 11u, 19u, 22u, 30u, 33u, 41u};
+
+  // ------------------------------------ 23u -----------------------------------------
+  static uint32_t permutexIdxTable23u_0[16] = {0u, 1u, 1u, 2u, 2u, 3u, 4u,  5u,
+                                               5u, 6u, 7u, 8u, 8u, 9u, 10u, 11u};
+  static uint32_t permutexIdxTable23u_1[16] = {0u, 1u, 2u, 3u, 3u, 4u,  5u,  6u,
+                                               6u, 7u, 7u, 8u, 9u, 10u, 10u, 11u};
+  static uint64_t shiftTable23u_0[8] = {0u, 14u, 28u, 10u, 24u, 6u, 20u, 2u};
+  static uint64_t shiftTable23u_1[8] = {9u, 27u, 13u, 31u, 17u, 3u, 21u, 7u};
+
+  static uint8_t shuffleIdxTable23u_0[64] = {
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u,  8u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 5u, 4u, 3u, 2u, 8u, 7u, 6u, 5u, 11u, 10u, 9u,  8u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u};
+  static uint32_t shiftTable23u_2[16] = {9u, 2u, 3u, 4u, 5u, 6u, 7u, 8u,
+                                         9u, 2u, 3u, 4u, 5u, 6u, 7u, 8u};
+  static uint64_t gatherIdxTable23u[8] = {0u, 8u, 11u, 19u, 23u, 31u, 34u, 42u};
+
+  // ------------------------------------ 24u -----------------------------------------
+  static uint8_t shuffleIdxTable24u_0[64] = {
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF,
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF,
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF,
+      2u, 1u, 0u, 0xFF, 5u, 4u, 3u, 0xFF, 8u, 7u, 6u, 0xFF, 11u, 10u, 9u, 0xFF};
+  static uint32_t permutexIdxTable24u[16] = {0u, 1u, 2u, 0x0, 3u, 4u,  5u,  0x0,
+                                             6u, 7u, 8u, 0x0, 9u, 10u, 11u, 0x0};
+
+  // ------------------------------------ 26u -----------------------------------------
+  static uint32_t permutexIdxTable26u_0[16] = {0u, 1u, 1u, 2u, 3u, 4u,  4u,  5u,
+                                               6u, 7u, 8u, 9u, 9u, 10u, 11u, 12u};
+  static uint32_t permutexIdxTable26u_1[16] = {0u, 1u, 2u, 3u, 4u,  5u,  5u,  6u,
+                                               7u, 8u, 8u, 9u, 10u, 11u, 12u, 13u};
+  static uint64_t shiftTable26u_0[8] = {0u, 20u, 8u, 28u, 16u, 4u, 24u, 12u};
+  static uint64_t shiftTable26u_1[8] = {6u, 18u, 30u, 10u, 22u, 2u, 14u, 26u};
+
+  static uint8_t shuffleIdxTable26u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 9u, 8u, 7u, 6u, 12u, 11u, 10u, 9u};
+  static uint32_t shiftTable26u_2[16] = {6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u,
+                                         6u, 4u, 2u, 0u, 6u, 4u, 2u, 0u};
+  static uint64_t gatherIdxTable26u[8] = {0u, 8u, 13u, 21u, 26u, 34u, 39u, 47u};
+
+  // ------------------------------------ 28u -----------------------------------------
+  static uint8_t shuffleIdxTable28u_0[64] = {
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u,
+      3u, 2u, 1u, 0u, 6u, 5u, 4u, 3u, 10u, 9u, 8u, 7u, 13u, 12u, 11u, 10u};
+  static uint32_t shiftTable28u[16] = {4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u,
+                                       4u, 0u, 4u, 0u, 4u, 0u, 4u, 0u};
+  static uint16_t permutexIdxTable28u[32] = {0u,  1u,  2u,  3u,  4u,  5u,  6u,  0x0, 7u,  8u,  9u,
+                                             10u, 11u, 12u, 13u, 0x0, 14u, 15u, 16u, 17u, 18u, 19u,
+                                             20u, 0x0, 21u, 22u, 23u, 24u, 25u, 26u, 27u, 0x0};
+
+  // ------------------------------------ 30u -----------------------------------------
+  static uint32_t permutexIdxTable30u_0[16] = {0u, 1u, 1u, 2u,  3u,  4u,  5u,  6u,
+                                               7u, 8u, 9u, 10u, 11u, 12u, 13u, 14u};
+  static uint32_t permutexIdxTable30u_1[16] = {0u, 1u, 2u,  3u,  4u,  5u,  6u,  7u,
+                                               8u, 9u, 10u, 11u, 12u, 13u, 14u, 15u};
+  static uint64_t shiftTable30u_0[8] = {0u, 28u, 24u, 20u, 16u, 12u, 8u, 4u};
+  static uint64_t shiftTable30u_1[8] = {2u, 6u, 10u, 14u, 18u, 22u, 26u, 30u};
+
+  static uint8_t shuffleIdxTable30u_0[64] = {
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u,
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u,
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u,
+      0u, 0u, 0u, 4u, 3u, 2u, 1u, 0u, 0u, 0u, 0u, 11u, 10u, 9u, 8u, 7u};
+  static uint8_t shuffleIdxTable30u_1[64] = {
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u,
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u,
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u,
+      7u, 6u, 5u, 4u, 3u, 0u, 0u, 0u, 15u, 14u, 13u, 12u, 11u, 0u, 0u, 0u};
+  static uint64_t shiftTable30u_2[8] = {34u, 30u, 34u, 30u, 34u, 30u, 34u, 30u};
+  static uint64_t shiftTable30u_3[8] = {28u, 24u, 28u, 24u, 28u, 24u, 28u, 24u};
+  static uint64_t gatherIdxTable30u[8] = {0u, 8u, 15u, 23u, 30u, 38u, 45u, 53u};
+
+  static uint64_t nibbleReverseTable[8] = {
+      0x0E060A020C040800, 0x0F070B030D050901, 0x0E060A020C040800, 0x0F070B030D050901,
+      0x0E060A020C040800, 0x0F070B030D050901, 0x0E060A020C040800, 0x0F070B030D050901};
+
+  static uint64_t reverseMaskTable1u[8] = {
+      0x0001020304050607, 0x08090A0B0C0D0E0F, 0x1011121314151617, 0x18191A1B1C1D1E1F,
+      0x2021222324252627, 0x28292A2B2C2D2E2F, 0x3031323334353637, 0x38393A3B3C3D3E3F};
+
+  static uint64_t reverseMaskTable16u[8] = {
+      0x0607040502030001, 0x0E0F0C0D0A0B0809, 0x1617141512131011, 0x1E1F1C1D1A1B1819,
+      0x2627242522232021, 0x2E2F2C2D2A2B2829, 0x3637343532333031, 0x3E3F3C3D3A3B3839};
+
+  static uint64_t reverseMaskTable32u[8] = {
+      0x0405060700010203, 0x0C0D0E0F08090A0B, 0x1415161710111213, 0x1C1D1E1F18191A1B,
+      0x2425262720212223, 0x2C2D2E2F28292A2B, 0x3435363730313233, 0x3C3D3E3F38393A3B};
+
+  uint32_t getAlign(uint32_t start_bit, uint32_t base, uint32_t bitsize) {

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1484662528

   > > The reason of CI test failed is the machine doesn't support AVX512. Maybe we'd better running these CI SIMD test on AVX512 machines. https://github.com/apache/orc/actions/runs/4528477658/jobs/7975338899?pr=1375#step:3:41
   > 
   > Could we make it robust? It is likely to happen in the future which may bother the code review.
   
   Currently, I changed the message status from fatal_error to warning when  AVX512 required but compiler doesn't support it at compile time. This will build ORC binary without AVX512, and running the CI SIMD test without AVX512.
   May I have your opinion about this change?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1148715458


##########
c++/src/CpuInfoUtil.cc:
##########
@@ -0,0 +1,545 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * @file CpuInfoUtil.cc is from Apache Arrow as of 2023-03-21
+ */
+
+#include "CpuInfoUtil.hh"
+
+#ifdef __APPLE__
+#include <sys/sysctl.h>
+#endif
+
+#ifndef _MSC_VER
+#include <unistd.h>
+#endif
+
+#ifdef _WIN32
+#define NOMINMAX
+#include <Windows.h>
+#include <intrin.h>
+#endif
+
+#include <algorithm>
+#include <array>
+#include <bitset>
+#include <cstdint>
+#include <fstream>
+#include <optional>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>
+
+#include "orc/Exceptions.hh"
+
+#undef CPUINFO_ARCH_X86
+
+#if defined(__i386) || defined(_M_IX86) || defined(__x86_64__) || defined(_M_X64)
+#define CPUINFO_ARCH_X86
+#endif
+
+#ifndef ORC_HAVE_RUNTIME_AVX512
+#define UNUSED(x) (void)(x)
+#endif
+
+namespace orc {
+
+  namespace {
+
+    constexpr int kCacheLevels = static_cast<int>(CpuInfo::CacheLevel::Last) + 1;
+
+    //============================== OS Dependent ==============================//
+
+#if defined(_WIN32)
+    //------------------------------ WINDOWS ------------------------------//
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = nullptr;
+      PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer_position = nullptr;
+      DWORD buffer_size = 0;
+      size_t offset = 0;
+      typedef BOOL(WINAPI * GetLogicalProcessorInformationFuncPointer)(void*, void*);
+      GetLogicalProcessorInformationFuncPointer func_pointer =
+          (GetLogicalProcessorInformationFuncPointer)GetProcAddress(
+              GetModuleHandle("kernel32"), "GetLogicalProcessorInformation");
+
+      if (!func_pointer) {
+        throw ParseError("Failed to find procedure GetLogicalProcessorInformation");
+      }
+
+      // Get buffer size
+      if (func_pointer(buffer, &buffer_size) && GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
+        throw ParseError("Failed to get size of processor information buffer");
+      }
+
+      buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(buffer_size);
+      if (!buffer) {
+        return;
+      }
+
+      if (!func_pointer(buffer, &buffer_size)) {
+        free(buffer);
+        throw ParseError("Failed to get processor information");
+      }
+
+      buffer_position = buffer;
+      while (offset + sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION) <= buffer_size) {
+        if (RelationCache == buffer_position->Relationship) {
+          PCACHE_DESCRIPTOR cache = &buffer_position->Cache;
+          if (cache->Level >= 1 && cache->Level <= kCacheLevels) {
+            const int64_t current = (*cache_sizes)[cache->Level - 1];
+            (*cache_sizes)[cache->Level - 1] = std::max<int64_t>(current, cache->Size);
+          }
+        }
+        offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
+        buffer_position++;
+      }
+
+      free(buffer);
+    }
+
+#if defined(CPUINFO_ARCH_X86)
+    // On x86, get CPU features by cpuid, https://en.wikipedia.org/wiki/CPUID
+
+#if defined(__MINGW64_VERSION_MAJOR) && __MINGW64_VERSION_MAJOR < 5
+    void __cpuidex(int CPUInfo[4], int function_id, int subfunction_id) {
+      __asm__ __volatile__("cpuid"
+                           : "=a"(CPUInfo[0]), "=b"(CPUInfo[1]), "=c"(CPUInfo[2]), "=d"(CPUInfo[3])
+                           : "a"(function_id), "c"(subfunction_id));
+    }
+
+    int64_t _xgetbv(int xcr) {
+      int out = 0;
+      __asm__ __volatile__("xgetbv" : "=a"(out) : "c"(xcr) : "%edx");
+      return out;
+    }
+#endif  // MINGW
+
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      int register_EAX_id = 1;
+      int highest_valid_id = 0;
+      int highest_extended_valid_id = 0;
+      std::bitset<32> features_ECX;
+      std::array<int, 4> cpu_info;
+
+      // Get highest valid id
+      __cpuid(cpu_info.data(), 0);
+      highest_valid_id = cpu_info[0];
+      // HEX of "GenuineIntel": 47656E75 696E6549 6E74656C
+      // HEX of "AuthenticAMD": 41757468 656E7469 63414D44
+      if (cpu_info[1] == 0x756e6547 && cpu_info[3] == 0x49656e69 && cpu_info[2] == 0x6c65746e) {
+        *vendor = CpuInfo::Vendor::Intel;
+      } else if (cpu_info[1] == 0x68747541 && cpu_info[3] == 0x69746e65 &&
+                 cpu_info[2] == 0x444d4163) {
+        *vendor = CpuInfo::Vendor::AMD;
+      }
+
+      if (highest_valid_id <= register_EAX_id) {
+        return;
+      }
+
+      // EAX=1: Processor Info and Feature Bits
+      __cpuidex(cpu_info.data(), register_EAX_id, 0);
+      features_ECX = cpu_info[2];
+
+      // Get highest extended id
+      __cpuid(cpu_info.data(), 0x80000000);
+      highest_extended_valid_id = cpu_info[0];
+
+      // Retrieve CPU model name
+      if (highest_extended_valid_id >= static_cast<int>(0x80000004)) {
+        model_name->clear();
+        for (int i = 0x80000002; i <= static_cast<int>(0x80000004); ++i) {
+          __cpuidex(cpu_info.data(), i, 0);
+          *model_name += std::string(reinterpret_cast<char*>(cpu_info.data()), sizeof(cpu_info));
+        }
+      }
+
+      bool zmm_enabled = false;
+      if (features_ECX[27]) {  // OSXSAVE
+        // Query if the OS supports saving ZMM registers when switching contexts
+        int64_t xcr0 = _xgetbv(0);
+        zmm_enabled = (xcr0 & 0xE0) == 0xE0;
+      }
+
+      if (features_ECX[9]) *hardware_flags |= CpuInfo::SSSE3;
+      if (features_ECX[19]) *hardware_flags |= CpuInfo::SSE4_1;
+      if (features_ECX[20]) *hardware_flags |= CpuInfo::SSE4_2;
+      if (features_ECX[23]) *hardware_flags |= CpuInfo::POPCNT;
+      if (features_ECX[28]) *hardware_flags |= CpuInfo::AVX;
+
+      // cpuid with EAX=7, ECX=0: Extended Features
+      register_EAX_id = 7;
+      if (highest_valid_id > register_EAX_id) {
+        __cpuidex(cpu_info.data(), register_EAX_id, 0);
+        std::bitset<32> features_EBX = cpu_info[1];
+
+        if (features_EBX[3]) *hardware_flags |= CpuInfo::BMI1;
+        if (features_EBX[5]) *hardware_flags |= CpuInfo::AVX2;
+        if (features_EBX[8]) *hardware_flags |= CpuInfo::BMI2;
+        if (zmm_enabled) {
+          if (features_EBX[16]) *hardware_flags |= CpuInfo::AVX512F;
+          if (features_EBX[17]) *hardware_flags |= CpuInfo::AVX512DQ;
+          if (features_EBX[28]) *hardware_flags |= CpuInfo::AVX512CD;
+          if (features_EBX[30]) *hardware_flags |= CpuInfo::AVX512BW;
+          if (features_EBX[31]) *hardware_flags |= CpuInfo::AVX512VL;
+        }
+      }
+    }
+#endif
+
+#elif defined(__APPLE__)
+    //------------------------------ MACOS ------------------------------//
+    std::optional<int64_t> IntegerSysCtlByName(const char* name) {
+      size_t len = sizeof(int64_t);
+      int64_t data = 0;
+      if (sysctlbyname(name, &data, &len, nullptr, 0) == 0) {
+        return data;
+      }
+      // ENOENT is the official errno value for non-existing sysctl's,
+      // but EINVAL and ENOTSUP have been seen in the wild.
+      if (errno != ENOENT && errno != EINVAL && errno != ENOTSUP) {
+        std::ostringstream ss;
+        ss << "sysctlbyname failed for '" << name << "'";
+        throw ParseError(ss.str());
+      }
+      return std::nullopt;
+    }
+
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      static_assert(kCacheLevels >= 3, "");
+      auto c = IntegerSysCtlByName("hw.l1dcachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[0] = *c;
+      }
+      c = IntegerSysCtlByName("hw.l2cachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[1] = *c;
+      }
+      c = IntegerSysCtlByName("hw.l3cachesize");
+      if (c.has_value()) {
+        (*cache_sizes)[2] = *c;
+      }
+    }
+
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      // hardware_flags
+      struct SysCtlCpuFeature {
+        const char* name;
+        int64_t flag;
+      };
+      std::vector<SysCtlCpuFeature> features = {
+#if defined(CPUINFO_ARCH_X86)
+        {"hw.optional.sse4_2",
+         CpuInfo::SSSE3 | CpuInfo::SSE4_1 | CpuInfo::SSE4_2 | CpuInfo::POPCNT},
+        {"hw.optional.avx1_0", CpuInfo::AVX},
+        {"hw.optional.avx2_0", CpuInfo::AVX2},
+        {"hw.optional.bmi1", CpuInfo::BMI1},
+        {"hw.optional.bmi2", CpuInfo::BMI2},
+        {"hw.optional.avx512f", CpuInfo::AVX512F},
+        {"hw.optional.avx512cd", CpuInfo::AVX512CD},
+        {"hw.optional.avx512dq", CpuInfo::AVX512DQ},
+        {"hw.optional.avx512bw", CpuInfo::AVX512BW},
+        {"hw.optional.avx512vl", CpuInfo::AVX512VL},
+#endif
+      };
+      for (const auto& feature : features) {
+        auto v = IntegerSysCtlByName(feature.name);
+        if (v.value_or(0)) {
+          *hardware_flags |= feature.flag;
+        }
+      }
+
+      // TODO: vendor, model_name
+      *vendor = CpuInfo::Vendor::Unknown;
+      *model_name = "Unknown";
+    }
+
+#else
+    //------------------------------ LINUX ------------------------------//
+    // Get cache size, return 0 on error
+    int64_t LinuxGetCacheSize(int level) {
+      // get cache size by sysconf()
+#ifdef _SC_LEVEL1_DCACHE_SIZE
+      const int kCacheSizeConf[] = {
+          _SC_LEVEL1_DCACHE_SIZE,
+          _SC_LEVEL2_CACHE_SIZE,
+          _SC_LEVEL3_CACHE_SIZE,
+      };
+      static_assert(sizeof(kCacheSizeConf) / sizeof(kCacheSizeConf[0]) == kCacheLevels, "");
+
+      errno = 0;
+      const int64_t cache_size = sysconf(kCacheSizeConf[level]);
+      if (errno == 0 && cache_size > 0) {
+        return cache_size;
+      }
+#endif
+
+      // get cache size from sysfs if sysconf() fails or not supported
+      const char* kCacheSizeSysfs[] = {
+          "/sys/devices/system/cpu/cpu0/cache/index0/size",  // l1d (index1 is l1i)
+          "/sys/devices/system/cpu/cpu0/cache/index2/size",  // l2
+          "/sys/devices/system/cpu/cpu0/cache/index3/size",  // l3
+      };
+      static_assert(sizeof(kCacheSizeSysfs) / sizeof(kCacheSizeSysfs[0]) == kCacheLevels, "");
+
+      std::ifstream cacheinfo(kCacheSizeSysfs[level], std::ios::in);
+      if (!cacheinfo) {
+        return 0;
+      }
+      // cacheinfo is one line like: 65536, 64K, 1M, etc.
+      uint64_t size = 0;
+      char unit = '\0';
+      cacheinfo >> size >> unit;
+      if (unit == 'K') {
+        size <<= 10;
+      } else if (unit == 'M') {
+        size <<= 20;
+      } else if (unit == 'G') {
+        size <<= 30;
+      } else if (unit != '\0') {
+        return 0;
+      }
+      return static_cast<int64_t>(size);
+    }
+
+    // Helper function to parse for hardware flags from /proc/cpuinfo
+    // values contains a list of space-separated flags.  check to see if the flags we
+    // care about are present.
+    // Returns a bitmap of flags.
+    int64_t LinuxParseCpuFlags(const std::string& values) {
+      const struct {
+        std::string name;
+        int64_t flag;
+      } flag_mappings[] = {
+#if defined(CPUINFO_ARCH_X86)
+        {"ssse3", CpuInfo::SSSE3},
+        {"sse4_1", CpuInfo::SSE4_1},
+        {"sse4_2", CpuInfo::SSE4_2},
+        {"popcnt", CpuInfo::POPCNT},
+        {"avx", CpuInfo::AVX},
+        {"avx2", CpuInfo::AVX2},
+        {"avx512f", CpuInfo::AVX512F},
+        {"avx512cd", CpuInfo::AVX512CD},
+        {"avx512vl", CpuInfo::AVX512VL},
+        {"avx512dq", CpuInfo::AVX512DQ},
+        {"avx512bw", CpuInfo::AVX512BW},
+        {"bmi1", CpuInfo::BMI1},
+        {"bmi2", CpuInfo::BMI2},
+#endif
+      };
+      const int64_t num_flags = sizeof(flag_mappings) / sizeof(flag_mappings[0]);
+
+      int64_t flags = 0;
+      for (int i = 0; i < num_flags; ++i) {
+        if (values.find(flag_mappings[i].name) != std::string::npos) {
+          flags |= flag_mappings[i].flag;
+        }
+      }
+      return flags;
+    }
+
+    void OsRetrieveCacheSize(std::array<int64_t, kCacheLevels>* cache_sizes) {
+      for (int i = 0; i < kCacheLevels; ++i) {
+        const int64_t cache_size = LinuxGetCacheSize(i);
+        if (cache_size > 0) {
+          (*cache_sizes)[i] = cache_size;
+        }
+      }
+    }
+
+    static constexpr bool IsWhitespace(char c) {
+      return c == ' ' || c == '\t';
+    }
+
+    std::string TrimString(std::string value) {
+      size_t ltrim_chars = 0;
+      while (ltrim_chars < value.size() && IsWhitespace(value[ltrim_chars])) {
+        ++ltrim_chars;
+      }
+      value.erase(0, ltrim_chars);
+      size_t rtrim_chars = 0;
+      while (rtrim_chars < value.size() && IsWhitespace(value[value.size() - 1 - rtrim_chars])) {
+        ++rtrim_chars;
+      }
+      value.erase(value.size() - rtrim_chars, rtrim_chars);
+      return value;
+    }
+
+    // Read from /proc/cpuinfo
+    void OsRetrieveCpuInfo(int64_t* hardware_flags, CpuInfo::Vendor* vendor,
+                           std::string* model_name) {
+      std::ifstream cpuinfo("/proc/cpuinfo", std::ios::in);
+      while (cpuinfo) {
+        std::string line;
+        std::getline(cpuinfo, line);
+        const size_t colon = line.find(':');
+        if (colon != std::string::npos) {
+          const std::string name = TrimString(line.substr(0, colon - 1));
+          const std::string value = TrimString(line.substr(colon + 1, std::string::npos));
+          if (name.compare("flags") == 0 || name.compare("Features") == 0) {
+            *hardware_flags |= LinuxParseCpuFlags(value);
+          } else if (name.compare("model name") == 0) {
+            *model_name = value;
+          } else if (name.compare("vendor_id") == 0) {
+            if (value.compare("GenuineIntel") == 0) {
+              *vendor = CpuInfo::Vendor::Intel;
+            } else if (value.compare("AuthenticAMD") == 0) {
+              *vendor = CpuInfo::Vendor::AMD;
+            }
+          }
+        }
+      }
+    }
+#endif  // WINDOWS, MACOS, LINUX
+
+    //============================== Arch Dependent ==============================//
+
+#if defined(CPUINFO_ARCH_X86)
+    //------------------------------ X86_64 ------------------------------//
+    bool ArchParseUserSimdLevel(const std::string& simd_level, int64_t* hardware_flags) {
+      enum {
+        USER_SIMD_NONE,
+        USER_SIMD_AVX512,
+        USER_SIMD_MAX,
+      };
+
+      int level = USER_SIMD_MAX;
+      // Parse the level
+      if (simd_level == "AVX512") {
+        level = USER_SIMD_AVX512;
+      } else if (simd_level == "NONE") {
+        level = USER_SIMD_NONE;
+      } else {
+        return false;
+      }
+
+      // Disable feature as the level
+      if (level < USER_SIMD_AVX512) {
+        *hardware_flags &= ~CpuInfo::AVX512;
+      }
+      return true;
+    }
+
+    void ArchVerifyCpuRequirements(const CpuInfo* ci) {
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+      if (!ci->isDetected(CpuInfo::AVX512)) {
+        throw ParseError("CPU does not support the Supplemental AVX512 instruction set");
+      }
+#else
+      UNUSED(ci);
+#endif
+    }
+
+#endif  // X86
+
+  }  // namespace
+
+  struct CpuInfo::Impl {
+    int64_t hardware_flags = 0;
+    int numCores = 0;
+    int64_t original_hardware_flags = 0;
+    Vendor vendor = Vendor::Unknown;
+    std::string model_name = "Unknown";
+    std::array<int64_t, kCacheLevels> cache_sizes{};
+
+    Impl() {
+      OsRetrieveCacheSize(&cache_sizes);
+      OsRetrieveCpuInfo(&hardware_flags, &vendor, &model_name);
+      original_hardware_flags = hardware_flags;
+      numCores = std::max(static_cast<int>(std::thread::hardware_concurrency()), 1);
+
+      // parse user simd level
+      const auto maybe_env_var = std::getenv("ORC_USER_SIMD_LEVEL");

Review Comment:
   I would rather add a new dev/BUILD_WITH_SIMD.md or directly modify the root README.md for the steps to enable build with the AVX512.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1484460330

   The reason of CI test failed is the machine doesn't support AVX512. Maybe we'd better running these CI SIMD test on AVX512 machines.
   https://github.com/apache/orc/actions/runs/4528477658/jobs/7975338899?pr=1375#step:3:41


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1090335348


##########
c++/src/RleDecoderV2.cc:
##########
@@ -67,6 +91,147 @@ namespace orc {
   }
 
   void RleDecoderV2::readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs) {
+    uint64_t startBit = 0;
+#if ENABLE_AVX512

Review Comment:
   Do you mean to create a new option in ORC config file to check the option "on" or "off" at the runtime? If yes, may I have a question that where is the ORC config file?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1522616787

   Hi @dongjoon-hyun, welcome back from vacation! Do you have any other comments? Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode [orc]

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1755416934

   > @wpleonardo Do we have any performance benchmark about this PR? @alexey-milovidov Maybe you are interested in it.
   > 
   > I try to use this feature in clickhouse(https://github.com/clickHouse/ClickHouse), but can't see any performance improvement.
   > 
   > Q: `select * from file('/data1/clickhouse_official/data/user_files/test.orc') format Null;`
   > 
   > With AVX512:
   > 
   > ```
   > 0 rows in set. Elapsed: 3.659 sec. Processed 1.13 million rows, 486.19 MB (308.68 thousand rows/s., 132.88 MB/s.)
   > 0 rows in set. Elapsed: 3.653 sec. Processed 1.20 million rows, 517.87 MB (329.40 thousand rows/s., 141.76 MB/s.)
   > 0 rows in set. Elapsed: 3.719 sec. Processed 1.13 million rows, 486.19 MB (303.70 thousand rows/s., 130.74 MB/s.)
   > ```
   > 
   > Without AVX512
   > 
   > ```
   > 0 rows in set. Elapsed: 3.565 sec. Processed 1.13 million rows, 486.19 MB (316.81 thousand rows/s., 136.38 MB/s.)
   > 0 rows in set. Elapsed: 3.540 sec. Processed 1.20 million rows, 517.87 MB (339.91 thousand rows/s., 146.28 MB/s.)
   > 0 rows in set. Elapsed: 3.681 sec. Processed 1.20 million rows, 517.87 MB (326.90 thousand rows/s., 140.69 MB/s.)
   > ```
   > 
   > About the test orc file:
   > 
   > ```
   > $ du -sh test.orc                                                     
   > 505M	test.orc
   > 
   > 
   > $ orc-metadata ./test.orc                           
   > { "name": "./test.orc",
   >   "type": "struct<reporttime:bigint,appid:bigint,uid:bigint,platform:int,nettype:int,clientversioncode:bigint,sdkversioncode:bigint,statid:string,statversion:int,countrycode:string,language:string,model:string,osversion:string,channel:string,heartcount:int,msgcount:int,giftcount:int,barragecount:int,gid:string,entrytype:int,prefetchedms:int,linkdstate:int,networkavailable:int,starttimestamp:bigint,sessionlogints:int,medialogints:int,sdkboundts:int,msconnectedts:int,vsconnectedts:int,firstiframets:int,ownerstatus:int,stopreason:int,totaltime:int,cpuusageavg:int,memusageavg:int,backgroundtotal:bigint,foregroundtotal:bigint,firstvideopackts:int,firstvoicerecvts:int,firstvoiceplayts:int,firstiframeassemblets:int,uiinitts:int,uiloadedts:int,uiappearedts:int,setvideoviewts:int,blurviewdimissts:int,preparesdkinqueuets:int,preparesdkexects:int,startsdkinqueuets:int,startsdkexects:int,sdkjoinchannelinqueuets:int,sdkjoinchannelexects:int,lastsdkleavechannelinqueuets:int,lastsdkleavechanne
 lexects:int,unused_1:int,unused_2:int,setvideoviewinqueuets:int,setvideoviewexects:int,livetype:int,audiostatus:int,firstiframesize:bigint,firstiframedecodetime:bigint,extras:bigint,entrancetype:int,entrancemode:int,mclientip:bigint,mnc:bigint,mcc:bigint,vsipsuccess:bigint,msipsuccess:bigint,vsipfail:bigint,msipfail:bigint,mediaflag:bigint,dispatchid:string,proxyflag:int,redirectcount:int,directorrescode:int,subentrancetab:string,logininfolist:array<struct<strategy:bigint,ip:bigint,loginStat:bigint,reserve1:bigint,reserve2:bigint>>,playcentertype:int,videomutetype:bigint,owneruid:bigint,extra:string>",
   >   "rows": 1203317,
   >   "stripe count": 12,
   >   "format": "0.12", "writer version": "future - 9",
   >   "compression": "snappy", "compression block": 65536,
   >   "file length": 529207118,
   >   "content": 529182229, "stripe stats": 21150, "footer": 3712, "postscript": 26,
   >   "row index stride": 10000,
   >   "user metadata": {
   >     "org.apache.spark.version": "3.3.2"
   >   },
   >   "stripes": [
   >     { "stripe": 0, "rows": 117760,
   >       "offset": 3, "length": 50876922,
   >       "index": 23728, "data": 50851823, "footer": 1371
   >     },
   >     { "stripe": 1, "rows": 117760,
   >       "offset": 50876925, "length": 50948680,
   >       "index": 23679, "data": 50923619, "footer": 1382
   >     },
   >     { "stripe": 2, "rows": 62050,
   >       "offset": 101825605, "length": 26902880,
   >       "index": 15322, "data": 26886211, "footer": 1347
   >     },
   >     { "stripe": 3, "rows": 117760,
   >       "offset": 128728485, "length": 50474083,
   >       "index": 24110, "data": 50448601, "footer": 1372
   >     },
   >     { "stripe": 4, "rows": 117760,
   >       "offset": 179202568, "length": 50413042,
   >       "index": 23858, "data": 50387825, "footer": 1359
   >     },
   >     { "stripe": 5, "rows": 63570,
   >       "offset": 229615610, "length": 27504277,
   >       "index": 14890, "data": 27488029, "footer": 1358
   >     },
   >     { "stripe": 6, "rows": 117760,
   >       "offset": 268435456, "length": 50981984,
   >       "index": 24191, "data": 50956424, "footer": 1369
   >     },
   >     { "stripe": 7, "rows": 117760,
   >       "offset": 319417440, "length": 51017894,
   >       "index": 23792, "data": 50992731, "footer": 1371
   >     },
   >     { "stripe": 8, "rows": 61720,
   >       "offset": 370435334, "length": 26840720,
   >       "index": 15246, "data": 26824109, "footer": 1365
   >     },
   >     { "stripe": 9, "rows": 117760,
   >       "offset": 397276054, "length": 49971095,
   >       "index": 23487, "data": 49946233, "footer": 1375
   >     },
   >     { "stripe": 10, "rows": 117760,
   >       "offset": 447247149, "length": 50259825,
   >       "index": 24090, "data": 50234369, "footer": 1366
   >     },
   >     { "stripe": 11, "rows": 73897,
   >       "offset": 497506974, "length": 31675255,
   >       "index": 16948, "data": 31656952, "footer": 1355
   >     }
   >   ]
   > }
   > ```
   
   Yes, we have the performance micro-benchmark for this PR. If you use the ORC default align fixed bit width, AVX512 bit-unpacking has almost the same performance as non-AVX512. But if you use the ORC not align bit width, AVX512 bit-unpacking has almost 6X performance gain compared with non-AVX512, and performance close to non-AVX512 with aligned fixed bit-width.
   So, maybe you could check the Clickhouse ORC setting if aligned bit-width or not.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1518942257

   In cmake_modules/ConfigSimdLevel.cmake, changed check_cxx_source_compiles to check_cxx_source_runs, to make sure AVX512 program can run normally on that machine.
   https://github.com/wpleonardo/orc/blob/d6fd57d1c81709d6412fd506301aeffde39a3db6/cmake_modules/ConfigSimdLevel.cmake#L57
   Please help me rerun CI test. Sorry for multiple rerun CI test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac merged pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac merged PR #1375:
URL: https://github.com/apache/orc/pull/1375


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1413181094

   > > Do you know why this PR causes `ILLEGAL` failures?
   > > ```
   > > 75% tests passed, 2 tests failed out of 8
   > > 
   > > Total Test time (real) = 545.23 sec
   > > 
   > > The following tests FAILED:
   > > 	  1 - orc-test (ILLEGAL)
   > > 	  8 - tool-test (ILLEGAL)
   > > ```
   > 
   > All checks have already passed. Thanks.
   
   Please let me know when it is ready for review again. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1411446664

   Do you know why this PR causes `ILLEGAL` failures?
   ```
   75% tests passed, 2 tests failed out of 8
   
   Total Test time (real) = 545.23 sec
   
   The following tests FAILED:
   	  1 - orc-test (ILLEGAL)
   	  8 - tool-test (ILLEGAL)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1107118931


##########
CMakeLists.txt:
##########
@@ -157,6 +172,102 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")
+    set(ORC_AVX512_FLAG
+        "${ORC_AVX512_FLAG} -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vbmi")
+  endif()
+  check_cxx_compiler_flag(${ORC_AVX512_FLAG} CXX_SUPPORTS_AVX512)
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    check_cxx_source_compiles("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  message(STATUS "BUILD_ENABLE_AVX512=${BUILD_ENABLE_AVX512}")

Review Comment:
   Fixed.



##########
CMakeLists.txt:
##########
@@ -157,6 +172,102 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")
+    set(ORC_AVX512_FLAG
+        "${ORC_AVX512_FLAG} -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vbmi")
+  endif()
+  check_cxx_compiler_flag(${ORC_AVX512_FLAG} CXX_SUPPORTS_AVX512)
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    check_cxx_source_compiles("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  message(STATUS "BUILD_ENABLE_AVX512=${BUILD_ENABLE_AVX512}")
+  message(STATUS "CXX_SUPPORTS_AVX512=${CXX_SUPPORTS_AVX512}")
+  message(STATUS "ORC_RUNTIME_SIMD_LEVEL=${ORC_RUNTIME_SIMD_LEVEL}")
+  # Runtime SIMD level it can get from compiler and ORC_RUNTIME_SIMD_LEVEL
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND ORC_RUNTIME_SIMD_LEVEL MATCHES "^(AVX512|MAX)$")
+    message(STATUS "Enable the AVX512 vector decode of bit-packing")
+    set(ORC_HAVE_RUNTIME_AVX512 ON)
+    set(ORC_SIMD_LEVEL "AVX512")
+    add_definitions(-DORC_HAVE_RUNTIME_AVX512)
+  else ()
+    set(ORC_HAVE_RUNTIME_AVX512 OFF)
+    message(STATUS "Disable the AVX512 vector decode of bit-packing")
+  endif()
+  if(ORC_SIMD_LEVEL STREQUAL "DEFAULT")
+    set(ORC_SIMD_LEVEL "NONE")
+  endif()
+elseif(ORC_CPU_FLAG STREQUAL "aarch64")

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1107118708


##########
CMakeLists.txt:
##########
@@ -157,6 +172,102 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")
+    set(ORC_AVX512_FLAG
+        "${ORC_AVX512_FLAG} -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vbmi")
+  endif()
+  check_cxx_compiler_flag(${ORC_AVX512_FLAG} CXX_SUPPORTS_AVX512)
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    check_cxx_source_compiles("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  message(STATUS "BUILD_ENABLE_AVX512=${BUILD_ENABLE_AVX512}")
+  message(STATUS "CXX_SUPPORTS_AVX512=${CXX_SUPPORTS_AVX512}")
+  message(STATUS "ORC_RUNTIME_SIMD_LEVEL=${ORC_RUNTIME_SIMD_LEVEL}")

Review Comment:
   Already change the print information about these values. Please check it:
   -- System processor: x86_64
   -- Performing Test CXX_SUPPORTS_AVX512
   -- Performing Test CXX_SUPPORTS_AVX512 - Success
   -- BUILD_ENABLE_AVX512: ON
   -- Enable the AVX512 vector decode of bit-packing, compiler support AVX512
   -- ORC_HAVE_RUNTIME_AVX512: ON, ORC_SIMD_LEVEL: AVX512
   
   -- System processor: x86_64
   -- Performing Test CXX_SUPPORTS_AVX512
   -- Performing Test CXX_SUPPORTS_AVX512 - Success
   -- BUILD_ENABLE_AVX512: OFF
   -- Disable the AVX512 vector decode of bit-packing
   -- ORC_HAVE_RUNTIME_AVX512: OFF, ORC_SIMD_LEVEL: NONE



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1169459344


##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,2694 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  inline void UnpackAvx512::alignHeaderBoundary(const uint32_t bitWidth, const uint32_t bitMaxSize,
+                                                uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                                uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint64_t& tailBitLen, uint32_t& backupByteLen,
+                                                uint64_t& numElements, bool& resetBuf,
+                                                const uint8_t*& srcPtr, int64_t*& dstPtr) {
+    if (startBit != 0) {
+      bufMoveByteLen +=
+          moveByteLen(remainingNumElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+    } else {
+      bufMoveByteLen += moveByteLen(remainingNumElements * bitWidth);
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      numElements = remainingNumElements;
+      resetBuf = false;
+      remainingNumElements = 0;
+    } else {
+      uint64_t leadingBits = 0;
+      if (startBit != 0) leadingBits = ORC_VECTOR_BYTE_WIDTH - startBit;
+      uint64_t bufRestBitLen = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + leadingBits;
+      numElements = bufRestBitLen / bitWidth;
+      remainingNumElements -= numElements;
+      tailBitLen = fmod(bufRestBitLen, bitWidth);
+      resetBuf = true;
+    }
+
+    if (tailBitLen != 0) {
+      backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+      tailBitLen = 0;
+    }
+
+    if (startBit > 0) {
+      uint32_t align = getAlign(startBit, bitWidth, bitMaxSize);
+      if (align > numElements) {
+        align = numElements;
+      }
+      if (align != 0) {
+        bufMoveByteLen -= moveByteLen(align * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+        plainUnpackLongs(dstPtr, 0, align, bitWidth, startBit);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+        dstPtr += align;
+        numElements -= align;
+      }
+    }
+  }
+
+  inline void UnpackAvx512::alignTailerBoundary(const uint32_t bitWidth, uint64_t& startBit,
+                                                uint64_t& bufMoveByteLen, uint64_t& bufRestByteLen,
+                                                uint64_t& remainingNumElements,
+                                                uint32_t& backupByteLen, uint64_t& numElements,
+                                                bool& resetBuf, const uint8_t*& srcPtr,
+                                                int64_t*& dstPtr) {
+    if (numElements > 0) {
+      if (startBit != 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH);
+      } else {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+      }
+      plainUnpackLongs(dstPtr, 0, numElements, bitWidth, startBit);
+      srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+      dstPtr += numElements;
+      bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    }
+
+    if (bufMoveByteLen <= bufRestByteLen) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                resetBuf, backupByteLen);
+      return;
+    }
+
+    if (backupByteLen != 0) {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+      plainUnpackLongs(dstPtr, 0, 1, bitWidth, startBit);
+      dstPtr++;
+      backupByteLen = 0;
+      remainingNumElements--;
+    } else {
+      decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                resetBuf, backupByteLen);
+    }
+
+    bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bufMoveByteLen = 0;
+    srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+  }
+
+  void UnpackAvx512::vectorUnpack1(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 1;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __m512i reverseMask1u = _mm512_loadu_si512(reverseMaskTable1u);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          uint64_t src_64 = *reinterpret_cast<uint64_t*>(const_cast<uint8_t*>(srcPtr));
+          // convert mask to 512-bit register. 0 --> 0x00, 1 --> 0xFF
+          __m512i srcmm = _mm512_movm_epi8(src_64);
+          // make 0x00 --> 0x00, 0xFF --> 0x01
+          srcmm = _mm512_abs_epi8(srcmm);
+          srcmm = _mm512_shuffle_epi8(srcmm, reverseMask1u);
+          _mm512_storeu_si512(simdPtr, srcmm);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack2(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 2;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_16U;         // first 16 bytes (64 elements)
+        __m512i parse_mask = _mm512_set1_epi16(0x0303);  // 2 times 1 then (8 - 2) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm3 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          __m512i srcmm0, srcmm1, srcmm2, tmpmm;
+
+          srcmm2 = _mm512_srli_epi16(srcmm3, 2);
+          srcmm1 = _mm512_srli_epi16(srcmm3, 4);
+          srcmm0 = _mm512_srli_epi16(srcmm3, 6);
+
+          // turn 2 bitWidth into 8 by zeroing 3 of each 4 elements.
+          // move them into their places
+          // srcmm0: a e i m 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm1: b f j n 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);        // ab ef 00 00 00 00 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);       // ij mn 00 00 00 00 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x00);  // ab ef ab ef ij mn ij mn
+
+          // srcmm2: c g k o 0 0 0 0 0 0 0 0 0 0 0 0
+          // srcmm3: d h l p 0 0 0 0 0 0 0 0 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm2, srcmm3);        // cd gh 00 00 00 00 00 00
+          srcmm1 = _mm512_unpackhi_epi8(srcmm2, srcmm3);       // kl op 00 00 00 00 00 00
+          srcmm1 = _mm512_shuffle_i64x2(tmpmm, srcmm1, 0x00);  // cd gh cd gh kl op kl op
+
+          tmpmm = _mm512_unpacklo_epi16(srcmm0, srcmm1);        // abcd abcd ijkl ijkl
+          srcmm0 = _mm512_unpackhi_epi16(srcmm0, srcmm1);       // efgh efgh mnop mnop
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x88);   // abcd ijkl efgh mnop
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // abcd efgh ijkl mnop
+
+          srcmm0 = _mm512_and_si512(srcmm0, parse_mask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack3(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 3;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable3u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable3u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable3u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable3u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable3u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack4(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 4;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_MAX_32U;        // first 32 bytes (64 elements)
+        __m512i parseMask = _mm512_set1_epi16(0x0F0F);  // 4 times 1 then (8 - 4) times 0
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm0, srcmm1, tmpmm;
+
+          srcmm1 = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm0 = _mm512_srli_epi16(srcmm1, 4);
+
+          // move elements into their places
+          // srcmm0: a c e g 0 0 0 0
+          // srcmm1: b d f h 0 0 0 0
+          tmpmm = _mm512_unpacklo_epi8(srcmm0, srcmm1);         // ab ef 00 00
+          srcmm0 = _mm512_unpackhi_epi8(srcmm0, srcmm1);        // cd gh 00 00
+          srcmm0 = _mm512_shuffle_i64x2(tmpmm, srcmm0, 0x44);   // ab ef cd gh
+          srcmm0 = _mm512_shuffle_i64x2(srcmm0, srcmm0, 0xD8);  // ab cd ef gh
+
+          // turn 4 bitWidth into 8 by zeroing 4 of each 8 bits.
+          srcmm0 = _mm512_and_si512(srcmm0, parseMask);
+
+          _mm512_storeu_si512(simdPtr, srcmm0);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack5(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 5;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable5u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable5u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable5u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable5u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable5u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack6(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 6;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable6u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable6u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable6u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable6u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable6u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack7(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 7;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_8Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+        uint8_t* simdPtr = reinterpret_cast<uint8_t*>(vectorBuf);
+        __mmask64 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_BYTE(bitWidth * 64));
+        __m512i parseMask = _mm512_set1_epi8(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable7u);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable7u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable7u_1);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable7u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable7u_1);
+
+        while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi8(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi16(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi8(zmm[0], 0xAAAAAAAAAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 8 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 8 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 8 * bitWidth;
+          numElements -= VECTOR_UNPACK_8BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_8BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_8BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack9(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 9;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable9u_0);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable9u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable9u_1);
+
+        __m512i shiftMaskPtr[3];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable9u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable9u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable9u_2);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable9u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi16(zmm[0], shiftMaskPtr[2]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 7);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack10(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 10;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable10u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable10u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable10u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack11(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 11;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable11u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable11u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable11u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable11u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable11u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable11u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable11u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable11u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable11u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4u);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 5);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack12(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 12;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr = _mm512_loadu_si512(shuffleIdxTable12u_0);
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable12u);
+        __m512i shiftMask = _mm512_loadu_si512(shiftTable12u);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm;
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          zmm = _mm512_permutexvar_epi32(permutexIdx, srcmm);
+          zmm = _mm512_shuffle_epi8(zmm, shuffleIdxPtr);
+
+          // shifting elements so they start from the start of the word
+          zmm = _mm512_srlv_epi16(zmm, shiftMask);
+          zmm = _mm512_and_si512(zmm, parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack13(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 13;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable13u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable13u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable13u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable13u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable13u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable13u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable13u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable13u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable13u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 3);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverse_mask_16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack14(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 14;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable14u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable14u_1);
+
+        __m512i permutexIdx = _mm512_loadu_si512(permutexIdxTable14u);
+
+        __m512i shiftMaskPtr[2];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable14u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable14u_1);
+
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+          srcmm = _mm512_permutexvar_epi16(permutexIdx, srcmm);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack15(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 15;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      alignHeaderBoundary(bitWidth, UNPACK_16Bit_MAX_SIZE, startBit, bufMoveByteLen, bufRestByteLen,
+                          len, tailBitLen, backupByteLen, numElements, resetBuf, srcPtr, dstPtr);
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __mmask32 readMask = ORC_VECTOR_BIT_MASK(ORC_VECTOR_BITS_2_WORD(bitWidth * 32));
+        __m512i parseMask0 = _mm512_set1_epi16(ORC_VECTOR_BIT_MASK(bitWidth));
+        __m512i nibbleReversemm = _mm512_loadu_si512(nibbleReverseTable);
+        __m512i reverseMask16u = _mm512_loadu_si512(reverseMaskTable16u);
+        __m512i maskmm = _mm512_set1_epi8(0x0F);
+
+        __m512i shuffleIdxPtr[2];
+        shuffleIdxPtr[0] = _mm512_loadu_si512(shuffleIdxTable15u_0);
+        shuffleIdxPtr[1] = _mm512_loadu_si512(shuffleIdxTable15u_1);
+
+        __m512i permutexIdxPtr[2];
+        permutexIdxPtr[0] = _mm512_loadu_si512(permutexIdxTable15u_0);
+        permutexIdxPtr[1] = _mm512_loadu_si512(permutexIdxTable15u_1);
+
+        __m512i shiftMaskPtr[4];
+        shiftMaskPtr[0] = _mm512_loadu_si512(shiftTable15u_0);
+        shiftMaskPtr[1] = _mm512_loadu_si512(shiftTable15u_1);
+        shiftMaskPtr[2] = _mm512_loadu_si512(shiftTable15u_2);
+        shiftMaskPtr[3] = _mm512_loadu_si512(shiftTable15u_3);
+
+        __m512i gatherIdxmm = _mm512_loadu_si512(gatherIdxTable15u);
+
+        while (numElements >= 2 * VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_i64gather_epi64(gatherIdxmm, srcPtr, 1);
+
+          // shuffling so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[0]);
+          zmm[1] = _mm512_shuffle_epi8(srcmm, shuffleIdxPtr[1]);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[2]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[3]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+        if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm, zmm[2];
+
+          srcmm = _mm512_maskz_loadu_epi16(readMask, srcPtr);
+
+          __m512i lowNibblemm = _mm512_and_si512(srcmm, maskmm);
+          __m512i highNibblemm = _mm512_srli_epi16(srcmm, 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          srcmm = _mm512_or_si512(lowNibblemm, highNibblemm);
+
+          // permuting so in zmm[0] will be elements with even indexes and in zmm[1] - with odd ones
+          zmm[0] = _mm512_permutexvar_epi16(permutexIdxPtr[0], srcmm);
+          zmm[1] = _mm512_permutexvar_epi16(permutexIdxPtr[1], srcmm);
+
+          // shifting elements so they start from the start of the word
+          zmm[0] = _mm512_srlv_epi32(zmm[0], shiftMaskPtr[0]);
+          zmm[1] = _mm512_sllv_epi32(zmm[1], shiftMaskPtr[1]);
+
+          // gathering even and odd elements together
+          zmm[0] = _mm512_mask_mov_epi16(zmm[0], 0xAAAAAAAA, zmm[1]);
+          zmm[0] = _mm512_and_si512(zmm[0], parseMask0);
+
+          zmm[0] = _mm512_slli_epi16(zmm[0], 1);
+
+          lowNibblemm = _mm512_and_si512(zmm[0], maskmm);
+          highNibblemm = _mm512_srli_epi16(zmm[0], 4);
+          highNibblemm = _mm512_and_si512(highNibblemm, maskmm);
+
+          lowNibblemm = _mm512_shuffle_epi8(nibbleReversemm, lowNibblemm);
+          highNibblemm = _mm512_shuffle_epi8(nibbleReversemm, highNibblemm);
+          lowNibblemm = _mm512_slli_epi16(lowNibblemm, 4);
+
+          zmm[0] = _mm512_or_si512(lowNibblemm, highNibblemm);
+          zmm[0] = _mm512_shuffle_epi8(zmm[0], reverseMask16u);
+
+          _mm512_storeu_si512(simdPtr, zmm[0]);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      alignTailerBoundary(bitWidth, startBit, bufMoveByteLen, bufRestByteLen, len, backupByteLen,
+                          numElements, resetBuf, srcPtr, dstPtr);
+    }
+  }
+
+  void UnpackAvx512::vectorUnpack16(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 16;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint64_t numElements = len;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    int64_t* dstPtr = data + offset;
+    bool resetBuf = false;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      bufMoveByteLen += moveByteLen(len * bitWidth);
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        numElements = len;
+        resetBuf = false;
+      } else {
+        numElements = bufRestByteLen * ORC_VECTOR_BYTE_WIDTH / bitWidth;
+        len -= numElements;
+        tailBitLen = fmod(bufRestByteLen * ORC_VECTOR_BYTE_WIDTH, bitWidth);
+        resetBuf = true;
+      }
+
+      if (tailBitLen != 0) {
+        backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+        tailBitLen = 0;
+      }
+
+      if (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+        uint16_t* simdPtr = reinterpret_cast<uint16_t*>(vectorBuf);
+        __m512i reverse_mask_16u = _mm512_loadu_si512(reverseMaskTable16u);
+        while (numElements >= VECTOR_UNPACK_16BIT_MAX_NUM) {
+          __m512i srcmm = _mm512_loadu_si512(srcPtr);
+          srcmm = _mm512_shuffle_epi8(srcmm, reverse_mask_16u);
+          _mm512_storeu_si512(simdPtr, srcmm);
+
+          srcPtr += 4 * bitWidth;
+          decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, 4 * bitWidth, false,
+                                    0);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          bufMoveByteLen -= 4 * bitWidth;
+          numElements -= VECTOR_UNPACK_16BIT_MAX_NUM;
+          std::copy(simdPtr, simdPtr + VECTOR_UNPACK_16BIT_MAX_NUM, dstPtr);
+          dstPtr += VECTOR_UNPACK_16BIT_MAX_NUM;
+        }
+      }
+
+      if (numElements > 0) {
+        bufMoveByteLen -= moveByteLen(numElements * bitWidth);
+        unpackDefault.unrolledUnpack16(dstPtr, 0, numElements);
+        srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+        dstPtr += numElements;
+        bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+      }
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufMoveByteLen,
+                                  resetBuf, backupByteLen);
+        return;
+      }
+
+      if (backupByteLen != 0) {
+        decoder->resetBufferStart(&decoder->bufferStart, &decoder->bufferEnd, bufRestByteLen,
+                                  resetBuf, backupByteLen);
+        ;

Review Comment:
   Sorry, fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1169459884


##########
c++/src/RLEv2.hh:
##########
@@ -220,17 +221,36 @@ namespace orc {
 
     const std::unique_ptr<SeekableInputStream> inputStream;
     const bool isSigned;
-
     unsigned char firstByte;
-    uint64_t runLength;  // Length of the current run
-    uint64_t runRead;    // Number of returned values of the current run
-    const char* bufferStart;
-    const char* bufferEnd;
-    uint32_t bitsLeft;                  // Used by readLongs when bitSize < 8
-    uint32_t curByte;                   // Used by anything that uses readLongs
+    uint64_t runLength;                 // Length of the current run
+    uint64_t runRead;                   // Number of returned values of the current run
     DataBuffer<int64_t> unpackedPatch;  // Used by PATCHED_BASE
     DataBuffer<int64_t> literals;       // Values of the current run
   };
+
+  inline void RleDecoderV2::resetBufferStart(char** bufStart, char** bufEnd, uint64_t len,

Review Comment:
   Fixed.



##########
c++/src/BpackingAvx512.hh:
##########
@@ -0,0 +1,103 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGAVX512_HH
+#define ORC_BPACKINGAVX512_HH
+
+#include <cstdint>
+#include <cstdlib>
+
+#include "BpackingDefault.hh"
+
+namespace orc {
+
+#define VECTOR_UNPACK_8BIT_MAX_NUM 64
+#define VECTOR_UNPACK_16BIT_MAX_NUM 32
+#define VECTOR_UNPACK_32BIT_MAX_NUM 16
+#define UNPACK_8Bit_MAX_SIZE 8
+#define UNPACK_16Bit_MAX_SIZE 16
+#define UNPACK_32Bit_MAX_SIZE 32
+
+  class RleDecoderV2;
+
+  class UnpackAvx512 {
+   public:
+    UnpackAvx512(RleDecoderV2* dec);
+    ~UnpackAvx512();
+
+    void vectorUnpack1(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack2(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack3(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack4(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack5(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack6(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack7(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack9(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack10(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack11(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack12(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack13(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack14(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack15(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack16(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack17(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack18(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack19(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack20(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack21(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack22(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack23(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack24(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack26(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack28(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack30(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack32(int64_t* data, uint64_t offset, uint64_t len);
+
+    void plainUnpackLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs,
+                          uint64_t& startBit);
+
+    inline void alignHeaderBoundary(const uint32_t bitWidth, const uint32_t bitMaxSize,
+                                    uint64_t& startBit, uint64_t& bufMoveByteLen,
+                                    uint64_t& bufRestByteLen, uint64_t& remainingNumElements,
+                                    uint64_t& tailBitLen, uint32_t& backupByteLen,
+                                    uint64_t& numElements, bool& resetBuf, const uint8_t*& srcPtr,
+                                    int64_t*& dstPtr);
+
+    inline void alignTailerBoundary(const uint32_t bitWidth, uint64_t& startBit,
+                                    uint64_t& bufMoveByteLen, uint64_t& bufRestByteLen,
+                                    uint64_t& remainingNumElements, uint32_t& backupByteLen,
+                                    uint64_t& numElements, bool& resetBuf, const uint8_t*& srcPtr,
+                                    int64_t*& dstPtr);
+
+   private:
+    RleDecoderV2* decoder;
+    UnpackDefault unpackDefault;
+
+    // Used by vectorially bit-unpacking data

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1480561489

   > LGTM. Thanks @wpleonardo
   > 
   > Let me know if you need to add co-authors in the commit message.
   
   Thank you very much, @wgtmac , this feature is developed by myself, with no co-authors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1139540587


##########
c++/src/CpuInfoUtil.cc:
##########
@@ -0,0 +1,581 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "CpuInfoUtil.hh"
+
+#ifdef __APPLE__
+#include <sys/sysctl.h>
+#endif
+
+#ifndef _MSC_VER
+#include <unistd.h>
+#endif
+
+#ifdef _WIN32
+#define NOMINMAX
+#include <Windows.h>
+#include <intrin.h>
+#endif
+
+#include <algorithm>
+#include <array>
+#include <bitset>
+#include <cctype>
+#include <cerrno>
+#include <cstdint>
+#include <fstream>
+#include <memory>
+#include <optional>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>

Review Comment:
   I also deleted the redundant code about Arm and PowerPC platforms in [c++/src/CpuInfoUtil.cc].



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1141576795


##########
c++/src/BpackingAvx512.hh:
##########
@@ -0,0 +1,93 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGAVX512_HH
+#define ORC_BPACKINGAVX512_HH
+
+#include <stdlib.h>
+#include <cstdint>
+
+#include "BpackingDefault.hh"
+#include "Dispatch.hh"
+#include "RLEv2.hh"

Review Comment:
   Removed below files in c++/src/BpackingAvx512.hh:
   #include "Dispatch.hh"
   #include "RLEv2.hh"
   #include "io/InputStream.hh"
   #include "io/OutputStream.hh"
   
   Remove below file in c++/src/BpackingAvx512.cc, and modified files sorting:
   #include "Utils.hh"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1477299432

   @stiga-huang @coderex2522 Could you please take a look again? It generally looks good to me now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1148758780


##########
README.md:
##########
@@ -93,3 +93,15 @@ To build only the C++ library:
 % make test-out
 
 ```
+
+To build the C++ library with AVX512 enabling:

Review Comment:
   ```suggestion
   To build the C++ library with AVX512 enabled:
   ```



##########
README.md:
##########
@@ -93,3 +93,15 @@ To build only the C++ library:
 % make test-out
 
 ```
+
+To build the C++ library with AVX512 enabling:
+```shell
+ENV parameter ORC_USER_SIMD_LEVEL is to switch "AVX512" and "NONE" at the running time.
+export ORC_USER_SIMD_LEVEL=AVX512
+% mkdir build
+% cd build
+% cmake .. -DBUILD_JAVA=OFF -DBUILD_ENABLE_AVX512=ON
+% make package
+% make test-out
+

Review Comment:
   Remove blank line



##########
README.md:
##########
@@ -93,3 +93,15 @@ To build only the C++ library:
 % make test-out
 
 ```
+
+To build the C++ library with AVX512 enabling:
+```shell
+ENV parameter ORC_USER_SIMD_LEVEL is to switch "AVX512" and "NONE" at the running time.

Review Comment:
   ```suggestion
   Environment variable ORC_USER_SIMD_LEVEL can be set to "AVX512" or "NONE" at compile and/or run time. At compile time, it defines the SIMD level to be compiled into the binaries. At run time, it defines the SIMD level to dispatch the code which can apply SIMD optimization. Note that if ORC_USER_SIMD_LEVEL is set to "NONE" at compile time, ORC_USER_SIMD_LEVEL will not take effect at run time even if it is set to "AVX512".
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1151358663


##########
README.md:
##########
@@ -93,3 +93,16 @@ To build only the C++ library:
 % make test-out
 
 ```
+
+To build the C++ library with AVX512 enabled:
+```shell
+Cmake option BUILD_ENABLE_AVX512 can be set to "ON" or (default value)"OFF" at the compile time. At compile time, it defines the SIMD level(AVX512) to be compiled into the binaries.
+Environment variable ORC_USER_SIMD_LEVEL can be set to "AVX512" or (default value)"NONE" at the run time. At run time, it defines the SIMD level to dispatch the code which can apply SIMD optimization. 
+Note that if ORC_USER_SIMD_LEVEL is set to "NONE" at run time, AVX512 will not take effect at run time even if BUILD_ENABLE_AVX512 is set to "ON" at compile time.

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo closed pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo closed pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode
URL: https://github.com/apache/orc/pull/1375


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1148902985


##########
README.md:
##########
@@ -93,3 +93,15 @@ To build only the C++ library:
 % make test-out
 
 ```
+
+To build the C++ library with AVX512 enabling:

Review Comment:
   Done



##########
README.md:
##########
@@ -93,3 +93,15 @@ To build only the C++ library:
 % make test-out
 
 ```
+
+To build the C++ library with AVX512 enabling:
+```shell
+ENV parameter ORC_USER_SIMD_LEVEL is to switch "AVX512" and "NONE" at the running time.
+export ORC_USER_SIMD_LEVEL=AVX512
+% mkdir build
+% cd build
+% cmake .. -DBUILD_JAVA=OFF -DBUILD_ENABLE_AVX512=ON
+% make package
+% make test-out
+

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] stiga-huang commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "stiga-huang (via GitHub)" <gi...@apache.org>.

stiga-huang commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1150600749


##########
README.md:
##########
@@ -93,3 +93,16 @@ To build only the C++ library:
 % make test-out
 
 ```
+
+To build the C++ library with AVX512 enabled:
+```shell
+Cmake option BUILD_ENABLE_AVX512 can be set to "ON" or (default value)"OFF" at the compile time. At compile time, it defines the SIMD level(AVX512) to be compiled into the binaries.
+Environment variable ORC_USER_SIMD_LEVEL can be set to "AVX512" or (default value)"NONE" at the run time. At run time, it defines the SIMD level to dispatch the code which can apply SIMD optimization. 
+Note that if ORC_USER_SIMD_LEVEL is set to "NONE" at run time, AVX512 will not take effect at run time even if BUILD_ENABLE_AVX512 is set to "ON" at compile time.

Review Comment:
   These 3 lines are too long and would be better to move outside the shell section. E.g.
   
   To build the C++ library with AVX512 enabled:
   ```shell
   export ORC_USER_SIMD_LEVEL=AVX512
   % mkdir build
   % cd build
   % cmake .. -DBUILD_JAVA=OFF -DBUILD_ENABLE_AVX512=ON
   % make package
   % make test-out
   ```
   Cmake option BUILD_ENABLE_AVX512 can be set to "ON" or (default value)"OFF" at the compile time. At compile time, it defines the SIMD level(AVX512) to be compiled into the binaries.
   
   Environment variable ORC_USER_SIMD_LEVEL can be set to "AVX512" or (default value)"NONE" at the run time. At run time, it defines the SIMD level to dispatch the code which can apply SIMD optimization. 
   Note that if ORC_USER_SIMD_LEVEL is set to "NONE" at run time, AVX512 will not take effect at run time even if BUILD_ENABLE_AVX512 is set to "ON" at compile time.



##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,4476 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  void UnpackAvx512::vectorUnpack1(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 1;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint32_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      if (startBit != 0) {
+        bufMoveByteLen +=
+            moveLen(len * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH, ORC_VECTOR_BYTE_WIDTH);
+      } else {
+        bufMoveByteLen += moveLen(len * bitWidth, ORC_VECTOR_BYTE_WIDTH);
+      }
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        numElements = len;
+        resetBuf = false;
+        len -= numElements;
+      } else {
+        if (startBit != 0) {
+          numElements =
+              (bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + ORC_VECTOR_BYTE_WIDTH - startBit) /
+              bitWidth;
+          len -= numElements;
+          tailBitLen = fmod(
+              bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + ORC_VECTOR_BYTE_WIDTH - startBit, bitWidth);
+          resetBuf = true;
+        } else {
+          numElements = (bufRestByteLen * ORC_VECTOR_BYTE_WIDTH) / bitWidth;
+          len -= numElements;
+          tailBitLen = fmod(bufRestByteLen * ORC_VECTOR_BYTE_WIDTH, bitWidth);
+          resetBuf = true;
+        }
+      }
+
+      if (tailBitLen != 0) {
+        backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+        tailBitLen = 0;
+      }
+
+      if (startBit > 0) {
+        uint32_t align = getAlign(startBit, bitWidth, 8);
+        if (align > numElements) {
+          align = numElements;
+        }
+        if (align != 0) {
+          bufMoveByteLen -=
+              moveLen(align * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH, ORC_VECTOR_BYTE_WIDTH);
+          plainUnpackLongs(dstPtr, 0, align, bitWidth, startBit);
+          srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          dstPtr += align;
+          numElements -= align;
+        }
+      }

Review Comment:
   Codes of line 46 to 93 are similar (or identical?) in these methods, e.g. the same codes appear in vectorUnpack2 and vectorUnpack3. Can we extract them to reduce the method size?
   
   I think the core of these methods are the while-loop of `while (numElements >= VECTOR_UNPACK_8BIT_MAX_NUM)`. It'd be nice to refactor the codes dealing with boundaries and left these methods focus on this while-loop.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1498355658

   Gentle ping, @stiga-huang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1173239598


##########
c++/src/RLEv2.hh:
##########
@@ -166,6 +166,50 @@ namespace orc {
 
     void next(int16_t* data, uint64_t numValues, const char* notNull) override;
 
+    unsigned char readByte();
+
+    void setBufStart(char* start) {

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1537452801

   I have submitted this. Thanks all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1427878821

1.Refactor the way of unpacking about the default and AVX512. Added the dynamic dispatch function about AVX512 and default way. @wgtmac
https://github.com/apache/orc/pull/1375#discussion_r1067745787
1.1 Separate the default unpacking way unrolledUnpackX from RLEv2.hh and RleDecoderV2.cc, and create new files "BpackingDefault.hh & .cc" to definite the default unpacking way.
1.2 Also create new files "BpackingAvx512.hh & .cc" to definite the AVX512 unpacking way.
1.3 Create a new file "Dispatch.hh" to do dynamic dispatch.
1.4 Create new file "Bpacking.hh & .cc" to have the different unpacking ways.
2.Delete the file c++/src/DetectPlatform.hh, and chose the same way (CpuInfoUtil.cc) with Apache arrow to check if current CPU support avx512. @wgtmac
https://github.com/apache/orc/pull/1375#discussion_r1067729278
https://github.com/apache/orc/pull/1375#discussion_r1067729882
https://github.com/apache/orc/pull/1375#discussion_r1067729882
3.Delete Env parameter "ENABLE_RUNTIME_AVX512", create an Env parameter "ORC_USER_SIMD_LEVEL" (value: NONE | AVX512) to change the status at the running time. @wgtmac
https://github.com/apache/orc/pull/1375#discussion_r1067743477
4.In the CMakelists.txt and CpuInfoUtil.cc file, delete the definition about avx2 , sse_4_2 , neon, ppc, s390x and riscv64. Because currently CI doesn't support them, and in the future we also don't support them. @wgtmac
https://github.com/apache/orc/pull/1375#discussion_r1092785072
https://github.com/apache/orc/pull/1375#discussion_r1092788112
https://github.com/apache/orc/pull/1375#discussion_r1092788886
5.AVX512 main function name changed from unrolledUnpackVectorX to vectorUnpackX. @wgtmac
https://github.com/apache/orc/pull/1375#discussion_r1092812660
6.Optimiz the code comments, make them be more readable. @wgtmac
https://github.com/apache/orc/pull/1375#discussion_r1092798187
7.Change the file name VectorDecoder.hh to BitUnpackerAvx512.hh @wgtmac
https://github.com/apache/orc/pull/1375#discussion_r1092835623
8.In the testcase file c++/test/TestRleVectorDecoder.cc delete #include <inttypes.h> @wgtmac
https://github.com/apache/orc/pull/1375#discussion_r1092839029
9.In the testcase file c++/test/TestRleVectorDecoder.cc, change the class name RleVectorTest to RleV2BitUnpackAvx512Test. @wgtmac
https://github.com/apache/orc/pull/1375#discussion_r1092839780
10.Modified the CMakelists.txt to print the value of BUILD_ENABLE_AVX512, CXX_SUPPORTS_AVX512, ORC_RUNTIME_SIMD_LEVEL, ORC_HAVE_RUNTIME_AVX512 and ORC_SIMD_LEVEL in the cmake process. @stiga-huang
https://github.com/apache/orc/pull/1375#discussion_r1097305535

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1105297117


##########
CMakeLists.txt:
##########
@@ -157,6 +172,102 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(FATAL_ERROR "Unknown system processor")

Review Comment:
   This fails the build on these processors, though it rarely happens. At least we should not break build which succeeds in the past.



##########
CMakeLists.txt:
##########
@@ -157,6 +172,102 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")
+    set(ORC_AVX512_FLAG
+        "${ORC_AVX512_FLAG} -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vbmi")
+  endif()
+  check_cxx_compiler_flag(${ORC_AVX512_FLAG} CXX_SUPPORTS_AVX512)
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    check_cxx_source_compiles("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  message(STATUS "BUILD_ENABLE_AVX512=${BUILD_ENABLE_AVX512}")
+  message(STATUS "CXX_SUPPORTS_AVX512=${CXX_SUPPORTS_AVX512}")
+  message(STATUS "ORC_RUNTIME_SIMD_LEVEL=${ORC_RUNTIME_SIMD_LEVEL}")

Review Comment:
   `ORC_RUNTIME_SIMD_LEVEL` is not necessary to print in the compile stage. 



##########
CMakeLists.txt:
##########
@@ -157,6 +172,102 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")
+    set(ORC_AVX512_FLAG
+        "${ORC_AVX512_FLAG} -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vbmi")
+  endif()
+  check_cxx_compiler_flag(${ORC_AVX512_FLAG} CXX_SUPPORTS_AVX512)
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    check_cxx_source_compiles("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  message(STATUS "BUILD_ENABLE_AVX512=${BUILD_ENABLE_AVX512}")

Review Comment:
   Could you merge `CXX_SUPPORTS_AVX512` and `BUILD_ENABLE_AVX512` in a single message?



##########
CMakeLists.txt:
##########
@@ -87,6 +91,17 @@ if (BUILD_POSITION_INDEPENDENT_LIB)
   set(CMAKE_POSITION_INDEPENDENT_CODE ON)
 endif ()
 
+if(NOT DEFINED ORC_SIMD_LEVEL)

Review Comment:
   These options and variables look confusing to me. `BUILD_ENABLE_AVX512` and `ORC_SIMD_LEVEL` serve the same purpose. At least one of them should be removed.
   
   If `ORC_SIMD_LEVEL` and `ORC_RUNTIME_SIMD_LEVEL` only have default values, then they should be removed because they cannot be changed. Otherwise, they should at least support `NONE` and `AVX512` to be configurable.



##########
c++/src/Bpacking.hh:
##########
@@ -0,0 +1,40 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKING_HH
+#define ORC_BPACKING_HH
+
+#include <stdint.h>
+
+#include "BpackingDefault.hh"
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+#include "BpackingAvx512.hh"

Review Comment:
   This header file should only be included in the Bpacking.cc



##########
c++/src/Dispatch.hh:
##########
@@ -0,0 +1,109 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DISPATCH_HH
+#define ORC_DISPATCH_HH
+
+#include <utility>
+#include <vector>
+
+#include "CpuInfoUtil.hh"
+
+namespace orc {
+  enum class DispatchLevel : int {
+    // These dispatch levels, corresponding to instruction set features,
+    // are sorted in increasing order of preference.
+    NONE = 0,
+    AVX512,
+    MAX
+  };
+
+  /*
+    A facility for dynamic dispatch according to available DispatchLevel.
+
+    Typical use:
+
+      static void my_function_default(...);
+      static void my_function_avx512(...);
+
+      struct MyDynamicFunction {
+        using FunctionType = decltype(&my_function_default);
+
+        static std::vector<std::pair<DispatchLevel, FunctionType>> implementations() {
+          return {
+            { DispatchLevel::NONE, my_function_default }
+      #if defined(ARROW_HAVE_RUNTIME_AVX512)

Review Comment:
   ```suggestion
         #if defined(ORC_HAVE_RUNTIME_AVX512)
   ```



##########
CMakeLists.txt:
##########
@@ -157,6 +172,139 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)

Review Comment:
   Please move lines from 175 to 270 into a separate cmake module. 



##########
c++/src/BitUnpackerAvx512.hh:
##########
@@ -0,0 +1,488 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef VECTOR_DECODER_HH

Review Comment:
   Please fix the marco to match the file name



##########
CMakeLists.txt:
##########
@@ -157,6 +172,102 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)

Review Comment:
   `CXX_SUPPORTS_SSE4_2` is not used and can be removed.



##########
CMakeLists.txt:
##########
@@ -157,6 +172,102 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")
+    set(ORC_AVX512_FLAG
+        "${ORC_AVX512_FLAG} -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vbmi")
+  endif()
+  check_cxx_compiler_flag(${ORC_AVX512_FLAG} CXX_SUPPORTS_AVX512)
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")

Review Comment:
   It seems that `CMAKE_REQUIRED_FLAGS` is not officially documented. Do we any have better alternatives?



##########
CMakeLists.txt:
##########
@@ -157,6 +172,102 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")

Review Comment:
   Why not use a single set for `ORC_AVX512_FLAG`



##########
CMakeLists.txt:
##########
@@ -157,6 +172,102 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")
+    set(ORC_AVX512_FLAG
+        "${ORC_AVX512_FLAG} -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vbmi")
+  endif()
+  check_cxx_compiler_flag(${ORC_AVX512_FLAG} CXX_SUPPORTS_AVX512)
+  if(MINGW)
+    # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
+    message(STATUS "Disable AVX512 support on MINGW for now")
+  else()
+    # Check for AVX512 support in the compiler.
+    set(OLD_CMAKE_REQURED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+    set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} ${ORC_AVX512_FLAG}")
+    check_cxx_source_compiles("
+      #ifdef _MSC_VER
+      #include <intrin.h>
+      #else
+      #include <immintrin.h>
+      #endif
+
+      int main() {
+        __m512i mask = _mm512_set1_epi32(0x1);
+        char out[32];
+        _mm512_storeu_si512(out, mask);
+        return 0;
+      }"
+      CXX_SUPPORTS_AVX512)
+    set(CMAKE_REQUIRED_FLAGS ${OLD_CMAKE_REQURED_FLAGS})
+  endif()
+
+  message(STATUS "BUILD_ENABLE_AVX512=${BUILD_ENABLE_AVX512}")
+  message(STATUS "CXX_SUPPORTS_AVX512=${CXX_SUPPORTS_AVX512}")
+  message(STATUS "ORC_RUNTIME_SIMD_LEVEL=${ORC_RUNTIME_SIMD_LEVEL}")
+  # Runtime SIMD level it can get from compiler and ORC_RUNTIME_SIMD_LEVEL
+  if(BUILD_ENABLE_AVX512 AND CXX_SUPPORTS_AVX512 AND ORC_RUNTIME_SIMD_LEVEL MATCHES "^(AVX512|MAX)$")
+    message(STATUS "Enable the AVX512 vector decode of bit-packing")
+    set(ORC_HAVE_RUNTIME_AVX512 ON)
+    set(ORC_SIMD_LEVEL "AVX512")
+    add_definitions(-DORC_HAVE_RUNTIME_AVX512)
+  else ()
+    set(ORC_HAVE_RUNTIME_AVX512 OFF)
+    message(STATUS "Disable the AVX512 vector decode of bit-packing")
+  endif()
+  if(ORC_SIMD_LEVEL STREQUAL "DEFAULT")
+    set(ORC_SIMD_LEVEL "NONE")
+  endif()
+elseif(ORC_CPU_FLAG STREQUAL "aarch64")

Review Comment:
   Please remove the logic relevant to `aarch64`



##########
c++/src/BitUnpackerAvx512.hh:
##########
@@ -0,0 +1,488 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef VECTOR_DECODER_HH
+#define VECTOR_DECODER_HH
+
+// Mingw-w64 defines strcasecmp in string.h
+#if defined(_WIN32) && !defined(strcasecmp)
+#include <string.h>
+#define strcasecmp stricmp
+#else
+#include <strings.h>
+#endif
+
+#if defined(ORC_HAVE_RUNTIME_AVX512)

Review Comment:
   Why not move it to above line 22?



##########
c++/src/Dispatch.hh:
##########
@@ -0,0 +1,109 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DISPATCH_HH
+#define ORC_DISPATCH_HH
+
+#include <utility>
+#include <vector>
+
+#include "CpuInfoUtil.hh"
+
+namespace orc {
+  enum class DispatchLevel : int {
+    // These dispatch levels, corresponding to instruction set features,
+    // are sorted in increasing order of preference.
+    NONE = 0,
+    AVX512,
+    MAX
+  };
+
+  /*
+    A facility for dynamic dispatch according to available DispatchLevel.
+
+    Typical use:
+
+      static void my_function_default(...);
+      static void my_function_avx512(...);
+
+      struct MyDynamicFunction {
+        using FunctionType = decltype(&my_function_default);
+
+        static std::vector<std::pair<DispatchLevel, FunctionType>> implementations() {
+          return {
+            { DispatchLevel::NONE, my_function_default }
+      #if defined(ARROW_HAVE_RUNTIME_AVX512)
+            , { DispatchLevel::AVX512, my_function_avx512 }
+      #endif
+          };
+        }
+      };
+
+      void my_function(...) {
+        static DynamicDispatch<MyDynamicFunction> dispatch;
+        return dispatch.func(...);
+      }
+  */
+  template <typename DynamicFunction>
+  class DynamicDispatch {
+   protected:
+    using FunctionType = typename DynamicFunction::FunctionType;
+    using Implementation = std::pair<DispatchLevel, FunctionType>;
+
+   public:
+    DynamicDispatch() {
+      Resolve(DynamicFunction::implementations());
+    }
+
+    FunctionType func = {};
+
+   protected:
+    // Use the Implementation with the highest DispatchLevel
+    void Resolve(const std::vector<Implementation>& implementations) {
+      Implementation cur{DispatchLevel::NONE, {}};
+
+      for (const auto& impl : implementations) {
+        if (impl.first >= cur.first && IsSupported(impl.first)) {
+          // Higher (or same) level than current
+          cur = impl;
+        }
+      }
+
+      if (!cur.second) {
+        throw InvalidArgument("No appropriate implementation found");
+      }
+      func = cur.second;
+    }
+
+   private:
+    bool IsSupported(DispatchLevel level) const {
+      static const auto cpu_info = orc::CpuInfo::GetInstance();

Review Comment:
   ```suggestion
         static const auto cpuInfo = CpuInfo::GetInstance();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] coderex2522 commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "coderex2522 (via GitHub)" <gi...@apache.org>.

coderex2522 commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1105356989


##########
c++/src/CpuInfoUtil.hh:
##########
@@ -0,0 +1,110 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_CPUINFOUTIL_HH
+#define ORC_CPUINFOUTIL_HH
+
+#include <cstdint>
+#include <memory>
+#include <string>
+
+namespace orc {
+
+  /// CpuInfo is an interface to query for cpu information at runtime.  The caller can
+  /// ask for the sizes of the caches and what hardware features are supported.
+  /// On Linux, this information is pulled from a couple of sys files (/proc/cpuinfo and
+  /// /sys/devices)

Review Comment:
   Please use the standard multi-line comment specification.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] coderex2522 commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "coderex2522 (via GitHub)" <gi...@apache.org>.

coderex2522 commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1105379569


##########
c++/src/CpuInfoUtil.hh:
##########
@@ -0,0 +1,110 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_CPUINFOUTIL_HH
+#define ORC_CPUINFOUTIL_HH
+
+#include <cstdint>
+#include <memory>
+#include <string>
+
+namespace orc {
+
+  /// CpuInfo is an interface to query for cpu information at runtime.  The caller can
+  /// ask for the sizes of the caches and what hardware features are supported.
+  /// On Linux, this information is pulled from a couple of sys files (/proc/cpuinfo and
+  /// /sys/devices)
+  class CpuInfo {
+   public:
+    ~CpuInfo();
+
+    /// x86 features
+    static constexpr int64_t SSSE3 = (1LL << 0);
+    static constexpr int64_t SSE4_1 = (1LL << 1);
+    static constexpr int64_t SSE4_2 = (1LL << 2);
+    static constexpr int64_t POPCNT = (1LL << 3);
+    static constexpr int64_t AVX = (1LL << 4);
+    static constexpr int64_t AVX2 = (1LL << 5);
+    static constexpr int64_t AVX512F = (1LL << 6);
+    static constexpr int64_t AVX512CD = (1LL << 7);
+    static constexpr int64_t AVX512VL = (1LL << 8);
+    static constexpr int64_t AVX512DQ = (1LL << 9);
+    static constexpr int64_t AVX512BW = (1LL << 10);
+    static constexpr int64_t AVX512 = AVX512F | AVX512CD | AVX512VL | AVX512DQ | AVX512BW;
+    static constexpr int64_t BMI1 = (1LL << 11);
+    static constexpr int64_t BMI2 = (1LL << 12);
+
+    /// Arm features
+    static constexpr int64_t ASIMD = (1LL << 32);
+
+    /// Cache enums for L1 (data), L2 and L3
+    enum class CacheLevel { L1 = 0, L2, L3, Last = L3 };
+
+    /// CPU vendors
+    enum class Vendor { Unknown, Intel, AMD };
+
+    static const CpuInfo* GetInstance();
+
+    /// Returns all the flags for this cpu
+    int64_t hardwareFlags() const;
+
+    /// Returns the number of cores (including hyper-threaded) on this machine.
+    int numCores() const;
+
+    /// Returns the vendor of the cpu.
+    Vendor vendor() const;
+
+    /// Returns the model name of the cpu (e.g. Intel i7-2600)
+    const std::string& modelName() const;
+
+    /// Returns the size of the cache in KB at this cache level
+    int64_t CacheSize(CacheLevel level) const;
+
+    /// \brief Returns whether or not the given feature is enabled.
+    ///
+    /// IsSupported() is true if IsDetected() is also true and the feature
+    /// wasn't disabled by the user (for example by setting the ORC_USER_SIMD_LEVEL
+    /// environment variable).
+    bool IsSupported(int64_t flags) const;
+
+    /// Returns whether or not the given feature is available on the CPU.
+    bool IsDetected(int64_t flags) const;
+
+    /// Determine if the CPU meets the minimum CPU requirements and if not, issue an error
+    /// and terminate.
+    void VerifyCpuRequirements() const;
+
+    /// Toggle a hardware feature on and off.  It is not valid to turn on a feature
+    /// that the underlying hardware cannot support. This is useful for testing.
+    // void EnableFeature(int64_t flag, bool enable);
+
+    bool HasEfficientBmi2() const {
+      // BMI2 (pext, pdep) is only efficient on Intel X86 processors.
+      return vendor() == Vendor::Intel && IsSupported(BMI2);
+    }
+
+   private:
+    CpuInfo();
+
+    struct Impl;
+    std::unique_ptr<Impl> impl_;

Review Comment:
   If class CpuInfo is only a interface, I suggest that the impl_ variable should be removed from class CpuInfo.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1109449067


##########
c++/test/TestRleVectorDecoder.cc:
##########
@@ -106,7 +106,12 @@ namespace orc {
     int32_t lpad = offset * BARWIDTH / total;
     int32_t rpad = BARWIDTH - lpad;
 
-    printf("\r%s:%3d%% [%.*s%*s] [%ld /%ld]", testName, val, lpad, BARSTR, rpad, "", offset, total);
+#ifdef __APPLE__
+    printf("\r%s:%3d%% [%.*s%*s] [%lld/%lld]", testName, val, lpad, BARSTR, rpad, "", offset,

Review Comment:
   Please do not use printf as different platforms may have different semantics. Use modern C++ as much as possible. We support C++17 now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1090083421


##########
CMakeLists.txt:
##########
@@ -67,6 +67,10 @@ option(BUILD_CPP_ENABLE_METRICS
     "Enable the metrics collection at compile phase"
     OFF)
 
+option(BUILD_ENABLE_AVX512
+    "Enable AVX512 vector decode of bit-packing"
+    OFF)

Review Comment:
   Well, this does mean we are going to skip GitHub Action testing on this PR, @wpleonardo ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1407864470

   @wpleonardo I'd suggest apply `clang-format -i source_file` to all files that you have changed or added to make the format check happy. You can also set up your IDEs to do it automatically. AFAIK, VSCode or CLion support it.
   
   For the failure on a specific platform, we can probably disable it in the cmake config first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1092935860


##########
c++/src/RleDecoderV2.cc:
##########
@@ -18,11 +18,35 @@
 
 #include "Adaptor.hh"
 #include "Compression.hh"
+#include "DetectPlatform.hh"
 #include "RLEV2Util.hh"
 #include "RLEv2.hh"
 #include "Utils.hh"
+#include "VectorDecoder.hh"
 
 namespace orc {
+  void RleDecoderV2::resetBufferStart(uint64_t len, bool resetBuf, uint32_t backupByteLen) {
+    uint64_t restLen = bufferEnd - bufferStart;

Review Comment:
   OK, done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1092926507


##########
c++/src/RLEv2.hh:
##########
@@ -189,13 +192,45 @@ namespace orc {
       resetReadLongs();
     }
 
+    void resetBufferStart(uint64_t len, bool resetBuf, uint32_t backupLen);
     unsigned char readByte();
 
     int64_t readLongBE(uint64_t bsz);
     int64_t readVslong();
     uint64_t readVulong();
     void readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs);
-    void plainUnpackLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs);
+    void plainUnpackLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs,
+                          uint64_t& startBit);
+
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+    void unrolledUnpackVector1(int64_t* data, uint64_t offset, uint64_t len);

Review Comment:
   OK, thank you very much, I am modifying this part following this suggestion and the above one
   https://github.com/apache/orc/pull/1375#discussion_r1067745787



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1092915841


##########
CMakeLists.txt:
##########
@@ -157,6 +172,139 @@ elseif (MSVC)
   set (WARN_FLAGS "${WARN_FLAGS} -wd4146") # unary minus operator applied to unsigned type, result still unsigned
 endif ()
 
+include(CheckCXXCompilerFlag)
+include(CheckCXXSourceCompiles)
+message(STATUS "System processor: ${CMAKE_SYSTEM_PROCESSOR}")
+
+if(NOT DEFINED ORC_CPU_FLAG)
+  if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|X86|x86|i[3456]86|x64")
+    set(ORC_CPU_FLAG "x86")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|ARM64|arm64")
+    set(ORC_CPU_FLAG "aarch64")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^arm$|armv[4-7]")
+    set(ORC_CPU_FLAG "aarch32")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "powerpc|ppc")
+    set(ORC_CPU_FLAG "ppc")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "s390x")
+    set(ORC_CPU_FLAG "s390x")
+  elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "riscv64")
+    set(ORC_CPU_FLAG "riscv64")
+  else()
+    message(FATAL_ERROR "Unknown system processor")
+  endif()
+endif()
+
+# Check architecture specific compiler flags
+if(ORC_CPU_FLAG STREQUAL "x86")
+  # x86/amd64 compiler flags, msvc/gcc/clang
+  if(MSVC)
+    set(ORC_SSE4_2_FLAG "")
+    set(ORC_AVX2_FLAG "/arch:AVX2")
+    set(ORC_AVX512_FLAG "/arch:AVX512")
+    set(CXX_SUPPORTS_SSE4_2 TRUE)
+  else()
+    set(ORC_SSE4_2_FLAG "-msse4.2")
+    set(ORC_AVX2_FLAG "-march=haswell")
+    # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ
+    set(ORC_AVX512_FLAG "-march=native -mbmi2")
+    # Append the avx2/avx512 subset option also, fix issue ORC-9877 for homebrew-cpp

Review Comment:
   Sorry for bad reference, already deleted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by GitBox <gi...@apache.org>.

wgtmac commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1067721200


##########
CMakeLists.txt:
##########
@@ -67,6 +67,10 @@ option(BUILD_CPP_ENABLE_METRICS
     "Enable the metrics collection at compile phase"
     OFF)
 
+option(ENABLE_AVX512_BIT_PACKING

Review Comment:
   We need at least two levels of control of this setting:
   - An option like this to enable compiling the library with AVX512 enabled.
   - A setting to disable runtime dispatch to AVX512.



##########
CMakeLists.txt:
##########
@@ -67,6 +67,10 @@ option(BUILD_CPP_ENABLE_METRICS
     "Enable the metrics collection at compile phase"
     OFF)
 
+option(ENABLE_AVX512_BIT_PACKING

Review Comment:
   I think `BUILD_ENABLE_AVX512` is enough.



##########
c++/src/DetectPlatform.hh:
##########
@@ -0,0 +1,92 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DETECTPLATFORM_HH
+#define ORC_DETECTPLATFORM_HH
+
+#if defined(__GNUC__) || defined(__clang__)
+  DIAGNOSTIC_IGNORE("-Wold-style-cast")
+#endif
+
+namespace orc
+{
+#ifdef _WIN32
+
+#include "intrin.h"
+//  Windows CPUID
+#define cpuid(info, x)    __cpuidex(info, x, 0)
+#else
+//  GCC Intrinsics 
+#include <cpuid.h>
+#include <dlfcn.h>
+
+  void cpuid(int info[4], int InfoType) {
+    __cpuid_count(InfoType, 0, info[0], info[1], info[2], info[3]);
+  }
+
+  unsigned long long xgetbv(unsigned int index) {
+    unsigned int eax, edx;
+    __asm__ __volatile__(
+    "xgetbv;"
+    : "=a" (eax), "=d"(edx)
+    : "c" (index)
+    );
+    return ((unsigned long long) edx << 32) | eax;
+  }
+
+#endif
+
+  #define CPUID_AVX512F       0x00100000
+  #define CPUID_AVX512CD      0x00200000
+  #define CPUID_AVX512VL      0x04000000
+  #define CPUID_AVX512BW      0x01000000
+  #define CPUID_AVX512DQ      0x02000000
+  #define EXC_OSXSAVE         0x08000000 // 27th  bit
+
+  #define CPUID_AVX512_MASK (CPUID_AVX512F | CPUID_AVX512CD | CPUID_AVX512VL | CPUID_AVX512BW | CPUID_AVX512DQ)
+
+  enum arch_t {
+    px_arch     = 0,
+    avx2_arch   = 1,
+    avx512_arch = 2
+  };
+
+  arch_t detect_platform() {

Review Comment:
   ```suggestion
     arch_t detectPlatform() {
   ```



##########
c++/src/RLEv2.hh:
##########
@@ -189,13 +192,45 @@ namespace orc {
       resetReadLongs();
     }
 
+    void resetBufferStart(uint64_t len, bool resetBuf, uint32_t backupLen);
     unsigned char readByte();
 
     int64_t readLongBE(uint64_t bsz);
     int64_t readVslong();
     uint64_t readVulong();
     void readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs);
-    void plainUnpackLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs);
+    void plainUnpackLongs(int64_t *data, uint64_t offset, uint64_t len, uint64_t fbs,
+                        uint64_t& startBit);
+
+#if ENABLE_AVX512
+    void unrolledUnpackVector1(int64_t *data, uint64_t offset, uint64_t len);

Review Comment:
   Could you do a little bit refactoring to use the same function signatures but dispatch to different implementations? This can make it easy to add support for other SIMD implementations. You may want to check this for reference: https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bpacking.h



##########
c++/src/DetectPlatform.hh:
##########
@@ -0,0 +1,92 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DETECTPLATFORM_HH
+#define ORC_DETECTPLATFORM_HH
+
+#if defined(__GNUC__) || defined(__clang__)
+  DIAGNOSTIC_IGNORE("-Wold-style-cast")
+#endif
+
+namespace orc
+{
+#ifdef _WIN32
+
+#include "intrin.h"
+//  Windows CPUID
+#define cpuid(info, x)    __cpuidex(info, x, 0)
+#else
+//  GCC Intrinsics 
+#include <cpuid.h>
+#include <dlfcn.h>
+
+  void cpuid(int info[4], int InfoType) {
+    __cpuid_count(InfoType, 0, info[0], info[1], info[2], info[3]);
+  }
+
+  unsigned long long xgetbv(unsigned int index) {
+    unsigned int eax, edx;
+    __asm__ __volatile__(
+    "xgetbv;"
+    : "=a" (eax), "=d"(edx)
+    : "c" (index)
+    );
+    return ((unsigned long long) edx << 32) | eax;
+  }
+
+#endif
+
+  #define CPUID_AVX512F       0x00100000
+  #define CPUID_AVX512CD      0x00200000
+  #define CPUID_AVX512VL      0x04000000
+  #define CPUID_AVX512BW      0x01000000
+  #define CPUID_AVX512DQ      0x02000000
+  #define EXC_OSXSAVE         0x08000000 // 27th  bit
+
+  #define CPUID_AVX512_MASK (CPUID_AVX512F | CPUID_AVX512CD | CPUID_AVX512VL | CPUID_AVX512BW | CPUID_AVX512DQ)
+
+  enum arch_t {
+    px_arch     = 0,
+    avx2_arch   = 1,
+    avx512_arch = 2
+  };
+
+  arch_t detect_platform() {
+    arch_t detected_platform = arch_t::px_arch;
+    int    cpu_info[4];

Review Comment:
   Please fix similar naming of variables.



##########
c++/src/DetectPlatform.hh:
##########
@@ -0,0 +1,92 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DETECTPLATFORM_HH
+#define ORC_DETECTPLATFORM_HH
+
+#if defined(__GNUC__) || defined(__clang__)
+  DIAGNOSTIC_IGNORE("-Wold-style-cast")
+#endif
+
+namespace orc
+{
+#ifdef _WIN32
+
+#include "intrin.h"
+//  Windows CPUID
+#define cpuid(info, x)    __cpuidex(info, x, 0)
+#else
+//  GCC Intrinsics 
+#include <cpuid.h>
+#include <dlfcn.h>
+
+  void cpuid(int info[4], int InfoType) {
+    __cpuid_count(InfoType, 0, info[0], info[1], info[2], info[3]);
+  }
+
+  unsigned long long xgetbv(unsigned int index) {
+    unsigned int eax, edx;
+    __asm__ __volatile__(
+    "xgetbv;"
+    : "=a" (eax), "=d"(edx)
+    : "c" (index)
+    );
+    return ((unsigned long long) edx << 32) | eax;
+  }
+
+#endif
+
+  #define CPUID_AVX512F       0x00100000
+  #define CPUID_AVX512CD      0x00200000
+  #define CPUID_AVX512VL      0x04000000
+  #define CPUID_AVX512BW      0x01000000
+  #define CPUID_AVX512DQ      0x02000000
+  #define EXC_OSXSAVE         0x08000000 // 27th  bit
+
+  #define CPUID_AVX512_MASK (CPUID_AVX512F | CPUID_AVX512CD | CPUID_AVX512VL | CPUID_AVX512BW | CPUID_AVX512DQ)
+
+  enum arch_t {

Review Comment:
   Please use `enum class` and match the naming style.



##########
c++/src/DetectPlatform.hh:
##########
@@ -0,0 +1,92 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DETECTPLATFORM_HH
+#define ORC_DETECTPLATFORM_HH
+
+#if defined(__GNUC__) || defined(__clang__)
+  DIAGNOSTIC_IGNORE("-Wold-style-cast")
+#endif
+
+namespace orc
+{
+#ifdef _WIN32
+
+#include "intrin.h"
+//  Windows CPUID
+#define cpuid(info, x)    __cpuidex(info, x, 0)
+#else
+//  GCC Intrinsics 
+#include <cpuid.h>
+#include <dlfcn.h>
+
+  void cpuid(int info[4], int InfoType) {
+    __cpuid_count(InfoType, 0, info[0], info[1], info[2], info[3]);
+  }
+
+  unsigned long long xgetbv(unsigned int index) {
+    unsigned int eax, edx;
+    __asm__ __volatile__(
+    "xgetbv;"
+    : "=a" (eax), "=d"(edx)
+    : "c" (index)
+    );
+    return ((unsigned long long) edx << 32) | eax;
+  }
+
+#endif
+
+  #define CPUID_AVX512F       0x00100000
+  #define CPUID_AVX512CD      0x00200000
+  #define CPUID_AVX512VL      0x04000000
+  #define CPUID_AVX512BW      0x01000000
+  #define CPUID_AVX512DQ      0x02000000
+  #define EXC_OSXSAVE         0x08000000 // 27th  bit
+
+  #define CPUID_AVX512_MASK (CPUID_AVX512F | CPUID_AVX512CD | CPUID_AVX512VL | CPUID_AVX512BW | CPUID_AVX512DQ)
+
+  enum arch_t {
+    px_arch     = 0,
+    avx2_arch   = 1,
+    avx512_arch = 2
+  };
+
+  arch_t detect_platform() {
+    arch_t detected_platform = arch_t::px_arch;
+    int    cpu_info[4];

Review Comment:
   ```suggestion
       int    cpuInfo[4];
   ```



##########
c++/src/VectorDecoder.hh:
##########
@@ -0,0 +1,506 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef VECTOR_DECODER_HH
+#define VECTOR_DECODER_HH
+
+#include <immintrin.h>

Review Comment:
   This should also be protected by macro as it is not always available.



##########
c++/src/RleDecoderV2.cc:
##########
@@ -67,6 +91,147 @@ namespace orc {
   }
 
   void RleDecoderV2::readLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs) {
+    uint64_t startBit = 0;
+#if ENABLE_AVX512

Review Comment:
   It would be better to be able to disable it at the runtime.



##########
c++/src/DetectPlatform.hh:
##########
@@ -0,0 +1,92 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_DETECTPLATFORM_HH
+#define ORC_DETECTPLATFORM_HH
+
+#if defined(__GNUC__) || defined(__clang__)
+  DIAGNOSTIC_IGNORE("-Wold-style-cast")
+#endif
+
+namespace orc
+{
+#ifdef _WIN32
+
+#include "intrin.h"
+//  Windows CPUID
+#define cpuid(info, x)    __cpuidex(info, x, 0)
+#else
+//  GCC Intrinsics 
+#include <cpuid.h>
+#include <dlfcn.h>
+
+  void cpuid(int info[4], int InfoType) {
+    __cpuid_count(InfoType, 0, info[0], info[1], info[2], info[3]);
+  }
+
+  unsigned long long xgetbv(unsigned int index) {
+    unsigned int eax, edx;
+    __asm__ __volatile__(
+    "xgetbv;"
+    : "=a" (eax), "=d"(edx)
+    : "c" (index)
+    );
+    return ((unsigned long long) edx << 32) | eax;
+  }
+
+#endif
+
+  #define CPUID_AVX512F       0x00100000
+  #define CPUID_AVX512CD      0x00200000
+  #define CPUID_AVX512VL      0x04000000
+  #define CPUID_AVX512BW      0x01000000
+  #define CPUID_AVX512DQ      0x02000000
+  #define EXC_OSXSAVE         0x08000000 // 27th  bit
+
+  #define CPUID_AVX512_MASK (CPUID_AVX512F | CPUID_AVX512CD | CPUID_AVX512VL | CPUID_AVX512BW | CPUID_AVX512DQ)
+
+  enum arch_t {
+    px_arch     = 0,
+    avx2_arch   = 1,
+    avx512_arch = 2
+  };
+
+  arch_t detect_platform() {

Review Comment:
   Can you add a test?



##########
CMakeLists.txt:
##########
@@ -67,6 +67,10 @@ option(BUILD_CPP_ENABLE_METRICS
     "Enable the metrics collection at compile phase"
     OFF)
 
+option(ENABLE_AVX512_BIT_PACKING
+    "Enable AVX512 vector decode of bit-packing"
+     OFF)

Review Comment:
   Can we do something like below to check if the CPU supports AVX512 at compile time?
   
   https://github.com/apache/arrow/blob/master/cpp/cmake_modules/SetupCxxFlags.cmake#L45
   
   Then we can enable it by default and disable it by options.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1138674070


##########
c++/src/CpuInfoUtil.hh:
##########
@@ -0,0 +1,109 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_CPUINFOUTIL_HH
+#define ORC_CPUINFOUTIL_HH
+
+#include <cstdint>
+#include <memory>
+#include <string>
+
+namespace orc {
+
+  /**
+   * CpuInfo is an interface to query for cpu information at runtime.  The caller can
+   * ask for the sizes of the caches and what hardware features are supported.
+   * On Linux, this information is pulled from a couple of sys files (/proc/cpuinfo and
+   * /sys/devices)
+   */
+  class CpuInfo {

Review Comment:
   Already added a comment about this code borrowed from Apache/arrow.
   https://github.com/wpleonardo/orc/blob/fe5b6c7a29721bb5d8c4699a0b072d64555d600d/c%2B%2B/src/CpuInfoUtil.hh#L19
   Do we have any license issues about that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1138653547


##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,4318 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#if defined(ORC_HAVE_RUNTIME_AVX512)

Review Comment:
   Removed.



##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,4318 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#if defined(ORC_HAVE_RUNTIME_AVX512)
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1138653893


##########
c++/test/CMakeLists.txt:
##########
@@ -42,6 +42,7 @@ add_executable (orc-test
   TestReader.cc
   TestRleDecoder.cc
   TestRleEncoder.cc
+  TestRleVectorDecoder.cc

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1473147503

   Windows SIMD test is failing: https://github.com/apache/orc/actions/runs/4444354324/jobs/7802486868?pr=1375 @wpleonardo 
   ```
   [----------] 54 tests from OrcTest/RleV2BitUnpackAvx512Test
   [ RUN      ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_1bit/0
   unknown file: error: SEH exception with code 0xc000001d thrown in the test body.
   [  FAILED  ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_1bit/0, where GetParam() = true (2 ms)
   [ RUN      ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_1bit/1
   unknown file: error: SEH exception with code 0xc000001d thrown in the test body.
   [  FAILED  ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_1bit/1, where GetParam() = false (1 ms)
   [ RUN      ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_2bit/0
   unknown file: error: SEH exception with code 0xc000001d thrown in the test body.
   [  FAILED  ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_2bit/0, where GetParam() = true (1 ms)
   [ RUN      ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_2bit/1
   unknown file: error: SEH exception with code 0xc000001d thrown in the test body.
   [  FAILED  ] OrcTest/RleV2BitUnpackAvx512Test.RleV2_basic_vector_decode_2bit/1, where GetParam() = false (1 ms)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1141576795


##########
c++/src/BpackingAvx512.hh:
##########
@@ -0,0 +1,93 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGAVX512_HH
+#define ORC_BPACKINGAVX512_HH
+
+#include <stdlib.h>
+#include <cstdint>
+
+#include "BpackingDefault.hh"
+#include "Dispatch.hh"
+#include "RLEv2.hh"

Review Comment:
   Removed below files:
   #include "Dispatch.hh"
   #include "RLEv2.hh"
   #include "io/InputStream.hh"
   #include "io/OutputStream.hh"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1484462106

   > The reason of CI test failed is the machine doesn't support AVX512. Maybe we'd better running these CI SIMD test on AVX512 machines. https://github.com/apache/orc/actions/runs/4528477658/jobs/7975338899?pr=1375#step:3:41
   
   Could we make it robust? It is likely to happen in the future which may bother the code review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1144372473


##########
c++/src/BpackingAvx512.hh:
##########
@@ -0,0 +1,89 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_BPACKINGAVX512_HH
+#define ORC_BPACKINGAVX512_HH
+
+#include <stdlib.h>
+#include <cstdint>
+
+#include "BpackingDefault.hh"
+
+namespace orc {
+
+#define MAX_VECTOR_BUF_8BIT_LENGTH 64
+#define MAX_VECTOR_BUF_16BIT_LENGTH 32
+#define MAX_VECTOR_BUF_32BIT_LENGTH 16
+
+  class UnpackAvx512 {
+   public:
+    UnpackAvx512(RleDecoderV2* dec);
+    ~UnpackAvx512();
+
+    void vectorUnpack1(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack2(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack3(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack4(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack5(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack6(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack7(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack9(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack10(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack11(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack12(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack13(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack14(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack15(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack16(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack17(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack18(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack19(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack20(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack21(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack22(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack23(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack24(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack26(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack28(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack30(int64_t* data, uint64_t offset, uint64_t len);
+    void vectorUnpack32(int64_t* data, uint64_t offset, uint64_t len);
+
+    void plainUnpackLongs(int64_t* data, uint64_t offset, uint64_t len, uint64_t fbs,
+                          uint64_t& startBit);
+
+   private:
+    RleDecoderV2* decoder;
+    UnpackDefault unpackDefault;
+
+    // Used by vectorially 1~8 bit-unpacking data
+    uint8_t vectorBuf8[MAX_VECTOR_BUF_8BIT_LENGTH + 1];
+    // Used by vectorially 9~16 bit-unpacking data
+    uint16_t vectorBuf16[MAX_VECTOR_BUF_16BIT_LENGTH + 1];
+    // Used by vectorially 17~32 bit-unpacking data
+    uint32_t vectorBuf32[MAX_VECTOR_BUF_32BIT_LENGTH + 1];

Review Comment:
   Done. Already removed the redundant array buffers.
   https://github.com/wpleonardo/orc/blob/f053f9c73bf13fe29aff95cfe4cb71857c57da07/c%2B%2B/src/BpackingAvx512.hh#L76



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1151359551


##########
c++/src/BpackingAvx512.cc:
##########
@@ -0,0 +1,4476 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "BpackingAvx512.hh"
+#include "BitUnpackerAvx512.hh"
+#include "CpuInfoUtil.hh"
+#include "RLEv2.hh"
+
+namespace orc {
+  UnpackAvx512::UnpackAvx512(RleDecoderV2* dec) : decoder(dec), unpackDefault(UnpackDefault(dec)) {
+    // PASS
+  }
+
+  UnpackAvx512::~UnpackAvx512() {
+    // PASS
+  }
+
+  void UnpackAvx512::vectorUnpack1(int64_t* data, uint64_t offset, uint64_t len) {
+    uint32_t bitWidth = 1;
+    const uint8_t* srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+    uint32_t numElements = 0;
+    int64_t* dstPtr = data + offset;
+    uint64_t bufMoveByteLen = 0;
+    uint64_t bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+    bool resetBuf = false;
+    uint64_t startBit = 0;
+    uint64_t tailBitLen = 0;
+    uint32_t backupByteLen = 0;
+
+    while (len > 0) {
+      if (startBit != 0) {
+        bufMoveByteLen +=
+            moveLen(len * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH, ORC_VECTOR_BYTE_WIDTH);
+      } else {
+        bufMoveByteLen += moveLen(len * bitWidth, ORC_VECTOR_BYTE_WIDTH);
+      }
+
+      if (bufMoveByteLen <= bufRestByteLen) {
+        numElements = len;
+        resetBuf = false;
+        len -= numElements;
+      } else {
+        if (startBit != 0) {
+          numElements =
+              (bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + ORC_VECTOR_BYTE_WIDTH - startBit) /
+              bitWidth;
+          len -= numElements;
+          tailBitLen = fmod(
+              bufRestByteLen * ORC_VECTOR_BYTE_WIDTH + ORC_VECTOR_BYTE_WIDTH - startBit, bitWidth);
+          resetBuf = true;
+        } else {
+          numElements = (bufRestByteLen * ORC_VECTOR_BYTE_WIDTH) / bitWidth;
+          len -= numElements;
+          tailBitLen = fmod(bufRestByteLen * ORC_VECTOR_BYTE_WIDTH, bitWidth);
+          resetBuf = true;
+        }
+      }
+
+      if (tailBitLen != 0) {
+        backupByteLen = tailBitLen / ORC_VECTOR_BYTE_WIDTH;
+        tailBitLen = 0;
+      }
+
+      if (startBit > 0) {
+        uint32_t align = getAlign(startBit, bitWidth, 8);
+        if (align > numElements) {
+          align = numElements;
+        }
+        if (align != 0) {
+          bufMoveByteLen -=
+              moveLen(align * bitWidth + startBit - ORC_VECTOR_BYTE_WIDTH, ORC_VECTOR_BYTE_WIDTH);
+          plainUnpackLongs(dstPtr, 0, align, bitWidth, startBit);
+          srcPtr = reinterpret_cast<const uint8_t*>(decoder->bufferStart);
+          bufRestByteLen = decoder->bufferEnd - decoder->bufferStart;
+          dstPtr += align;
+          numElements -= align;
+        }
+      }

Review Comment:
   Thank you very much for reminding me. 
   Already added 2 inline functions to optimizing the bit-unpacking code path. Please check it.
   https://github.com/wpleonardo/orc/blob/0d59f902680ae279fbcd49e44696ad2f84e1264e/c%2B%2B/src/BpackingAvx512.hh#L74
   https://github.com/wpleonardo/orc/blob/0d59f902680ae279fbcd49e44696ad2f84e1264e/c%2B%2B/src/BpackingAvx512.hh#L80



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1491145274

   @stiga-huang May I know do we have other comments about code review? Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] stiga-huang commented on a diff in pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "stiga-huang (via GitHub)" <gi...@apache.org>.

stiga-huang commented on code in PR #1375:
URL: https://github.com/apache/orc/pull/1375#discussion_r1172453769


##########
c++/src/RLEv2.hh:
##########
@@ -166,6 +166,50 @@ namespace orc {
 
     void next(int16_t* data, uint64_t numValues, const char* notNull) override;
 
+    unsigned char readByte();
+
+    void setBufStart(char* start) {

Review Comment:
   The parameter type can be `const char*`. Same for the parameter of `setBufEnd()`



##########
c++/src/RLEv2.hh:
##########
@@ -220,17 +251,40 @@ namespace orc {
 
     const std::unique_ptr<SeekableInputStream> inputStream;
     const bool isSigned;
-
     unsigned char firstByte;
-    uint64_t runLength;  // Length of the current run
-    uint64_t runRead;    // Number of returned values of the current run
-    const char* bufferStart;
-    const char* bufferEnd;
-    uint32_t bitsLeft;                  // Used by readLongs when bitSize < 8
-    uint32_t curByte;                   // Used by anything that uses readLongs
+    char* bufferStart;
+    char* bufferEnd;
+    uint64_t runLength;                 // Length of the current run
+    uint64_t runRead;                   // Number of returned values of the current run
+    uint32_t bitsLeft;  		// Used by readLongs when bitSize < 8
+    uint32_t curByte;   		// Used by anything that uses readLongs
     DataBuffer<int64_t> unpackedPatch;  // Used by PATCHED_BASE
     DataBuffer<int64_t> literals;       // Values of the current run
   };
+
+  inline void RleDecoderV2::resetBufferStart(uint64_t len, bool resetBuf, uint32_t backupByteLen) {
+    char* bufStart = getBufStart();
+    uint64_t remainingLen = bufLength();
+    int bufferLength = 0;
+    const void* bufferPointer = nullptr;
+
+    if (backupByteLen != 0) {
+      inputStream->BackUp(backupByteLen);
+    }
+
+    if (len >= remainingLen && resetBuf) {
+      if (!inputStream->Next(&bufferPointer, &bufferLength)) {
+        throw ParseError("bad read in RleDecoderV2::resetBufferStart");
+      }
+    }
+
+    if (bufferPointer == nullptr) {
+      setBufStart(bufStart + len);
+    } else {
+      setBufStart(const_cast<char*>(static_cast<const char*>(bufferPointer)));

Review Comment:
   If the parameter type of `setBufStart()` is `const char*`, we don't need these casts. Actually, accessing fields of the class itself in its methods don't need these get/set wrappers. Only external callers need them. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wpleonardo commented on pull request #1375: ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Posted by "wpleonardo (via GitHub)" <gi...@apache.org>.

wpleonardo commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1493509224

   Hi @stiga-huang , CI has passed, could you help me review my PR? Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode [orc]

Posted by "taiyang-li (via GitHub)" <gi...@apache.org>.

taiyang-li commented on PR #1375:
URL: https://github.com/apache/orc/pull/1375#issuecomment-1756715851

   @wpleonardo I tried, but still find no improvement
   
   ```
   orc file(snappy + unaligned) + avx512
   0 rows in set. Elapsed: 3.478 sec. Processed 1.20 million rows, 539.37 MB (345.98 thousand rows/s., 155.08 MB/s.)
   0 rows in set. Elapsed: 3.424 sec. Processed 1.20 million rows, 539.37 MB (351.44 thousand rows/s., 157.53 MB/s.)
   0 rows in set. Elapsed: 3.444 sec. Processed 1.20 million rows, 539.37 MB (349.44 thousand rows/s., 156.63 MB/s.)
   
   
   orc file (snappy + unaligned) +  none
   0 rows in set. Elapsed: 3.362 sec. Processed 1.20 million rows, 539.37 MB (357.89 thousand rows/s., 160.42 MB/s.)
   0 rows in set. Elapsed: 3.535 sec. Processed 1.20 million rows, 539.37 MB (340.43 thousand rows/s., 152.59 MB/s.)
   0 rows in set. Elapsed: 3.370 sec. Processed 1.20 million rows, 539.37 MB (357.08 thousand rows/s., 160.06 MB/s.)
    
   
   orc file (lz4 + unaligned) + avx512
   0 rows in set. Elapsed: 3.075 sec. Processed 1.20 million rows, 1.90 GB (391.26 thousand rows/s., 618.31 MB/s.)
   0 rows in set. Elapsed: 3.082 sec. Processed 1.20 million rows, 1.90 GB (390.46 thousand rows/s., 617.05 MB/s.)
   0 rows in set. Elapsed: 3.014 sec. Processed 1.20 million rows, 1.90 GB (399.18 thousand rows/s., 630.82 MB/s.)
   
   
   orc file (lz4 + unaligned) + none 
   rows in set. Elapsed: 2.973 sec. Processed 1.20 million rows, 1.90 GB (404.76 thousand rows/s., 639.64 MB/s.)
   0 rows in set. Elapsed: 3.070 sec. Processed 1.20 million rows, 1.90 GB (391.90 thousand rows/s., 619.32 MB/s.)
   0 rows in set. Elapsed: 2.903 sec. Processed 1.20 million rows, 1.90 GB (414.51 thousand rows/s., 655.05 MB/s.)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org