You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/30 02:28:17 UTC

[GitHub] [arrow] shanhuuang opened a new pull request #10627: ARROW-13206: [C++][Parquet] Basic support for encoding DELTA_BINARY_P…

shanhuuang opened a new pull request #10627:
URL: https://github.com/apache/arrow/pull/10627


   …ACKED. Only works for int32/int64 when delta_bit_width_<=32
   
   1. Update the implement of DeltaBitPackDecoder
   2. Revise the code of putting/getting ZigZag int and the unit test
   3. Add the implement of putting/getting ZigZag int64 and unit test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-899971312


   Could we create a separate PR for the test files directly to the test file repo?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-899971808


   I might be missing it but is there a test demonstrating current behavior for bit_widths >= 32?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-883013956


   > Sorry will try to get to it next week.
   
   Sorry for the delay in getting back to you. This week would be fine :)
   I'm also working on PARQUET-491 and PARQUET-492 which depend on code in this PR. Maybe I should create a new PR after this one is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-877139663


   > > > Is it possible for delta_bit_width_<=32? if so it seems like we should probably cover that case while updating?
   > > 
   > > 
   > > In current code, internal::unpack32 is called in BitReader::GetBatch if delta_bit_width_<=32 whether value type is int64 or int32.
   > > delta_binary_packed.parquet contains int64 columns with delta_bit_width_ from 0 to 32
   > 
   > Sorry I think i mean >= 32?
   
   Yes, it is possible for delta_bit_width_>32. For example, a column contains records like 0, -4294967296, -860510501
   The deltas would be -4294967296 and 3434456795
   The minimum delta is -4294967296
   so the final deltas are 0 and 7729424091
   delta_bit_width_ of bit packed int(7729424091) is 33
   
   delta_binary_packed.parquet also contains int64 columns with delta_bit_width_ from 33 to 64.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r690153087



##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -418,14 +450,59 @@ inline bool BitReader::GetVlqInt(uint32_t* v) {
 }
 
 inline bool BitWriter::PutZigZagVlqInt(int32_t v) {
-  auto u_v = ::arrow::util::SafeCopy<uint32_t>(v);
-  return PutVlqInt((u_v << 1) ^ (u_v >> 31));
+  uint32_t u_v = ::arrow::util::SafeCopy<uint32_t>(v);
+  u_v = (u_v << 1) ^ static_cast<uint32_t>(v >> 31);
+  return PutVlqInt(u_v);

Review comment:
       Can you explain why these implementations were changed?

##########
File path: cpp/src/parquet/types.h
##########
@@ -678,7 +678,7 @@ struct type_traits<Type::INT64> {
   using value_type = int64_t;
 
   static constexpr int value_byte_size = 8;
-  static constexpr const char* printf_code = "ld";
+  static constexpr const char* printf_code = "lld";

Review comment:
       This should really be a conditional, e.g.:
   ```c++
     static constexpr const char* printf_code = (sizeof(long) == 64) ? "ld" : "lld";
   ```

##########
File path: cpp/src/arrow/util/bpacking64_codegen.py
##########
@@ -0,0 +1,131 @@
+#!/bin/python
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# This script is modified from its original version in GitHub. Original source:
+# https://github.com/lemire/FrameOfReference/blob/146948b6058a976bc7767262ad3a2ce201486b93/scripts/turbopacking64.py
+
+# Usage:
+#   python bpacking64_codegen.py > bpacking64_default.h
+
+def howmany(bit):
+    """ how many values are we going to pack? """
+    return 32
+
+
+def howmanywords(bit):
+    return (howmany(bit) * bit + 63)//64
+
+
+def howmanybytes(bit):
+    return (howmany(bit) * bit + 7)//8
+
+
+print('''// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// This file was generated by script which is modified from its original version in GitHub.
+// Original source:
+// https://github.com/lemire/FrameOfReference/blob/master/scripts/turbopacking64.py
+// The original copyright notice follows.
+
+// This code is released under the
+// Apache License Version 2.0 http://www.apache.org/licenses/.
+// (c) Daniel Lemire 2013
+
+#pragma once
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+
+namespace arrow {
+namespace internal {
+''')
+
+
+print("inline const uint8_t* unpack0_64(const uint8_t* in, uint64_t* out) {")
+print("  for(int k = 0; k < {0} ; k += 1) {{".format(howmany(0)))
+print("    out[k] = 0;")
+print("  }")
+print("  return in;")
+print("}")
+
+for bit in range(1, 65):
+    print("")
+    print(
+        "inline const uint8_t* unpack{0}_64(const uint8_t* in, uint64_t* out) {{".format(bit))
+
+    if(bit < 64):
+        print("  const uint64_t mask = {0}ULL;".format((1 << bit)-1))
+    maskstr = " & mask"
+    if (bit == 64):
+        maskstr = ""  # no need
+
+    for k in range(howmanywords(bit)-1):
+        print("  uint64_t w{0} = util::SafeLoadAs<uint64_t>(in);".format(k))
+        print("  w{0} = arrow::BitUtil::FromLittleEndian(w{0});".format(k))
+        print("  in += 8;".format(k))
+    k = howmanywords(bit) - 1
+    if (bit % 2 == 0):
+        print("  uint64_t w{0} = util::SafeLoadAs<uint64_t>(in);".format(k))
+        print("  w{0} = arrow::BitUtil::FromLittleEndian(w{0});".format(k))
+        print("  in += 8;".format(k))
+    else:
+        print("  uint64_t w{0} = util::SafeLoadAs<uint32_t>(in);".format(k))

Review comment:
       Ok, is this because DELTA_BINARY_PACKED outputs values in multiples of 32? Slightly annoying :-/

##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -418,14 +450,59 @@ inline bool BitReader::GetVlqInt(uint32_t* v) {
 }
 
 inline bool BitWriter::PutZigZagVlqInt(int32_t v) {
-  auto u_v = ::arrow::util::SafeCopy<uint32_t>(v);
-  return PutVlqInt((u_v << 1) ^ (u_v >> 31));
+  uint32_t u_v = ::arrow::util::SafeCopy<uint32_t>(v);
+  u_v = (u_v << 1) ^ static_cast<uint32_t>(v >> 31);
+  return PutVlqInt(u_v);
 }
 
 inline bool BitReader::GetZigZagVlqInt(int32_t* v) {
   uint32_t u;
   if (!GetVlqInt(&u)) return false;
-  *v = ::arrow::util::SafeCopy<int32_t>((u >> 1) ^ (u << 31));
+  u = (u >> 1) ^ (~(u & 1) + 1);
+  *v = ::arrow::util::SafeCopy<int32_t>(u);
+  return true;
+}
+
+inline bool BitWriter::PutVlqInt(uint64_t v) {
+  bool result = true;
+  while ((v & 0xFFFFFFFFFFFFFF80ULL) != 0ULL) {
+    result &= PutAligned<uint8_t>(static_cast<uint8_t>((v & 0x7F) | 0x80), 1);

Review comment:
       Let's keep this code as-is.

##########
File path: cpp/src/arrow/util/bit_util_test.cc
##########
@@ -1939,24 +1939,77 @@ TEST(BitUtil, RoundUpToPowerOf2) {
 #undef U64
 #undef S64
 
-static void TestZigZag(int32_t v) {
+static void TestZigZag(int32_t v, uint8_t* buffer_expect) {

Review comment:
       Can pass a `std::array<uint8_t, 5>` to avoid having to declare the buffers out of line.

##########
File path: cpp/src/arrow/util/bit_util_test.cc
##########
@@ -1939,24 +1939,77 @@ TEST(BitUtil, RoundUpToPowerOf2) {
 #undef U64
 #undef S64
 
-static void TestZigZag(int32_t v) {
+static void TestZigZag(int32_t v, uint8_t* buffer_expect) {
   uint8_t buffer[BitUtil::BitReader::kMaxVlqByteLength] = {};
   BitUtil::BitWriter writer(buffer, sizeof(buffer));
   BitUtil::BitReader reader(buffer, sizeof(buffer));
   writer.PutZigZagVlqInt(v);
+  EXPECT_EQ(buffer_expect[0], buffer[0]);
+  EXPECT_EQ(buffer_expect[1], buffer[1]);
+  EXPECT_EQ(buffer_expect[2], buffer[2]);
+  EXPECT_EQ(buffer_expect[3], buffer[3]);
+  EXPECT_EQ(buffer_expect[4], buffer[4]);
   int32_t result;
   EXPECT_TRUE(reader.GetZigZagVlqInt(&result));
   EXPECT_EQ(v, result);
 }
 
 TEST(BitStreamUtil, ZigZag) {
-  TestZigZag(0);
-  TestZigZag(1);
-  TestZigZag(1234);
-  TestZigZag(-1);
-  TestZigZag(-1234);
-  TestZigZag(std::numeric_limits<int32_t>::max());
-  TestZigZag(-std::numeric_limits<int32_t>::max());
+  uint8_t buffer_expect0[5] = {0, 0, 0, 0, 0};
+  uint8_t buffer_expect1[5] = {2, 0, 0, 0, 0};
+  uint8_t buffer_expect2[5] = {164, 19, 0, 0, 0};
+  uint8_t buffer_expect3[5] = {1, 0, 0, 0, 0};
+  uint8_t buffer_expect4[5] = {163, 19, 0, 0, 0};
+  uint8_t buffer_expect5[5] = {254, 255, 255, 255, 15};
+  uint8_t buffer_expect6[5] = {253, 255, 255, 255, 15};
+  uint8_t buffer_expect7[5] = {255, 255, 255, 255, 15};
+  TestZigZag(0, buffer_expect0);

Review comment:
       This could be `TestZigZag(0, {0, 0, 0, 0, 0})` if `TestZigZag` took a `std::array` or `std::vector`.

##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -2107,73 +2105,99 @@ class DeltaBitPackDecoder : public DecoderImpl, virtual public TypedDecoder<DTyp
   }
 
  private:
-  void InitBlock() {
-    // The number of values per block.
-    uint32_t block_size;
-    if (!decoder_.GetVlqInt(&block_size)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&num_mini_blocks_)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&values_current_block_)) {
+  void InitHeader() {
+    if (!decoder_.GetVlqInt(&values_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&mini_blocks_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&total_value_count_)) {
       ParquetException::EofException();
     }
     if (!decoder_.GetZigZagVlqInt(&last_value_)) ParquetException::EofException();
 
-    delta_bit_widths_ = AllocateBuffer(pool_, num_mini_blocks_);
-    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    delta_bit_widths_ = AllocateBuffer(pool_, mini_blocks_per_block_);
+    values_per_mini_block_ = values_per_block_ / mini_blocks_per_block_;
+    if (values_per_mini_block_ % 8 != 0) {
+      throw ParquetException("miniBlockSize must be multiple of 8, but it's " +

Review comment:
       Why 8? According to the [spec](https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5), it should be a multiple of 32.

##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -48,10 +48,9 @@
 #include "arrow/util/future.h"
 #include "arrow/util/logging.h"
 #include "arrow/util/range.h"
-
+#include "gtest/gtest.h"

Review comment:
       Nit: can you keep the inclusion order as is?
   

##########
File path: cpp/src/arrow/util/bpacking64_codegen.py
##########
@@ -0,0 +1,131 @@
+#!/bin/python
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# This script is modified from its original version in GitHub. Original source:
+# https://github.com/lemire/FrameOfReference/blob/146948b6058a976bc7767262ad3a2ce201486b93/scripts/turbopacking64.py
+
+# Usage:
+#   python bpacking64_codegen.py > bpacking64_default.h
+
+def howmany(bit):
+    """ how many values are we going to pack? """
+    return 32
+
+
+def howmanywords(bit):
+    return (howmany(bit) * bit + 63)//64
+
+
+def howmanybytes(bit):
+    return (howmany(bit) * bit + 7)//8
+
+
+print('''// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// This file was generated by script which is modified from its original version in GitHub.
+// Original source:
+// https://github.com/lemire/FrameOfReference/blob/master/scripts/turbopacking64.py
+// The original copyright notice follows.
+
+// This code is released under the
+// Apache License Version 2.0 http://www.apache.org/licenses/.
+// (c) Daniel Lemire 2013
+
+#pragma once
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+
+namespace arrow {
+namespace internal {
+''')
+
+
+print("inline const uint8_t* unpack0_64(const uint8_t* in, uint64_t* out) {")
+print("  for(int k = 0; k < {0} ; k += 1) {{".format(howmany(0)))
+print("    out[k] = 0;")
+print("  }")
+print("  return in;")
+print("}")
+
+for bit in range(1, 65):
+    print("")
+    print(
+        "inline const uint8_t* unpack{0}_64(const uint8_t* in, uint64_t* out) {{".format(bit))
+
+    if(bit < 64):
+        print("  const uint64_t mask = {0}ULL;".format((1 << bit)-1))
+    maskstr = " & mask"
+    if (bit == 64):
+        maskstr = ""  # no need
+
+    for k in range(howmanywords(bit)-1):
+        print("  uint64_t w{0} = util::SafeLoadAs<uint64_t>(in);".format(k))
+        print("  w{0} = arrow::BitUtil::FromLittleEndian(w{0});".format(k))
+        print("  in += 8;".format(k))
+    k = howmanywords(bit) - 1
+    if (bit % 2 == 0):
+        print("  uint64_t w{0} = util::SafeLoadAs<uint64_t>(in);".format(k))
+        print("  w{0} = arrow::BitUtil::FromLittleEndian(w{0});".format(k))
+        print("  in += 8;".format(k))
+    else:
+        print("  uint64_t w{0} = util::SafeLoadAs<uint32_t>(in);".format(k))

Review comment:
       I find this implementation weird. Why does it not unpack 64 entries at a time? I'm afraid this may make writing the SIMD implementations more difficult.

##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -2107,73 +2105,99 @@ class DeltaBitPackDecoder : public DecoderImpl, virtual public TypedDecoder<DTyp
   }
 
  private:
-  void InitBlock() {
-    // The number of values per block.
-    uint32_t block_size;
-    if (!decoder_.GetVlqInt(&block_size)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&num_mini_blocks_)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&values_current_block_)) {
+  void InitHeader() {
+    if (!decoder_.GetVlqInt(&values_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&mini_blocks_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&total_value_count_)) {
       ParquetException::EofException();
     }
     if (!decoder_.GetZigZagVlqInt(&last_value_)) ParquetException::EofException();
 
-    delta_bit_widths_ = AllocateBuffer(pool_, num_mini_blocks_);
-    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    delta_bit_widths_ = AllocateBuffer(pool_, mini_blocks_per_block_);
+    values_per_mini_block_ = values_per_block_ / mini_blocks_per_block_;
+    if (values_per_mini_block_ % 8 != 0) {
+      throw ParquetException("miniBlockSize must be multiple of 8, but it's " +
+                             std::to_string(values_per_mini_block_));
+    }
+
+    mini_block_idx_ = mini_blocks_per_block_;
+    values_current_mini_block_ = 0;
+  }
 
+  void InitBlock() {
     if (!decoder_.GetZigZagVlqInt(&min_delta_)) ParquetException::EofException();
-    for (uint32_t i = 0; i < num_mini_blocks_; ++i) {
+
+    // readBitWidthsForMiniBlocks

Review comment:
       What does this comment mean? Would you like to write it in plain English?

##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -2107,73 +2105,99 @@ class DeltaBitPackDecoder : public DecoderImpl, virtual public TypedDecoder<DTyp
   }
 
  private:
-  void InitBlock() {
-    // The number of values per block.
-    uint32_t block_size;
-    if (!decoder_.GetVlqInt(&block_size)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&num_mini_blocks_)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&values_current_block_)) {
+  void InitHeader() {
+    if (!decoder_.GetVlqInt(&values_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&mini_blocks_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&total_value_count_)) {
       ParquetException::EofException();
     }
     if (!decoder_.GetZigZagVlqInt(&last_value_)) ParquetException::EofException();
 
-    delta_bit_widths_ = AllocateBuffer(pool_, num_mini_blocks_);
-    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    delta_bit_widths_ = AllocateBuffer(pool_, mini_blocks_per_block_);
+    values_per_mini_block_ = values_per_block_ / mini_blocks_per_block_;
+    if (values_per_mini_block_ % 8 != 0) {
+      throw ParquetException("miniBlockSize must be multiple of 8, but it's " +
+                             std::to_string(values_per_mini_block_));
+    }
+
+    mini_block_idx_ = mini_blocks_per_block_;
+    values_current_mini_block_ = 0;
+  }
 
+  void InitBlock() {
     if (!decoder_.GetZigZagVlqInt(&min_delta_)) ParquetException::EofException();
-    for (uint32_t i = 0; i < num_mini_blocks_; ++i) {
+
+    // readBitWidthsForMiniBlocks
+    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    for (uint32_t i = 0; i < mini_blocks_per_block_; ++i) {
       if (!decoder_.GetAligned<uint8_t>(1, bit_width_data + i)) {
         ParquetException::EofException();
       }
     }
-    values_per_mini_block_ = block_size / num_mini_blocks_;
     mini_block_idx_ = 0;
     delta_bit_width_ = bit_width_data[0];
     values_current_mini_block_ = values_per_mini_block_;
   }
 
-  template <typename T>
   int GetInternal(T* buffer, int max_values) {
     max_values = std::min(max_values, this->num_values_);
-    const uint8_t* bit_width_data = delta_bit_widths_->data();
-    for (int i = 0; i < max_values; ++i) {
+    DCHECK_LE(static_cast<uint32_t>(max_values), total_value_count_);
+    int i = 0;
+    while (i < max_values) {
       if (ARROW_PREDICT_FALSE(values_current_mini_block_ == 0)) {
         ++mini_block_idx_;
-        if (mini_block_idx_ < static_cast<size_t>(delta_bit_widths_->size())) {
-          delta_bit_width_ = bit_width_data[mini_block_idx_];
+        if (mini_block_idx_ < mini_blocks_per_block_) {
+          delta_bit_width_ = delta_bit_widths_->data()[mini_block_idx_];
           values_current_mini_block_ = values_per_mini_block_;
         } else {
+          // mini_block_idx_ > mini_blocks_per_block_ only if last_value_ is stored in
+          // header

Review comment:
       I don't understand this comment. `last_value_` is always stored in header, right?
   

##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -2107,73 +2105,99 @@ class DeltaBitPackDecoder : public DecoderImpl, virtual public TypedDecoder<DTyp
   }
 
  private:
-  void InitBlock() {
-    // The number of values per block.
-    uint32_t block_size;
-    if (!decoder_.GetVlqInt(&block_size)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&num_mini_blocks_)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&values_current_block_)) {
+  void InitHeader() {
+    if (!decoder_.GetVlqInt(&values_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&mini_blocks_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&total_value_count_)) {
       ParquetException::EofException();
     }
     if (!decoder_.GetZigZagVlqInt(&last_value_)) ParquetException::EofException();
 
-    delta_bit_widths_ = AllocateBuffer(pool_, num_mini_blocks_);
-    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    delta_bit_widths_ = AllocateBuffer(pool_, mini_blocks_per_block_);
+    values_per_mini_block_ = values_per_block_ / mini_blocks_per_block_;
+    if (values_per_mini_block_ % 8 != 0) {
+      throw ParquetException("miniBlockSize must be multiple of 8, but it's " +
+                             std::to_string(values_per_mini_block_));
+    }
+
+    mini_block_idx_ = mini_blocks_per_block_;
+    values_current_mini_block_ = 0;
+  }
 
+  void InitBlock() {
     if (!decoder_.GetZigZagVlqInt(&min_delta_)) ParquetException::EofException();
-    for (uint32_t i = 0; i < num_mini_blocks_; ++i) {
+
+    // readBitWidthsForMiniBlocks
+    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    for (uint32_t i = 0; i < mini_blocks_per_block_; ++i) {
       if (!decoder_.GetAligned<uint8_t>(1, bit_width_data + i)) {
         ParquetException::EofException();
       }
     }
-    values_per_mini_block_ = block_size / num_mini_blocks_;
     mini_block_idx_ = 0;
     delta_bit_width_ = bit_width_data[0];
     values_current_mini_block_ = values_per_mini_block_;
   }
 
-  template <typename T>
   int GetInternal(T* buffer, int max_values) {
     max_values = std::min(max_values, this->num_values_);
-    const uint8_t* bit_width_data = delta_bit_widths_->data();
-    for (int i = 0; i < max_values; ++i) {
+    DCHECK_LE(static_cast<uint32_t>(max_values), total_value_count_);
+    int i = 0;
+    while (i < max_values) {
       if (ARROW_PREDICT_FALSE(values_current_mini_block_ == 0)) {
         ++mini_block_idx_;
-        if (mini_block_idx_ < static_cast<size_t>(delta_bit_widths_->size())) {
-          delta_bit_width_ = bit_width_data[mini_block_idx_];
+        if (mini_block_idx_ < mini_blocks_per_block_) {
+          delta_bit_width_ = delta_bit_widths_->data()[mini_block_idx_];
           values_current_mini_block_ = values_per_mini_block_;
         } else {
+          // mini_block_idx_ > mini_blocks_per_block_ only if last_value_ is stored in
+          // header
+          if (ARROW_PREDICT_FALSE(mini_block_idx_ > mini_blocks_per_block_)) {

Review comment:
       Instead of this cryptic condition, can you add a boolean member indicating that decoding hasn't started yet?

##########
File path: cpp/src/parquet/reader_test.cc
##########
@@ -18,8 +18,10 @@
 #include <fcntl.h>
 #include <gmock/gmock.h>
 #include <gtest/gtest.h>
+
 #include <cstdint>
 #include <cstdlib>
+#include <fstream>

Review comment:
       Hmm... nothing else changed in this file. Is this necessary?

##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -4056,5 +4055,32 @@ TEST(TestArrowWriteDictionaries, NestedSubfield) {
   ::arrow::AssertTablesEqual(*table, *actual);
 }
 
+TEST(TestArrowReadDeltaEncoding, DeltaBinaryPacked) {
+  auto file = test::get_data_file("delta_binary_packed.parquet");
+  auto expect_file = test::get_data_file("delta_binary_packed_expect.csv");
+  auto pool = ::arrow::default_memory_pool();
+  std::unique_ptr<FileReader> parquet_reader;
+  std::shared_ptr<::arrow::Table> table;
+  ASSERT_OK(
+      FileReader::Make(pool, ParquetFileReader::OpenFile(file, false), &parquet_reader));
+  ASSERT_OK(parquet_reader->ReadTable(&table));
+
+  ASSERT_OK_AND_ASSIGN(auto input_file, ::arrow::io::ReadableFile::Open(expect_file));
+  auto convert_options = ::arrow::csv::ConvertOptions::Defaults();

Review comment:
       The CSV reader is only available if Arrow is compiled with `-DARROW_CSV=ON`. You'll have to make this test conditional on this somehow.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10627: ARROW-13206: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r664765915



##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -263,8 +284,10 @@ inline void GetValue_(int num_bits, T* v, int max_bytes, const uint8_t* buffer,
 #pragma warning(disable : 4800 4805)
 #endif
     // Read bits of v that crossed into new buffered_values_
-    *v = *v | static_cast<T>(BitUtil::TrailingBits(*buffered_values, *bit_offset)
-                             << (num_bits - *bit_offset));
+    if (ARROW_PREDICT_TRUE(num_bits - *bit_offset < 64)) {

Review comment:
       please comment why this check is needed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r665015564



##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -418,14 +450,65 @@ inline bool BitReader::GetVlqInt(uint32_t* v) {
 }
 
 inline bool BitWriter::PutZigZagVlqInt(int32_t v) {
-  auto u_v = ::arrow::util::SafeCopy<uint32_t>(v);
-  return PutVlqInt((u_v << 1) ^ (u_v >> 31));
+  uint32_t u_v = ::arrow::util::SafeCopy<uint32_t>(v);
+  v = (u_v << 1) ^ (v >> 31);
+  u_v = ::arrow::util::SafeCopy<uint32_t>(v);

Review comment:
       Actually, I'm not particularly clear about the difference between ::arrow::util::SafeCopy<uint32_t> and static_cast<uint32_t>. Could you please tell me when should we use SafeCopy rather than static_cast?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-889976082


   sorry I have had less time then I would have liked recently for Arrow reviews, will try to get to this soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-874982472


   Is it possible for delta_bit_width_<=32?  if so it seems like we should probably cover that case while updating?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10627: ARROW-13206: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r664738151



##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -418,14 +450,65 @@ inline bool BitReader::GetVlqInt(uint32_t* v) {
 }
 
 inline bool BitWriter::PutZigZagVlqInt(int32_t v) {
-  auto u_v = ::arrow::util::SafeCopy<uint32_t>(v);
-  return PutVlqInt((u_v << 1) ^ (u_v >> 31));
+  uint32_t u_v = ::arrow::util::SafeCopy<uint32_t>(v);
+  v = (u_v << 1) ^ (v >> 31);
+  u_v = ::arrow::util::SafeCopy<uint32_t>(v);

Review comment:
       and elimnate this line.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-874981528


   https://issues.apache.org/jira/browse/PARQUET-490


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-874983984


   CC @zeroshade note bugs fixed in ZigZag/VLQ int, not sure if these got propagated to Go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r691990798



##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -2107,73 +2105,99 @@ class DeltaBitPackDecoder : public DecoderImpl, virtual public TypedDecoder<DTyp
   }
 
  private:
-  void InitBlock() {
-    // The number of values per block.
-    uint32_t block_size;
-    if (!decoder_.GetVlqInt(&block_size)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&num_mini_blocks_)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&values_current_block_)) {
+  void InitHeader() {
+    if (!decoder_.GetVlqInt(&values_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&mini_blocks_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&total_value_count_)) {
       ParquetException::EofException();
     }
     if (!decoder_.GetZigZagVlqInt(&last_value_)) ParquetException::EofException();
 
-    delta_bit_widths_ = AllocateBuffer(pool_, num_mini_blocks_);
-    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    delta_bit_widths_ = AllocateBuffer(pool_, mini_blocks_per_block_);
+    values_per_mini_block_ = values_per_block_ / mini_blocks_per_block_;
+    if (values_per_mini_block_ % 8 != 0) {
+      throw ParquetException("miniBlockSize must be multiple of 8, but it's " +

Review comment:
       Well, we should follow the spec here, especially as `unpack64` assumes the input length is a multiple of 32.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10627: ARROW-13206: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r664776263



##########
File path: cpp/src/parquet/reader_test.cc
##########
@@ -322,6 +323,27 @@ TEST(TestFileReaderAdHoc, NationDictTruncatedDataPage) {
   ASSERT_EQ(ss2.str(), ss.str());
 }
 
+TEST(TestFileReaderEncoding, DeltaBinaryPacked) {
+  // Parquet file(delta_binary_packed.parquet) is generated by parquet-mr version 1.10.0.
+  // There are 66 columns in total and their encoding type is DELTA_BINARY_PACKED. The
+  // data type of the first 65 columns is bigint and their bit width ranges from 0 to 64.
+  // The data type of the last column is int.
+  std::list<int> columns;
+
+  std::stringstream ss_values;
+  const char* file = "delta_binary_packed.parquet";
+  auto reader_props = default_reader_properties();
+  auto reader = ParquetFileReader::OpenFile(data_file(file), false, reader_props);
+  ParquetFilePrinter printer(reader.get());
+  printer.DebugPrint(ss_values, columns, true, false, false, file);
+
+  std::ifstream in(data_file("delta_binary_packed.parquet.expect"));

Review comment:
       this test seems very brittle.  using CSV or JSON or some other format and comparing the column data seems directly instead of the string representation seems like a better idea?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeroshade commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
zeroshade commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-875020990


   @emkornfield I added the tests that were added here to the Go side, including the comparison of the actual bytes generated and the Golang implementation passes all the tests as currently written without any modifications needed. it appears that the Golang version doesn't have this bug (I remember coming across a bug in the ZigZag/VLQ int stuff a while back where I had to manage the signs and such to get the right output so I likely fixed it without realizing). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-883013956


   > Sorry will try to get to it next week.
   
   Sorry for the delay in getting back to you. This week would be fine :)
   I'm also working on PARQUET-491 and PARQUET-492 which depend on code in this PR. Maybe I should create a new PR after this one is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10627: ARROW-13206: [C++][Parquet] Basic support for encoding DELTA_BINARY_P…

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-871049358


   https://issues.apache.org/jira/browse/ARROW-13206


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r692095886



##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -313,7 +337,18 @@ inline int BitReader::GetBatch(int num_bits, T* v, int batch_size) {
                            reinterpret_cast<uint32_t*>(v + i), batch_size - i, num_bits);
     i += num_unpacked;
     byte_offset += num_unpacked * num_bits / 8;
+  } else if (sizeof(T) == 8 && num_bits > 32) {
+    // Use unpack64 only if num_bits is larger then 32

Review comment:
       Yes, I think it is just a performance issue. If num_bits is no larger than 32, I guess that using unpack32 will achieve better performance with function such as unpack32_avx2/unpack32_avx512. 
   The result of using unpack32 or unpack64 are the same if num_bits <= 32

##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -2107,73 +2105,99 @@ class DeltaBitPackDecoder : public DecoderImpl, virtual public TypedDecoder<DTyp
   }
 
  private:
-  void InitBlock() {
-    // The number of values per block.
-    uint32_t block_size;
-    if (!decoder_.GetVlqInt(&block_size)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&num_mini_blocks_)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&values_current_block_)) {
+  void InitHeader() {
+    if (!decoder_.GetVlqInt(&values_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&mini_blocks_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&total_value_count_)) {
       ParquetException::EofException();
     }
     if (!decoder_.GetZigZagVlqInt(&last_value_)) ParquetException::EofException();
 
-    delta_bit_widths_ = AllocateBuffer(pool_, num_mini_blocks_);
-    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    delta_bit_widths_ = AllocateBuffer(pool_, mini_blocks_per_block_);
+    values_per_mini_block_ = values_per_block_ / mini_blocks_per_block_;
+    if (values_per_mini_block_ % 8 != 0) {
+      throw ParquetException("miniBlockSize must be multiple of 8, but it's " +

Review comment:
       OK. I've opened a JIRA [here](https://issues.apache.org/jira/browse/PARQUET-2077).

##########
File path: cpp/src/arrow/util/bit_util_test.cc
##########
@@ -1939,24 +1939,61 @@ TEST(BitUtil, RoundUpToPowerOf2) {
 #undef U64
 #undef S64
 
-static void TestZigZag(int32_t v) {
+static void TestZigZag(int32_t v, std::array<uint8_t, 5> buffer_expect) {
   uint8_t buffer[BitUtil::BitReader::kMaxVlqByteLength] = {};
   BitUtil::BitWriter writer(buffer, sizeof(buffer));
   BitUtil::BitReader reader(buffer, sizeof(buffer));
   writer.PutZigZagVlqInt(v);
+  EXPECT_EQ(buffer_expect[0], buffer[0]);
+  EXPECT_EQ(buffer_expect[1], buffer[1]);
+  EXPECT_EQ(buffer_expect[2], buffer[2]);
+  EXPECT_EQ(buffer_expect[3], buffer[3]);
+  EXPECT_EQ(buffer_expect[4], buffer[4]);

Review comment:
       Got it! Thanks. :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r690006801



##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -263,8 +284,10 @@ inline void GetValue_(int num_bits, T* v, int max_bytes, const uint8_t* buffer,
 #pragma warning(disable : 4800 4805)
 #endif
     // Read bits of v that crossed into new buffered_values_
-    *v = *v | static_cast<T>(BitUtil::TrailingBits(*buffered_values, *bit_offset)
-                             << (num_bits - *bit_offset));
+    if (ARROW_PREDICT_TRUE(num_bits - *bit_offset < 64)) {

Review comment:
       thanks, makes esnse, could add a comment to the code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r692035158



##########
File path: cpp/src/arrow/util/bit_util_test.cc
##########
@@ -1939,24 +1939,61 @@ TEST(BitUtil, RoundUpToPowerOf2) {
 #undef U64
 #undef S64
 
-static void TestZigZag(int32_t v) {
+static void TestZigZag(int32_t v, std::array<uint8_t, 5> buffer_expect) {
   uint8_t buffer[BitUtil::BitReader::kMaxVlqByteLength] = {};
   BitUtil::BitWriter writer(buffer, sizeof(buffer));
   BitUtil::BitReader reader(buffer, sizeof(buffer));
   writer.PutZigZagVlqInt(v);
+  EXPECT_EQ(buffer_expect[0], buffer[0]);
+  EXPECT_EQ(buffer_expect[1], buffer[1]);
+  EXPECT_EQ(buffer_expect[2], buffer[2]);
+  EXPECT_EQ(buffer_expect[3], buffer[3]);
+  EXPECT_EQ(buffer_expect[4], buffer[4]);

Review comment:
       Nit: can probably write `EXPECT_THAT(buffer, testing::ElementsAreArray(buffer_expect))`

##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -4056,5 +4061,34 @@ TEST(TestArrowWriteDictionaries, NestedSubfield) {
   ::arrow::AssertTablesEqual(*table, *actual);
 }
 
+#ifdef ARROW_CSV
+TEST(TestArrowReadDeltaEncoding, DeltaBinaryPacked) {
+  auto file = test::get_data_file("delta_binary_packed.parquet");
+  auto expect_file = test::get_data_file("delta_binary_packed_expect.csv");
+  auto pool = ::arrow::default_memory_pool();
+  std::unique_ptr<FileReader> parquet_reader;
+  std::shared_ptr<::arrow::Table> table;
+  ASSERT_OK(
+      FileReader::Make(pool, ParquetFileReader::OpenFile(file, false), &parquet_reader));
+  ASSERT_OK(parquet_reader->ReadTable(&table));
+
+  ASSERT_OK_AND_ASSIGN(auto input_file, ::arrow::io::ReadableFile::Open(expect_file));
+  auto convert_options = ::arrow::csv::ConvertOptions::Defaults();
+  for (int i = 0; i <= 64; ++i) {
+    std::string column_name = "bitwidth" + std::to_string(i);
+    convert_options.column_types[column_name] = ::arrow::int64();
+  }
+  convert_options.column_types["int_value"] = ::arrow::int32();
+  ASSERT_OK_AND_ASSIGN(auto csv_reader,
+                       ::arrow::csv::TableReader::Make(
+                           ::arrow::io::default_io_context(), input_file,
+                           ::arrow::csv::ReadOptions::Defaults(),
+                           ::arrow::csv::ParseOptions::Defaults(), convert_options));
+  ASSERT_OK_AND_ASSIGN(auto expect_table, csv_reader->Read());
+
+  ::arrow::AssertTablesEqual(*table, *expect_table);
+}
+#endif
+

Review comment:
       When ARROW_CSV is not defined, can probably skip the test instead of not defining it, e.g.:
   ```c++
   TEST(TestArrowReadDeltaEncoding, DeltaBinaryPacked) {
     GTEST_SKIP() << "Test needs CSV reader";
   }
   ```

##########
File path: cpp/src/parquet/CMakeLists.txt
##########
@@ -407,3 +407,7 @@ endif()
 if(ARROW_WITH_ZSTD)
   add_definitions(-DARROW_WITH_ZSTD)
 endif()
+
+if(ARROW_CSV)

Review comment:
       This seems ok to me currently.

##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -418,14 +450,59 @@ inline bool BitReader::GetVlqInt(uint32_t* v) {
 }
 
 inline bool BitWriter::PutZigZagVlqInt(int32_t v) {
-  auto u_v = ::arrow::util::SafeCopy<uint32_t>(v);
-  return PutVlqInt((u_v << 1) ^ (u_v >> 31));
+  uint32_t u_v = ::arrow::util::SafeCopy<uint32_t>(v);
+  u_v = (u_v << 1) ^ static_cast<uint32_t>(v >> 31);
+  return PutVlqInt(u_v);

Review comment:
       Ah, thanks. You're right indeed.

##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -313,7 +337,18 @@ inline int BitReader::GetBatch(int num_bits, T* v, int batch_size) {
                            reinterpret_cast<uint32_t*>(v + i), batch_size - i, num_bits);
     i += num_unpacked;
     byte_offset += num_unpacked * num_bits / 8;
+  } else if (sizeof(T) == 8 && num_bits > 32) {
+    // Use unpack64 only if num_bits is larger then 32

Review comment:
       I hadn't noticed previously, but why? Is it just a performance issue?

##########
File path: cpp/src/arrow/util/bit_util_test.cc
##########
@@ -1939,24 +1939,61 @@ TEST(BitUtil, RoundUpToPowerOf2) {
 #undef U64
 #undef S64
 
-static void TestZigZag(int32_t v) {
+static void TestZigZag(int32_t v, std::array<uint8_t, 5> buffer_expect) {
   uint8_t buffer[BitUtil::BitReader::kMaxVlqByteLength] = {};
   BitUtil::BitWriter writer(buffer, sizeof(buffer));
   BitUtil::BitReader reader(buffer, sizeof(buffer));
   writer.PutZigZagVlqInt(v);
+  EXPECT_EQ(buffer_expect[0], buffer[0]);
+  EXPECT_EQ(buffer_expect[1], buffer[1]);
+  EXPECT_EQ(buffer_expect[2], buffer[2]);
+  EXPECT_EQ(buffer_expect[3], buffer[3]);
+  EXPECT_EQ(buffer_expect[4], buffer[4]);

Review comment:
       You'll need to include `<gmock/gmock-matchers.h>` for that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang edited a comment on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang edited a comment on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-883013956


   > Sorry will try to get to it next week.
   
   Sorry for the delay in getting back to you. This week would be fine :)
   I'm also working on [PARQUET-491](https://issues.apache.org/jira/browse/PARQUET-491) and [PARQUET-492](https://issues.apache.org/jira/browse/PARQUET-492) which depend on code in this PR. Maybe I should create a new PR after this one is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r665004266



##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -263,8 +284,10 @@ inline void GetValue_(int num_bits, T* v, int max_bytes, const uint8_t* buffer,
 #pragma warning(disable : 4800 4805)
 #endif
     // Read bits of v that crossed into new buffered_values_
-    *v = *v | static_cast<T>(BitUtil::TrailingBits(*buffered_values, *bit_offset)
-                             << (num_bits - *bit_offset));
+    if (ARROW_PREDICT_TRUE(num_bits - *bit_offset < 64)) {

Review comment:
       To solve the runtime error in test job "AMD64 Ubuntu 20.04 C++ ASAN UBSAN":
   /arrow/cpp/src/arrow/util/bit_stream_utils.h:289:30: runtime error: shift exponent 64 is too large for 64-bit type 'uint64_t' (aka 'unsigned long')
   https://github.com/shanhuuang/arrow/runs/2986826720?check_suite_focus=true#logs
   I met it when I try to read a file with only 3 records.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r690178768



##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -2107,73 +2105,99 @@ class DeltaBitPackDecoder : public DecoderImpl, virtual public TypedDecoder<DTyp
   }
 
  private:
-  void InitBlock() {
-    // The number of values per block.
-    uint32_t block_size;
-    if (!decoder_.GetVlqInt(&block_size)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&num_mini_blocks_)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&values_current_block_)) {
+  void InitHeader() {
+    if (!decoder_.GetVlqInt(&values_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&mini_blocks_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&total_value_count_)) {
       ParquetException::EofException();
     }
     if (!decoder_.GetZigZagVlqInt(&last_value_)) ParquetException::EofException();
 
-    delta_bit_widths_ = AllocateBuffer(pool_, num_mini_blocks_);
-    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    delta_bit_widths_ = AllocateBuffer(pool_, mini_blocks_per_block_);
+    values_per_mini_block_ = values_per_block_ / mini_blocks_per_block_;
+    if (values_per_mini_block_ % 8 != 0) {
+      throw ParquetException("miniBlockSize must be multiple of 8, but it's " +
+                             std::to_string(values_per_mini_block_));
+    }
+
+    mini_block_idx_ = mini_blocks_per_block_;
+    values_current_mini_block_ = 0;
+  }
 
+  void InitBlock() {
     if (!decoder_.GetZigZagVlqInt(&min_delta_)) ParquetException::EofException();
-    for (uint32_t i = 0; i < num_mini_blocks_; ++i) {
+
+    // readBitWidthsForMiniBlocks
+    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    for (uint32_t i = 0; i < mini_blocks_per_block_; ++i) {
       if (!decoder_.GetAligned<uint8_t>(1, bit_width_data + i)) {
         ParquetException::EofException();
       }
     }
-    values_per_mini_block_ = block_size / num_mini_blocks_;
     mini_block_idx_ = 0;
     delta_bit_width_ = bit_width_data[0];
     values_current_mini_block_ = values_per_mini_block_;
   }
 
-  template <typename T>
   int GetInternal(T* buffer, int max_values) {
     max_values = std::min(max_values, this->num_values_);
-    const uint8_t* bit_width_data = delta_bit_widths_->data();
-    for (int i = 0; i < max_values; ++i) {
+    DCHECK_LE(static_cast<uint32_t>(max_values), total_value_count_);
+    int i = 0;
+    while (i < max_values) {
       if (ARROW_PREDICT_FALSE(values_current_mini_block_ == 0)) {
         ++mini_block_idx_;
-        if (mini_block_idx_ < static_cast<size_t>(delta_bit_widths_->size())) {
-          delta_bit_width_ = bit_width_data[mini_block_idx_];
+        if (mini_block_idx_ < mini_blocks_per_block_) {
+          delta_bit_width_ = delta_bit_widths_->data()[mini_block_idx_];
           values_current_mini_block_ = values_per_mini_block_;
         } else {
+          // mini_block_idx_ > mini_blocks_per_block_ only if last_value_ is stored in
+          // header
+          if (ARROW_PREDICT_FALSE(mini_block_idx_ > mini_blocks_per_block_)) {
+            buffer[i++] = last_value_;
+            --total_value_count_;
+            if (ARROW_PREDICT_FALSE(i == max_values)) break;
+          }
           InitBlock();
-          buffer[i] = last_value_;
           continue;
         }
       }
 
-      // TODO: the key to this algorithm is to decode the entire miniblock at once.
-      int64_t delta;
-      if (!decoder_.GetValue(delta_bit_width_, &delta)) ParquetException::EofException();
-      delta += min_delta_;
-      last_value_ += static_cast<int32_t>(delta);
-      buffer[i] = last_value_;
-      --values_current_mini_block_;
+      int values_decode =
+          std::min(values_current_mini_block_, (uint32_t)(max_values - i));

Review comment:
       because the type of values_current_mini_block_ is uint32_t. I'll use static_cast, thanks~




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-901558457


   > Thank you for this PR. Are you planning to support encoding in another PR?
   
   Yes. I'm also working on [PARQUET-491](https://issues.apache.org/jira/browse/PARQUET-491) and [PARQUET-492](https://issues.apache.org/jira/browse/PARQUET-492) which depend on code in this PR. And I would like to create a new PR after this one is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-883013956


   > Sorry will try to get to it next week.
   
   Sorry for the delay in getting back to you. This week would be fine :)
   I'm also working on PARQUET-491 and PARQUET-492 which depend on code in this PR. Maybe I should create a new PR after this one is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-899971994






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r690163362



##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -418,14 +450,59 @@ inline bool BitReader::GetVlqInt(uint32_t* v) {
 }
 
 inline bool BitWriter::PutZigZagVlqInt(int32_t v) {
-  auto u_v = ::arrow::util::SafeCopy<uint32_t>(v);
-  return PutVlqInt((u_v << 1) ^ (u_v >> 31));
+  uint32_t u_v = ::arrow::util::SafeCopy<uint32_t>(v);
+  u_v = (u_v << 1) ^ static_cast<uint32_t>(v >> 31);
+  return PutVlqInt(u_v);
 }
 
 inline bool BitReader::GetZigZagVlqInt(int32_t* v) {
   uint32_t u;
   if (!GetVlqInt(&u)) return false;
-  *v = ::arrow::util::SafeCopy<int32_t>((u >> 1) ^ (u << 31));
+  u = (u >> 1) ^ (~(u & 1) + 1);
+  *v = ::arrow::util::SafeCopy<int32_t>(u);
+  return true;
+}
+
+inline bool BitWriter::PutVlqInt(uint64_t v) {
+  bool result = true;
+  while ((v & 0xFFFFFFFFFFFFFF80ULL) != 0ULL) {
+    result &= PutAligned<uint8_t>(static_cast<uint8_t>((v & 0x7F) | 0x80), 1);

Review comment:
       I used "&=" here with reference to the code in function "bool BitWriter::PutVlqInt(uint32_t v)". Using "&&=" would be a better choice.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r665004266



##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -263,8 +284,10 @@ inline void GetValue_(int num_bits, T* v, int max_bytes, const uint8_t* buffer,
 #pragma warning(disable : 4800 4805)
 #endif
     // Read bits of v that crossed into new buffered_values_
-    *v = *v | static_cast<T>(BitUtil::TrailingBits(*buffered_values, *bit_offset)
-                             << (num_bits - *bit_offset));
+    if (ARROW_PREDICT_TRUE(num_bits - *bit_offset < 64)) {

Review comment:
       To solve the runtime error in test job "AMD64 Ubuntu 20.04 C++ ASAN UBSAN":
   /arrow/cpp/src/arrow/util/bit_stream_utils.h:289:30: runtime error: shift exponent 64 is too large for 64-bit type 'uint64_t' (aka 'unsigned long')
   https://github.com/shanhuuang/arrow/runs/2986826720?check_suite_focus=true#logs
   I met it when I tried to read a file with only 3 records.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-902369767


   > @shanhuuang Do you have a user id on the [Apache JIRA tracker](https://issues.apache.org/jira/), so that I can assign the issue to you?
   
   Yes, my username on JIRA is "shanhuang". Thanks! :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r691810354



##########
File path: cpp/src/parquet/CMakeLists.txt
##########
@@ -407,3 +407,7 @@ endif()
 if(ARROW_WITH_ZSTD)
   add_definitions(-DARROW_WITH_ZSTD)
 endif()
+
+if(ARROW_CSV)

Review comment:
       this seems out of place?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r690011304



##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -2107,73 +2105,99 @@ class DeltaBitPackDecoder : public DecoderImpl, virtual public TypedDecoder<DTyp
   }
 
  private:
-  void InitBlock() {
-    // The number of values per block.
-    uint32_t block_size;
-    if (!decoder_.GetVlqInt(&block_size)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&num_mini_blocks_)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&values_current_block_)) {
+  void InitHeader() {
+    if (!decoder_.GetVlqInt(&values_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&mini_blocks_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&total_value_count_)) {
       ParquetException::EofException();
     }
     if (!decoder_.GetZigZagVlqInt(&last_value_)) ParquetException::EofException();
 
-    delta_bit_widths_ = AllocateBuffer(pool_, num_mini_blocks_);
-    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    delta_bit_widths_ = AllocateBuffer(pool_, mini_blocks_per_block_);
+    values_per_mini_block_ = values_per_block_ / mini_blocks_per_block_;
+    if (values_per_mini_block_ % 8 != 0) {
+      throw ParquetException("miniBlockSize must be multiple of 8, but it's " +
+                             std::to_string(values_per_mini_block_));
+    }
+
+    mini_block_idx_ = mini_blocks_per_block_;
+    values_current_mini_block_ = 0;
+  }
 
+  void InitBlock() {
     if (!decoder_.GetZigZagVlqInt(&min_delta_)) ParquetException::EofException();
-    for (uint32_t i = 0; i < num_mini_blocks_; ++i) {
+
+    // readBitWidthsForMiniBlocks
+    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    for (uint32_t i = 0; i < mini_blocks_per_block_; ++i) {
       if (!decoder_.GetAligned<uint8_t>(1, bit_width_data + i)) {
         ParquetException::EofException();
       }
     }
-    values_per_mini_block_ = block_size / num_mini_blocks_;
     mini_block_idx_ = 0;
     delta_bit_width_ = bit_width_data[0];
     values_current_mini_block_ = values_per_mini_block_;
   }
 
-  template <typename T>
   int GetInternal(T* buffer, int max_values) {
     max_values = std::min(max_values, this->num_values_);
-    const uint8_t* bit_width_data = delta_bit_widths_->data();
-    for (int i = 0; i < max_values; ++i) {
+    DCHECK_LE(static_cast<uint32_t>(max_values), total_value_count_);
+    int i = 0;
+    while (i < max_values) {
       if (ARROW_PREDICT_FALSE(values_current_mini_block_ == 0)) {
         ++mini_block_idx_;
-        if (mini_block_idx_ < static_cast<size_t>(delta_bit_widths_->size())) {
-          delta_bit_width_ = bit_width_data[mini_block_idx_];
+        if (mini_block_idx_ < mini_blocks_per_block_) {
+          delta_bit_width_ = delta_bit_widths_->data()[mini_block_idx_];
           values_current_mini_block_ = values_per_mini_block_;
         } else {
+          // mini_block_idx_ > mini_blocks_per_block_ only if last_value_ is stored in
+          // header
+          if (ARROW_PREDICT_FALSE(mini_block_idx_ > mini_blocks_per_block_)) {
+            buffer[i++] = last_value_;
+            --total_value_count_;
+            if (ARROW_PREDICT_FALSE(i == max_values)) break;
+          }
           InitBlock();
-          buffer[i] = last_value_;
           continue;
         }
       }
 
-      // TODO: the key to this algorithm is to decode the entire miniblock at once.
-      int64_t delta;
-      if (!decoder_.GetValue(delta_bit_width_, &delta)) ParquetException::EofException();
-      delta += min_delta_;
-      last_value_ += static_cast<int32_t>(delta);
-      buffer[i] = last_value_;
-      --values_current_mini_block_;
+      int values_decode =
+          std::min(values_current_mini_block_, (uint32_t)(max_values - i));

Review comment:
       please use static_cast.  why the cast to uint?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-875706968


   > > Is it possible for delta_bit_width_<=32? if so it seems like we should probably cover that case while updating?
   > 
   > In current code, internal::unpack32 is called in BitReader::GetBatch if delta_bit_width_<=32 whether value type is int64 or int32.
   > delta_binary_packed.parquet contains int64 columns with delta_bit_width_ from 0 to 32
   
   Sorry I think i mean >= 32?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10627: ARROW-13206: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r664736988



##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -418,14 +450,65 @@ inline bool BitReader::GetVlqInt(uint32_t* v) {
 }
 
 inline bool BitWriter::PutZigZagVlqInt(int32_t v) {
-  auto u_v = ::arrow::util::SafeCopy<uint32_t>(v);
-  return PutVlqInt((u_v << 1) ^ (u_v >> 31));
+  uint32_t u_v = ::arrow::util::SafeCopy<uint32_t>(v);
+  v = (u_v << 1) ^ (v >> 31);
+  u_v = ::arrow::util::SafeCopy<uint32_t>(v);

Review comment:
       nit: just static_cast v >> 31 above.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-875351178


   > Is it possible for delta_bit_width_<=32? if so it seems like we should probably cover that case while updating?
   
   In current code, internal::unpack32 is called in BitReader::GetBatch if delta_bit_width_<=32 whether value type is int64 or int32.
   delta_binary_packed.parquet contains int64 columns with delta_bit_width_ from 0 to 32


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r690890712



##########
File path: cpp/src/arrow/util/bit_util_test.cc
##########
@@ -1939,24 +1939,77 @@ TEST(BitUtil, RoundUpToPowerOf2) {
 #undef U64
 #undef S64
 
-static void TestZigZag(int32_t v) {
+static void TestZigZag(int32_t v, uint8_t* buffer_expect) {

Review comment:
       OK, thanks for your advice.

##########
File path: cpp/src/arrow/util/bpacking64_codegen.py
##########
@@ -0,0 +1,131 @@
+#!/bin/python
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# This script is modified from its original version in GitHub. Original source:
+# https://github.com/lemire/FrameOfReference/blob/146948b6058a976bc7767262ad3a2ce201486b93/scripts/turbopacking64.py
+
+# Usage:
+#   python bpacking64_codegen.py > bpacking64_default.h
+
+def howmany(bit):
+    """ how many values are we going to pack? """
+    return 32
+
+
+def howmanywords(bit):
+    return (howmany(bit) * bit + 63)//64
+
+
+def howmanybytes(bit):
+    return (howmany(bit) * bit + 7)//8
+
+
+print('''// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// This file was generated by script which is modified from its original version in GitHub.
+// Original source:
+// https://github.com/lemire/FrameOfReference/blob/master/scripts/turbopacking64.py
+// The original copyright notice follows.
+
+// This code is released under the
+// Apache License Version 2.0 http://www.apache.org/licenses/.
+// (c) Daniel Lemire 2013
+
+#pragma once
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+
+namespace arrow {
+namespace internal {
+''')
+
+
+print("inline const uint8_t* unpack0_64(const uint8_t* in, uint64_t* out) {")
+print("  for(int k = 0; k < {0} ; k += 1) {{".format(howmany(0)))
+print("    out[k] = 0;")
+print("  }")
+print("  return in;")
+print("}")
+
+for bit in range(1, 65):
+    print("")
+    print(
+        "inline const uint8_t* unpack{0}_64(const uint8_t* in, uint64_t* out) {{".format(bit))
+
+    if(bit < 64):
+        print("  const uint64_t mask = {0}ULL;".format((1 << bit)-1))
+    maskstr = " & mask"
+    if (bit == 64):
+        maskstr = ""  # no need
+
+    for k in range(howmanywords(bit)-1):
+        print("  uint64_t w{0} = util::SafeLoadAs<uint64_t>(in);".format(k))
+        print("  w{0} = arrow::BitUtil::FromLittleEndian(w{0});".format(k))
+        print("  in += 8;".format(k))
+    k = howmanywords(bit) - 1
+    if (bit % 2 == 0):
+        print("  uint64_t w{0} = util::SafeLoadAs<uint64_t>(in);".format(k))
+        print("  w{0} = arrow::BitUtil::FromLittleEndian(w{0});".format(k))
+        print("  in += 8;".format(k))
+    else:
+        print("  uint64_t w{0} = util::SafeLoadAs<uint32_t>(in);".format(k))

Review comment:
       Yes. The number of values(including padding) in a miniblock is always a multiple of 32.

##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -418,14 +450,59 @@ inline bool BitReader::GetVlqInt(uint32_t* v) {
 }
 
 inline bool BitWriter::PutZigZagVlqInt(int32_t v) {
-  auto u_v = ::arrow::util::SafeCopy<uint32_t>(v);
-  return PutVlqInt((u_v << 1) ^ (u_v >> 31));
+  uint32_t u_v = ::arrow::util::SafeCopy<uint32_t>(v);
+  u_v = (u_v << 1) ^ static_cast<uint32_t>(v >> 31);
+  return PutVlqInt(u_v);

Review comment:
       Original implementation of PutZigZagVlqInt was incorrect. The second shift – the (v >> 31) part – is an arithmetic shift according to: https://developers.google.com/protocol-buffers/docs/encoding#signed_integers

##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -2107,73 +2105,99 @@ class DeltaBitPackDecoder : public DecoderImpl, virtual public TypedDecoder<DTyp
   }
 
  private:
-  void InitBlock() {
-    // The number of values per block.
-    uint32_t block_size;
-    if (!decoder_.GetVlqInt(&block_size)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&num_mini_blocks_)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&values_current_block_)) {
+  void InitHeader() {
+    if (!decoder_.GetVlqInt(&values_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&mini_blocks_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&total_value_count_)) {
       ParquetException::EofException();
     }
     if (!decoder_.GetZigZagVlqInt(&last_value_)) ParquetException::EofException();
 
-    delta_bit_widths_ = AllocateBuffer(pool_, num_mini_blocks_);
-    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    delta_bit_widths_ = AllocateBuffer(pool_, mini_blocks_per_block_);
+    values_per_mini_block_ = values_per_block_ / mini_blocks_per_block_;
+    if (values_per_mini_block_ % 8 != 0) {
+      throw ParquetException("miniBlockSize must be multiple of 8, but it's " +

Review comment:
       I've read the code of [DeltaBinaryPackingValuesWriter](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/delta/DeltaBinaryPackingValuesWriter.java#L82) in parquet-mr. The parameters are always DEFAULT_NUM_BLOCK_VALUES(which is 128) and DEFAULT_NUM_MINIBLOCKS(which is 4). So if the file is written by parquet-mr, the values_per_mini_block_ here is always 32.
   But I'm not quiet sure if it should be 32 because the code [here](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/delta/DeltaBinaryPackingConfig.java#L41) is 8, which is dependent on [DeltaBinaryPackingValuesReader](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/delta/DeltaBinaryPackingValuesReader.java#L63) 

##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -2107,73 +2105,99 @@ class DeltaBitPackDecoder : public DecoderImpl, virtual public TypedDecoder<DTyp
   }
 
  private:
-  void InitBlock() {
-    // The number of values per block.
-    uint32_t block_size;
-    if (!decoder_.GetVlqInt(&block_size)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&num_mini_blocks_)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&values_current_block_)) {
+  void InitHeader() {
+    if (!decoder_.GetVlqInt(&values_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&mini_blocks_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&total_value_count_)) {
       ParquetException::EofException();
     }
     if (!decoder_.GetZigZagVlqInt(&last_value_)) ParquetException::EofException();
 
-    delta_bit_widths_ = AllocateBuffer(pool_, num_mini_blocks_);
-    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    delta_bit_widths_ = AllocateBuffer(pool_, mini_blocks_per_block_);
+    values_per_mini_block_ = values_per_block_ / mini_blocks_per_block_;
+    if (values_per_mini_block_ % 8 != 0) {
+      throw ParquetException("miniBlockSize must be multiple of 8, but it's " +
+                             std::to_string(values_per_mini_block_));
+    }
+
+    mini_block_idx_ = mini_blocks_per_block_;
+    values_current_mini_block_ = 0;
+  }
 
+  void InitBlock() {
     if (!decoder_.GetZigZagVlqInt(&min_delta_)) ParquetException::EofException();
-    for (uint32_t i = 0; i < num_mini_blocks_; ++i) {
+
+    // readBitWidthsForMiniBlocks
+    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    for (uint32_t i = 0; i < mini_blocks_per_block_; ++i) {
       if (!decoder_.GetAligned<uint8_t>(1, bit_width_data + i)) {
         ParquetException::EofException();
       }
     }
-    values_per_mini_block_ = block_size / num_mini_blocks_;
     mini_block_idx_ = 0;
     delta_bit_width_ = bit_width_data[0];
     values_current_mini_block_ = values_per_mini_block_;
   }
 
-  template <typename T>
   int GetInternal(T* buffer, int max_values) {
     max_values = std::min(max_values, this->num_values_);
-    const uint8_t* bit_width_data = delta_bit_widths_->data();
-    for (int i = 0; i < max_values; ++i) {
+    DCHECK_LE(static_cast<uint32_t>(max_values), total_value_count_);
+    int i = 0;
+    while (i < max_values) {
       if (ARROW_PREDICT_FALSE(values_current_mini_block_ == 0)) {
         ++mini_block_idx_;
-        if (mini_block_idx_ < static_cast<size_t>(delta_bit_widths_->size())) {
-          delta_bit_width_ = bit_width_data[mini_block_idx_];
+        if (mini_block_idx_ < mini_blocks_per_block_) {
+          delta_bit_width_ = delta_bit_widths_->data()[mini_block_idx_];
           values_current_mini_block_ = values_per_mini_block_;
         } else {
+          // mini_block_idx_ > mini_blocks_per_block_ only if last_value_ is stored in
+          // header
+          if (ARROW_PREDICT_FALSE(mini_block_idx_ > mini_blocks_per_block_)) {

Review comment:
       Good advice! I added a boolean member named "block_initialized_" to indicate whether function "InitBlock" has been called.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-900136999


   > Only works for int32/int64 when delta_bit_width_<=32.
   
   What do you mean? Are delta bit widths > 32 not supported? I don't see check in the code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10627: ARROW-13206: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r664773191



##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -2062,17 +2061,12 @@ class DeltaBitPackDecoder : public DecoderImpl, virtual public TypedDecoder<DTyp
 
   explicit DeltaBitPackDecoder(const ColumnDescriptor* descr,
                                MemoryPool* pool = ::arrow::default_memory_pool())
-      : DecoderImpl(descr, Encoding::DELTA_BINARY_PACKED), pool_(pool) {
-    if (DType::type_num != Type::INT32 && DType::type_num != Type::INT64) {

Review comment:
       why remove this check?  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-902041792


   @shanhuuang Do you have a user id on the [Apache JIRA tracker](https://issues.apache.org/jira/), so that I can assign the issue to you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r690008700



##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -418,14 +450,59 @@ inline bool BitReader::GetVlqInt(uint32_t* v) {
 }
 
 inline bool BitWriter::PutZigZagVlqInt(int32_t v) {
-  auto u_v = ::arrow::util::SafeCopy<uint32_t>(v);
-  return PutVlqInt((u_v << 1) ^ (u_v >> 31));
+  uint32_t u_v = ::arrow::util::SafeCopy<uint32_t>(v);
+  u_v = (u_v << 1) ^ static_cast<uint32_t>(v >> 31);
+  return PutVlqInt(u_v);
 }
 
 inline bool BitReader::GetZigZagVlqInt(int32_t* v) {
   uint32_t u;
   if (!GetVlqInt(&u)) return false;
-  *v = ::arrow::util::SafeCopy<int32_t>((u >> 1) ^ (u << 31));
+  u = (u >> 1) ^ (~(u & 1) + 1);
+  *v = ::arrow::util::SafeCopy<int32_t>(u);
+  return true;
+}
+
+inline bool BitWriter::PutVlqInt(uint64_t v) {
+  bool result = true;
+  while ((v & 0xFFFFFFFFFFFFFF80ULL) != 0ULL) {
+    result &= PutAligned<uint8_t>(static_cast<uint8_t>((v & 0x7F) | 0x80), 1);

Review comment:
       nit: should this be logical and?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #10627: PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10627:
URL: https://github.com/apache/arrow/pull/10627


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-881816210


   Sorry will try to get to it next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-891507186


   > sorry I have had less time then I would have liked recently for Arrow reviews, will try to get to this soon.
   
   OK, Thanks a lot :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r665023770



##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -2062,17 +2061,12 @@ class DeltaBitPackDecoder : public DecoderImpl, virtual public TypedDecoder<DTyp
 
   explicit DeltaBitPackDecoder(const ColumnDescriptor* descr,
                                MemoryPool* pool = ::arrow::default_memory_pool())
-      : DecoderImpl(descr, Encoding::DELTA_BINARY_PACKED), pool_(pool) {
-    if (DType::type_num != Type::INT32 && DType::type_num != Type::INT64) {

Review comment:
       I thought that DeltaBitPackDecoder object would only be created in function parquet::MakeDecoder, so no more check is needed here. It's ok to restore this code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang edited a comment on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang edited a comment on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-883013956


   > Sorry will try to get to it next week.
   
   Sorry for the delay in getting back to you. This week would be fine :)
   I'm also working on [PARQUET-491](https://issues.apache.org/jira/browse/PARQUET-491) and [PARQUET-492](https://issues.apache.org/jira/browse/PARQUET-492) which depend on code in this PR. Maybe I should create a new PR after this one is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10627: ARROW-13206: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r664746283



##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -418,14 +450,65 @@ inline bool BitReader::GetVlqInt(uint32_t* v) {
 }
 
 inline bool BitWriter::PutZigZagVlqInt(int32_t v) {
-  auto u_v = ::arrow::util::SafeCopy<uint32_t>(v);
-  return PutVlqInt((u_v << 1) ^ (u_v >> 31));
+  uint32_t u_v = ::arrow::util::SafeCopy<uint32_t>(v);
+  v = (u_v << 1) ^ (v >> 31);
+  u_v = ::arrow::util::SafeCopy<uint32_t>(v);
+  return PutVlqInt(u_v);
 }
 
 inline bool BitReader::GetZigZagVlqInt(int32_t* v) {
   uint32_t u;
   if (!GetVlqInt(&u)) return false;
-  *v = ::arrow::util::SafeCopy<int32_t>((u >> 1) ^ (u << 31));
+  *v = ::arrow::util::SafeCopy<int32_t>(u);

Review comment:
       https://github.com/protocolbuffers/protobuf/blob/0b05994474f3f53bbda8fef1c24acb06f778a3bf/src/google/protobuf/wire_format_lite.h#L867 seems to be more concise and has a comptable license, could we use that and update license.txt to note the use of the code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r665479865



##########
File path: LICENSE.txt
##########
@@ -2240,3 +2241,46 @@ be copied, modified, or distributed except according to those terms.
 
 Distributed on an "AS IS" BASIS, WITHOUT WARRANTY OF ANY KIND, either
 express or implied.  See your chosen license for details.
+
+--------------------------------------------------------------------------------
+
+The file cpp/src/arrow/util/bit_stream_utils.h contains code from

Review comment:
       there is already a reference to protocol buggers above, could you update that instead?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r665490308



##########
File path: cpp/src/arrow/util/bit_stream_utils.h
##########
@@ -418,14 +450,65 @@ inline bool BitReader::GetVlqInt(uint32_t* v) {
 }
 
 inline bool BitWriter::PutZigZagVlqInt(int32_t v) {
-  auto u_v = ::arrow::util::SafeCopy<uint32_t>(v);
-  return PutVlqInt((u_v << 1) ^ (u_v >> 31));
+  uint32_t u_v = ::arrow::util::SafeCopy<uint32_t>(v);
+  v = (u_v << 1) ^ (v >> 31);
+  u_v = ::arrow::util::SafeCopy<uint32_t>(v);

Review comment:
       Using safe-copy instead could also work.  SafeCopy (and other methods in ubsan.h) are  mostly to avoid undefined behavior.  The most common case of this is when trying to read possibly unaligned values.  
   
   SafeCopy is equivelant to the new [std::bit_cast](https://en.cppreference.com/w/cpp/numeric/bit_cast)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r691991030



##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -2107,73 +2105,99 @@ class DeltaBitPackDecoder : public DecoderImpl, virtual public TypedDecoder<DTyp
   }
 
  private:
-  void InitBlock() {
-    // The number of values per block.
-    uint32_t block_size;
-    if (!decoder_.GetVlqInt(&block_size)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&num_mini_blocks_)) ParquetException::EofException();
-    if (!decoder_.GetVlqInt(&values_current_block_)) {
+  void InitHeader() {
+    if (!decoder_.GetVlqInt(&values_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&mini_blocks_per_block_)) ParquetException::EofException();
+    if (!decoder_.GetVlqInt(&total_value_count_)) {
       ParquetException::EofException();
     }
     if (!decoder_.GetZigZagVlqInt(&last_value_)) ParquetException::EofException();
 
-    delta_bit_widths_ = AllocateBuffer(pool_, num_mini_blocks_);
-    uint8_t* bit_width_data = delta_bit_widths_->mutable_data();
+    delta_bit_widths_ = AllocateBuffer(pool_, mini_blocks_per_block_);
+    values_per_mini_block_ = values_per_block_ / mini_blocks_per_block_;
+    if (values_per_mini_block_ % 8 != 0) {
+      throw ParquetException("miniBlockSize must be multiple of 8, but it's " +

Review comment:
       Also, perhaps open a JIRA about the parquet-mr `DeltaBinaryPackingConfig` check?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on a change in pull request #10627: PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r691841357



##########
File path: cpp/src/parquet/CMakeLists.txt
##########
@@ -407,3 +407,7 @@ endif()
 if(ARROW_WITH_ZSTD)
   add_definitions(-DARROW_WITH_ZSTD)
 endif()
+
+if(ARROW_CSV)

Review comment:
       Do you means that I should add the code in function [ADD_PARQUET_TEST](https://github.com/apache/arrow/blob/master/cpp/src/parquet/CMakeLists.txt#L34)? Something like:
   ```
     if(REL_TEST_NAME STREQUAL "arrow-test" AND ARROW_CSV)
       add_definitions(-DARROW_CSV)
     endif()
   ```
   This one also works :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-881315333


   @emkornfield Could you please review it? Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10627: ARROW-13206: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r664769446



##########
File path: cpp/src/arrow/util/bpacking64_default.h
##########
@@ -0,0 +1,5642 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// This file was generated by script which is modified from its original version in
+// GitHub. Original source:
+// https://github.com/lemire/FrameOfReference/blob/master/scripts/turbopacking64.py
+// The original copyright notice follows.
+
+// This code is released under the
+// Apache License Version 2.0 http://www.apache.org/licenses/.
+// (c) Daniel Lemire 2013
+
+#pragma once
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+
+namespace arrow {
+namespace internal {
+
+inline const uint8_t* unpack0_64(const uint8_t* in, uint64_t* out) {

Review comment:
       same comment as above, if there is a script used to generate this please add it to the PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] shanhuuang edited a comment on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
shanhuuang edited a comment on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-883013956


   > Sorry will try to get to it next week.
   
   Sorry for the delay in getting back to you. This week would be fine :)
   I'm also working on [PARQUET-491](https://issues.apache.org/jira/browse/PARQUET-491) and [PARQUET-492](https://issues.apache.org/jira/browse/PARQUET-492) which depend on code in this PR. Maybe I should create a new PR after this one is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10627: ARROW-13206: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#discussion_r664768744



##########
File path: cpp/src/arrow/util/bpacking.cc
##########
@@ -174,5 +176,221 @@ int unpack32(const uint32_t* in, uint32_t* out, int batch_size, int num_bits) {
 #endif
 }
 
+namespace {
+
+int unpack64_default(const uint8_t* in, uint64_t* out, int batch_size, int num_bits) {

Review comment:
       please comment on how this code was generated?  If there is a script could you add that as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org