You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "mapleFU (via GitHub)" <gi...@apache.org> on 2023/05/04 07:21:38 UTC

[GitHub] [arrow] mapleFU opened a new pull request, #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

mapleFU opened a new pull request, #35428:
URL: https://github.com/apache/arrow/pull/35428

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   See in the issue. 
   
   1. SerializedPageReader will reuse same buffer, so, `buffer->size()` might not equal to `page.uncompressed_size()`
   2. So, `data_size` in decoder would be equal to previous page size.
   3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed
   
   ### What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Change reader and add test
   
   ### Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   5. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   Yes
   
   ### Are there any user-facing changes?
   
   No
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] adamreeve commented on pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "adamreeve (via GitHub)" <gi...@apache.org>.

adamreeve commented on PR #35428:
URL: https://github.com/apache/arrow/pull/35428#issuecomment-1534389918

   Thanks for the quick response @mapleFU! I can confirm this does fix the issue for me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #35428:
URL: https://github.com/apache/arrow/pull/35428#issuecomment-1534212822

   * Closes: #35423


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #35428: GH-35423: [C++][Parquet] Parquet Encoding Fixing mismatched buffer size

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1190183274


##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -222,6 +222,46 @@ TEST_F(TestParquetFileFormat, WriteRecordBatchReaderCustomOptions) {
                     *actual_schema);
 }
 
+#ifdef ARROW_WITH_SNAPPY
+
+TEST_F(TestParquetFileFormat, WriteRecordBatchReaderByteStreamSplit) {

Review Comment:
   If the feature is well tested in the `file_deserialize_test.cc` then I think we should remove this.  Even though it was found using datasets I don't think the root cause was in datasets and it doesn't appear that there are any changes to `file_parquet.cc`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #35428:
URL: https://github.com/apache/arrow/pull/35428#issuecomment-1534212859

   :warning: GitHub issue #35423 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #35428:
URL: https://github.com/apache/arrow/pull/35428#issuecomment-1534221863

   The real reason is that:
   
   ```c++
   std::shared_ptr<Buffer> SerializedPageReader::DecompressIfNeeded(
       std::shared_ptr<Buffer> page_buffer, int compressed_len, int uncompressed_len,
       int levels_byte_len) {
     if (decompressor_ == nullptr) {
       return page_buffer;
     }
     if (compressed_len < levels_byte_len || uncompressed_len < levels_byte_len) {
       throw ParquetException("Invalid page header");
     }
   
     // Grow the uncompressed buffer if we need to.
     if (uncompressed_len > static_cast<int>(decompression_buffer_->size())) {
       PARQUET_THROW_NOT_OK(
           decompression_buffer_->Resize(uncompressed_len, /*shrink_to_fit=*/false));
     }
   
     if (levels_byte_len > 0) {
       // First copy the levels as-is
       uint8_t* decompressed = decompression_buffer_->mutable_data();
       memcpy(decompressed, page_buffer->data(), levels_byte_len);
     }
   
     // Decompress the values
     PARQUET_THROW_NOT_OK(decompressor_->Decompress(
         compressed_len - levels_byte_len, page_buffer->data() + levels_byte_len,
         uncompressed_len - levels_byte_len,
         decompression_buffer_->mutable_data() + levels_byte_len));
   
     return decompression_buffer_;
   }
   ```
   
   (1) Here the buffer size will only grow up, and never thrink.
   (2) Page.buffer uses this buffer size
   (3) BYTE_STREAM_SPLIT will check this size


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1188646953


##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -222,6 +222,46 @@ TEST_F(TestParquetFileFormat, WriteRecordBatchReaderCustomOptions) {
                     *actual_schema);
 }
 
+#ifdef ARROW_WITH_SNAPPY
+
+TEST_F(TestParquetFileFormat, WriteRecordBatchReaderByteStreamSplit) {

Review Comment:
   Not necessary here. But I think user report a problem using dataset and byte stream split. So I add a testing here to reproduce the problem and make sure I fixed it. If you want I can remove this test



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on a diff in pull request #35428: GH-35423: [C++][Parquet] Parquet Encoding Fixing mismatched buffer size

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1191473065


##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -222,6 +222,46 @@ TEST_F(TestParquetFileFormat, WriteRecordBatchReaderCustomOptions) {
                     *actual_schema);
 }
 
+#ifdef ARROW_WITH_SNAPPY
+
+TEST_F(TestParquetFileFormat, WriteRecordBatchReaderByteStreamSplit) {

Review Comment:
   Okay, changes are removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1188632607


##########
cpp/src/parquet/file_deserialize_test.cc:
##########
@@ -753,7 +755,19 @@ TEST_F(TestPageSerde, Compression) {
 
     ResetStream();
   }
-}  // namespace parquet
+}
+
+TEST_F(TestPageSerde, Compression) {
+  // The pages keep getting larger

Review Comment:
   Can you add a reference to the GH bug id so as to track what motivates these tests?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wgtmac commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1184663311


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -857,7 +857,10 @@ class ColumnReaderImplBase {
   // first page with this encoding.
   void InitializeDataDecoder(const DataPage& page, int64_t levels_byte_size) {
     const uint8_t* buffer = page.data() + levels_byte_size;
-    const int64_t data_size = page.size() - levels_byte_size;
+    // PageReader may reuse the underlying buffer, so data_size
+    // should use `page.uncompressed_size() - levels_byte_size` rather
+    // than `page.size() - level_byte_size`.
+    const int64_t data_size = page.uncompressed_size() - levels_byte_size;

Review Comment:
   nit: it would be helpful to comment this is now a precise uncompressed data size, compared to an upper bound in the past.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1185656187


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -857,7 +857,10 @@ class ColumnReaderImplBase {
   // first page with this encoding.
   void InitializeDataDecoder(const DataPage& page, int64_t levels_byte_size) {
     const uint8_t* buffer = page.data() + levels_byte_size;
-    const int64_t data_size = page.size() - levels_byte_size;
+    // PageReader may reuse the underlying buffer, so data_size
+    // should use `page.uncompressed_size() - levels_byte_size` rather
+    // than `page.size() - level_byte_size`.
+    const int64_t data_size = page.uncompressed_size() - levels_byte_size;

Review Comment:
   So, here we should make `uncompressed_size` and `size` be equal (at least in DATA_PAGE V1?) I can fix it later this week



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1188637986


##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -222,6 +222,46 @@ TEST_F(TestParquetFileFormat, WriteRecordBatchReaderCustomOptions) {
                     *actual_schema);
 }
 
+#ifdef ARROW_WITH_SNAPPY
+
+TEST_F(TestParquetFileFormat, WriteRecordBatchReaderByteStreamSplit) {

Review Comment:
   Is it necessary to use datasets for this test, or can the test be implemented in the regular Parquet C++ test suite?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #35428:
URL: https://github.com/apache/arrow/pull/35428#issuecomment-1540197004

   Resolve comment except https://github.com/apache/arrow/pull/35428#discussion_r1188637986 . If not neccessary I can just remove it ( personally I think dataset didn't testing ByteStreamSplit and Compression before, so it's ok to add one)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1185161864


##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -222,6 +222,42 @@ TEST_F(TestParquetFileFormat, WriteRecordBatchReaderCustomOptions) {
                     *actual_schema);
 }
 
+TEST_F(TestParquetFileFormat, WriteRecordBatchReaderEncoding) {
+  auto float_schema = schema({field("td", float32())});
+  auto options =
+      checked_pointer_cast<ParquetFileWriteOptions>(format_->DefaultWriteOptions());
+  options->writer_properties = parquet::WriterProperties::Builder()
+                                   .created_by("TestParquetFileFormat")
+                                   ->disable_statistics()
+                                   ->encoding(parquet::Encoding::BYTE_STREAM_SPLIT)
+                                   ->disable_dictionary()
+                                   ->compression(Compression::SNAPPY)

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1188631698


##########
cpp/src/parquet/file_deserialize_test.cc:
##########
@@ -196,6 +197,9 @@ class TestPageSerde : public ::testing::Test {
                         bool verification_checksum, bool has_dictionary = false,
                         bool write_data_page_v2 = false);
 
+  void TestPageCompressionRoundTrip(int num_pages,
+                                    const std::function<int(int)>& getPageSize);

Review Comment:
   Why not allow simply passing a vector of page sizes?
   ```suggestion
     void TestPageCompressionRoundTrip(const std::vector<int32_t>& page_sizes);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wgtmac commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1185640256


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -857,7 +857,10 @@ class ColumnReaderImplBase {
   // first page with this encoding.
   void InitializeDataDecoder(const DataPage& page, int64_t levels_byte_size) {
     const uint8_t* buffer = page.data() + levels_byte_size;
-    const int64_t data_size = page.size() - levels_byte_size;
+    // PageReader may reuse the underlying buffer, so data_size
+    // should use `page.uncompressed_size() - levels_byte_size` rather
+    // than `page.size() - level_byte_size`.
+    const int64_t data_size = page.uncompressed_size() - levels_byte_size;

Review Comment:
   Did you mean here: https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_page.h#L53-L54 ?
   
   It currently says `the total size in bytes of the page's data buffer` and it is kind of misleading. I agree we need to improve it. Something like `its size may exceed the actual size in use`? Please suggest a better statement @wjones127 
   
   Or should we fix the `size()` function to always return the actual size in use to get rid of the confusion?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #35428:
URL: https://github.com/apache/arrow/pull/35428#issuecomment-1536395989

   @wgtmac @wjones127 I've change the implemention in `SerializedPageReader` internal, mind take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1188629713


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -581,11 +581,8 @@ std::shared_ptr<Buffer> SerializedPageReader::DecompressIfNeeded(
     throw ParquetException("Invalid page header");
   }
 
-  // Grow the uncompressed buffer if we need to.

Review Comment:
   We can keep the comment here as it's still valid.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #35428:
URL: https://github.com/apache/arrow/pull/35428#issuecomment-1534234481

   @adamreeve Would you mind try this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1186198011


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -857,7 +857,10 @@ class ColumnReaderImplBase {
   // first page with this encoding.
   void InitializeDataDecoder(const DataPage& page, int64_t levels_byte_size) {
     const uint8_t* buffer = page.data() + levels_byte_size;
-    const int64_t data_size = page.size() - levels_byte_size;
+    // PageReader may reuse the underlying buffer, so data_size
+    // should use `page.uncompressed_size() - levels_byte_size` rather
+    // than `page.size() - level_byte_size`.
+    const int64_t data_size = page.uncompressed_size() - levels_byte_size;

Review Comment:
   Yes, that's where I meant. If there's some way to fix the `size()` function I would prefer that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1186207129


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -857,7 +857,10 @@ class ColumnReaderImplBase {
   // first page with this encoding.
   void InitializeDataDecoder(const DataPage& page, int64_t levels_byte_size) {
     const uint8_t* buffer = page.data() + levels_byte_size;
-    const int64_t data_size = page.size() - levels_byte_size;
+    // PageReader may reuse the underlying buffer, so data_size
+    // should use `page.uncompressed_size() - levels_byte_size` rather
+    // than `page.size() - level_byte_size`.
+    const int64_t data_size = page.uncompressed_size() - levels_byte_size;

Review Comment:
   Ah nevermind. That's what the latest code does. I'm all caught up then :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #35428:
URL: https://github.com/apache/arrow/pull/35428#issuecomment-1534256412

   Yes, another fixing way is use concrete buffer size in `SerializedPageReader`. https://github.com/apache/arrow/pull/35368 this patch is also related, seems that PageReader is a bit hacking here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou merged pull request #35428: GH-35423: [C++][Parquet] Parquet PageReader Force decompression buffer resize smaller

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou merged PR #35428:
URL: https://github.com/apache/arrow/pull/35428


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1186205624


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -857,7 +857,10 @@ class ColumnReaderImplBase {
   // first page with this encoding.
   void InitializeDataDecoder(const DataPage& page, int64_t levels_byte_size) {
     const uint8_t* buffer = page.data() + levels_byte_size;
-    const int64_t data_size = page.size() - levels_byte_size;
+    // PageReader may reuse the underlying buffer, so data_size
+    // should use `page.uncompressed_size() - levels_byte_size` rather
+    // than `page.size() - level_byte_size`.
+    const int64_t data_size = page.uncompressed_size() - levels_byte_size;

Review Comment:
   Yeah I'm thinking could we instead fix change these lines:
   
   https://github.com/apache/arrow/blob/99ac74e6890767e1a3cf96d7aff0b96b9503153c/cpp/src/parquet/column_reader.cc#L584-L588
   
   To something like (change the inequality to `!=`):
   
   ```cpp
    // Grow the uncompressed buffer if we need to. Regardless, make sure the buffers
    // size is reported correctly.
    if (uncompressed_len != static_cast<int>(decompression_buffer_->size())) { 
      PARQUET_THROW_NOT_OK( 
          decompression_buffer_->Resize(uncompressed_len, /*shrink_to_fit=*/false)); 
    } 
   ```
   
   Would that fix the same bug?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #35428:
URL: https://github.com/apache/arrow/pull/35428#issuecomment-1536554572

   ```
   In file included from /arrow/cpp/src/arrow/builder_benchmark.cc:32:
   /arrow/cpp/src/arrow/util/benchmark_util.h:150:8: error: 'void arrow::MemoryPoolMemoryManager::Stop(benchmark::MemoryManager::Result*)' marked 'override', but does not override
     150 |   void Stop(benchmark::MemoryManager::Result* result) override {
         |        ^~~~
   /arrow/cpp/src/arrow/util/benchmark_util.h:201:36: error: cannot declare field 'arrow::BenchmarkMemoryTracker::manager_' to be of abstract type 'arrow::MemoryPoolMemoryManager'
     201 |   ::arrow::MemoryPoolMemoryManager manager_;
         |                                    ^~~~~~~~
   /arrow/cpp/src/arrow/util/benchmark_util.h:142:7: note:   because the following virtual functions are pure within 'arrow::MemoryPoolMemoryManager':
     142 | class MemoryPoolMemoryManager : public benchmark::MemoryManager {
         |       ^~~~~~~~~~~~~~~~~~~~~~~
   In file included from /arrow/cpp/src/arrow/builder_benchmark.cc:27:
   /opt/conda/envs/arrow/include/benchmark/benchmark.h:408:16: note:     'virtual void benchmark::MemoryManager::Stop(benchmark::MemoryManager::Result&)'
     408 |   virtual void Stop(Result& result) = 0;
         |                ^~~~
   ```
   
   Emm, seems this isn't caused by me? I'm not sure how it happens...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1188704369


##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -222,6 +222,46 @@ TEST_F(TestParquetFileFormat, WriteRecordBatchReaderCustomOptions) {
                     *actual_schema);
 }
 
+#ifdef ARROW_WITH_SNAPPY
+
+TEST_F(TestParquetFileFormat, WriteRecordBatchReaderByteStreamSplit) {

Review Comment:
   I'll let @westonpace comment on whether this test is relevant for the datasets layer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #35428: GH-35423: [C++][Parquet] Parquet PageReader Force decompression buffer resize smaller

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #35428:
URL: https://github.com/apache/arrow/pull/35428#issuecomment-1546036891

   Change the description of this patch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #35428: GH-35423: [C++][Parquet] Parquet PageReader Force decompression buffer resize smaller

Posted by "ursabot (via GitHub)" <gi...@apache.org>.

ursabot commented on PR #35428:
URL: https://github.com/apache/arrow/pull/35428#issuecomment-1548947512

   Benchmark runs are scheduled for baseline = 660cb6efe1ab7e483f125c019839c57e5b2aeab4 and contender = e2e3a9df28db03b09b5a83a60fa8293dfef6d0d8. e2e3a9df28db03b09b5a83a60fa8293dfef6d0d8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/950760036b014a8f9036ca2ecda49e9f...6ed12a58585a470f869849b391ebfcc8/)
   [Failed :arrow_down:0.17% :arrow_up:0.03%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/4ca5cc02911546139944cc20ee7fa87b...047ee94172d440ad86e5b435d1d5f493/)
   [Finished :arrow_down:0.25% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/88c1e3aedee348d8966bb263ef015994...25dc661ddcc7437f836227a1724ba395/)
   [Finished :arrow_down:0.54% :arrow_up:0.27%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/9a223188de8242aa94f5688cf6cb8f01...893e7cf799da4cf7b6d570f782cd6566/)
   Buildkite builds:
   [Finished] [`e2e3a9df` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2874)
   [Finished] [`e2e3a9df` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2910)
   [Finished] [`e2e3a9df` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2875)
   [Finished] [`e2e3a9df` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2900)
   [Finished] [`660cb6ef` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2873)
   [Failed] [`660cb6ef` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2909)
   [Finished] [`660cb6ef` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2874)
   [Finished] [`660cb6ef` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2899)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #35428: GH-35423: [C++][Parquet] Parquet Encoding Fixing mismatched buffer size

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #35428:
URL: https://github.com/apache/arrow/pull/35428#issuecomment-1544376362

   Changes in dataset testing is removed. Would you mind take a look again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1188647613


##########
cpp/src/parquet/file_deserialize_test.cc:
##########
@@ -753,7 +755,19 @@ TEST_F(TestPageSerde, Compression) {
 
     ResetStream();
   }
-}  // namespace parquet
+}
+
+TEST_F(TestPageSerde, Compression) {
+  // The pages keep getting larger

Review Comment:
   Sure. Previously here is for testing crc and compression. I'll add motivation for my new tests here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1185132474


##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -222,6 +222,42 @@ TEST_F(TestParquetFileFormat, WriteRecordBatchReaderCustomOptions) {
                     *actual_schema);
 }
 
+TEST_F(TestParquetFileFormat, WriteRecordBatchReaderEncoding) {
+  auto float_schema = schema({field("td", float32())});
+  auto options =
+      checked_pointer_cast<ParquetFileWriteOptions>(format_->DefaultWriteOptions());
+  options->writer_properties = parquet::WriterProperties::Builder()
+                                   .created_by("TestParquetFileFormat")
+                                   ->disable_statistics()
+                                   ->encoding(parquet::Encoding::BYTE_STREAM_SPLIT)
+                                   ->disable_dictionary()
+                                   ->compression(Compression::SNAPPY)

Review Comment:
   In order to use snappy, you should wrap the test in:
   
   ```cpp
   #ifdef ARROW_WITH_SNAPPY
   ...
   #endif
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #35428: GH-35423: [C++][Parquet][BugFix] Parquet Encoding Fixing mismatched buffer size

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on code in PR #35428:
URL: https://github.com/apache/arrow/pull/35428#discussion_r1185258936


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -857,7 +857,10 @@ class ColumnReaderImplBase {
   // first page with this encoding.
   void InitializeDataDecoder(const DataPage& page, int64_t levels_byte_size) {
     const uint8_t* buffer = page.data() + levels_byte_size;
-    const int64_t data_size = page.size() - levels_byte_size;
+    // PageReader may reuse the underlying buffer, so data_size
+    // should use `page.uncompressed_size() - levels_byte_size` rather
+    // than `page.size() - level_byte_size`.
+    const int64_t data_size = page.uncompressed_size() - levels_byte_size;

Review Comment:
   Should we update the `Page` docs too? I would not have guessed that `Page.size()` would return an upper bound and not the exact size. Or are we using the class in a weird way that causes this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org