You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wgtmac (via GitHub)" <gi...@apache.org> on 2023/02/09 11:44:32 UTC

[GitHub] [arrow] wgtmac opened a new pull request, #34096: GH-34086: [C++][Parquet] Fix writing num_rows to data page v2

wgtmac opened a new pull request, #34096:
URL: https://github.com/apache/arrow/pull/34096

   ### Rationale for this change
   
   The C++ parquet writer does not correctly fill num_rows field to DataPageV2 header.
   
   ### What changes are included in this PR?
   
   ColumnWriter keeps track of number of rows buffered in the current data page and then fills it into header of data page v2.
   
   ### Are these changes tested?
   
   A test case has been added to make sure the data page header has been set correctly for required, optional and repeated columns.
   
   ### Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #34096: GH-34086: [C++][Parquet] Fix writing num_rows to data page v2

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #34096:
URL: https://github.com/apache/arrow/pull/34096#issuecomment-1426164685

   Great, can you help me to understand this patch? So, previouslu DATA_PAGE_V2 has bad num_rows in header, if it has:
   
   ```
   Page1: [50 rows] Page2: [50 rows] 
   ```
   
   The header would be:
   
   ```
   Page1: [50 rows] Page2: [100 rows] 
   ```
   
   And `num_buffered_rows_` is within every page. `rows_written_` is equal to sum of `num_buffered_rows_`, though currently they are added together?
   
   Am I right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] etseidl commented on pull request #34096: GH-34086: [C++][Parquet] Fix writing num_rows to data page v2

Posted by "etseidl (via GitHub)" <gi...@apache.org>.
etseidl commented on PR #34096:
URL: https://github.com/apache/arrow/pull/34096#issuecomment-1426613187

   > @etseidl I got it, seems once a page has def-level or def-level and rep-level, it values and num_rows would be not equal. And Page Header just using num_value, would could be more or less than real storing rows.
   > 
   > 
   > 
   > Because the `ColumnWriter` only has `rows_written_`, so it needs a `num_buffered_rows_` to get rows it actually written.
   > 
   > 
   > 
   > Am I right?
   
   Yes!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34096: GH-34086: [C++][Parquet] Fix writing num_rows to data page v2

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34096:
URL: https://github.com/apache/arrow/pull/34096#discussion_r1102002469


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -854,8 +858,21 @@ void ColumnWriterImpl::AddDataPage() {
   InitSinks();
   num_buffered_values_ = 0;
   num_buffered_encoded_values_ = 0;
+  num_buffered_rows_ = 0;
 }
 
+namespace {
+
+int32_t safe_cast_64_to_32(int64_t value) {
+  if (value > std::numeric_limits<int32_t>::max() ||
+      value < std::numeric_limits<int32_t>::min()) {
+    throw ParquetException("Cannot cast ", value, " to int32_t");
+  }
+  return static_cast<int32_t>(value);
+}

Review Comment:
   I have removed the check. Please take a look again. Thanks @wjones127 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34096: GH-34086: [C++][Parquet] Fix writing num_rows to data page v2

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34096:
URL: https://github.com/apache/arrow/pull/34096#issuecomment-1424056438

   :warning: GitHub issue #34086 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 merged pull request #34096: GH-34086: [C++][Parquet] Fix writing num_rows to data page v2

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 merged PR #34096:
URL: https://github.com/apache/arrow/pull/34096


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #34096: GH-34086: [C++][Parquet] Fix writing num_rows to data page v2

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #34096:
URL: https://github.com/apache/arrow/pull/34096#issuecomment-1426612602

   @etseidl I got it, seems once a page has def-level or def-level and rep-level, it values and num_rows would be not equal. And Page Header just using num_value, would could be more or less than real storing rows.
   
   Because the `ColumnWriter` only has `rows_written_`, so it needs a `num_buffered_rows_` to get rows it actually written.
   
   Am I right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34096: GH-34086: [C++][Parquet] Fix writing num_rows to data page v2

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34096:
URL: https://github.com/apache/arrow/pull/34096#issuecomment-1424057581

   @pitrou @wjones127 Please take a look when you have time. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #34096: GH-34086: [C++][Parquet] Fix writing num_rows to data page v2

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #34096:
URL: https://github.com/apache/arrow/pull/34096#discussion_r1101910229


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -854,8 +858,21 @@ void ColumnWriterImpl::AddDataPage() {
   InitSinks();
   num_buffered_values_ = 0;
   num_buffered_encoded_values_ = 0;
+  num_buffered_rows_ = 0;
 }
 
+namespace {
+
+int32_t safe_cast_64_to_32(int64_t value) {
+  if (value > std::numeric_limits<int32_t>::max() ||
+      value < std::numeric_limits<int32_t>::min()) {
+    throw ParquetException("Cannot cast ", value, " to int32_t");
+  }
+  return static_cast<int32_t>(value);
+}

Review Comment:
   Is this necessary? Is it even possible to write more than `std::numeric_limits<int32_t>::max()` values to a row group in parquet?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34096: GH-34086: [C++][Parquet] Fix writing num_rows to data page v2

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34096:
URL: https://github.com/apache/arrow/pull/34096#issuecomment-1424056391

   * Closes: #34086


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] etseidl commented on pull request #34096: GH-34086: [C++][Parquet] Fix writing num_rows to data page v2

Posted by "etseidl (via GitHub)" <gi...@apache.org>.
etseidl commented on PR #34096:
URL: https://github.com/apache/arrow/pull/34096#issuecomment-1426224839

   @mapleFU the issue is with nested columns (a list column for instance).  The number of values in a given page for that column will be different from the number of rows.  The previous writer was using `num_values` for both the `num_values` and `num_rows` fields in the page header.  This PR fixes this to use `num_rows` for the `num_rows` field.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org