You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wgtmac (via GitHub)" <gi...@apache.org> on 2023/05/06 02:47:20 UTC

[GitHub] [arrow] wgtmac opened a new pull request, #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

wgtmac opened a new pull request, #35455:
URL: https://github.com/apache/arrow/pull/35455

   ### Rationale for this change
   
   parquet-mr does not write page stats to the header once page index is enabled: https://issues.apache.org/jira/browse/PARQUET-1365. We should follow the same thing.
   
   ### What changes are included in this PR?
   
   Once page index is enabled for one column, it does not write page stats to the header any more.
   
   ### Are these changes tested?
   
   Added a test to check page stats have been skipped.
   
   ### Are there any user-facing changes?
   
   NO.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1578338001

   > Nice, I see. What about the column index, are you using it?
   
   Yes, we did a naive implementation of predicate push down similar to what the parquet-mr does.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1544291922

   > I guess maybe a user previously uses page filter. Now when he/she enables page index, the filter would not work
   
   I think the documentation is also missing for stats filtering on the page header. Since page index is an ongoing effort and it is not enabled by default, users will not get confused by this for now.
   
   > About statistics, though nan will break the page index, and page statistics and make full use of nan-counts. I think this on-going patch helps https://github.com/apache/parquet-format/pull/196
   
   Due to the inactivity in that PR, I assume it will take a long time to move forward as we need somehow to reach a consensus from the community.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35455:
URL: https://github.com/apache/arrow/pull/35455#discussion_r1219288708


##########
cpp/src/parquet/properties.h:
##########
@@ -541,6 +544,8 @@ class PARQUET_EXPORT WriterProperties {
     }
 
     /// Enable writing page index for column specified by `path`. Default disabled.
+    /// Note that it does not write statistics to the page header once page index is
+    /// enabled.

Review Comment:
   Update this comment as well? Or just removed it as things are explained in more detail above.



##########
cpp/src/parquet/properties.h:
##########
@@ -541,6 +544,8 @@ class PARQUET_EXPORT WriterProperties {
     }
 
     /// Enable writing page index for column specified by `path`. Default disabled.
+    /// Note that it does not write statistics to the page header once page index is
+    /// enabled.

Review Comment:
   Update this comment as well? Or just remove it as things are explained in more detail above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1540366484

   Please take a look when you have time, thanks! @wjones127 @emkornfield 
   
   cc @fatemehp since this change may affect page filtering based on header stats.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1578175296

   Could you please take a look again? @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "emkornfield (via GitHub)" <gi...@apache.org>.
emkornfield commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1540956271

   Seems OK to me as writing column indices is opt-in?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35455:
URL: https://github.com/apache/arrow/pull/35455#discussion_r1204392111


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -460,7 +460,11 @@ class SerializedPageWriter : public PageWriter {
         ToThrift(page.definition_level_encoding()));
     data_page_header.__set_repetition_level_encoding(
         ToThrift(page.repetition_level_encoding()));
-    data_page_header.__set_statistics(ToThrift(page.statistics()));
+
+    // Write page statistics only when page index is not enabled.
+    if (column_index_builder_ == nullptr) {
+      data_page_header.__set_statistics(ToThrift(page.statistics()));
+    }

Review Comment:
   The stats computation is still required to build page index (a.k.a. the page stats of ColumnIndex).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35455:
URL: https://github.com/apache/arrow/pull/35455#discussion_r1219232053


##########
cpp/src/parquet/properties.h:
##########
@@ -525,7 +525,8 @@ class PARQUET_EXPORT WriterProperties {
     /// Enable writing page index in general for all columns. Default disabled.
     ///
     /// Page index contains statistics for data pages and can be used to skip pages
-    /// when scanning data in ordered and unordered columns.
+    /// when scanning data in ordered and unordered columns. Note that it does not

Review Comment:
   Here "it" denotes to the parquet writer or column writer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1578277465

   > Did you try to benchmark writing the page index vs. writing statistics in data page headers? Perhaps in the future we can enable the page index by default?
   
   I didn't benchmark the writer time because it should spend more time when page index is enabled because:
   - It only skips writing stats to the thrift-encoded header but the stats comparison (which is heavy) in the column writer still does the job.
   - Page index builder also does more work than just serializing stats. It also collects sorting order from page stats, collects page offsets and so on.
   
   The main goal of skipping writing stats to page header mainly is to reduce the file size as they are duplicated and easier to get from the column index. We have internally enabled page index by default. The benefit brought by page index is promising.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1590313850

   > @wgtmac could you elaborate on how you are achieving 2X~4X acceleration of reading the column chunk from the cloud object store you mentioned above? Are you reading the column chunk page by page?
   
   In short, collect all offset/length ranges of required pages, then coalesce them into reasonable I/O chunks and issue async reads before reading any page.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35455:
URL: https://github.com/apache/arrow/pull/35455#discussion_r1204394958


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -460,7 +460,11 @@ class SerializedPageWriter : public PageWriter {
         ToThrift(page.definition_level_encoding()));
     data_page_header.__set_repetition_level_encoding(
         ToThrift(page.repetition_level_encoding()));
-    data_page_header.__set_statistics(ToThrift(page.statistics()));
+
+    // Write page statistics only when page index is not enabled.
+    if (column_index_builder_ == nullptr) {
+      data_page_header.__set_statistics(ToThrift(page.statistics()));
+    }

Review Comment:
   Ok, I see. Also, why do you use the wording "page index"? It's called column index in the spec.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1561249215

   @pitrou Could you take a look as well? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35455:
URL: https://github.com/apache/arrow/pull/35455#discussion_r1219233736


##########
cpp/src/parquet/properties.h:
##########
@@ -525,7 +525,8 @@ class PARQUET_EXPORT WriterProperties {
     /// Enable writing page index in general for all columns. Default disabled.
     ///
     /// Page index contains statistics for data pages and can be used to skip pages
-    /// when scanning data in ordered and unordered columns.
+    /// when scanning data in ordered and unordered columns. Note that it does not

Review Comment:
   Perhaps:
   > Writing statistics to the page index disables the old method of writing statistics to each data page header.
   > The page index makes filtering more efficient than the page header, as it gathers all the statistics for a Parquet file in a single place, avoiding scattered I/O.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1544265866

   Sure. So I suppose modify my proposal to: if we fully support predicate pushdown for page index in version N, let's set this to default in version N + 1 (assuming there are no unresolved issues we find).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou merged PR #35455:
URL: https://github.com/apache/arrow/pull/35455


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35455:
URL: https://github.com/apache/arrow/pull/35455#discussion_r1219224647


##########
cpp/src/parquet/properties.h:
##########
@@ -525,7 +525,8 @@ class PARQUET_EXPORT WriterProperties {
     /// Enable writing page index in general for all columns. Default disabled.
     ///
     /// Page index contains statistics for data pages and can be used to skip pages
-    /// when scanning data in ordered and unordered columns.
+    /// when scanning data in ordered and unordered columns. Note that it does not

Review Comment:
   What is "it" in this context? The sentence isn't very clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1543146674

   > I think we should add support for leveraging it in the readers in 13.0.0
   
   Yes, but I don't think the time frame is sufficient for a complete page skipping implementation. We can support it progressively.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35455:
URL: https://github.com/apache/arrow/pull/35455#discussion_r1186610366


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -460,7 +460,11 @@ class SerializedPageWriter : public PageWriter {
         ToThrift(page.definition_level_encoding()));
     data_page_header.__set_repetition_level_encoding(
         ToThrift(page.repetition_level_encoding()));
-    data_page_header.__set_statistics(ToThrift(page.statistics()));
+
+    // Write page statistics only when page index is not enabled.

Review Comment:
   Yes, in this case it is more efficient to use the column index instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35455:
URL: https://github.com/apache/arrow/pull/35455#discussion_r1204426865


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -460,7 +460,11 @@ class SerializedPageWriter : public PageWriter {
         ToThrift(page.definition_level_encoding()));
     data_page_header.__set_repetition_level_encoding(
         ToThrift(page.repetition_level_encoding()));
-    data_page_header.__set_statistics(ToThrift(page.statistics()));
+
+    // Write page statistics only when page index is not enabled.
+    if (column_index_builder_ == nullptr) {
+      data_page_header.__set_statistics(ToThrift(page.statistics()));
+    }

Review Comment:
   Hmm, that's true, my bad.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1537009500

   * Closes: #34375


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35455:
URL: https://github.com/apache/arrow/pull/35455#discussion_r1186610501


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -479,7 +483,11 @@ class SerializedPageWriter : public PageWriter {
         page.repetition_levels_byte_length());
 
     data_page_header.__set_is_compressed(page.is_compressed());
-    data_page_header.__set_statistics(ToThrift(page.statistics()));
+
+    // Write page statistics only when page index is not enabled.

Review Comment:
   In this case, the page stats in the header is also not very useful. Actually in the parquet-mr, it does not write page stats into the header any more.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] fatemehp commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "fatemehp (via GitHub)" <gi...@apache.org>.
fatemehp commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1590264083

   @wgtmac could you elaborate on how you are achieving 2X~4X acceleration of reading the column chunk from the cloud object store you mentioned above? Are you reading the column chunk page by page?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35455:
URL: https://github.com/apache/arrow/pull/35455#discussion_r1219291183


##########
cpp/src/parquet/properties.h:
##########
@@ -541,6 +544,8 @@ class PARQUET_EXPORT WriterProperties {
     }
 
     /// Enable writing page index for column specified by `path`. Default disabled.
+    /// Note that it does not write statistics to the page header once page index is
+    /// enabled.

Review Comment:
   Sorry for missing that! Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35455:
URL: https://github.com/apache/arrow/pull/35455#discussion_r1186610202


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -460,7 +460,11 @@ class SerializedPageWriter : public PageWriter {
         ToThrift(page.definition_level_encoding()));
     data_page_header.__set_repetition_level_encoding(
         ToThrift(page.repetition_level_encoding()));
-    data_page_header.__set_statistics(ToThrift(page.statistics()));
+
+    // Write page statistics only when page index is not enabled.

Review Comment:
   Seems that this will invalidate page filter?



##########
cpp/src/parquet/column_writer.cc:
##########
@@ -479,7 +483,11 @@ class SerializedPageWriter : public PageWriter {
         page.repetition_levels_byte_length());
 
     data_page_header.__set_is_compressed(page.is_compressed());
-    data_page_header.__set_statistics(ToThrift(page.statistics()));
+
+    // Write page statistics only when page index is not enabled.

Review Comment:
   what if that Page Index is canceled, like all values is nan?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35455:
URL: https://github.com/apache/arrow/pull/35455#discussion_r1204397905


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -460,7 +460,11 @@ class SerializedPageWriter : public PageWriter {
         ToThrift(page.definition_level_encoding()));
     data_page_header.__set_repetition_level_encoding(
         ToThrift(page.repetition_level_encoding()));
-    data_page_header.__set_statistics(ToThrift(page.statistics()));
+
+    // Write page statistics only when page index is not enabled.
+    if (column_index_builder_ == nullptr) {
+      data_page_header.__set_statistics(ToThrift(page.statistics()));
+    }

Review Comment:
   Because `PageIndex` consists of ColumnIndex and OffsetIndex.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1542799456

   > Yes, by default page index is disabled. I am not sure if we should turn on it by default someday to align with what parquet-mr does.
   
   I think we eventually want to turn it on by default. We just release support for it in 12.0.0, but it's not yet used for any data skipping in the Parquet reader or dataset scanner (right?). I think we should add support for leveraging it in the readers in 13.0.0, and then if it seems to be working for users we can turn in on by default in 14.0.0. Does that seem like a reasonable plan?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1579096021

   Benchmark runs are scheduled for baseline = 68cbc6fe79203be597d4b274a62de6966bf9181c and contender = 7f8ccb5ff17980a7ca9c1668f33689738c634144. 7f8ccb5ff17980a7ca9c1668f33689738c634144 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/aa90139be6bc458eae7fb86b17a48feb...f3725c6d88d04f028290ef18cc26db4d/)
   [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/d46b8964796e4429b39faf0dc15301ea...65899fda0d5c43afa2dabfd4cf5f648a/)
   [Failed] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/81e42fcb9f6a4186ada122eb7de2c35f...24a96de6afe9445aa4b23083387a7f6d/)
   [Failed] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/007f087537e44a2d9043fe57ec01bb92...879d6b73a9164786bd4360af3baa6d52/)
   Buildkite builds:
   [Failed] [`7f8ccb5f` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2984)
   [Failed] [`7f8ccb5f` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/3020)
   [Failed] [`7f8ccb5f` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2985)
   [Failed] [`7f8ccb5f` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/3010)
   [Failed] [`68cbc6fe` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2983)
   [Failed] [`68cbc6fe` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/3019)
   [Failed] [`68cbc6fe` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2984)
   [Failed] [`68cbc6fe` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/3009)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1578249646

   Did you try to benchmark writing the page index vs. writing statistics in data page headers?
   Perhaps in the future we can enable the page index by default?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1542806146

   Added some TODOs in this issue: https://github.com/apache/arrow/issues/26168


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1542304391

   > Seems OK to me as writing column indices is opt-in?
   
   Yes, by default page index is disabled. I am not sure if we should turn on it by default someday to align with what parquet-mr does.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35455:
URL: https://github.com/apache/arrow/pull/35455#discussion_r1204382141


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -460,7 +460,11 @@ class SerializedPageWriter : public PageWriter {
         ToThrift(page.definition_level_encoding()));
     data_page_header.__set_repetition_level_encoding(
         ToThrift(page.repetition_level_encoding()));
-    data_page_header.__set_statistics(ToThrift(page.statistics()));
+
+    // Write page statistics only when page index is not enabled.
+    if (column_index_builder_ == nullptr) {
+      data_page_header.__set_statistics(ToThrift(page.statistics()));
+    }

Review Comment:
   This only disables writing the statistics, but still computes them, which is suboptimal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35455:
URL: https://github.com/apache/arrow/pull/35455#discussion_r1219236987


##########
docs/source/cpp/parquet.rst:
##########
@@ -573,20 +575,14 @@ Miscellaneous
 +--------------------------+----------+----------+---------+
 | Feature                  | Reading  | Writing  | Notes   |
 +==========================+==========+==========+=========+
-| Column Index             | ✓        |          | \(1)    |
+| Column Index             | ✓        | ✓        |         |
 +--------------------------+----------+----------+---------+
-| Offset Index             | ✓        |          | \(1)    |
+| Offset Index             | ✓        | ✓        |         |
 +--------------------------+----------+----------+---------+
-| Bloom Filter             | ✓        | ✓        | \(2)    |
+| Bloom Filter             | ✓        | ✓        | \(1)    |
 +--------------------------+----------+----------+---------+
-| CRC checksums            | ✓        | ✓        | \(3)    |
+| CRC checksums            | ✓        | ✓        |         |
 +--------------------------+----------+----------+---------+
 
-* \(1) Access to the Column and Offset Index structures is provided, but
-  data read APIs do not currently make any use of them.

Review Comment:
   Why remove this note? AFAIU it is still valid (we do not expose any high-level filtering feature).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1578302665

   > We have internally enabled page index by default. The benefit brought by page index is promising.
   
   Did you measure the benefits (other than file size)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1578328313

   Yes, it could achieve 2X~4X acceleration of reading column chunk from cloud object store using page offset index (with I/O coalescing) compared to issue single I/O of a full column chunk (which is currently by default).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35455: GH-34375: [C++][Parquet] Ignore page header stats when page index enabled

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35455:
URL: https://github.com/apache/arrow/pull/35455#issuecomment-1578330288

   Nice, I see. What about the column index, are you using it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org