You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wgtmac (via GitHub)" <gi...@apache.org> on 2023/04/19 06:46:09 UTC

[GitHub] [arrow] wgtmac opened a new pull request, #35230: GH-34949: [C++][Parquet] Enable page index by columns

wgtmac opened a new pull request, #35230:
URL: https://github.com/apache/arrow/pull/35230

   ### Rationale for this change
   
   Currently parquet writer only supports enabling page index for all columns. It would be good to enable/disable at the column level as sometimes it may not be useful for some columns but it pays to create them.
   
   ### What changes are included in this PR?
   
   Similar to `WriterProperties::Builder::enable_dictionary/disable_dictionary`, this patch adds `WriterProperties::Builder::enable_write_page_index/disable_write_page_index` and keep it backward compatible to enable/disable for all columns.
   
   ### Are these changes tested?
   
   Added `ParquetPageIndexRoundTripTest::EnablePerColumn` to cover the new settings.
   
   ### Are there any user-facing changes?
   
   Yes, users are now more flexible to enable/disable page index.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35230: GH-34949: [C++][Parquet] Enable page index by columns

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35230:
URL: https://github.com/apache/arrow/pull/35230#issuecomment-1518373481

   Benchmark runs are scheduled for baseline = f01853d775d484d644f9cb06f5bf06d32b6cc7b3 and contender = d2c4c212185037adc8ae5eebef7703d1b41f2383. d2c4c212185037adc8ae5eebef7703d1b41f2383 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/541e7deaa93a41f9b9d30ebd5ee97f59...4a62009b392746ccbe4a1777754ecfc2/)
   [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/88a36bd03b2248c9bddf053134a467bc...d1c68b3cdae442f48c32466eede1be45/)
   [Finished :arrow_down:5.36% :arrow_up:0.26%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/7bfba98e22c34f08a25ff5251be2f274...a0986a1e24ec4f3eb00faf36c57f3e7e/)
   [Finished :arrow_down:0.45% :arrow_up:0.03%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/6e9d04e80ef14231bc710c62621e1dd2...598b44a1a3c94c63a4b17330c82c899e/)
   Buildkite builds:
   [Finished] [`d2c4c212` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2756)
   [Failed] [`d2c4c212` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2790)
   [Finished] [`d2c4c212` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2754)
   [Finished] [`d2c4c212` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2781)
   [Finished] [`f01853d7` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2755)
   [Failed] [`f01853d7` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2789)
   [Finished] [`f01853d7` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2753)
   [Finished] [`f01853d7` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2780)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #35230: GH-34949: [C++][Parquet] Enable page index by columns

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #35230:
URL: https://github.com/apache/arrow/pull/35230#issuecomment-1517441922

   Actually we can archive it by disabling statistics on all columns. Without column statistics, ColumnIndexes are dropped automatically.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35230: GH-34949: [C++][Parquet] Enable page index by columns

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35230:
URL: https://github.com/apache/arrow/pull/35230#issuecomment-1514211874

   * Closes: #34949


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #35230: GH-34949: [C++][Parquet] Enable page index by columns

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #35230:
URL: https://github.com/apache/arrow/pull/35230#issuecomment-1517320248

   > By the way, can we just disable the column index?
   > 
   > I've a case that I don't want to collect the Column Index, because statistics is not important for me. ( I'm sure I'll read the whole file), however, offset index can be used.
   
   I have thought about this. However, it would be complex if we want to control ColumnIndex and OffsetIndex separately for individual columns. What about splitting `WriterProperties::Builder::enable_write_page_index/disable_write_page_index` into `WriterProperties::Builder::enable_write_column_index/disable_write_column_index` and WriterProperties::Builder::enable_write_offset_index/disable_write_offset_index`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35230: GH-34949: [C++][Parquet] Enable page index by columns

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35230:
URL: https://github.com/apache/arrow/pull/35230#issuecomment-1517463497

   Yes, but seems it's a bit trickey here :)
   Maybe we can document it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35230: GH-34949: [C++][Parquet] Enable page index by columns

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35230:
URL: https://github.com/apache/arrow/pull/35230#issuecomment-1517275391

   By the way, can we just disable the column index?
   
   I've a case that I don't want to collect the Column Index, because statistics is not important for me. ( I'm sure I'll read the whole file), however, offset index can be used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35230: GH-34949: [C++][Parquet] Enable page index by columns

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35230:
URL: https://github.com/apache/arrow/pull/35230#issuecomment-1518373938

   ['Python', 'R'] benchmarks have high level of regressions.
   [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/7bfba98e22c34f08a25ff5251be2f274...a0986a1e24ec4f3eb00faf36c57f3e7e/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 merged pull request #35230: GH-34949: [C++][Parquet] Enable page index by columns

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 merged PR #35230:
URL: https://github.com/apache/arrow/pull/35230


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35230: GH-34949: [C++][Parquet] Enable page index by columns

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35230:
URL: https://github.com/apache/arrow/pull/35230#issuecomment-1517330044

   Well, I think just having Offset Index can optimizing IO, but I don't know how can we do when we only have Column Index


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org