You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/27 07:52:50 UTC

[GitHub] [arrow] wgtmac opened a new pull request, #14742: ARROW-10158: [C++][Parquet] Expose page index info from ColumnChunkMetaData

wgtmac opened a new pull request, #14742:
URL: https://github.com/apache/arrow/pull/14742

   This is the first step to support page index of parquet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #14742: ARROW-18413: [C++][Parquet] Expose page index info from ColumnChunkMetaData

Posted by GitBox <gi...@apache.org>.
wgtmac commented on code in PR #14742:
URL: https://github.com/apache/arrow/pull/14742#discussion_r1035455360


##########
cpp/src/parquet/metadata.h:
##########
@@ -171,6 +171,13 @@ class PARQUET_EXPORT ColumnChunkMetaData {
   int64_t total_uncompressed_size() const;
   std::unique_ptr<ColumnCryptoMetaData> crypto_metadata() const;
 
+  bool has_column_index() const;
+  int64_t column_index_offset() const;

Review Comment:
   SGTM, I will change it shortly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #14742: ARROW-18413: [C++][Parquet] Expose page index info from ColumnChunkMetaData

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #14742:
URL: https://github.com/apache/arrow/pull/14742#issuecomment-1333776743

   Benchmark runs are scheduled for baseline = 6f4a539323da0d6aef528ccce2404e36f0a08585 and contender = 958fbfa5fe567a908f5cbcc09dfc54a00e480be9. 958fbfa5fe567a908f5cbcc09dfc54a00e480be9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a41df08b6c464e7bacefc2427ad38c24...297f7bdc7f944cd08bdcf27ed475a965/)
   [Failed :arrow_down:0.03% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/4804e7d0e31f45f38ae585c31e7e5537...b69a7568a710452996484b1f1e083829/)
   [Finished :arrow_down:0.82% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/e2093864c4f3447cb1bd8a1f4e32a095...496abd4c1acc427da8e3fa043457dc51/)
   [Finished :arrow_down:0.35% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/536ca9a101b542808e1f11f36c0d952c...7d41c6746b8840fea3028f4ecc002b65/)
   Buildkite builds:
   [Finished] [`958fbfa5` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1954)
   [Failed] [`958fbfa5` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1976)
   [Finished] [`958fbfa5` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1946)
   [Finished] [`958fbfa5` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1968)
   [Finished] [`6f4a5393` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1953)
   [Finished] [`6f4a5393` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1975)
   [Finished] [`6f4a5393` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1945)
   [Finished] [`6f4a5393` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1967)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #14742: ARROW-10158: [C++][Parquet] Expose page index info from ColumnChunkMetaData

Posted by GitBox <gi...@apache.org>.
wgtmac commented on PR #14742:
URL: https://github.com/apache/arrow/pull/14742#issuecomment-1328204285

   @pitrou @emkornfield Can you please take a look? As I have mentioned in other thread, I will work on the page index and break it down into small commits to make it review-friendly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #14742: ARROW-18413: [C++][Parquet] Expose page index info from ColumnChunkMetaData

Posted by GitBox <gi...@apache.org>.
wgtmac commented on code in PR #14742:
URL: https://github.com/apache/arrow/pull/14742#discussion_r1036099527


##########
cpp/src/parquet/metadata.cc:
##########
@@ -312,6 +312,20 @@ class ColumnChunkMetaData::ColumnChunkMetaDataImpl {
     }
   }
 
+  inline std::optional<IndexLocation> GetColumIndexLocation() const {

Review Comment:
   Fixed



##########
cpp/src/parquet/metadata.h:
##########
@@ -118,6 +119,12 @@ struct PageEncodingStats {
   int32_t count;
 };
 
+/// \brief Public struct for location to page index in ColumnChunkMetaData.
+struct IndexLocation {
+  int64_t index_file_offset_bytes;
+  int32_t offset_index_length;

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #14742: ARROW-18413: [C++][Parquet] Expose page index info from ColumnChunkMetaData

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14742:
URL: https://github.com/apache/arrow/pull/14742#issuecomment-1333580310

   Test failures are unrelated, will merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on a diff in pull request #14742: ARROW-18413: [C++][Parquet] Expose page index info from ColumnChunkMetaData

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #14742:
URL: https://github.com/apache/arrow/pull/14742#discussion_r1035188213


##########
cpp/src/parquet/metadata.h:
##########
@@ -171,6 +171,13 @@ class PARQUET_EXPORT ColumnChunkMetaData {
   int64_t total_uncompressed_size() const;
   std::unique_ptr<ColumnCryptoMetaData> crypto_metadata() const;
 
+  bool has_column_index() const;
+  int64_t column_index_offset() const;

Review Comment:
   I think it might be a clearer contract to have something like:
   
   ```
   struct IndexLocation {
      int64_t index_file_offset_bytes;
      int32_t offset_index_length
   }
   
   optional<IndexLocation> GetColumIndexLocation();
   optional<IndexLocation> GetOffsetIndexLocation();
   ```
   
   Thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #14742: ARROW-10158: [C++][Parquet] Expose page index info from ColumnChunkMetaData

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #14742:
URL: https://github.com/apache/arrow/pull/14742#issuecomment-1328190383

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #14742: ARROW-18413: [C++][Parquet] Expose page index info from ColumnChunkMetaData

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #14742:
URL: https://github.com/apache/arrow/pull/14742#discussion_r1035769461


##########
cpp/src/parquet/metadata.h:
##########
@@ -118,6 +119,12 @@ struct PageEncodingStats {
   int32_t count;
 };
 
+/// \brief Public struct for location to page index in ColumnChunkMetaData.
+struct IndexLocation {
+  int64_t index_file_offset_bytes;
+  int32_t offset_index_length;

Review Comment:
   I'm not sure why the complex naming, we could just have
   ```suggestion
     /// File offset of the given index, in bytes
     int64_t offset;
     /// Length of the given index, in bytes
     int32_t length;
   ```



##########
cpp/src/parquet/metadata.cc:
##########
@@ -312,6 +312,20 @@ class ColumnChunkMetaData::ColumnChunkMetaDataImpl {
     }
   }
 
+  inline std::optional<IndexLocation> GetColumIndexLocation() const {

Review Comment:
   `inline` is implied if the method is defined inside the `class` body, so it's superfluous here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #14742: ARROW-18413: [C++][Parquet] Expose page index info from ColumnChunkMetaData

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #14742:
URL: https://github.com/apache/arrow/pull/14742#issuecomment-1329964208

   https://issues.apache.org/jira/browse/ARROW-18413


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #14742: ARROW-18413: [C++][Parquet] Expose page index info from ColumnChunkMetaData

Posted by GitBox <gi...@apache.org>.
wgtmac commented on PR #14742:
URL: https://github.com/apache/arrow/pull/14742#issuecomment-1332328296

   > Thank you. Is it possible to add basic tests for this?
   
   Thank you for the review. @pitrou  I have added a simple test and addressed your comment. Please take a look again when you get the chance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #14742: ARROW-18413: [C++][Parquet] Expose page index info from ColumnChunkMetaData

Posted by GitBox <gi...@apache.org>.
pitrou merged PR #14742:
URL: https://github.com/apache/arrow/pull/14742


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #14742: ARROW-18413: [C++][Parquet] Expose page index info from ColumnChunkMetaData

Posted by GitBox <gi...@apache.org>.
wgtmac commented on PR #14742:
URL: https://github.com/apache/arrow/pull/14742#issuecomment-1331634459

   I have addressed your comment, and the unsuccessful CI checks are unrelated to my change. Can you please take a look again? @emkornfield  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #14742: ARROW-10158: [C++][Parquet] Expose page index info from ColumnChunkMetaData

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #14742:
URL: https://github.com/apache/arrow/pull/14742#issuecomment-1328190381

   https://issues.apache.org/jira/browse/ARROW-10158


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org