You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/09 01:45:18 UTC

[GitHub] [arrow] wgtmac commented on a diff in pull request #14603: PARQUET-2210: [C++][Parquet] Skip pages based on header metadata using a callback

wgtmac commented on code in PR #14603:
URL: https://github.com/apache/arrow/pull/14603#discussion_r1017306358


##########
cpp/src/parquet/column_reader.h:
##########
@@ -28,6 +28,7 @@
 #include "parquet/properties.h"
 #include "parquet/schema.h"
 #include "parquet/types.h"
+#include "parquet/thrift_internal.h"  // IWYU pragma: keep

Review Comment:
   This might fall the CI check due to exposing thrift symbols.



##########
cpp/src/parquet/column_reader.cc:
##########
@@ -253,6 +253,13 @@ class SerializedPageReader : public PageReader {
 
   void set_max_page_header_size(uint32_t size) override { max_page_header_size_ = size; }
 
+  bool set_skip_page_callback(
+      std::function<bool(const format::PageHeader&)> skip_page_callback) override {
+    skip_page_callback_ = skip_page_callback;
+    has_skip_page_callback_ = true;

Review Comment:
   We can avoid the has_skip_page_callback_ flag by simply checking `if (skip_page_callback_)`
   
   https://en.cppreference.com/w/cpp/utility/functional/function/operator_bool



##########
cpp/src/parquet/column_reader.cc:
##########
@@ -386,6 +397,16 @@ std::shared_ptr<Page> SerializedPageReader::NextPage() {
       throw ParquetException("Invalid page header");
     }
 
+   // Once we have the header, we will call the skip_page_call_back_ to
+   // determine if we should be skipping this page. If yes, we will advance the
+   // stream to the next page.
+   if(has_skip_page_callback_) {

Review Comment:
   What we have done is to let the PageReader be aware of **Offset Index** belong to the pages of the RowGroup. In this way, it can be smart enough to move among pages and handle the skip logic. The prerequisite is https://issues.apache.org/jira/browse/ARROW-10158. If that sounds good, I can pick up ARROW-10158 to contribute our implementation. WDYT? @fatemehp @emkornfield @pitrou 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org