You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/28 18:41:30 UTC

[GitHub] [arrow] bkietz commented on a change in pull request #8037: ARROW-9827: [C++][Dataset] Skip parsing RowGroup metadata statistics when there is no filter

bkietz commented on a change in pull request #8037:
URL: https://github.com/apache/arrow/pull/8037#discussion_r479471053



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -357,12 +357,14 @@ Result<ScanTaskIterator> ParquetFileFormat::ScanFile(std::shared_ptr<ScanOptions
                         GetReader(fragment->source(), options.get(), context.get()));
 
   if (!parquet_fragment->HasCompleteMetadata()) {
-    // row groups were not already filtered; do this now
-    RETURN_NOT_OK(parquet_fragment->EnsureCompleteMetadata(reader.get()));
-    ARROW_ASSIGN_OR_RAISE(row_groups,
-                          parquet_fragment->FilterRowGroups(*options->filter));
-    if (row_groups.empty()) {
-      return MakeEmptyIterator<std::shared_ptr<ScanTask>>();
+    // row groups were not already filtered; do this now (if there is a filter)
+    if (!options->filter->Equals(true)) {

Review comment:
       Simplifying the filter by the partition expression *would* work, but it's not safe without loading the physical schema since we need to validate the filter against that. Otherwise a filter without implicit casts (for example) would result in an assertion failure. Let's leave that optimization for a follow up; the right way to do this is probably to rewrite `Expression::Assume` to return a Result (allowing it to fail non-catastrophically even without a schema against which to validate), but that's a much larger project

##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -357,12 +357,23 @@ Result<ScanTaskIterator> ParquetFileFormat::ScanFile(std::shared_ptr<ScanOptions
                         GetReader(fragment->source(), options.get(), context.get()));
 
   if (!parquet_fragment->HasCompleteMetadata()) {
-    // row groups were not already filtered; do this now
-    RETURN_NOT_OK(parquet_fragment->EnsureCompleteMetadata(reader.get()));
-    ARROW_ASSIGN_OR_RAISE(row_groups,
-                          parquet_fragment->FilterRowGroups(*options->filter));
-    if (row_groups.empty()) {
-      return MakeEmptyIterator<std::shared_ptr<ScanTask>>();
+    // row groups were not already filtered; do this now (if there is a filter)
+    if (!options->filter->Equals(true)) {
+      RETURN_NOT_OK(parquet_fragment->EnsureCompleteMetadata(reader.get()));
+      ARROW_ASSIGN_OR_RAISE(row_groups,
+                            parquet_fragment->FilterRowGroups(*options->filter));
+      if (row_groups.empty()) {
+        return MakeEmptyIterator<std::shared_ptr<ScanTask>>();
+      }
+    } else {

Review comment:
       ```suggestion
       } else {
         // since we are not scanning this fragment with a filter, don't bother loading statistics
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org