You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/24 13:20:42 UTC

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8037: ARROW-9827: [C++][Dataset] Skip parsing RowGroup metadata statistics when there is no filter

jorisvandenbossche commented on a change in pull request #8037:
URL: https://github.com/apache/arrow/pull/8037#discussion_r475594117



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -357,12 +357,14 @@ Result<ScanTaskIterator> ParquetFileFormat::ScanFile(std::shared_ptr<ScanOptions
                         GetReader(fragment->source(), options.get(), context.get()));
 
   if (!parquet_fragment->HasCompleteMetadata()) {
-    // row groups were not already filtered; do this now
-    RETURN_NOT_OK(parquet_fragment->EnsureCompleteMetadata(reader.get()));
-    ARROW_ASSIGN_OR_RAISE(row_groups,
-                          parquet_fragment->FilterRowGroups(*options->filter));
-    if (row_groups.empty()) {
-      return MakeEmptyIterator<std::shared_ptr<ScanTask>>();
+    // row groups were not already filtered; do this now (if there is a filter)
+    if (!options->filter->Equals(true)) {

Review comment:
       @bkietz ideally we would also skip this if the `flilter` only involves the `partition_expression` and not any actual columns of the file. What is the best way to check this? (simplify the filter first with `partition_expression` ?)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org