You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/25 16:06:31 UTC

[GitHub] [arrow] bkietz commented on a change in pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

bkietz commented on a change in pull request #7547:
URL: https://github.com/apache/arrow/pull/7547#discussion_r445663612



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -571,6 +571,8 @@ static inline Result<std::string> FileFromRowGroup(
       }
     }
 
+    // TODO Is it possible to infer the file size and return a populated FileInfo?
+    // This could avoid some spurious HEAD requests on S3 (ARROW-8950)

Review comment:
       If it's not easily accessible from FileMetaData or so then this probably warrants a custom metadata field when writing `_metadata`.

##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -156,27 +177,26 @@ Result<std::shared_ptr<DatasetFactory>> FileSystemDatasetFactory::Make(
 
   ARROW_ASSIGN_OR_RAISE(auto files, filesystem->GetFileInfo(selector));
 
-  std::vector<std::string> paths;
+  // XXX We may avoid copying around vectors of FileInfo by filtering in place.
+  std::vector<fs::FileInfo> filtered_files;

Review comment:
       ```suggestion
     auto files_end = std::remove_if(files.begin(), files.end(), [&](const FileInfo& info) {
       if (!info.IsFile()) {
         return true;
       }
   
       if (StartsWithAnyOf(info.path(), options.selector_ignore_prefixes)) {
         return true;
       }
       
       return false;
     });
     files.erase(files_end, files.end());
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org