You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/15 01:49:34 UTC

[GitHub] [arrow] westonpace commented on a change in pull request #12625: ARROW-15587: [C++] Add support for all options specified by substrait::ReadRel::LocalFiles::FileOrFiles

westonpace commented on a change in pull request #12625:
URL: https://github.com/apache/arrow/pull/12625#discussion_r826506842



##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -134,8 +135,30 @@ Result<std::shared_ptr<DatasetFactory>> FileSystemDatasetFactory::Make(
 Result<std::shared_ptr<DatasetFactory>> FileSystemDatasetFactory::Make(
     std::shared_ptr<fs::FileSystem> filesystem, const std::vector<fs::FileInfo>& files,
     std::shared_ptr<FileFormat> format, FileSystemFactoryOptions options) {
+  // Discover files in directories and globs
+  std::vector<fs::FileInfo> discovered_files;
+  for (const auto& file : files) {
+    if (file.IsDirectory()) {
+      fs::FileSelector file_selector;
+      file_selector.base_dir = file.dir_name();
+      file_selector.recursive = true;
+      ARROW_ASSIGN_OR_RAISE(auto folder_files,
+                            filesystem->GetFileInfo(file_selector));
+      std::move(folder_files.begin(), folder_files.end(), std::back_inserter(discovered_files));
+    } else if (file.IsGlob()) {
+      glob_t glob_result;
+      memset(&glob_result, 0, sizeof(glob_result)); 
+      glob(file.path().c_str(), GLOB_TILDE, NULL, &glob_result);

Review comment:
       What happens if the files are on a remote filesystem like S3?

##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -134,8 +135,30 @@ Result<std::shared_ptr<DatasetFactory>> FileSystemDatasetFactory::Make(
 Result<std::shared_ptr<DatasetFactory>> FileSystemDatasetFactory::Make(
     std::shared_ptr<fs::FileSystem> filesystem, const std::vector<fs::FileInfo>& files,
     std::shared_ptr<FileFormat> format, FileSystemFactoryOptions options) {
+  // Discover files in directories and globs
+  std::vector<fs::FileInfo> discovered_files;
+  for (const auto& file : files) {
+    if (file.IsDirectory()) {
+      fs::FileSelector file_selector;
+      file_selector.base_dir = file.dir_name();
+      file_selector.recursive = true;
+      ARROW_ASSIGN_OR_RAISE(auto folder_files,
+                            filesystem->GetFileInfo(file_selector));
+      std::move(folder_files.begin(), folder_files.end(), std::back_inserter(discovered_files));
+    } else if (file.IsGlob()) {
+      glob_t glob_result;
+      memset(&glob_result, 0, sizeof(glob_result)); 

Review comment:
       ```suggestion
         glob_t glob_result{};
   ```
   Minor nit: As of C++11 you can use `{}` which forces value initialization (e.g. zero-filled) instead of default initialization.

##########
File path: cpp/src/arrow/engine/substrait/relation_internal.cc
##########
@@ -103,13 +104,18 @@ Result<compute::Declaration> FromProto(const substrait::Rel& rel,
               "substrait::ReadRel::LocalFiles::FileOrFiles::format "
               "other than FILE_FORMAT_PARQUET");
         }
-
-        if (!util::string_view{item.uri_file()}.starts_with("file:///")) {
+        if (!util::string_view{item.uri_file()}.starts_with("file:///") &&
+            !util::string_view{item.uri_folder()}.starts_with("file:///") &&
+            !util::string_view{item.uri_path_glob()}.starts_with("file:///")) {
           return Status::NotImplemented(
               "substrait::ReadRel::LocalFiles::FileOrFiles::uri_file "
               "with other than local filesystem (file:///)");
         }
-        auto path = item.uri_file().substr(7);
+
+        std::string path;
+        if(item.has_uri_file()) path = item.uri_file().substr(7);

Review comment:
       Instead of doing two if/else chains on `file/folder/glob` could you combine this one with the one at line 135?

##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -23,6 +23,7 @@
 #include <unordered_set>
 #include <utility>
 #include <vector>
+#include <glob.h>

Review comment:
       I'm not sure this include will work on Windows.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org