You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/02 01:17:02 UTC

[GitHub] [arrow] westonpace commented on a change in pull request #12625: ARROW-15587: [C++] Add support for all options specified by substrait::ReadRel::LocalFiles::FileOrFiles

westonpace commented on a change in pull request #12625:
URL: https://github.com/apache/arrow/pull/12625#discussion_r840935751



##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -18,6 +18,7 @@
 #include "arrow/dataset/discovery.h"
 
 #include <algorithm>
+#include <iostream>

Review comment:
       ```suggestion
   ```
   

##########
File path: cpp/src/arrow/filesystem/filesystem.h
##########
@@ -192,6 +193,8 @@ class ARROW_EXPORT FileSystem : public std::enable_shared_from_this<FileSystem>
   /// it exists.
   /// If it doesn't exist, see `FileSelector::allow_not_found`.
   virtual Result<FileInfoVector> GetFileInfo(const FileSelector& select) = 0;
+  // Same, for glob paths.

Review comment:
       The comments here are a little too brief but I created https://issues.apache.org/jira/browse/ARROW-16101 to address this more fully so let's keep it how it is for now.

##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -18,6 +18,7 @@
 #include "arrow/dataset/discovery.h"
 
 #include <algorithm>
+#include <iostream>

Review comment:
       ```suggestion
   ```
   

##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -663,19 +664,26 @@ std::wstring PathWithoutTrailingSlash(const PlatformFilename& fn) {
   return path;
 }
 
-Result<std::vector<WIN32_FIND_DATAW>> ListDirInternal(const PlatformFilename& dir_path) {
+Result<std::vector<WIN32_FIND_DATAW>> ListDirInternal(const PlatformFilename& dir_path,
+                                                      bool is_glob = false) {
   WIN32_FIND_DATAW find_data;
-  std::wstring pattern = PathWithoutTrailingSlash(dir_path) + L"\\*.*";
+  std::wstring pattern;
+  if (is_glob)
+    pattern = dir_path.ToNative();
+  else
+    pattern = PathWithoutTrailingSlash(dir_path) + L"\\*.*";
+
+  std::vector<WIN32_FIND_DATAW> results;
   HANDLE handle = FindFirstFileW(pattern.c_str(), &find_data);
   if (handle == INVALID_HANDLE_VALUE) {
+    if (is_glob) return results;

Review comment:
       Why is this ok?  Can you add a comment explaining what is going on here?

##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -728,6 +736,34 @@ Result<std::vector<PlatformFilename>> ListDir(const PlatformFilename& dir_path)
   return results;
 }
 
+std::vector<NativePathString> splitStr(const NativePathString& str, wchar_t delimiter) {
+  std::vector<NativePathString> result_vec;

Review comment:
       Minor nit: We try to avoid return variables named "result" and also try to avoid appending type-suffixes like `_vec`.  Maybe just `std::vector<NativePathString> segments;`?

##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -728,6 +736,34 @@ Result<std::vector<PlatformFilename>> ListDir(const PlatformFilename& dir_path)
   return results;
 }
 
+std::vector<NativePathString> splitStr(const NativePathString& str, wchar_t delimiter) {

Review comment:
       ```suggestion
   std::vector<NativePathString> SplitStr(const NativePathString& str, wchar_t delimiter) {
   ```
   

##########
File path: cpp/src/arrow/filesystem/filesystem.cc
##########
@@ -348,6 +350,10 @@ Result<std::vector<FileInfo>> SubTreeFileSystem::GetFileInfo(const FileSelector&
   return infos;
 }
 
+Result<std::vector<FileInfo>> SubTreeFileSystem::GetFileInfoGlob(const FileInfo& file) {
+  return Status::NotImplemented("Glob file with SubTreeFileSystem");

Review comment:
       Can we just prepend the glob pattern with `base_dir`?

##########
File path: cpp/src/arrow/filesystem/filesystem.cc
##########
@@ -348,6 +350,10 @@ Result<std::vector<FileInfo>> SubTreeFileSystem::GetFileInfo(const FileSelector&
   return infos;
 }
 
+Result<std::vector<FileInfo>> SubTreeFileSystem::GetFileInfoGlob(const FileInfo& file) {
+  return Status::NotImplemented("Glob file with SubTreeFileSystem");

Review comment:
       Can we just prepend the glob pattern with `base_dir`?

##########
File path: cpp/src/arrow/filesystem/filesystem.h
##########
@@ -192,6 +193,8 @@ class ARROW_EXPORT FileSystem : public std::enable_shared_from_this<FileSystem>
   /// it exists.
   /// If it doesn't exist, see `FileSelector::allow_not_found`.
   virtual Result<FileInfoVector> GetFileInfo(const FileSelector& select) = 0;
+  // Same, for glob paths.

Review comment:
       The comments here are a little too brief but I created https://issues.apache.org/jira/browse/ARROW-16101 to address this more fully so let's keep it how it is for now.

##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -728,6 +736,34 @@ Result<std::vector<PlatformFilename>> ListDir(const PlatformFilename& dir_path)
   return results;
 }
 
+std::vector<NativePathString> splitStr(const NativePathString& str, wchar_t delimiter) {
+  std::vector<NativePathString> result_vec;

Review comment:
       Minor nit: We try to avoid return variables named "result" and also try to avoid appending type-suffixes like `_vec`.  Maybe just `std::vector<NativePathString> segments;`?

##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -759,6 +795,24 @@ Result<std::vector<PlatformFilename>> ListDir(const PlatformFilename& dir_path)
   return results;
 }
 
+Result<std::vector<PlatformFilename>> ListGlob(const PlatformFilename& glob_path) {
+  glob_t glob_result{};
+  errno = glob(glob_path.ToNative().c_str(), GLOB_TILDE, NULL, &glob_result);
+
+  if (errno != 0 && errno != GLOB_NOMATCH) {
+    return IOErrorFromErrno(errno, "Cannot list glob '", glob_path.ToString(), "'");
+  }
+
+  std::vector<PlatformFilename> results;
+  for (size_t i = 0; i < glob_result.gl_pathc; ++i) {
+    results.emplace_back(std::string(glob_result.gl_pathv[i]));

Review comment:
       ```suggestion
       results.emplace_back(glob_result.gl_pathv[i]);
   ```
   

##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -759,6 +795,24 @@ Result<std::vector<PlatformFilename>> ListDir(const PlatformFilename& dir_path)
   return results;
 }
 
+Result<std::vector<PlatformFilename>> ListGlob(const PlatformFilename& glob_path) {
+  glob_t glob_result{};
+  errno = glob(glob_path.ToNative().c_str(), GLOB_TILDE, NULL, &glob_result);
+
+  if (errno != 0 && errno != GLOB_NOMATCH) {
+    return IOErrorFromErrno(errno, "Cannot list glob '", glob_path.ToString(), "'");
+  }
+
+  std::vector<PlatformFilename> results;
+  for (size_t i = 0; i < glob_result.gl_pathc; ++i) {
+    results.emplace_back(std::string(glob_result.gl_pathv[i]));

Review comment:
       ```suggestion
       results.emplace_back(glob_result.gl_pathv[i]);
   ```
   

##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -134,8 +135,26 @@ Result<std::shared_ptr<DatasetFactory>> FileSystemDatasetFactory::Make(
 Result<std::shared_ptr<DatasetFactory>> FileSystemDatasetFactory::Make(
     std::shared_ptr<fs::FileSystem> filesystem, const std::vector<fs::FileInfo>& files,
     std::shared_ptr<FileFormat> format, FileSystemFactoryOptions options) {
+  // Discover files in directories and globs
+  std::vector<fs::FileInfo> discovered_files;
+  for (const auto& file : files) {
+    if (file.IsDirectory()) {
+      fs::FileSelector file_selector;
+      file_selector.base_dir = file.dir_name();
+      file_selector.recursive = true;
+      ARROW_ASSIGN_OR_RAISE(auto folder_files, filesystem->GetFileInfo(file_selector));
+      std::move(folder_files.begin(), folder_files.end(),
+                std::back_inserter(discovered_files));
+    } else if (file.IsGlob()) {
+      ARROW_ASSIGN_OR_RAISE(auto files, filesystem->GetFileInfoGlob(file));
+      std::move(files.begin(), files.end(), std::back_inserter(discovered_files));
+    } else if (file.IsFile()) {
+      discovered_files.emplace_back(file);
+    }
+  }
+

Review comment:
       So this is something of an abstract concern but I think we have a bit of ambiguity between input & output types.
   
   I see `FileInfo` as an "output type".  It is something you get back from a filesystem that describes something that it went out and discovered or inspected.
   
   In the meantime, PlatformFilename and FileSelector are "input" types.  They are provided by the user for the purpose of obtaining FileInfo objects.
   
   As a result I'm not sure it makes sense to have `FileType::Glob` because I would interpret that as the filesystem discovering a glob inside the filesystem.  For example `GetFileInfo` would never return `FileType::Glob`.
   
   Instead, can we do something slightly different:
   
   Instead of modifying the `Make` overload that accepts `const std::vector<fs::FileInfo>&` can we add a new overload that takes `const std::vector<fs::FileSelector>&`?
   
   Then, instead of adding `FileType::Glob` can we add an `is_glob` property to `FileSelector` (if this is true then `base_dir` is expected to be a glob pattern and `recursive` will be ignored).
   
   We could also then change the rules a little on `FileSelector` so that `base_dir` is allowed to represent a single file.  Then we can treat the "path" case (user doesn't specify it as a file or a directory) as a selector also.  That would allow us to move all the `GetFileInfo` calls out of the Substrait consumer.
   
   CC @pitrou for a second opinion as he has a lot of experience here.

##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -728,6 +736,34 @@ Result<std::vector<PlatformFilename>> ListDir(const PlatformFilename& dir_path)
   return results;
 }
 
+std::vector<NativePathString> splitStr(const NativePathString& str, wchar_t delimiter) {

Review comment:
       ```suggestion
   std::vector<NativePathString> SplitStr(const NativePathString& str, wchar_t delimiter) {
   ```
   

##########
File path: cpp/src/arrow/util/io_util.cc
##########
@@ -663,19 +664,26 @@ std::wstring PathWithoutTrailingSlash(const PlatformFilename& fn) {
   return path;
 }
 
-Result<std::vector<WIN32_FIND_DATAW>> ListDirInternal(const PlatformFilename& dir_path) {
+Result<std::vector<WIN32_FIND_DATAW>> ListDirInternal(const PlatformFilename& dir_path,
+                                                      bool is_glob = false) {
   WIN32_FIND_DATAW find_data;
-  std::wstring pattern = PathWithoutTrailingSlash(dir_path) + L"\\*.*";
+  std::wstring pattern;
+  if (is_glob)
+    pattern = dir_path.ToNative();
+  else
+    pattern = PathWithoutTrailingSlash(dir_path) + L"\\*.*";
+
+  std::vector<WIN32_FIND_DATAW> results;
   HANDLE handle = FindFirstFileW(pattern.c_str(), &find_data);
   if (handle == INVALID_HANDLE_VALUE) {
+    if (is_glob) return results;

Review comment:
       Why is this ok?  Can you add a comment explaining what is going on here?

##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -134,8 +135,26 @@ Result<std::shared_ptr<DatasetFactory>> FileSystemDatasetFactory::Make(
 Result<std::shared_ptr<DatasetFactory>> FileSystemDatasetFactory::Make(
     std::shared_ptr<fs::FileSystem> filesystem, const std::vector<fs::FileInfo>& files,
     std::shared_ptr<FileFormat> format, FileSystemFactoryOptions options) {
+  // Discover files in directories and globs
+  std::vector<fs::FileInfo> discovered_files;
+  for (const auto& file : files) {
+    if (file.IsDirectory()) {
+      fs::FileSelector file_selector;
+      file_selector.base_dir = file.dir_name();
+      file_selector.recursive = true;
+      ARROW_ASSIGN_OR_RAISE(auto folder_files, filesystem->GetFileInfo(file_selector));
+      std::move(folder_files.begin(), folder_files.end(),
+                std::back_inserter(discovered_files));
+    } else if (file.IsGlob()) {
+      ARROW_ASSIGN_OR_RAISE(auto files, filesystem->GetFileInfoGlob(file));
+      std::move(files.begin(), files.end(), std::back_inserter(discovered_files));
+    } else if (file.IsFile()) {
+      discovered_files.emplace_back(file);
+    }
+  }
+

Review comment:
       So this is something of an abstract concern but I think we have a bit of ambiguity between input & output types.
   
   I see `FileInfo` as an "output type".  It is something you get back from a filesystem that describes something that it went out and discovered or inspected.
   
   In the meantime, PlatformFilename and FileSelector are "input" types.  They are provided by the user for the purpose of obtaining FileInfo objects.
   
   As a result I'm not sure it makes sense to have `FileType::Glob` because I would interpret that as the filesystem discovering a glob inside the filesystem.  For example `GetFileInfo` would never return `FileType::Glob`.
   
   Instead, can we do something slightly different:
   
   Instead of modifying the `Make` overload that accepts `const std::vector<fs::FileInfo>&` can we add a new overload that takes `const std::vector<fs::FileSelector>&`?
   
   Then, instead of adding `FileType::Glob` can we add an `is_glob` property to `FileSelector` (if this is true then `base_dir` is expected to be a glob pattern and `recursive` will be ignored).
   
   We could also then change the rules a little on `FileSelector` so that `base_dir` is allowed to represent a single file.  Then we can treat the "path" case (user doesn't specify it as a file or a directory) as a selector also.  That would allow us to move all the `GetFileInfo` calls out of the Substrait consumer.
   
   CC @pitrou for a second opinion as he has a lot of experience here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org