You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/27 20:56:45 UTC

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8069: ARROW-9867: [C++][Dataset] Add FileSystemDataset::filesystem property

jorisvandenbossche commented on a change in pull request #8069:
URL: https://github.com/apache/arrow/pull/8069#discussion_r478689723



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -539,7 +549,7 @@ cdef class FileSystemDataset(Dataset):
         ]:
             if not isinstance(arg, class_):
                 raise TypeError(
-                    "Argument '{0}' has incorrect type (expected {1}, "
+                    "Argument '{0}' wtf has incorrect type (expected {1}, "

Review comment:
       This was not an intended change .. ? ;-)

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -467,12 +467,13 @@ cdef class FileSystemDataset(Dataset):
     cdef:
         CFileSystemDataset* filesystem_dataset
 
-    def __init__(self, fragments, Schema schema, FileFormat format,
+    def __init__(self, filesystem, fragments, Schema schema, FileFormat format,

Review comment:
       Can you add this to the docstring? 
   I would also move this keyword after `format`, I think fragments is most logical to come first. 
   
   If we want to have this backwards compatible, we could also make this keyword optional, and if not specified, take it from the first fragment.

##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -82,25 +82,34 @@ Result<ScanTaskIterator> FileFragment::Scan(std::shared_ptr<ScanOptions> options
 FileSystemDataset::FileSystemDataset(std::shared_ptr<Schema> schema,
                                      std::shared_ptr<Expression> root_partition,
                                      std::shared_ptr<FileFormat> format,
+                                     std::shared_ptr<fs::FileSystem> filesystem,
                                      std::vector<std::shared_ptr<FileFragment>> fragments)
     : Dataset(std::move(schema), std::move(root_partition)),
       format_(std::move(format)),
+      filesystem_(std::move(filesystem)),
       fragments_(std::move(fragments)) {}
 
 Result<std::shared_ptr<FileSystemDataset>> FileSystemDataset::Make(
     std::shared_ptr<Schema> schema, std::shared_ptr<Expression> root_partition,
-    std::shared_ptr<FileFormat> format,
+    std::shared_ptr<FileFormat> format, std::shared_ptr<fs::FileSystem> filesystem,
     std::vector<std::shared_ptr<FileFragment>> fragments) {
-  return std::shared_ptr<FileSystemDataset>(
-      new FileSystemDataset(std::move(schema), std::move(root_partition),
-                            std::move(format), std::move(fragments)));
+  for (const auto& fragment : fragments) {
+    if ((filesystem == nullptr && fragment->source().filesystem() != nullptr) ||
+        (filesystem != nullptr &&
+         !fragment->source().filesystem()->Equals(*filesystem))) {
+      return Status::Invalid("FileSystemDataset's filesystem differed from a fragment's");
+    }
+  }

Review comment:
       This validation should not be needed in many cases (eg from filesystem discovery, or from ParquetFactory, we know for sure that all fragments are already coming from the same filesystem), so I think we should avoid that when possible.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org