You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/27 18:57:20 UTC

[GitHub] [arrow] bkietz opened a new pull request #8069: ARROW-9867: [C++][Dataset] Add FileSystemDataset::filesystem property

bkietz opened a new pull request #8069:
URL: https://github.com/apache/arrow/pull/8069


   In addition, the constructor now ensures that all fragments in a file system dataset belong to the provided filesystem.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz closed pull request #8069: ARROW-9867: [C++][Dataset] Add FileSystemDataset::filesystem property

Posted by GitBox <gi...@apache.org>.
bkietz closed pull request #8069:
URL: https://github.com/apache/arrow/pull/8069


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #8069: ARROW-9867: [C++][Dataset] Add FileSystemDataset::filesystem property

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #8069:
URL: https://github.com/apache/arrow/pull/8069#discussion_r478630774



##########
File path: cpp/src/arrow/dataset/file_base.h
##########
@@ -195,15 +195,15 @@ class ARROW_DS_EXPORT FileSystemDataset : public Dataset {
   /// \param[in] schema the schema of the dataset
   /// \param[in] root_partition the partition expression of the dataset
   /// \param[in] format the format of each FileFragment.
-  /// \param[in] fragments list of fragments to create the dataset from
+  /// \param[in] fragments list of fragments to create the dataset from.
   ///
-  /// Note that all fragment must be of `FileFragment` type. The type are
-  /// erased to simplify callers.
+  /// Note that fragments wrapping files resident in a differing filesystems is not

Review comment:
       ```suggestion
     /// Note that fragments wrapping files resident in a differing filesystems are not
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8069: ARROW-9867: [C++][Dataset] Add FileSystemDataset::filesystem property

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8069:
URL: https://github.com/apache/arrow/pull/8069#issuecomment-682136288


   https://issues.apache.org/jira/browse/ARROW-9867


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #8069: ARROW-9867: [C++][Dataset] Add FileSystemDataset::filesystem property

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #8069:
URL: https://github.com/apache/arrow/pull/8069#issuecomment-683876573


   CI failure is an S3 flake https://github.com/apache/arrow/pull/8069/checks?check_run_id=1042786261#step:8:3006


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8069: ARROW-9867: [C++][Dataset] Add FileSystemDataset::filesystem property

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8069:
URL: https://github.com/apache/arrow/pull/8069#discussion_r478689723



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -539,7 +549,7 @@ cdef class FileSystemDataset(Dataset):
         ]:
             if not isinstance(arg, class_):
                 raise TypeError(
-                    "Argument '{0}' has incorrect type (expected {1}, "
+                    "Argument '{0}' wtf has incorrect type (expected {1}, "

Review comment:
       This was not an intended change .. ? ;-)

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -467,12 +467,13 @@ cdef class FileSystemDataset(Dataset):
     cdef:
         CFileSystemDataset* filesystem_dataset
 
-    def __init__(self, fragments, Schema schema, FileFormat format,
+    def __init__(self, filesystem, fragments, Schema schema, FileFormat format,

Review comment:
       Can you add this to the docstring? 
   I would also move this keyword after `format`, I think fragments is most logical to come first. 
   
   If we want to have this backwards compatible, we could also make this keyword optional, and if not specified, take it from the first fragment.

##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -82,25 +82,34 @@ Result<ScanTaskIterator> FileFragment::Scan(std::shared_ptr<ScanOptions> options
 FileSystemDataset::FileSystemDataset(std::shared_ptr<Schema> schema,
                                      std::shared_ptr<Expression> root_partition,
                                      std::shared_ptr<FileFormat> format,
+                                     std::shared_ptr<fs::FileSystem> filesystem,
                                      std::vector<std::shared_ptr<FileFragment>> fragments)
     : Dataset(std::move(schema), std::move(root_partition)),
       format_(std::move(format)),
+      filesystem_(std::move(filesystem)),
       fragments_(std::move(fragments)) {}
 
 Result<std::shared_ptr<FileSystemDataset>> FileSystemDataset::Make(
     std::shared_ptr<Schema> schema, std::shared_ptr<Expression> root_partition,
-    std::shared_ptr<FileFormat> format,
+    std::shared_ptr<FileFormat> format, std::shared_ptr<fs::FileSystem> filesystem,
     std::vector<std::shared_ptr<FileFragment>> fragments) {
-  return std::shared_ptr<FileSystemDataset>(
-      new FileSystemDataset(std::move(schema), std::move(root_partition),
-                            std::move(format), std::move(fragments)));
+  for (const auto& fragment : fragments) {
+    if ((filesystem == nullptr && fragment->source().filesystem() != nullptr) ||
+        (filesystem != nullptr &&
+         !fragment->source().filesystem()->Equals(*filesystem))) {
+      return Status::Invalid("FileSystemDataset's filesystem differed from a fragment's");
+    }
+  }

Review comment:
       This validation should not be needed in many cases (eg from filesystem discovery, or from ParquetFactory, we know for sure that all fragments are already coming from the same filesystem), so I think we should avoid that when possible.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org