You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/02 15:24:18 UTC

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7156: ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles

jorisvandenbossche commented on a change in pull request #7156:
URL: https://github.com/apache/arrow/pull/7156#discussion_r433912711



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -42,6 +43,51 @@ def _forbid_instantiation(klass, subclasses_instead=True):
     raise TypeError(msg)
 
 
+ctypedef CResult[shared_ptr[CRandomAccessFile]] CCustomOpen()
+
+cdef class FileSource:
+
+    cdef:
+        # XXX why is shared_ptr necessary here? CFileSource shouldn't need it
+        CFileSource wrapped
+
+    def __cinit__(self, file, FileSystem filesystem=None):
+        cdef:
+            shared_ptr[CFileSystem] c_filesystem
+            c_string c_path
+            function[CCustomOpen] c_open
+            shared_ptr[CBuffer] c_buffer
+
+        if isinstance(file, FileSource):
+            self.wrapped = (<FileSource> file).wrapped
+
+        elif isinstance(file, Buffer):
+            c_buffer = pyarrow_unwrap_buffer(file)
+            self.wrapped = CFileSource(move(c_buffer))
+
+        elif _is_path_like(file):
+            if filesystem is None:
+                raise ValueError("cannot construct a FileSource from "
+                                 "a path without a FileSystem")
+            c_filesystem = filesystem.unwrap()
+            c_path = tobytes(_stringify_path(file))
+            self.wrapped = CFileSource(move(c_path), move(c_filesystem))
+
+        else:
+            c_open = BindMethod[CCustomOpen](
+                wrap_python_file(file, mode='r'),
+                &NativeFile.get_random_access_file)
+            self.wrapped = CFileSource(move(c_open))
+
+    @staticmethod
+    def from_uri(uri):

Review comment:
       I don't think we need to expose FileSource publicly, so it also shouldn't matter too much 
   (we still need to make a choice for internal usage of course)

##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -102,11 +102,10 @@ Result<std::shared_ptr<Dataset>> UnionDatasetFactory::Finish(FinishOptions optio
   return std::shared_ptr<Dataset>(new UnionDataset(options.schema, std::move(children)));
 }
 
-FileSystemDatasetFactory::FileSystemDatasetFactory(
-    std::vector<std::string> paths, std::shared_ptr<fs::FileSystem> filesystem,
-    std::shared_ptr<FileFormat> format, FileSystemFactoryOptions options)
-    : paths_(std::move(paths)),
-      fs_(std::move(filesystem)),
+FileSystemDatasetFactory::FileSystemDatasetFactory(std::vector<FileSource> sources,
+                                                   std::shared_ptr<FileFormat> format,
+                                                   FileSystemFactoryOptions options)

Review comment:
       Would it be useful to keep a version with the original signature accepting paths/filesystem, and do the conversion to sources here (using `SourcesFromPaths`), as convenience for downstream users? (R, Python, )

##########
File path: python/pyarrow/dataset.py
##########
@@ -411,7 +421,14 @@ def _filesystem_dataset(source, schema=None, filesystem=None,
     partitioning = _ensure_partitioning(partitioning)
 
     if isinstance(source, (list, tuple)):
-        fs, paths_or_selector = _ensure_multiple_sources(source, filesystem)
+        if all(_is_path_like(elem) for elem in source):
+            fs, paths_or_selector = _ensure_multiple_sources(source,
+                                                             filesystem)
+        else:
+            fs, paths_or_selector = _MockFileSystem(), source

Review comment:
       We could also make the filesystem keyword in `FileSystemDatasetFactory` init optional, and manually raise an error when it is required (eg when passing a selector), so we can use `None` here, instead of the MockFilesystem hack ?

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -42,6 +43,51 @@ def _forbid_instantiation(klass, subclasses_instead=True):
     raise TypeError(msg)
 
 
+ctypedef CResult[shared_ptr[CRandomAccessFile]] CCustomOpen()
+
+cdef class FileSource:
+
+    cdef:
+        # XXX why is shared_ptr necessary here? CFileSource shouldn't need it
+        CFileSource wrapped
+
+    def __cinit__(self, file, FileSystem filesystem=None):
+        cdef:
+            shared_ptr[CFileSystem] c_filesystem
+            c_string c_path
+            function[CCustomOpen] c_open
+            shared_ptr[CBuffer] c_buffer
+
+        if isinstance(file, FileSource):
+            self.wrapped = (<FileSource> file).wrapped
+
+        elif isinstance(file, Buffer):
+            c_buffer = pyarrow_unwrap_buffer(file)
+            self.wrapped = CFileSource(move(c_buffer))
+
+        elif _is_path_like(file):
+            if filesystem is None:
+                raise ValueError("cannot construct a FileSource from "
+                                 "a path without a FileSystem")
+            c_filesystem = filesystem.unwrap()
+            c_path = tobytes(_stringify_path(file))
+            self.wrapped = CFileSource(move(c_path), move(c_filesystem))
+
+        else:
+            c_open = BindMethod[CCustomOpen](
+                wrap_python_file(file, mode='r'),
+                &NativeFile.get_random_access_file)
+            self.wrapped = CFileSource(move(c_open))
+
+    @staticmethod
+    def from_uri(uri):

Review comment:
       But for our own usage, for me it is fine to move this into the main constructor (since that is handling a lot already)

##########
File path: python/pyarrow/dataset.py
##########
@@ -29,6 +29,7 @@
     DirectoryPartitioning,
     FileFormat,
     FileFragment,
+    FileSource,

Review comment:
       Ah, I see it is used below ..

##########
File path: python/pyarrow/dataset.py
##########
@@ -29,6 +29,7 @@
     DirectoryPartitioning,
     FileFormat,
     FileFragment,
+    FileSource,

Review comment:
       I would remove this here, I don't think there is a need right now for the user to create this manually?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org