You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/17 12:39:46 UTC

[GitHub] [arrow] jorisvandenbossche opened a new pull request #7468: ARROW-8283: [Python] Limit FileSystemDataset constructor from fragments/paths, no filesystem interaction

jorisvandenbossche opened a new pull request #7468:
URL: https://github.com/apache/arrow/pull/7468


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7468: ARROW-8283: [Python] Limit FileSystemDataset constructor from fragments/paths, no filesystem interaction

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7468:
URL: https://github.com/apache/arrow/pull/7468#discussion_r442117037



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -407,42 +407,82 @@ cdef class UnionDataset(Dataset):
 
 
 cdef class FileSystemDataset(Dataset):
-    """A Dataset created from a set of files on a particular filesystem.
+    """A Dataset of file fragments.
+
+    A FileSystemDataset is composed of one or more FileFragment.
 
     Parameters
     ----------
-    paths_or_selector : Union[FileSelector, List[FileInfo]]
-        List of files/directories to consume.
+    fragments : list[Fragments]
+        List of fragments to consume.
     schema : Schema
-        The top-level schema of the DataDataset.
+        The top-level schema of the Dataset.
     format : FileFormat
-        File format to create fragments from, currently only
-        ParquetFileFormat, IpcFileFormat, and CsvFileFormat are supported.
-    filesystem : FileSystem
-        The filesystem which files are from.
-    partitions : List[Expression], optional
-        Attach additional partition information for the file paths.
+        File format of the fragments, currently only ParquetFileFormat,
+        IpcFileFormat, and CsvFileFormat are supported.
     root_partition : Expression, optional
         The top-level partition of the DataDataset.
     """
 
     cdef:
         CFileSystemDataset* filesystem_dataset
 
-    def __init__(self, paths_or_selector, schema=None, format=None,
-                 filesystem=None, partitions=None, root_partition=None):
+    def __init__(self, fragments, Schema schema, FileFormat format,
+                 root_partition=None):
         cdef:
-            FileInfo info
-            Expression expr
             FileFragment fragment
-            vector[CFileInfo] c_file_infos
-            vector[shared_ptr[CExpression]] c_partitions
-            shared_ptr[CFileFragment] c_fragment
             vector[shared_ptr[CFileFragment]] c_fragments
             CResult[shared_ptr[CDataset]] result
 
         root_partition = root_partition or _true
+        if not isinstance(root_partition, Expression):
+            raise TypeError(
+                "Argument 'root_partition' has incorrect type (expected "
+                "Epression, got {0})".format(type(root_partition))
+            )
+
+        for fragment in fragments:
+            c_fragments.push_back(
+                static_pointer_cast[CFileFragment, CFragment](
+                    fragment.unwrap()))
+
+        result = CFileSystemDataset.Make(
+            pyarrow_unwrap_schema(schema),
+            (<Expression> root_partition).unwrap(),
+            (<FileFormat> format).unwrap(),
+            c_fragments
+        )
+        self.init(GetResultValue(result))
+
+    cdef void init(self, const shared_ptr[CDataset]& sp):
+        Dataset.init(self, sp)
+        self.filesystem_dataset = <CFileSystemDataset*> sp.get()
+
+    @classmethod
+    def from_paths(cls, paths, schema=None, format=None,

Review comment:
       Yes, it's not optional in practice: a bit below I check that the value is a Schema (and not None). 
   I did this here to give proper error messages when omitting one of those (there is some discussion about this in the original PR that added this: https://github.com/apache/arrow/pull/6913#discussion_r407510524), because cython is otherwise generating very confusing error messages. I still need to create a minimal example to report to cython ..
   
   But added a test for `from_paths` that covers this to ensure it is indeed checked.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques commented on a change in pull request #7468: ARROW-8283: [Python] Limit FileSystemDataset constructor from fragments/paths, no filesystem interaction

Posted by GitBox <gi...@apache.org>.
fsaintjacques commented on a change in pull request #7468:
URL: https://github.com/apache/arrow/pull/7468#discussion_r441638604



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -407,42 +407,82 @@ cdef class UnionDataset(Dataset):
 
 
 cdef class FileSystemDataset(Dataset):
-    """A Dataset created from a set of files on a particular filesystem.
+    """A Dataset of file fragments.
+
+    A FileSystemDataset is composed of one or more FileFragment.
 
     Parameters
     ----------
-    paths_or_selector : Union[FileSelector, List[FileInfo]]
-        List of files/directories to consume.
+    fragments : list[Fragments]
+        List of fragments to consume.
     schema : Schema
-        The top-level schema of the DataDataset.
+        The top-level schema of the Dataset.
     format : FileFormat
-        File format to create fragments from, currently only
-        ParquetFileFormat, IpcFileFormat, and CsvFileFormat are supported.
-    filesystem : FileSystem
-        The filesystem which files are from.
-    partitions : List[Expression], optional
-        Attach additional partition information for the file paths.
+        File format of the fragments, currently only ParquetFileFormat,
+        IpcFileFormat, and CsvFileFormat are supported.
     root_partition : Expression, optional
         The top-level partition of the DataDataset.
     """
 
     cdef:
         CFileSystemDataset* filesystem_dataset
 
-    def __init__(self, paths_or_selector, schema=None, format=None,
-                 filesystem=None, partitions=None, root_partition=None):
+    def __init__(self, fragments, Schema schema, FileFormat format,
+                 root_partition=None):
         cdef:
-            FileInfo info
-            Expression expr
             FileFragment fragment
-            vector[CFileInfo] c_file_infos
-            vector[shared_ptr[CExpression]] c_partitions
-            shared_ptr[CFileFragment] c_fragment
             vector[shared_ptr[CFileFragment]] c_fragments
             CResult[shared_ptr[CDataset]] result
 
         root_partition = root_partition or _true
+        if not isinstance(root_partition, Expression):
+            raise TypeError(
+                "Argument 'root_partition' has incorrect type (expected "
+                "Epression, got {0})".format(type(root_partition))
+            )
+
+        for fragment in fragments:
+            c_fragments.push_back(
+                static_pointer_cast[CFileFragment, CFragment](
+                    fragment.unwrap()))
+
+        result = CFileSystemDataset.Make(
+            pyarrow_unwrap_schema(schema),
+            (<Expression> root_partition).unwrap(),
+            (<FileFormat> format).unwrap(),
+            c_fragments
+        )
+        self.init(GetResultValue(result))
+
+    cdef void init(self, const shared_ptr[CDataset]& sp):
+        Dataset.init(self, sp)
+        self.filesystem_dataset = <CFileSystemDataset*> sp.get()
+
+    @classmethod
+    def from_paths(cls, paths, schema=None, format=None,

Review comment:
       Schema shouldn't be optional, add a test if the user doesn't pass a schema to see the result.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques closed pull request #7468: ARROW-8283: [Python] Limit FileSystemDataset constructor from fragments/paths, no filesystem interaction

Posted by GitBox <gi...@apache.org>.
fsaintjacques closed pull request #7468:
URL: https://github.com/apache/arrow/pull/7468


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7468: ARROW-8283: [Python] Limit FileSystemDataset constructor from fragments/paths, no filesystem interaction

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7468:
URL: https://github.com/apache/arrow/pull/7468#issuecomment-645351919


   https://issues.apache.org/jira/browse/ARROW-8283


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org