You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/04/13 09:04:00 UTC

[jira] [Assigned] (ARROW-8290) [Python][Dataset] Improve ergonomy of the FileSystemDataset constructor

     [ https://issues.apache.org/jira/browse/ARROW-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche reassigned ARROW-8290:
--------------------------------------------

    Assignee: Joris Van den Bossche

> [Python][Dataset] Improve ergonomy of the FileSystemDataset constructor
> -----------------------------------------------------------------------
>
>                 Key: ARROW-8290
>                 URL: https://issues.apache.org/jira/browse/ARROW-8290
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>
> Currently, to manually create a FileSystemDataset, you can do something like:
> {code}
> dataset = ds.FileSystemDataset(
>         schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(),
>         ["data_file1.parquet", "data_file2.parquet"],
>         [ds.field('file') == 1, ds.field('file') == 2])
> {code}
> There are some usibility improvements we can do though:
> - Allow passing the arguments by name to improve readability of the calling code (now they all need to be passed positionally, due to the way they are implemented in cython as {{not None}})
> - I would maybe change the order of the arguments (eg start with the paths, we don't need to match the order of the C++ constructor)
> - Potentially allow {{partitions}} to be optional, in which case they need to be set to a list of ScalarExpression(True) values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)