You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Hendrik Makait (Jira)" <ji...@apache.org> on 2020/12/13 09:40:00 UTC

[jira] [Commented] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?

    [ https://issues.apache.org/jira/browse/ARROW-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248540#comment-17248540 ] 

Hendrik Makait commented on ARROW-9938:
---------------------------------------

Unless someone is already working on this, I'd love to get started on putting together a PR for this. Since it will be my first contribution, I might ask for guidance in the process. As a first question: Should I split this into multiple PRs per format (i.e. one PR for csv, feather, json, respectively) or combine them into one larger PR?

> [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-9938
>                 URL: https://issues.apache.org/jira/browse/ARROW-9938
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: filesystem
>
> In the parquet IO functions, we support reading/writing files from non-local filesystems directly (in addition to passing a buffer) by:
> - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
> - specifying the filesystem keyword (eg {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) 
> On the other hand, for other file formats such as feather, we only support local files or buffers. So for those, you need to do the more manual (I _suppose_ this works?):
> {code:python}
> from pyarrow import fs, feather
> s3 = fs.S3FileSystem()
> with s3.open_input_file("bucket/data.arrow") as file:
>     table = feather.read_table(file)
> {code}
> So I think the question comes up: do we want to extend this filesystem support to other file formats (feather, csv, json) and make this more uniform across pyarrow, or do we prefer to keep the plain readers more low-level (and people can use the datasets API for more convenience)?
> cc [~apitrou] [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)