You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Fabian Höring <f....@criteo.com> on 2020/01/24 10:46:23 UTC

Improve the ergonomics of new PyArrow FileSystem API in Python ARROW-7584

Hello,

I created this ticket to discuss possible improvements of the new PyArrow FileSystem API
https://issues.apache.org/jira/browse/ARROW-7584
 
As of today there seem to be only two popular projects to have an agnostic FileSystem API that can handle S3 & HDFS from Python:
- PyArrow via https://arrow.apache.org/docs/python/filesystems.html
- TensorFlow via https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile
 
On my side I would like to reuse a clean FileSystem API in my project and turned to the arrow for this purpose (I think TensorFlow already handles too many use cases should not provide yet another feature).
 
"Clean FileSystem API" for me also means to cover the interactive use case where one uses that API like the file system shell commands. We actually used https://github.com/dask/hdfs3 before and it worked really.
 
Currently there is the FileSystem API work in progress (see https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185) and I would take the occasion to improve it and fix some issues with the existing API.
 
Can you have a look at the comments on https://issues.apache.org/jira/browse/ARROW-7584 and give feedback ?
 
I can do the implementations I suggest on my side but would like to make sure they will be accepted.

Best regards,
Fabian Höring

Re: Improve the ergonomics of new PyArrow FileSystem API in Python ARROW-7584

Posted by Wes McKinney <we...@gmail.com>.

hi Fabian

I responded on the JIRA. I'm generally supportive of ergonomic
improvements to the FS API in Python. It might make sense to break the
work into multiple patches to ease review burden

Thanks for offering to work on this.

- Wes

On Fri, Jan 24, 2020 at 4:46 AM Fabian Höring <f....@criteo.com> wrote:
>
> Hello,
>
> I created this ticket to discuss possible improvements of the new PyArrow FileSystem API
> https://issues.apache.org/jira/browse/ARROW-7584
>
> As of today there seem to be only two popular projects to have an agnostic FileSystem API that can handle S3 & HDFS from Python:
> - PyArrow via https://arrow.apache.org/docs/python/filesystems.html
> - TensorFlow via https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile
>
> On my side I would like to reuse a clean FileSystem API in my project and turned to the arrow for this purpose (I think TensorFlow already handles too many use cases should not provide yet another feature).
>
> "Clean FileSystem API" for me also means to cover the interactive use case where one uses that API like the file system shell commands. We actually used https://github.com/dask/hdfs3 before and it worked really.
>
> Currently there is the FileSystem API work in progress (see https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185) and I would take the occasion to improve it and fix some issues with the existing API.
>
> Can you have a look at the comments on https://issues.apache.org/jira/browse/ARROW-7584 and give feedback ?
>
> I can do the implementations I suggest on my side but would like to make sure they will be accepted.
>
> Best regards,
> Fabian Höring
>