You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2018/07/06 15:27:14 UTC

Using a shared filesystem abstract API in Arrow Python libraries [was Re: file-system specification]

hi Martin and Antoine,

I apologize I haven't been able to look at this in detail yet. I think
this is a valuable initiative; I created a wiki page so we can begin
to develop a plan to do the work

https://cwiki.apache.org/confluence/display/ARROW/Python+Filesystems+and+Filesystem+API

I added a JIRA filter and tagged a couple of filesystem-related
issues; there are more that should be added to the list. There's a lot
of other work related to filesystem implementations that we can help
organize and plan here.

As far as refining the details of the API, we should see what's the
best place to collect feedback and discuss. Martin, can you set up a
pull request with the entire patch so that many people can comment and
discuss?

NB: TensorFlow defines a filesystem abstraction, albeit in C++ with
SWIG bindings. We might also look there as a check on some of our
assumptions.

Thank you,
Wes

On Tue, May 15, 2018 at 7:47 AM, Antoine Pitrou <so...@pitrou.net> wrote:
>
> Hi Martin,
>
> On Wed, 9 May 2018 11:28:15 -0400
> Martin Durant <ma...@utoronto.ca> wrote:
>> I have sketched out a possible start of a python-wide file-system specification
>> https://github.com/martindurant/filesystem_spec
>>
>> This came about from my work in some other (remote) file-systems implementations for python, particularly in the context of Dask. Since arrow also cares about both local files and, for example, hdfs, I thought that people on this list may have comments and opinions about a possible standard that we ought to converge on. I do not think that my suggestions so far are necessarily right or even good in many cases, but I want to get the conversation going.
>
> Here are some comments:
>
> - API naming: you seem to favour re-using Unix command-line monickers in
>   some places, while using more regular verbs or names in other
>   places.  I think it should be consistent.  Since the Unix
>   command-line doesn't exactly cover the exposed functionality, and
>   since Unix tends to favour short cryptic names, I think it's better
>   to use Python-like naming (which is also more familiar to non-Unix
>   users). For example "move" or "rename" or "replace" instead of "mv",
>   etc.
>
> - **kwargs parameters: a couple APIs (`mkdir`, `put`...) allow passing
>   arbitrary parameters, which I assume are intended to be
>   backend-specific.  It makes it difficult to add other optional
>   parameters to those APIs in the future.  So I'd make the
>   backend-specific directives a single (optional) dict parameter rather
>   than a **kwargs.
>
> - `invalidate_cache` doesn't state whether it invalidates recursively
>   or not (recursively sounds better intuitively?).  Also, I think it
>   would be more flexible to take a list of paths rather than a single
>   path.
>
> - `du`: the effect of the `deep` parameter isn't obvious to me. I don't
>   know what it would mean *not* to recurse here: what is the size of a
>   directory if you don't recurse into it?
>
> - `glob` may need a formal definition (are trailing slashes
>   significant for directory or symlink resolution? this kind of thing),
>   though you may want to keep edge cases backend-specific.
>
> - are `head` and `tail` at all useful? They can be easily recreated
>   using a generic `open` facility.
>
> - `read_block` tries to do too much in a single API IMHO, and
>   using `open` directly is more flexible anyway.
>
> - if `touch` is intended to emulate the Unix API of the same name, the
>   docstring should state "Create empty file or update last modification
>   timestamp".
>
> - the information dicts returned by several APIs (`ls`, `info`....)
>   need standardizing, at least for non backend-specific fields.
>
> - if the backend is a networked filesystem with non-trivial latency,
>   perhaps the operations would deserve being batched (operate on
>   several paths at once), though I will happily defer to your expertise
>   on the topic.
>
> Regards
>
> Antoine.