You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Jae Lee <wo...@gmail.com> on 2022/02/10 04:06:23 UTC

[Python] Implementing own Filesystem Subclass in PyArrow v3.0.0

Hi Team,

I would like to implement a custom subclass of
pyarrow.filesystem.FileSystem (or perhaps pyarrow.fs.FileSystem) and was
hoping to leverage the full potential of what pyarrow provides with parquet
files - partitioning, filter, etc. The underneath storage is cloud-based
and not S3 compatible. Our API only provides support for
- CRUD bucket
- CRUD objects
Currently, there is no support for streaming or working with any type of
file handle. I've already looked into how s3fs.cc was implemented but was
not sure I could apply it in my situation.

Questions:
1. What Filesystem class do I need to implement to take full advantage of
what arrow provides in terms of dealing with parquet files?
(pyarrow.filesystem.FileSystem vs pyarrow.fs.FileSystem)
2. Is there any example of implementation of cloud-based non-s3 compatible
filesystem?
3. Given our limited API sets, what would you recommend?

Initially, I was thinking to download the entire parquet file/directory to
a local file system and provide a handle but was curious if there would be
an any better way to handle this.

Thank you in advance!
Jae

Re: [Python] Implementing own Filesystem Subclass in PyArrow v3.0.0

Posted by Weston Pace <we...@gmail.com>.

> 3. Given our limited API sets, what would you recommend?

The filesystem interface is already rather minimal.  We generally
don't put a function in there if we aren't using it somewhere.  That
being said, you can often get away with a mock implementation.  From a
quick rundown:

GetFileInfo/OpenInputStream/OpenOutputStream/OpenInputFile - These are
used almost everywhere
CreateDir/DeleteDir/DeleteDirContents - These are used when writing
datasets (and so you will need it if you want to write partitioned
parquet)
DeleteFile/Move/CopyFile - I think these may only be used in our unit
tests, you could maybe get by without them

> - CRUD bucket
> - CRUD objects
> Currently, there is no support for streaming or working with any type of file handle. I've already looked into how s3fs.cc was implemented but was not sure I could apply it in my situation.

Some thoughts:

 * Do you support empty directories?

This is a tricky one.  We do rely on empty directories in some of our
datasets APIs.  For example, we CreateDir and then put files in it.
There is some discussion on [1] about how we might emulate this in GCS
but I don't know what exactly got implemented.

 * No support for streaming?

Does this mean you need to download an entire file at a time (e.g. you
can't stream the file or do a partial read of the file)?  In this case
you can mock it by downloading the file and then wrapping it with
arrow::io::BufferReader.  That provides the input stream and readable
file interfaces on top of an in-memory buffer.  You can also probably
use arrow::io::BufferedOutputStream to collect all writes in memory
and then override the Close method to actually persist the write.
This being said, you will of course use considerably more memory than
you need to.  So you'll need to make sure your files are small enough
to fit into memory.

[1] https://issues.apache.org/jira/browse/ARROW-1231

On Thu, Feb 10, 2022 at 1:31 AM Joris Van den Bossche
<jo...@gmail.com> wrote:
>
> HI Jae,
>
> Mainly providing an answer on your first question:
>
> On Thu, 10 Feb 2022 at 05:06, Jae Lee <wo...@gmail.com> wrote:
>>
>> Hi Team,
>>
>> I would like to implement a custom subclass of pyarrow.filesystem.FileSystem (or perhaps pyarrow.fs.FileSystem) and was hoping to leverage the full potential of what pyarrow provides with parquet files - partitioning, filter, etc. The underneath storage is cloud-based and not S3 compatible. Our API only provides support for
>> - CRUD bucket
>> - CRUD objects
>> Currently, there is no support for streaming or working with any type of file handle. I've already looked into how s3fs.cc was implemented but was not sure I could apply it in my situation.
>>
>> Questions:
>> 1. What Filesystem class do I need to implement to take full advantage of what arrow provides in terms of dealing with parquet files? (pyarrow.filesystem.FileSystem vs pyarrow.fs.FileSystem)
>
>
> The pyarrow.filesystem module is deprecated, so you should look at pyarrow.fs FileSystems. Those filesystems are mostly implemented in C++ and can't be directly subclassed in Python (only in C++), but there is a dedicated mechanism to implement a FileSystem in Python, using the PyFileSystem class and the FileSystemHandler class (see https://arrow.apache.org/docs/python/api/filesystems.html#filesystem-implementations).
> You would need to implement your own FileSystemHandler, and then you can create a filesystem object that will be recognized by pyarrow functions with `fs = PyFileSystem(my_handler)`.
>
> We don't really have documentation about this (apart from the API docs for FileSystemHandler), but it might probably be best to look at an example. And we have an actual use case of this in our own code base to wrap fsspec-compatible python filesystems that can be used as example: see https://github.com/apache/arrow/blob/c0bae8daea2ace51c64f6db38cfb3d04c5bed657/python/pyarrow/fs.py#L254-L406
>
>>
>> 2. Is there any example of implementation of cloud-based non-s3 compatible filesystem?
>
>
> I am not aware of one in Python (in C++, we now also have a Google Cloud Storage filesystem, but I suppose that has an extensive API). The Python fsspec package (which can be used in pyarrow through the above mentioned handler) implements some filesystems for "cloud" storage (eg for http, ftp), but I am not familiar with the implementation details.
>
>>
>> 3. Given our limited API sets, what would you recommend?
>>
>> Initially, I was thinking to download the entire parquet file/directory to a local file system and provide a handle but was curious if there would be an any better way to handle this.
>>
>> Thank you in advance!
>> Jae

Re: [Python] Implementing own Filesystem Subclass in PyArrow v3.0.0

Posted by Joris Van den Bossche <jo...@gmail.com>.

HI Jae,

Mainly providing an answer on your first question:

On Thu, 10 Feb 2022 at 05:06, Jae Lee <wo...@gmail.com> wrote:

> Hi Team,
>
> I would like to implement a custom subclass of
> pyarrow.filesystem.FileSystem (or perhaps pyarrow.fs.FileSystem) and was
> hoping to leverage the full potential of what pyarrow provides with parquet
> files - partitioning, filter, etc. The underneath storage is cloud-based
> and not S3 compatible. Our API only provides support for
> - CRUD bucket
> - CRUD objects
> Currently, there is no support for streaming or working with any type of
> file handle. I've already looked into how s3fs.cc was implemented but was
> not sure I could apply it in my situation.
>
> Questions:
> 1. What Filesystem class do I need to implement to take full advantage of
> what arrow provides in terms of dealing with parquet files?
> (pyarrow.filesystem.FileSystem vs pyarrow.fs.FileSystem)
>

The pyarrow.filesystem module is deprecated, so you should look at
pyarrow.fs FileSystems. Those filesystems are mostly implemented in C++ and
can't be directly subclassed in Python (only in C++), but there is a
dedicated mechanism to implement a FileSystem in Python, using the
PyFileSystem class and the FileSystemHandler class (see
https://arrow.apache.org/docs/python/api/filesystems.html#filesystem-implementations).

You would need to implement your own FileSystemHandler, and then you can
create a filesystem object that will be recognized by pyarrow functions
with `fs = PyFileSystem(my_handler)`.

We don't really have documentation about this (apart from the API docs for
FileSystemHandler), but it might probably be best to look at an example.
And we have an actual use case of this in our own code base to wrap
fsspec-compatible python filesystems that can be used as example: see
https://github.com/apache/arrow/blob/c0bae8daea2ace51c64f6db38cfb3d04c5bed657/python/pyarrow/fs.py#L254-L406

> 2. Is there any example of implementation of cloud-based non-s3 compatible
> filesystem?
>

I am not aware of one in Python (in C++, we now also have a Google Cloud
Storage filesystem, but I suppose that has an extensive API). The Python
fsspec package (which can be used in pyarrow through the above mentioned
handler) implements some filesystems for "cloud" storage (eg for http,
ftp), but I am not familiar with the implementation details.

> 3. Given our limited API sets, what would you recommend?
>
> Initially, I was thinking to download the entire parquet file/directory to
> a local file system and provide a handle but was curious if there would be
> an any better way to handle this.
>
> Thank you in advance!
> Jae
>