You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Weston Pace <we...@gmail.com> on 2020/08/26 22:57:44 UTC

Writing parquet to new filesystem API

Forgive me if I am missing something obvious but I am unable to write
parquet files using the new filesystem API.

Here is what I am trying:

https://gist.github.com/westonpace/0c5ef01e21a40de5d16608b7f12de80d

I receive an error:

OSError: Unrecognized filesystem: <class 'pyarrow._fs.SubTreeFileSystem'>

Re: Writing parquet to new filesystem API

Posted by Joris Van den Bossche <jo...@gmail.com>.
Small correction, I of course meant to use "write_table" and not
"write_to_dataset" in the code snippet (as the latter won't work that way).
Corrected example below:

But, if you already want to use the new filesystems for writing as well,
> there is one workaround to create an output stream manually and pass that
> instead of the path.
> So in your example, you could replace
>
> pq.write_table(table, out_path, filesystem=subtree_filesystem)
>
> with
>
> with subtree_filesystem.open_output_stream(out_path) as f:
>     pq.write_table(table, f)
>
> However, this only works with single files (and not yet with
> write_to_dataset for partitioned datasets).
>
>

Re: Writing parquet to new filesystem API

Posted by Joris Van den Bossche <jo...@gmail.com>.
Hi Weston,

You are not missing something obvious, but this is a bit an unfortunate
"transitional phase" where we have new filesystems, but they are not yet
fully supported (on the reading side they are supported in pyarrow 1.0, but
for the writing side we are actively working on that, which will only be
for the next release. I actually have an open PR to add support for the new
filesystems to pq.write_table: https://github.com/apache/arrow/pull/7991).

But, if you already want to use the new filesystems for writing as well,
there is one workaround to create an output stream manually and pass that
instead of the path.
So in your example, you could replace

pq.write_to_dataset(table, out_path, filesystem=subtree_filesystem)

with

with subtree_filesystem.open_output_stream(out_path) as f:
    pq.write_table(table, f)

However, this only works with single files (and not yet with
write_to_dataset for partitioned datasets).

Best,
Joris

On Thu, 27 Aug 2020 at 00:58, Weston Pace <we...@gmail.com> wrote:
>
> Forgive me if I am missing something obvious but I am unable to write
> parquet files using the new filesystem API.
>
> Here is what I am trying:
>
> https://gist.github.com/westonpace/0c5ef01e21a40de5d16608b7f12de80d
>
> I receive an error:
>
> OSError: Unrecognized filesystem: <class 'pyarrow._fs.SubTreeFileSystem'>