You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Weston Pace <we...@gmail.com> on 2020/08/25 21:38:35 UTC

Creating filesystems that read local files

I created a RelativeFileSystem that extended FileSystem and proxied
calls to a LocalFileSystem instance.  This filesystem allowed me to
specify a base directory and then all paths were resolved relative to
that base directory (so fs.open("foo.parquet") became
self.target.open("C:\Datadir\foo.parquet").

However, because it was not a LocalFileSystem instance it was treated
differently by arrow at:

https://github.com/apache/arrow/blob/de8bfddae8704a998d910f2a84bd1e2f7bd934d1/python/pyarrow/parquet.py#L1043

Instead of using a native file reader the open method was called and
it read from a python file object.  Besides the performance impact I
also received a "ResourceWarning: unclosed file" when running `read`
on a dataset piece.

To avoid these warnings I changed RelativeFileSystem to subclass
LocalFileSystem instead of proxy to it.

Is this the recommended approach for reading local files?  If so I can
probably add something to the filesystems docs.  Part of the problem
is that the undesired behavior can be difficult to detect.  Had I not
been running with warnings on I would not have noticed the
ResourceWarning or, if that ResourceWarning is patched away, I
probably would never have noticed it until I realized my performance
dropped for some reason.

Re: Creating filesystems that read local files

Posted by Weston Pace <we...@gmail.com>.
Actually my workaround (extending LocalFileSystem) does not work since
`open` is never called in this case and the path is not normalized to
the base directory.

On Tue, Aug 25, 2020 at 11:38 AM Weston Pace <we...@gmail.com> wrote:
>
> I created a RelativeFileSystem that extended FileSystem and proxied
> calls to a LocalFileSystem instance.  This filesystem allowed me to
> specify a base directory and then all paths were resolved relative to
> that base directory (so fs.open("foo.parquet") became
> self.target.open("C:\Datadir\foo.parquet").
>
> However, because it was not a LocalFileSystem instance it was treated
> differently by arrow at:
>
> https://github.com/apache/arrow/blob/de8bfddae8704a998d910f2a84bd1e2f7bd934d1/python/pyarrow/parquet.py#L1043
>
> Instead of using a native file reader the open method was called and
> it read from a python file object.  Besides the performance impact I
> also received a "ResourceWarning: unclosed file" when running `read`
> on a dataset piece.
>
> To avoid these warnings I changed RelativeFileSystem to subclass
> LocalFileSystem instead of proxy to it.
>
> Is this the recommended approach for reading local files?  If so I can
> probably add something to the filesystems docs.  Part of the problem
> is that the undesired behavior can be difficult to detect.  Had I not
> been running with warnings on I would not have noticed the
> ResourceWarning or, if that ResourceWarning is patched away, I
> probably would never have noticed it until I realized my performance
> dropped for some reason.

Re: Creating filesystems that read local files

Posted by Weston Pace <we...@gmail.com>.
Ok.  I think I have it figured out as:

num_rows = 0
dataset = pa.dataset.dataset(short_files, filesystem=subtree_filesystem)
for fragment in dataset.get_fragments():
    fragment.ensure_complete_metadata()
    if fragment.row_groups:
        for row_group in fragment.row_groups:
            num_rows += row_group.num_rows

On Wed, Aug 26, 2020 at 10:06 AM Weston Pace <we...@gmail.com> wrote:
>
> Thanks Joris / Antoine,
>
> It appears I will have to learn the new datasets API.  I can confirm
> that SubTreeFileSystem is working for me.  In case there is still
> interest here is the code I had from before reproducing the issue:
> https://gist.github.com/westonpace/4107c1c492cdd78d611595d43e72964d
>
> It looks like the new ParquetDataset (_ParquetDatasetV2) is protected
> and also that `pieces` is deprecated.  I was previously using that for
> filtering pieces based on metadata statistics (it looks like the new
> "filters" feature takes care of this for me) as well as accessing
> piece metadata to count the number of rows in the dataset without
> loading anything other than the metadata.  Do you know off the top of
> your head what would be a good approach to count the rows in that way?
>
> On Wed, Aug 26, 2020 at 4:51 AM Joris Van den Bossche
> <jo...@gmail.com> wrote:
> >
> > Hi Weston,
> >
> > Currently there are two filesystems interfaces in pyarrow, a legacy one in
> > `pyarrow.filesystem` and a new one in `pyarrow.fs` (see
> > https://issues.apache.org/jira/browse/ARROW-9645 and
> > https://arrow.apache.org/docs/python/filesystems_deprecated.html, docs are
> > still a bit scarce).
> >
> > Based on your description, I assume you are using the "legacy"
> > LocalFileSystem.
> > In the new filesystems, however, I think there is already the feature you
> > are looking for, called "SubTreeFileSystem", created from a base directory
> > and other filesystem instance.
> >
> > Best,
> > Joris
> >
> >
> > On Tue, 25 Aug 2020 at 23:38, Weston Pace <we...@gmail.com> wrote:
> >
> > > I created a RelativeFileSystem that extended FileSystem and proxied
> > > calls to a LocalFileSystem instance.  This filesystem allowed me to
> > > specify a base directory and then all paths were resolved relative to
> > > that base directory (so fs.open("foo.parquet") became
> > > self.target.open("C:\Datadir\foo.parquet").
> > >
> > > However, because it was not a LocalFileSystem instance it was treated
> > > differently by arrow at:
> > >
> > >
> > > https://github.com/apache/arrow/blob/de8bfddae8704a998d910f2a84bd1e2f7bd934d1/python/pyarrow/parquet.py#L1043
> > >
> > > Instead of using a native file reader the open method was called and
> > > it read from a python file object.  Besides the performance impact I
> > > also received a "ResourceWarning: unclosed file" when running `read`
> > > on a dataset piece.
> > >
> > > To avoid these warnings I changed RelativeFileSystem to subclass
> > > LocalFileSystem instead of proxy to it.
> > >
> > > Is this the recommended approach for reading local files?  If so I can
> > > probably add something to the filesystems docs.  Part of the problem
> > > is that the undesired behavior can be difficult to detect.  Had I not
> > > been running with warnings on I would not have noticed the
> > > ResourceWarning or, if that ResourceWarning is patched away, I
> > > probably would never have noticed it until I realized my performance
> > > dropped for some reason.
> > >

Re: Creating filesystems that read local files

Posted by Weston Pace <we...@gmail.com>.
Thanks Joris / Antoine,

It appears I will have to learn the new datasets API.  I can confirm
that SubTreeFileSystem is working for me.  In case there is still
interest here is the code I had from before reproducing the issue:
https://gist.github.com/westonpace/4107c1c492cdd78d611595d43e72964d

It looks like the new ParquetDataset (_ParquetDatasetV2) is protected
and also that `pieces` is deprecated.  I was previously using that for
filtering pieces based on metadata statistics (it looks like the new
"filters" feature takes care of this for me) as well as accessing
piece metadata to count the number of rows in the dataset without
loading anything other than the metadata.  Do you know off the top of
your head what would be a good approach to count the rows in that way?

On Wed, Aug 26, 2020 at 4:51 AM Joris Van den Bossche
<jo...@gmail.com> wrote:
>
> Hi Weston,
>
> Currently there are two filesystems interfaces in pyarrow, a legacy one in
> `pyarrow.filesystem` and a new one in `pyarrow.fs` (see
> https://issues.apache.org/jira/browse/ARROW-9645 and
> https://arrow.apache.org/docs/python/filesystems_deprecated.html, docs are
> still a bit scarce).
>
> Based on your description, I assume you are using the "legacy"
> LocalFileSystem.
> In the new filesystems, however, I think there is already the feature you
> are looking for, called "SubTreeFileSystem", created from a base directory
> and other filesystem instance.
>
> Best,
> Joris
>
>
> On Tue, 25 Aug 2020 at 23:38, Weston Pace <we...@gmail.com> wrote:
>
> > I created a RelativeFileSystem that extended FileSystem and proxied
> > calls to a LocalFileSystem instance.  This filesystem allowed me to
> > specify a base directory and then all paths were resolved relative to
> > that base directory (so fs.open("foo.parquet") became
> > self.target.open("C:\Datadir\foo.parquet").
> >
> > However, because it was not a LocalFileSystem instance it was treated
> > differently by arrow at:
> >
> >
> > https://github.com/apache/arrow/blob/de8bfddae8704a998d910f2a84bd1e2f7bd934d1/python/pyarrow/parquet.py#L1043
> >
> > Instead of using a native file reader the open method was called and
> > it read from a python file object.  Besides the performance impact I
> > also received a "ResourceWarning: unclosed file" when running `read`
> > on a dataset piece.
> >
> > To avoid these warnings I changed RelativeFileSystem to subclass
> > LocalFileSystem instead of proxy to it.
> >
> > Is this the recommended approach for reading local files?  If so I can
> > probably add something to the filesystems docs.  Part of the problem
> > is that the undesired behavior can be difficult to detect.  Had I not
> > been running with warnings on I would not have noticed the
> > ResourceWarning or, if that ResourceWarning is patched away, I
> > probably would never have noticed it until I realized my performance
> > dropped for some reason.
> >

Re: Creating filesystems that read local files

Posted by Joris Van den Bossche <jo...@gmail.com>.
Hi Weston,

Currently there are two filesystems interfaces in pyarrow, a legacy one in
`pyarrow.filesystem` and a new one in `pyarrow.fs` (see
https://issues.apache.org/jira/browse/ARROW-9645 and
https://arrow.apache.org/docs/python/filesystems_deprecated.html, docs are
still a bit scarce).

Based on your description, I assume you are using the "legacy"
LocalFileSystem.
In the new filesystems, however, I think there is already the feature you
are looking for, called "SubTreeFileSystem", created from a base directory
and other filesystem instance.

Best,
Joris


On Tue, 25 Aug 2020 at 23:38, Weston Pace <we...@gmail.com> wrote:

> I created a RelativeFileSystem that extended FileSystem and proxied
> calls to a LocalFileSystem instance.  This filesystem allowed me to
> specify a base directory and then all paths were resolved relative to
> that base directory (so fs.open("foo.parquet") became
> self.target.open("C:\Datadir\foo.parquet").
>
> However, because it was not a LocalFileSystem instance it was treated
> differently by arrow at:
>
>
> https://github.com/apache/arrow/blob/de8bfddae8704a998d910f2a84bd1e2f7bd934d1/python/pyarrow/parquet.py#L1043
>
> Instead of using a native file reader the open method was called and
> it read from a python file object.  Besides the performance impact I
> also received a "ResourceWarning: unclosed file" when running `read`
> on a dataset piece.
>
> To avoid these warnings I changed RelativeFileSystem to subclass
> LocalFileSystem instead of proxy to it.
>
> Is this the recommended approach for reading local files?  If so I can
> probably add something to the filesystems docs.  Part of the problem
> is that the undesired behavior can be difficult to detect.  Had I not
> been running with warnings on I would not have noticed the
> ResourceWarning or, if that ResourceWarning is patched away, I
> probably would never have noticed it until I realized my performance
> dropped for some reason.
>

Re: Creating filesystems that read local files

Posted by Antoine Pitrou <an...@python.org>.
Hi Weston,

Can you show the code for your experiment?
(or post equivalent code)

Regards

Antoine.


Le 25/08/2020 à 23:38, Weston Pace a écrit :
> I created a RelativeFileSystem that extended FileSystem and proxied
> calls to a LocalFileSystem instance.  This filesystem allowed me to
> specify a base directory and then all paths were resolved relative to
> that base directory (so fs.open("foo.parquet") became
> self.target.open("C:\Datadir\foo.parquet").
> 
> However, because it was not a LocalFileSystem instance it was treated
> differently by arrow at:
> 
> https://github.com/apache/arrow/blob/de8bfddae8704a998d910f2a84bd1e2f7bd934d1/python/pyarrow/parquet.py#L1043
> 
> Instead of using a native file reader the open method was called and
> it read from a python file object.  Besides the performance impact I
> also received a "ResourceWarning: unclosed file" when running `read`
> on a dataset piece.
> 
> To avoid these warnings I changed RelativeFileSystem to subclass
> LocalFileSystem instead of proxy to it.
> 
> Is this the recommended approach for reading local files?  If so I can
> probably add something to the filesystems docs.  Part of the problem
> is that the undesired behavior can be difficult to detect.  Had I not
> been running with warnings on I would not have noticed the
> ResourceWarning or, if that ResourceWarning is patched away, I
> probably would never have noticed it until I realized my performance
> dropped for some reason.
>