You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Daniel Nugent <nu...@gmail.com> on 2020/04/29 23:58:28 UTC

'Plain' Dataset Python API doesn't memory map?

Hi,

I'm trying to use the 0.17 dataset API to map in an arrow table in the
uncompressed feather format (ultimately hoping to work with data larger
than memory). It seems like it reads all the constituent files into memory
before creating the Arrow table object though.

When I use the FeatherDataset API, it does appear to work map the files and
the Table is created based off of mapped data.

Any hints at what I'm doing wrong? I didn't see any options relating to
memory mapping for the general datasets

Here's the code for the plain dataset api call:

    from pyarrow.dataset import dataset as ds
    t = ds('demo', format='feather').read_table()

Here's the code for reading using the FeatherDataset api:

    from pyarrow.feather import FeatherDataset as ds
    from pathlib import Path
    t = ds(list(Path('demo').iterdir())).read_table()

Thanks!

-Dan Nugent

Re: 'Plain' Dataset Python API doesn't memory map?

Posted by Daniel Nugent <nu...@gmail.com>.

Thanks Joris. That did the trick.

-Dan Nugent
On Apr 30, 2020, 10:01 -0400, Wes McKinney <we...@gmail.com>, wrote:
> For the record, as I've stated elsewhere I'm fairly sure, I don't
> agree with toggling memory mapping at the filesystem level. If a
> filesystem supports memory mapping, then a consumer of the filesystem
> should IMHO be able to request a memory map.
>
> On Thu, Apr 30, 2020 at 2:27 AM Joris Van den Bossche
> <jo...@gmail.com> wrote:
> >
> > Hi Dan,
> >
> > Currently, the memory mapping in the Datasets API is controlled by the filesystem. So to enable memory mapping for feather, you can do:
> >
> > import pyarrow.dataset as ds
> > from pyarrow.fs import LocalFileSystem
> >
> > fs = LocalFileSystem(use_mmap=True)
> > t = ds.dataset('demo', format='feather', filesystem=fs).to_table()
> >
> > Can you try if that is working for you?
> > We should better document this (and there is actually also some discussion about the best API for this, see https://issues.apache.org/jira/browse/ARROW-8156, https://issues.apache.org/jira/browse/ARROW-8307)
> >
> > Joris
> >
> > On Thu, 30 Apr 2020 at 01:58, Daniel Nugent <nu...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I'm trying to use the 0.17 dataset API to map in an arrow table in the uncompressed feather format (ultimately hoping to work with data larger than memory). It seems like it reads all the constituent files into memory before creating the Arrow table object though.
> > >
> > > When I use the FeatherDataset API, it does appear to work map the files and the Table is created based off of mapped data.
> > >
> > > Any hints at what I'm doing wrong? I didn't see any options relating to memory mapping for the general datasets
> > >
> > > Here's the code for the plain dataset api call:
> > >
> > > from pyarrow.dataset import dataset as ds
> > > t = ds('demo', format='feather').read_table()
> > >
> > > Here's the code for reading using the FeatherDataset api:
> > >
> > > from pyarrow.feather import FeatherDataset as ds
> > > from pathlib import Path
> > > t = ds(list(Path('demo').iterdir())).read_table()
> > >
> > > Thanks!
> > >
> > > -Dan Nugent

Re: 'Plain' Dataset Python API doesn't memory map?

Posted by Wes McKinney <we...@gmail.com>.

For the record, as I've stated elsewhere I'm fairly sure, I don't
agree with toggling memory mapping at the filesystem level. If a
filesystem supports memory mapping, then a consumer of the filesystem
should IMHO be able to request a memory map.

On Thu, Apr 30, 2020 at 2:27 AM Joris Van den Bossche
<jo...@gmail.com> wrote:
>
> Hi Dan,
>
> Currently, the memory mapping in the Datasets API is controlled by the filesystem. So to enable memory mapping for feather, you can do:
>
> import pyarrow.dataset as ds
> from pyarrow.fs import LocalFileSystem
>
> fs = LocalFileSystem(use_mmap=True)
> t = ds.dataset('demo', format='feather', filesystem=fs).to_table()
>
> Can you try if that is working for you?
> We should better document this (and there is actually also some discussion about the best API for this, see https://issues.apache.org/jira/browse/ARROW-8156, https://issues.apache.org/jira/browse/ARROW-8307)
>
> Joris
>
> On Thu, 30 Apr 2020 at 01:58, Daniel Nugent <nu...@gmail.com> wrote:
>>
>> Hi,
>>
>> I'm trying to use the 0.17 dataset API to map in an arrow table in the uncompressed feather format (ultimately hoping to work with data larger than memory). It seems like it reads all the constituent files into memory before creating the Arrow table object though.
>>
>> When I use the FeatherDataset API, it does appear to work map the files and the Table is created based off of mapped data.
>>
>> Any hints at what I'm doing wrong? I didn't see any options relating to memory mapping for the general datasets
>>
>> Here's the code for the plain dataset api call:
>>
>>     from pyarrow.dataset import dataset as ds
>>     t = ds('demo', format='feather').read_table()
>>
>> Here's the code for reading using the FeatherDataset api:
>>
>>     from pyarrow.feather import FeatherDataset as ds
>>     from pathlib import Path
>>     t = ds(list(Path('demo').iterdir())).read_table()
>>
>> Thanks!
>>
>> -Dan Nugent

Re: 'Plain' Dataset Python API doesn't memory map?

Posted by Joris Van den Bossche <jo...@gmail.com>.

Hi Dan,

Currently, the memory mapping in the Datasets API is controlled by the
filesystem. So to enable memory mapping for feather, you can do:

import pyarrow.dataset as ds
from pyarrow.fs import LocalFileSystem

fs = LocalFileSystem(use_mmap=True)
t = ds.dataset('demo', format='feather', filesystem=fs).to_table()

Can you try if that is working for you?
We should better document this (and there is actually also some discussion
about the best API for this, see
https://issues.apache.org/jira/browse/ARROW-8156,
https://issues.apache.org/jira/browse/ARROW-8307)

Joris

On Thu, 30 Apr 2020 at 01:58, Daniel Nugent <nu...@gmail.com> wrote:

> Hi,
>
> I'm trying to use the 0.17 dataset API to map in an arrow table in the
> uncompressed feather format (ultimately hoping to work with data larger
> than memory). It seems like it reads all the constituent files into memory
> before creating the Arrow table object though.
>
> When I use the FeatherDataset API, it does appear to work map the files
> and the Table is created based off of mapped data.
>
> Any hints at what I'm doing wrong? I didn't see any options relating to
> memory mapping for the general datasets
>
> Here's the code for the plain dataset api call:
>
>     from pyarrow.dataset import dataset as ds
>     t = ds('demo', format='feather').read_table()
>
> Here's the code for reading using the FeatherDataset api:
>
>     from pyarrow.feather import FeatherDataset as ds
>     from pathlib import Path
>     t = ds(list(Path('demo').iterdir())).read_table()
>
> Thanks!
>
> -Dan Nugent
>