You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Alvin Chunga Mamani <al...@voltrondata.com> on 2022/05/06 06:03:17 UTC

[DISCUSS][C++][Python]Switch default mmap behaviour to off

Hi all,
I start this discussion to comment on the change to disable the use of mmap
by default, which represents a risk in non-local/pseudo file systems that
can affect performance.
Part of the solution would be to have a flag at the compilation level that
allows you to activate or deactivate the use of mmap in arrow C++/pyarrow.
Here in [1] an analysis on the use of mmap in Database Management System is
presented


Thanks.

[1] https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf

Re: [DISCUSS][C++][Python]Switch default mmap behaviour to off

Posted by Antoine Pitrou <an...@python.org>.


Le 11/05/2022 à 10:19, Alessandro Molina a écrit :
> As far as I understood, the idea is not to fully remove memory mapping,
> just turn the current mmap=True default arguments to mmap=False
> 
> The goal is mostly to provide consistent behaviour for end users. At the
> moment users might face very different performances when they read locally
> or on a network filesystem like NFS, because we will try to use memory
> mapping on both. But there were users reports that trying to memory map on
> NFS lead to terrible performances. By disabling memory mapping by default
> we can offer a more consistent experience to users.

Ideally we should be able to detect local vs. remote filesystems, but 
there doesn't seem to be an easy way to do that.

Re: [DISCUSS][C++][Python]Switch default mmap behaviour to off

Posted by Alessandro Molina <al...@ursacomputing.com>.

As far as I understood, the idea is not to fully remove memory mapping,
just turn the current mmap=True default arguments to mmap=False

The goal is mostly to provide consistent behaviour for end users. At the
moment users might face very different performances when they read locally
or on a network filesystem like NFS, because we will try to use memory
mapping on both. But there were users reports that trying to memory map on
NFS lead to terrible performances. By disabling memory mapping by default
we can offer a more consistent experience to users.

By default switching memory mapping off when reading/writing formats
shouldn't influence much local performances, as most formats need to go
through a decode phase and thus won't benefit much from memory mapping. The
only format where mmap can really be effective is the IPC one. And in that
case if users know what they are doing, they can still pass mmap=True.

We would still keep memory mapping enabled for some features. For example
in the future we might implement spillover of datasets, in such case the
spillover would probably rely on memory mapping.

On Fri, May 6, 2022 at 10:09 AM Sasha Krassovsky <kr...@gmail.com>
wrote:

> Hi,
> Which use of mmap are you referring to in the code base? Mmap in general
> could have a lot of different uses. The point of the paper you linked is
> that database management systems should explicitly manage their paging to
> and from disk to maintain transactional consistency or to avoid performance
> penalties if the working set doesn’t fit in memory. Arrow doesn’t care
> about the former. As for the latter, something like IPC might make good use
> of mmap. It could be mot even writing to a real file on disk but to a
> stream or even to another process’s address space. In that scenario mmap
> definitely does make sense.
>
> That’s not to say this isn’t something worth discussing, but I feel the
> paper’s results are much more nuanced than “we should remove mmap because
> mmap is bad”. It would help to have some specific instances to look at to
> see if it makes sense to switch to something else.
>
> Sasha Krassovsky
>
> > 5 мая 2022 г., в 23:03, Alvin Chunga Mamani <al...@voltrondata.com>
> написал(а):
> >
> > Hi all,
> > I start this discussion to comment on the change to disable the use of
> mmap
> > by default, which represents a risk in non-local/pseudo file systems that
> > can affect performance.
> > Part of the solution would be to have a flag at the compilation level
> that
> > allows you to activate or deactivate the use of mmap in arrow
> C++/pyarrow.
> > Here in [1] an analysis on the use of mmap in Database Management System
> is
> > presented
> >
> >
> > Thanks.
> >
> > [1] https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf
>

Re: [DISCUSS][C++][Python]Switch default mmap behaviour to off

Posted by Sasha Krassovsky <kr...@gmail.com>.

Hi,
Which use of mmap are you referring to in the code base? Mmap in general could have a lot of different uses. The point of the paper you linked is that database management systems should explicitly manage their paging to and from disk to maintain transactional consistency or to avoid performance penalties if the working set doesn’t fit in memory. Arrow doesn’t care about the former. As for the latter, something like IPC might make good use of mmap. It could be mot even writing to a real file on disk but to a stream or even to another process’s address space. In that scenario mmap definitely does make sense. 

That’s not to say this isn’t something worth discussing, but I feel the paper’s results are much more nuanced than “we should remove mmap because mmap is bad”. It would help to have some specific instances to look at to see if it makes sense to switch to something else. 

Sasha Krassovsky

> 5 мая 2022 г., в 23:03, Alvin Chunga Mamani <al...@voltrondata.com> написал(а):
> 
> Hi all,
> I start this discussion to comment on the change to disable the use of mmap
> by default, which represents a risk in non-local/pseudo file systems that
> can affect performance.
> Part of the solution would be to have a flag at the compilation level that
> allows you to activate or deactivate the use of mmap in arrow C++/pyarrow.
> Here in [1] an analysis on the use of mmap in Database Management System is
> presented
> 
> 
> Thanks.
> 
> [1] https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf