You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Andrew Piskorski <at...@piskorski.com> on 2022/05/09 16:38:20 UTC

mmap only, read data later?

Hello, I'm using R package arrow_7.0.0.tar.gz, in R 4.1.1, on Linux
(Ubuntu 18.04.4 LTS).

In R, I am mmap-ing many small Arrow files by calling arrow::read_feather()
with as_data_frame=FALSE on each one.  Compressed with lz4, each file
is quite small, often only 25 kB or so, but I'll often be mmap-ing
many thousands of them.  From the time this takes, I suspect that
Arrow is reading the full contents of each file rather than just
setting up the mmap, but I don't know how to properly check that.

I would like to make sure that at this stage, I JUST mmap each file,
and defer reading their data until later when I actually need it.  Are
there any settings or arguments I can use to make sure that happens?
Or ways to verify precisely what is happening?

I think I found the relevant C++ code in "r/src/io.cpp" and
"cpp/src/arrow/io/file.cc", but I definitely don't understand its
performance implications, nor how to control this sort of thing.

Thanks for your help and advice!

-- 
Andrew Piskorski <at...@piskorski.com>

Re: mmap only, read data later?

Posted by Weston Pace <we...@gmail.com>.

> Or ways to verify precisely what is happening?

Regrettably, mmap is quite difficult to monitor.  With strace you can
verify the mapping is being setup:

    strace -y R --no-save < /tmp/script.R 2>&1 | grep -i foo.arrow
    ...
    mmap(NULL, 490, PROT_READ, MAP_PRIVATE, 3</tmp/foo.arrow>...

Once the mapping is setup then future reads are going to look a lot
like page faults.  Perhaps the easiest thing to do is:

 1. Ensure the file(s) are completely evicted from the OS' kernel cache
 2. Run your test(s)
 3. Use a tool like pcstat[1] to determine what parts of your file are
now in the kernel cache

As Antoine said, you may need to account for a certain amount of OS
level readahead.

[1] https://github.com/tobert/pcstat

On Mon, May 9, 2022 at 7:19 AM Sasha Krassovsky
<kr...@gmail.com> wrote:
>
> Hi Andrew,
> Unfortunately mmap is made to implement “transparent paging”, meaning that the OS takes control of when to read pages of the file to and from disk. This means that it’s Arrow has no way of controlling when the file is actually read, and it’s possible that the OS is prefetching the whole file given files that small. That said, I’ve seen before that just the act of doing thousands of mmaps can be a significant overhead, as mmap is a fairly expensive system call.
>
> As for solutions, is there some reason you need mmap? Could you perhaps open an InputStream (equivalent to opening each file) for each file and then call read_feather later when you actually need it?
>
> Sasha Krassovsky
>
> > 9 мая 2022 г., в 09:38, Andrew Piskorski <at...@piskorski.com> написал(а):
> >
> > Hello, I'm using R package arrow_7.0.0.tar.gz, in R 4.1.1, on Linux
> > (Ubuntu 18.04.4 LTS).
> >
> > In R, I am mmap-ing many small Arrow files by calling arrow::read_feather()
> > with as_data_frame=FALSE on each one.  Compressed with lz4, each file
> > is quite small, often only 25 kB or so, but I'll often be mmap-ing
> > many thousands of them.  From the time this takes, I suspect that
> > Arrow is reading the full contents of each file rather than just
> > setting up the mmap, but I don't know how to properly check that.
> >
> > I would like to make sure that at this stage, I JUST mmap each file,
> > and defer reading their data until later when I actually need it.  Are
> > there any settings or arguments I can use to make sure that happens?
> > Or ways to verify precisely what is happening?
> >
> > I think I found the relevant C++ code in "r/src/io.cpp" and
> > "cpp/src/arrow/io/file.cc", but I definitely don't understand its
> > performance implications, nor how to control this sort of thing.
> >
> > Thanks for your help and advice!
> >
> > --
> > Andrew Piskorski <at...@piskorski.com>

Re: mmap only, read data later?

Posted by Sasha Krassovsky <kr...@gmail.com>.

Hi Andrew,
Unfortunately mmap is made to implement “transparent paging”, meaning that the OS takes control of when to read pages of the file to and from disk. This means that it’s Arrow has no way of controlling when the file is actually read, and it’s possible that the OS is prefetching the whole file given files that small. That said, I’ve seen before that just the act of doing thousands of mmaps can be a significant overhead, as mmap is a fairly expensive system call. 

As for solutions, is there some reason you need mmap? Could you perhaps open an InputStream (equivalent to opening each file) for each file and then call read_feather later when you actually need it?

Sasha Krassovsky 

> 9 мая 2022 г., в 09:38, Andrew Piskorski <at...@piskorski.com> написал(а):
> 
> Hello, I'm using R package arrow_7.0.0.tar.gz, in R 4.1.1, on Linux
> (Ubuntu 18.04.4 LTS).
> 
> In R, I am mmap-ing many small Arrow files by calling arrow::read_feather()
> with as_data_frame=FALSE on each one.  Compressed with lz4, each file
> is quite small, often only 25 kB or so, but I'll often be mmap-ing
> many thousands of them.  From the time this takes, I suspect that
> Arrow is reading the full contents of each file rather than just
> setting up the mmap, but I don't know how to properly check that.
> 
> I would like to make sure that at this stage, I JUST mmap each file,
> and defer reading their data until later when I actually need it.  Are
> there any settings or arguments I can use to make sure that happens?
> Or ways to verify precisely what is happening?
> 
> I think I found the relevant C++ code in "r/src/io.cpp" and
> "cpp/src/arrow/io/file.cc", but I definitely don't understand its
> performance implications, nor how to control this sort of thing.
> 
> Thanks for your help and advice!
> 
> -- 
> Andrew Piskorski <at...@piskorski.com>

Re: mmap only, read data later?

Posted by Weston Pace <we...@gmail.com>.

If you are reading this as a dataset, and you are not partitioning on
your disk, then it is going to read the entire content of every file,
because there is no statistics-based partitioning currently enabled
with IPC files.

If you have some kind of filter, and you can partition your data on
the same columns you are using in your filter, then you should be able
to reduce the total amount of I/O by allowing the dataset reader to
skip entire files based on the pathname.

On Mon, May 9, 2022 at 9:39 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 10/05/2022 à 04:36, Andrew Piskorski a écrit :
> > On Mon, May 09, 2022 at 07:00:47PM +0200, Antoine Pitrou wrote:
> >
> >> Generally, the Arrow IPC file/stream formats are designed for large
> >> data. If you have many very small files you might try to rethink how you
> >> store your data on disk.
> >
> > Ah.  Is this because of the overhead of mmap itself, or the metadata
> > that must be read separately for each file, (or both)?
>
> Because no particular effort was spent to optimize per-file overhead
> (and, yes, metadata must be read independently for each file). By the
> way, the same thing can be said of Parquet files.
>
> > Would creating my files with write_dataset() instead of write_feather()
> > help?
>
> I don't think that would change anything, assuming you end up with the
> same set of files at the end. What could improve things is *reading* the
> data as a dataset, as the datasets layer is able to parallelize reads to
> cover latencies.
>
> > Btw, I have no problem if Linux decides to pre-fetch my mmap-ed data;
> > that's what mmap is for after all.  What I DON'T want, is for Arrow to
> > WAIT for that data to actually be fetched.  Or at least I want it to
> > wait as little as possible, as presumably it must read some metadata.
> > Are there ways I should minimize the amount of (possibly redundant)
> > metadata Arrow needs to read?
>
> If possible, I would suggest writing files incrementally using the IPC
> stream format, which could allow you to consolidate the data in a
> smaller number of files. Whether that's possible depends on how the data
> is produced, of course (do these files correspond to distinct
> observations in time?).
>

Re: mmap only, read data later?

Posted by Antoine Pitrou <an...@python.org>.

Le 10/05/2022 à 04:36, Andrew Piskorski a écrit :
> On Mon, May 09, 2022 at 07:00:47PM +0200, Antoine Pitrou wrote:
> 
>> Generally, the Arrow IPC file/stream formats are designed for large
>> data. If you have many very small files you might try to rethink how you
>> store your data on disk.
> 
> Ah.  Is this because of the overhead of mmap itself, or the metadata
> that must be read separately for each file, (or both)?

Because no particular effort was spent to optimize per-file overhead 
(and, yes, metadata must be read independently for each file). By the 
way, the same thing can be said of Parquet files.

> Would creating my files with write_dataset() instead of write_feather()
> help?

I don't think that would change anything, assuming you end up with the 
same set of files at the end. What could improve things is *reading* the 
data as a dataset, as the datasets layer is able to parallelize reads to 
cover latencies.

> Btw, I have no problem if Linux decides to pre-fetch my mmap-ed data;
> that's what mmap is for after all.  What I DON'T want, is for Arrow to
> WAIT for that data to actually be fetched.  Or at least I want it to
> wait as little as possible, as presumably it must read some metadata.
> Are there ways I should minimize the amount of (possibly redundant)
> metadata Arrow needs to read?

If possible, I would suggest writing files incrementally using the IPC 
stream format, which could allow you to consolidate the data in a 
smaller number of files. Whether that's possible depends on how the data 
is produced, of course (do these files correspond to distinct 
observations in time?).

Re: mmap only, read data later?

Posted by Andrew Piskorski <at...@piskorski.com>.

On Mon, May 09, 2022 at 07:00:47PM +0200, Antoine Pitrou wrote:

> Generally, the Arrow IPC file/stream formats are designed for large 
> data. If you have many very small files you might try to rethink how you 
> store your data on disk.

Ah.  Is this because of the overhead of mmap itself, or the metadata
that must be read separately for each file, (or both)?

Would creating my files with write_dataset() instead of write_feather()
help?  AKA, with write_dataset() and open_dataset(), I'd have fewer
calls to each, but the partitioning of my dataset would give me an
actual layout of files on disk similar to what I have now with
individual Arrow/Feather files.

Btw, I have no problem if Linux decides to pre-fetch my mmap-ed data;
that's what mmap is for after all.  What I DON'T want, is for Arrow to
WAIT for that data to actually be fetched.  Or at least I want it to
wait as little as possible, as presumably it must read some metadata.
Are there ways I should minimize the amount of (possibly redundant)
metadata Arrow needs to read?

-- 
Andrew Piskorski <at...@piskorski.com>

Re: mmap only, read data later?

Posted by Antoine Pitrou <an...@python.org>.

Hi Andrew,

If the Arrow files are small, chances are the metadata (which is always 
being read) is as large on disk as the actual data (which is "only" 
mmap'ed). Also, mmap'ing works on a page granularity (a page being 
typically 4 kB on x86, sometimes a bit larger on other architectures), 
and the kernel will typically read ahead a bit (so when the metadata is 
read, the kernel would probably read a bit of the data that is laid out 
just after).

Generally, the Arrow IPC file/stream formats are designed for large 
data. If you have many very small files you might try to rethink how you 
store your data on disk.

Regards

Antoine.


Le 09/05/2022 à 18:38, Andrew Piskorski a écrit :
> Hello, I'm using R package arrow_7.0.0.tar.gz, in R 4.1.1, on Linux
> (Ubuntu 18.04.4 LTS).
> 
> In R, I am mmap-ing many small Arrow files by calling arrow::read_feather()
> with as_data_frame=FALSE on each one.  Compressed with lz4, each file
> is quite small, often only 25 kB or so, but I'll often be mmap-ing
> many thousands of them.  From the time this takes, I suspect that
> Arrow is reading the full contents of each file rather than just
> setting up the mmap, but I don't know how to properly check that.
> 
> I would like to make sure that at this stage, I JUST mmap each file,
> and defer reading their data until later when I actually need it.  Are
> there any settings or arguments I can use to make sure that happens?
> Or ways to verify precisely what is happening?
> 
> I think I found the relevant C++ code in "r/src/io.cpp" and
> "cpp/src/arrow/io/file.cc", but I definitely don't understand its
> performance implications, nor how to control this sort of thing.
> 
> Thanks for your help and advice!
>