You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by John Muehlhausen <jg...@jgm.org> on 2021/10/20 21:25:59 UTC

Create large IPC format record batch(es) in-place without copy or prior data analysis

Motivation:

We have memory-mappable Arrow IPC files with N batches where column(s) are
sorted to support binary search.  Because log2(n) < log2(n/2)+log2(n/2) and
binary search is required on each batch, we prefer the batches to be as
large as possible to reduce total search time... perhaps larger than
available RAM.... on the read side, only pages needed for the search
bisections and subsequent slice traversal are mapped in, of course.

The question then becomes one of creating large IPC-format files where
individual batches do not exist first in RAM because of their size.

Conceptually, this would seem to entail:
* allocating a fixed mmap'd area for writing to
* using builders to create buffers at the locations they would end up at
for an IPC format, and freezing these as arrays (if I understand the
terminology correctly)
* plopping in various other things such as metadata, schema, etc

One difficulty is that we want to size this area without having first
analyzed the data to be written to it, since such an analysis consumes
compute resources.  Therefore the area set aside for (e.g.) a variable
length string column would be a guess based on statistics and we would want
to just write the column buffers until the first one is full, which may
leave others (or itself) partially unpopulated.

This could result in some "wasted space" in the file which is a tradeoff we
can live with for the above reasons, which brings me back to
https://issues.apache.org/jira/browse/ARROW-5916 where this was discussed
before (and another discussion is linked there).  The idea was to allow
record batch lengths to be smaller than the associated buffer lengths,
which seemed like an easy change at the time... although I'll grant that we
only use trivial arrow types and in more complex cases there may be
side-effects I can't envision.

One of the ideas was to go ahead and fill in the buffers to create a valid
recordbatch but then store the sliced-down size in (e.g.) the user-defined
metadata, but this forces anyone using the IPC file to use a non-standard
mechanism to reject the "data" that fills the unpopulated buffer sections.

Even with the ability for a batch to be smaller than its buffers (to help
readers reject the residual of the buffers without referring to custom
metadata), I think I'm left with needing to create low-level code outside
of the Arrow library to create such a file since I cannot first create the
batch in RAM and then copy it out, due to the size and also due to wanting
to avoid the copy operation.

Any thoughts on creating large IPC format record batch(es) in-place in a
single pre-allocated buffer, that could be used with mmap?

Here is someone with a similar concern:
https://www.mail-archive.com/user@arrow.apache.org/msg01187.html

It seems like the
https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html
example could be tweaked to use "pools" that defines exactly where to put
each buffer, but then the final `arrow::Table::Make` (or equivalent for
batches/IPC) must also receive instruction about where exactly to write
user metadata, schema, footer, etc.

Thanks for any ideas,
John

Re: Create large IPC format record batch(es) in-place without copy or prior data analysis

Posted by Micah Kornfield <em...@gmail.com>.

Hi John,
>
> Any thoughts on creating large IPC format record batch(es) in-place in a
> single pre-allocated buffer, that could be used with mmap?


This seems doable today "by hand" today, it seems like this would be
valuable to potentially contribute.

The idea was to allow
> record batch lengths to be smaller than the associated buffer lengths,
> which seemed like an easy change at the time... although I'll grant that we
> only use trivial arrow types and in more complex cases there may be
> side-effects I can't envision.


To my recollection, this was about "Array Lengths" not "Buffer lengths".  I
still think having Array Lengths disagree with RecordBatch Lengths is not a
good idea.  I think  "Buffer Length" and "Buffer offset" can be arbitrary
as long as they are within Body Length on Message.  It really depends on
the determination of the specification of:

   -

   The body, a flat sequence of memory buffers written end-to-end with
   appropriate padding to ensure a minimum of 8-byte alignment

So one could imagine the following algorithm:
1.   pointer a = Reserve space for RecordBatchMessage Metadata data.
2.  pointer b = Reserve space for data buffers directly after reserved
space of "a".
3.  Populated data (Allocate buffers from B into standard arrow data
structures, ensuring 8 byte alignment requirement)
4.  Write out metadata to "a" (body length = "maximum end address of data
buffer" - "b"), (buffer offsets are "start of buffer memory address - "b")
5.  Add entry to the file index.

This might not play nicely with other aspects of arrow IO (e.g.
prefetching) but I think it still should be a valid file.  I'd guess others
would have opinions on this as well.

Thanks,
Micah

On Wed, Oct 20, 2021 at 5:26 PM John Muehlhausen <jg...@jgm.org> wrote:

> Motivation:
>
> We have memory-mappable Arrow IPC files with N batches where column(s) are
> sorted to support binary search.  Because log2(n) < log2(n/2)+log2(n/2) and
> binary search is required on each batch, we prefer the batches to be as
> large as possible to reduce total search time... perhaps larger than
> available RAM.... on the read side, only pages needed for the search
> bisections and subsequent slice traversal are mapped in, of course.
>
> The question then becomes one of creating large IPC-format files where
> individual batches do not exist first in RAM because of their size.
>
> Conceptually, this would seem to entail:
> * allocating a fixed mmap'd area for writing to
> * using builders to create buffers at the locations they would end up at
> for an IPC format, and freezing these as arrays (if I understand the
> terminology correctly)
> * plopping in various other things such as metadata, schema, etc
>
> One difficulty is that we want to size this area without having first
> analyzed the data to be written to it, since such an analysis consumes
> compute resources.  Therefore the area set aside for (e.g.) a variable
> length string column would be a guess based on statistics and we would want
> to just write the column buffers until the first one is full, which may
> leave others (or itself) partially unpopulated.
>
> This could result in some "wasted space" in the file which is a tradeoff we
> can live with for the above reasons, which brings me back to
> https://issues.apache.org/jira/browse/ARROW-5916 where this was discussed
> before (and another discussion is linked there).  The idea was to allow
> record batch lengths to be smaller than the associated buffer lengths,
> which seemed like an easy change at the time... although I'll grant that we
> only use trivial arrow types and in more complex cases there may be
> side-effects I can't envision.
>
> One of the ideas was to go ahead and fill in the buffers to create a valid
> recordbatch but then store the sliced-down size in (e.g.) the user-defined
> metadata, but this forces anyone using the IPC file to use a non-standard
> mechanism to reject the "data" that fills the unpopulated buffer sections.
>
> Even with the ability for a batch to be smaller than its buffers (to help
> readers reject the residual of the buffers without referring to custom
> metadata), I think I'm left with needing to create low-level code outside
> of the Arrow library to create such a file since I cannot first create the
> batch in RAM and then copy it out, due to the size and also due to wanting
> to avoid the copy operation.
>
> Any thoughts on creating large IPC format record batch(es) in-place in a
> single pre-allocated buffer, that could be used with mmap?
>
> Here is someone with a similar concern:
> https://www.mail-archive.com/user@arrow.apache.org/msg01187.html
>
> It seems like the
> https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html
> example could be tweaked to use "pools" that defines exactly where to put
> each buffer, but then the final `arrow::Table::Make` (or equivalent for
> batches/IPC) must also receive instruction about where exactly to write
> user metadata, schema, footer, etc.
>
> Thanks for any ideas,
> John
>