You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2021/11/06 00:13:10 UTC

Re: Create large IPC format record batch(es) in-place without copy or prior data analysis

Hi John,
>
> Any thoughts on creating large IPC format record batch(es) in-place in a
> single pre-allocated buffer, that could be used with mmap?


This seems doable today "by hand" today, it seems like this would be
valuable to potentially contribute.

The idea was to allow
> record batch lengths to be smaller than the associated buffer lengths,
> which seemed like an easy change at the time... although I'll grant that we
> only use trivial arrow types and in more complex cases there may be
> side-effects I can't envision.


To my recollection, this was about "Array Lengths" not "Buffer lengths".  I
still think having Array Lengths disagree with RecordBatch Lengths is not a
good idea.  I think  "Buffer Length" and "Buffer offset" can be arbitrary
as long as they are within Body Length on Message.  It really depends on
the determination of the specification of:

   -

   The body, a flat sequence of memory buffers written end-to-end with
   appropriate padding to ensure a minimum of 8-byte alignment

So one could imagine the following algorithm:
1.   pointer a = Reserve space for RecordBatchMessage Metadata data.
2.  pointer b = Reserve space for data buffers directly after reserved
space of "a".
3.  Populated data (Allocate buffers from B into standard arrow data
structures, ensuring 8 byte alignment requirement)
4.  Write out metadata to "a" (body length = "maximum end address of data
buffer" - "b"), (buffer offsets are "start of buffer memory address - "b")
5.  Add entry to the file index.

This might not play nicely with other aspects of arrow IO (e.g.
prefetching) but I think it still should be a valid file.  I'd guess others
would have opinions on this as well.

Thanks,
Micah

On Wed, Oct 20, 2021 at 5:26 PM John Muehlhausen <jg...@jgm.org> wrote:

> Motivation:
>
> We have memory-mappable Arrow IPC files with N batches where column(s) are
> sorted to support binary search.  Because log2(n) < log2(n/2)+log2(n/2) and
> binary search is required on each batch, we prefer the batches to be as
> large as possible to reduce total search time... perhaps larger than
> available RAM.... on the read side, only pages needed for the search
> bisections and subsequent slice traversal are mapped in, of course.
>
> The question then becomes one of creating large IPC-format files where
> individual batches do not exist first in RAM because of their size.
>
> Conceptually, this would seem to entail:
> * allocating a fixed mmap'd area for writing to
> * using builders to create buffers at the locations they would end up at
> for an IPC format, and freezing these as arrays (if I understand the
> terminology correctly)
> * plopping in various other things such as metadata, schema, etc
>
> One difficulty is that we want to size this area without having first
> analyzed the data to be written to it, since such an analysis consumes
> compute resources.  Therefore the area set aside for (e.g.) a variable
> length string column would be a guess based on statistics and we would want
> to just write the column buffers until the first one is full, which may
> leave others (or itself) partially unpopulated.
>
> This could result in some "wasted space" in the file which is a tradeoff we
> can live with for the above reasons, which brings me back to
> https://issues.apache.org/jira/browse/ARROW-5916 where this was discussed
> before (and another discussion is linked there).  The idea was to allow
> record batch lengths to be smaller than the associated buffer lengths,
> which seemed like an easy change at the time... although I'll grant that we
> only use trivial arrow types and in more complex cases there may be
> side-effects I can't envision.
>
> One of the ideas was to go ahead and fill in the buffers to create a valid
> recordbatch but then store the sliced-down size in (e.g.) the user-defined
> metadata, but this forces anyone using the IPC file to use a non-standard
> mechanism to reject the "data" that fills the unpopulated buffer sections.
>
> Even with the ability for a batch to be smaller than its buffers (to help
> readers reject the residual of the buffers without referring to custom
> metadata), I think I'm left with needing to create low-level code outside
> of the Arrow library to create such a file since I cannot first create the
> batch in RAM and then copy it out, due to the size and also due to wanting
> to avoid the copy operation.
>
> Any thoughts on creating large IPC format record batch(es) in-place in a
> single pre-allocated buffer, that could be used with mmap?
>
> Here is someone with a similar concern:
> https://www.mail-archive.com/user@arrow.apache.org/msg01187.html
>
> It seems like the
> https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html
> example could be tweaked to use "pools" that defines exactly where to put
> each buffer, but then the final `arrow::Table::Make` (or equivalent for
> batches/IPC) must also receive instruction about where exactly to write
> user metadata, schema, footer, etc.
>
> Thanks for any ideas,
> John
>