You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Yue Ni <ni...@gmail.com> on 2022/04/05 08:54:56 UTC

storing per record batch metadata in arrow IPC file

Hi there,

I am investigating analyzing time series data using apache arrow. I would
like to store some record batch specific metadata, for example, some
statistics/tags about data in a particular record batch. More specifically,
I may use a single record batch to store metric samples for a certain time
range, and would like to store the min/max time and some dimensional data
like `host` and `aws_region` as metadata for a particular record batch so
that when loading multiple record batches from IPC file, the metadata may
vary from batch to batch in an IPC file, and I can filter these batches
quickly simply using metadata without looking into data in the arrays. And
I would like to know if it is possible to store such per record batch
metadata in an arrow IPC file.

There is a similar effort I can find on the web [1], but it stores all the
record batches metadata in the IPC file footer's schema. I think the footer
will be fully loaded for every access, which will introduce some
unnecessary IO if only a few of the record batches are read each time.

I read some docs/source code [2] [3], and if my understanding is correct,
it is technically possible to store different metadata in different record
batches since in the streaming format, each message has a `custom_metadata`
associated with it. But I don't find any API (at least in pyarrow) allowing
me to do this. APIs like `pyarrow.record_batch` does allow users to specify
metadata when constructing a record batch, but it doesn't seem to be used
if `RecordBatchFileWriter` has a schema provided (which of course doesn't
have such record batch specific metadata).

I haven't looked into the lower level C++ API yet, and it seems the
assumption is that all the batches in the IPC file should share the same
schema, but do we allow them to have different metadata if the schema
(field names and their types) is the same? If we don't allow such usage
currently, do you think it is a valid use case to support this kind of
usage? Thanks.

[1]
https://github.com/heterodb/pg-strom/wiki/806%3A-Apache-Arrow-Min-Max-Statistics-Hint
[2]
https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
[3]
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py

Re: storing per record batch metadata in arrow IPC file

Posted by Yue Ni <ni...@gmail.com>.
Hi Weston,

> The C++ implementation does not expose this today that I can tell. So if
you want to use this then some C++ changes will be needed.  There is
already a JIRA ticket for this at [2].
Thanks for pointing this out, it seems the ticket ARROW-16131 I logged
above duplicates with ARROW-6940 you mentioned. After checking the source
code of this part, I gave it a try and submitted PR
https://github.com/apache/arrow/pull/12812

> On the other hand, if you already know what subset of batches you
are interested in, then I could maybe see some advantage in storing
the metadatas separately but only if the metadata is quite large.
This is indeed the case I am investigating. I plan to use some external
index to figure out a subset of batches to query against, by memory mapping
the IPC file, I can randomly access these selective record batches and then
use the metadata in each batch for further filtering/providing extra info.

> If the metadata is relatively small (KBs) then I still think you'd be
better off storing it all in the footer in most cases (or there wouldn't
be much difference)
I am still not sure if it is worth it to put the info in each batch's
metadata, and I am experimenting to see if this helps and ran into this
issue. In my case, I might create an IPC file with several thousands of
record batches, and each batch may have up to 100ish bytes metadata. If I
put the info for all batches into footer's metadata, depending on the shape
of data, it may be 100KB or more metadata in the footer in an unhappy path,
which I think could be wasteful. But you are correct, this may not matter
too much in many cases, I am still experimenting and thanks so much for the
detailed guidance.


On Wed, Apr 6, 2022 at 2:41 PM Weston Pace <we...@gmail.com> wrote:

> Correct, the "ground truth" so to speak for these things is probably
> the flatbuffers files[1] (Message.fbs, Schema.fbs, and Schema.fbs in
> this case). There is a per-message custom metadata field that could be
> used as you describe.  The C++ implementation does not expose this
> today that I can tell.  So if you want to use this then some C++
> changes will be needed.  There is already a JIRA ticket for this at
> [2].
>
> > the metadata may
> > vary from batch to batch in an IPC file, and I can filter these batches
> > quickly simply using metadata without looking into data in the arrays.
>
> > There is a similar effort I can find on the web [1], but it stores all
> the
> > record batches metadata in the IPC file footer's schema. I think the
> footer
> > will be fully loaded for every access, which will introduce some
> > unnecessary IO if only a few of the record batches are read each time.
>
> I'm not sure the two above statements work together well.  If you want
> to use the metadata to determine which batches to read then you will
> need to read the metadata for every single batch.  So it doesn't make
> sense to spread this information throughout the file.
>
> On the other hand, if you already know what subset of batches you are
> interested in, then I could maybe see some advantage in storing the
> metadatas separately but only if the metadata is quite large.  If the
> metadata is relatively small (KBs) then I still think you'd be better
> off storing it all in the footer in most cases (or there wouldn't be
> much difference).
>
> If you're doing streaming processing of the entire file then it
> probably doesn't matter much either way.
>
> So there might be some potential here but I wouldn't say it is a sure
> thing.
>
> [1] https://github.com/apache/arrow/tree/master/format
> [2] https://issues.apache.org/jira/browse/ARROW-6940
>
> On Tue, Apr 5, 2022 at 7:26 PM Yue Ni <ni...@gmail.com> wrote:
> >
> > Hi Aldrin,
> >
> > Thanks for the pointers. I checked out the C++ source code of this part,
> > and I think currently record batch specific metadata is not written into
> > the IPC file probably due to a bug in the code. I logged a bug to track
> > this issue (https://issues.apache.org/jira/browse/ARROW-16131), thanks
> so
> > much for the help.
> >
> > On Wed, Apr 6, 2022 at 12:58 AM Aldrin <ak...@ucsc.edu.invalid>
> wrote:
> >
> > > Hm, I didn't think it was possible, but it looks like there may be some
> > > things you can try?
> > >
> > > My understanding was that you create a writer for an IPC stream or
> file and
> > > you pass a schema on construction which is used as "the schema" for
> the IPC
> > > stream/file. So, RecordBatches written using that writer should/need to
> > > match the given schema. This doesn't check the metadata, I don't
> think, but
> > > it only writes an "IPC payload" if the equality check passes.
> > >
> > > That being said, I did some checking, and some things seem like it's
> more
> > > flexible now (but I could be wrong). I'm not sure what the dictionary
> > > deltas are (maybe it's for dictionary arrays rather than metadata), but
> > > the "emit_dictionary_deltas" IpcOption may be relevant [1]. Otherwise,
> the
> > > `WriteRecordBatch` function appears to take a metadata length [2] and
> the
> > > `WriteRecordBatchStream` function [3] seems to only check that a
> vector of
> > > RecordBatches have matching schemas. Also, the `WritePayload` function
> > > (from a RecordBatchWriter via MakeFileWriter) seems to be relevant for
> how
> > > to write metadata that can be leveraged for a seek-based interface [4].
> > >
> > > But, ultimately, I am not sure these things are exposed at a higher
> level
> > > (e.g. pyarrow), even though they're available for use. They're also not
> > > exposed via the feather interface, as far as I know.
> > >
> > > [1]:
> > >
> > >
> https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions22emit_dictionary_deltasE
> > > [2]:
> > >
> > >
> https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L644
> > > [3]:
> > >
> > >
> https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L665
> > > [4]:
> > >
> > >
> https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L1253
> > >
> > > Aldrin Montana
> > > Computer Science PhD Student
> > > UC Santa Cruz
> > >
> > >
> > > On Tue, Apr 5, 2022 at 1:55 AM Yue Ni <ni...@gmail.com> wrote:
> > >
> > > > Hi there,
> > > >
> > > > I am investigating analyzing time series data using apache arrow. I
> would
> > > > like to store some record batch specific metadata, for example, some
> > > > statistics/tags about data in a particular record batch. More
> > > specifically,
> > > > I may use a single record batch to store metric samples for a certain
> > > time
> > > > range, and would like to store the min/max time and some dimensional
> data
> > > > like `host` and `aws_region` as metadata for a particular record
> batch so
> > > > that when loading multiple record batches from IPC file, the
> metadata may
> > > > vary from batch to batch in an IPC file, and I can filter these
> batches
> > > > quickly simply using metadata without looking into data in the
> arrays.
> > > And
> > > > I would like to know if it is possible to store such per record batch
> > > > metadata in an arrow IPC file.
> > > >
> > > > There is a similar effort I can find on the web [1], but it stores
> all
> > > the
> > > > record batches metadata in the IPC file footer's schema. I think the
> > > footer
> > > > will be fully loaded for every access, which will introduce some
> > > > unnecessary IO if only a few of the record batches are read each
> time.
> > > >
> > > > I read some docs/source code [2] [3], and if my understanding is
> correct,
> > > > it is technically possible to store different metadata in different
> > > record
> > > > batches since in the streaming format, each message has a
> > > `custom_metadata`
> > > > associated with it. But I don't find any API (at least in pyarrow)
> > > allowing
> > > > me to do this. APIs like `pyarrow.record_batch` does allow users to
> > > specify
> > > > metadata when constructing a record batch, but it doesn't seem to be
> used
> > > > if `RecordBatchFileWriter` has a schema provided (which of course
> doesn't
> > > > have such record batch specific metadata).
> > > >
> > > > I haven't looked into the lower level C++ API yet, and it seems the
> > > > assumption is that all the batches in the IPC file should share the
> same
> > > > schema, but do we allow them to have different metadata if the schema
> > > > (field names and their types) is the same? If we don't allow such
> usage
> > > > currently, do you think it is a valid use case to support this kind
> of
> > > > usage? Thanks.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> https://github.com/heterodb/pg-strom/wiki/806%3A-Apache-Arrow-Min-Max-Statistics-Hint
> > > > [2]
> > > >
> > > >
> > >
> https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
> > > > [3]
> > > >
> > > >
> > >
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py
> > > >
> > >
>

Re: storing per record batch metadata in arrow IPC file

Posted by Weston Pace <we...@gmail.com>.
Actually, if you are doing streaming processing, you would have to
store it with the record batch since there is no footer :)

On Tue, Apr 5, 2022 at 8:40 PM Weston Pace <we...@gmail.com> wrote:
>
> Correct, the "ground truth" so to speak for these things is probably
> the flatbuffers files[1] (Message.fbs, Schema.fbs, and Schema.fbs in
> this case). There is a per-message custom metadata field that could be
> used as you describe.  The C++ implementation does not expose this
> today that I can tell.  So if you want to use this then some C++
> changes will be needed.  There is already a JIRA ticket for this at
> [2].
>
> > the metadata may
> > vary from batch to batch in an IPC file, and I can filter these batches
> > quickly simply using metadata without looking into data in the arrays.
>
> > There is a similar effort I can find on the web [1], but it stores all the
> > record batches metadata in the IPC file footer's schema. I think the footer
> > will be fully loaded for every access, which will introduce some
> > unnecessary IO if only a few of the record batches are read each time.
>
> I'm not sure the two above statements work together well.  If you want
> to use the metadata to determine which batches to read then you will
> need to read the metadata for every single batch.  So it doesn't make
> sense to spread this information throughout the file.
>
> On the other hand, if you already know what subset of batches you are
> interested in, then I could maybe see some advantage in storing the
> metadatas separately but only if the metadata is quite large.  If the
> metadata is relatively small (KBs) then I still think you'd be better
> off storing it all in the footer in most cases (or there wouldn't be
> much difference).
>
> If you're doing streaming processing of the entire file then it
> probably doesn't matter much either way.
>
> So there might be some potential here but I wouldn't say it is a sure thing.
>
> [1] https://github.com/apache/arrow/tree/master/format
> [2] https://issues.apache.org/jira/browse/ARROW-6940
>
> On Tue, Apr 5, 2022 at 7:26 PM Yue Ni <ni...@gmail.com> wrote:
> >
> > Hi Aldrin,
> >
> > Thanks for the pointers. I checked out the C++ source code of this part,
> > and I think currently record batch specific metadata is not written into
> > the IPC file probably due to a bug in the code. I logged a bug to track
> > this issue (https://issues.apache.org/jira/browse/ARROW-16131), thanks so
> > much for the help.
> >
> > On Wed, Apr 6, 2022 at 12:58 AM Aldrin <ak...@ucsc.edu.invalid> wrote:
> >
> > > Hm, I didn't think it was possible, but it looks like there may be some
> > > things you can try?
> > >
> > > My understanding was that you create a writer for an IPC stream or file and
> > > you pass a schema on construction which is used as "the schema" for the IPC
> > > stream/file. So, RecordBatches written using that writer should/need to
> > > match the given schema. This doesn't check the metadata, I don't think, but
> > > it only writes an "IPC payload" if the equality check passes.
> > >
> > > That being said, I did some checking, and some things seem like it's more
> > > flexible now (but I could be wrong). I'm not sure what the dictionary
> > > deltas are (maybe it's for dictionary arrays rather than metadata), but
> > > the "emit_dictionary_deltas" IpcOption may be relevant [1]. Otherwise, the
> > > `WriteRecordBatch` function appears to take a metadata length [2] and the
> > > `WriteRecordBatchStream` function [3] seems to only check that a vector of
> > > RecordBatches have matching schemas. Also, the `WritePayload` function
> > > (from a RecordBatchWriter via MakeFileWriter) seems to be relevant for how
> > > to write metadata that can be leveraged for a seek-based interface [4].
> > >
> > > But, ultimately, I am not sure these things are exposed at a higher level
> > > (e.g. pyarrow), even though they're available for use. They're also not
> > > exposed via the feather interface, as far as I know.
> > >
> > > [1]:
> > >
> > > https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions22emit_dictionary_deltasE
> > > [2]:
> > >
> > > https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L644
> > > [3]:
> > >
> > > https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L665
> > > [4]:
> > >
> > > https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L1253
> > >
> > > Aldrin Montana
> > > Computer Science PhD Student
> > > UC Santa Cruz
> > >
> > >
> > > On Tue, Apr 5, 2022 at 1:55 AM Yue Ni <ni...@gmail.com> wrote:
> > >
> > > > Hi there,
> > > >
> > > > I am investigating analyzing time series data using apache arrow. I would
> > > > like to store some record batch specific metadata, for example, some
> > > > statistics/tags about data in a particular record batch. More
> > > specifically,
> > > > I may use a single record batch to store metric samples for a certain
> > > time
> > > > range, and would like to store the min/max time and some dimensional data
> > > > like `host` and `aws_region` as metadata for a particular record batch so
> > > > that when loading multiple record batches from IPC file, the metadata may
> > > > vary from batch to batch in an IPC file, and I can filter these batches
> > > > quickly simply using metadata without looking into data in the arrays.
> > > And
> > > > I would like to know if it is possible to store such per record batch
> > > > metadata in an arrow IPC file.
> > > >
> > > > There is a similar effort I can find on the web [1], but it stores all
> > > the
> > > > record batches metadata in the IPC file footer's schema. I think the
> > > footer
> > > > will be fully loaded for every access, which will introduce some
> > > > unnecessary IO if only a few of the record batches are read each time.
> > > >
> > > > I read some docs/source code [2] [3], and if my understanding is correct,
> > > > it is technically possible to store different metadata in different
> > > record
> > > > batches since in the streaming format, each message has a
> > > `custom_metadata`
> > > > associated with it. But I don't find any API (at least in pyarrow)
> > > allowing
> > > > me to do this. APIs like `pyarrow.record_batch` does allow users to
> > > specify
> > > > metadata when constructing a record batch, but it doesn't seem to be used
> > > > if `RecordBatchFileWriter` has a schema provided (which of course doesn't
> > > > have such record batch specific metadata).
> > > >
> > > > I haven't looked into the lower level C++ API yet, and it seems the
> > > > assumption is that all the batches in the IPC file should share the same
> > > > schema, but do we allow them to have different metadata if the schema
> > > > (field names and their types) is the same? If we don't allow such usage
> > > > currently, do you think it is a valid use case to support this kind of
> > > > usage? Thanks.
> > > >
> > > > [1]
> > > >
> > > >
> > > https://github.com/heterodb/pg-strom/wiki/806%3A-Apache-Arrow-Min-Max-Statistics-Hint
> > > > [2]
> > > >
> > > >
> > > https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
> > > > [3]
> > > >
> > > >
> > > https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py
> > > >
> > >

Re: storing per record batch metadata in arrow IPC file

Posted by Weston Pace <we...@gmail.com>.
Correct, the "ground truth" so to speak for these things is probably
the flatbuffers files[1] (Message.fbs, Schema.fbs, and Schema.fbs in
this case). There is a per-message custom metadata field that could be
used as you describe.  The C++ implementation does not expose this
today that I can tell.  So if you want to use this then some C++
changes will be needed.  There is already a JIRA ticket for this at
[2].

> the metadata may
> vary from batch to batch in an IPC file, and I can filter these batches
> quickly simply using metadata without looking into data in the arrays.

> There is a similar effort I can find on the web [1], but it stores all the
> record batches metadata in the IPC file footer's schema. I think the footer
> will be fully loaded for every access, which will introduce some
> unnecessary IO if only a few of the record batches are read each time.

I'm not sure the two above statements work together well.  If you want
to use the metadata to determine which batches to read then you will
need to read the metadata for every single batch.  So it doesn't make
sense to spread this information throughout the file.

On the other hand, if you already know what subset of batches you are
interested in, then I could maybe see some advantage in storing the
metadatas separately but only if the metadata is quite large.  If the
metadata is relatively small (KBs) then I still think you'd be better
off storing it all in the footer in most cases (or there wouldn't be
much difference).

If you're doing streaming processing of the entire file then it
probably doesn't matter much either way.

So there might be some potential here but I wouldn't say it is a sure thing.

[1] https://github.com/apache/arrow/tree/master/format
[2] https://issues.apache.org/jira/browse/ARROW-6940

On Tue, Apr 5, 2022 at 7:26 PM Yue Ni <ni...@gmail.com> wrote:
>
> Hi Aldrin,
>
> Thanks for the pointers. I checked out the C++ source code of this part,
> and I think currently record batch specific metadata is not written into
> the IPC file probably due to a bug in the code. I logged a bug to track
> this issue (https://issues.apache.org/jira/browse/ARROW-16131), thanks so
> much for the help.
>
> On Wed, Apr 6, 2022 at 12:58 AM Aldrin <ak...@ucsc.edu.invalid> wrote:
>
> > Hm, I didn't think it was possible, but it looks like there may be some
> > things you can try?
> >
> > My understanding was that you create a writer for an IPC stream or file and
> > you pass a schema on construction which is used as "the schema" for the IPC
> > stream/file. So, RecordBatches written using that writer should/need to
> > match the given schema. This doesn't check the metadata, I don't think, but
> > it only writes an "IPC payload" if the equality check passes.
> >
> > That being said, I did some checking, and some things seem like it's more
> > flexible now (but I could be wrong). I'm not sure what the dictionary
> > deltas are (maybe it's for dictionary arrays rather than metadata), but
> > the "emit_dictionary_deltas" IpcOption may be relevant [1]. Otherwise, the
> > `WriteRecordBatch` function appears to take a metadata length [2] and the
> > `WriteRecordBatchStream` function [3] seems to only check that a vector of
> > RecordBatches have matching schemas. Also, the `WritePayload` function
> > (from a RecordBatchWriter via MakeFileWriter) seems to be relevant for how
> > to write metadata that can be leveraged for a seek-based interface [4].
> >
> > But, ultimately, I am not sure these things are exposed at a higher level
> > (e.g. pyarrow), even though they're available for use. They're also not
> > exposed via the feather interface, as far as I know.
> >
> > [1]:
> >
> > https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions22emit_dictionary_deltasE
> > [2]:
> >
> > https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L644
> > [3]:
> >
> > https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L665
> > [4]:
> >
> > https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L1253
> >
> > Aldrin Montana
> > Computer Science PhD Student
> > UC Santa Cruz
> >
> >
> > On Tue, Apr 5, 2022 at 1:55 AM Yue Ni <ni...@gmail.com> wrote:
> >
> > > Hi there,
> > >
> > > I am investigating analyzing time series data using apache arrow. I would
> > > like to store some record batch specific metadata, for example, some
> > > statistics/tags about data in a particular record batch. More
> > specifically,
> > > I may use a single record batch to store metric samples for a certain
> > time
> > > range, and would like to store the min/max time and some dimensional data
> > > like `host` and `aws_region` as metadata for a particular record batch so
> > > that when loading multiple record batches from IPC file, the metadata may
> > > vary from batch to batch in an IPC file, and I can filter these batches
> > > quickly simply using metadata without looking into data in the arrays.
> > And
> > > I would like to know if it is possible to store such per record batch
> > > metadata in an arrow IPC file.
> > >
> > > There is a similar effort I can find on the web [1], but it stores all
> > the
> > > record batches metadata in the IPC file footer's schema. I think the
> > footer
> > > will be fully loaded for every access, which will introduce some
> > > unnecessary IO if only a few of the record batches are read each time.
> > >
> > > I read some docs/source code [2] [3], and if my understanding is correct,
> > > it is technically possible to store different metadata in different
> > record
> > > batches since in the streaming format, each message has a
> > `custom_metadata`
> > > associated with it. But I don't find any API (at least in pyarrow)
> > allowing
> > > me to do this. APIs like `pyarrow.record_batch` does allow users to
> > specify
> > > metadata when constructing a record batch, but it doesn't seem to be used
> > > if `RecordBatchFileWriter` has a schema provided (which of course doesn't
> > > have such record batch specific metadata).
> > >
> > > I haven't looked into the lower level C++ API yet, and it seems the
> > > assumption is that all the batches in the IPC file should share the same
> > > schema, but do we allow them to have different metadata if the schema
> > > (field names and their types) is the same? If we don't allow such usage
> > > currently, do you think it is a valid use case to support this kind of
> > > usage? Thanks.
> > >
> > > [1]
> > >
> > >
> > https://github.com/heterodb/pg-strom/wiki/806%3A-Apache-Arrow-Min-Max-Statistics-Hint
> > > [2]
> > >
> > >
> > https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
> > > [3]
> > >
> > >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py
> > >
> >

Re: storing per record batch metadata in arrow IPC file

Posted by Yue Ni <ni...@gmail.com>.
Hi Aldrin,

Thanks for the pointers. I checked out the C++ source code of this part,
and I think currently record batch specific metadata is not written into
the IPC file probably due to a bug in the code. I logged a bug to track
this issue (https://issues.apache.org/jira/browse/ARROW-16131), thanks so
much for the help.

On Wed, Apr 6, 2022 at 12:58 AM Aldrin <ak...@ucsc.edu.invalid> wrote:

> Hm, I didn't think it was possible, but it looks like there may be some
> things you can try?
>
> My understanding was that you create a writer for an IPC stream or file and
> you pass a schema on construction which is used as "the schema" for the IPC
> stream/file. So, RecordBatches written using that writer should/need to
> match the given schema. This doesn't check the metadata, I don't think, but
> it only writes an "IPC payload" if the equality check passes.
>
> That being said, I did some checking, and some things seem like it's more
> flexible now (but I could be wrong). I'm not sure what the dictionary
> deltas are (maybe it's for dictionary arrays rather than metadata), but
> the "emit_dictionary_deltas" IpcOption may be relevant [1]. Otherwise, the
> `WriteRecordBatch` function appears to take a metadata length [2] and the
> `WriteRecordBatchStream` function [3] seems to only check that a vector of
> RecordBatches have matching schemas. Also, the `WritePayload` function
> (from a RecordBatchWriter via MakeFileWriter) seems to be relevant for how
> to write metadata that can be leveraged for a seek-based interface [4].
>
> But, ultimately, I am not sure these things are exposed at a higher level
> (e.g. pyarrow), even though they're available for use. They're also not
> exposed via the feather interface, as far as I know.
>
> [1]:
>
> https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions22emit_dictionary_deltasE
> [2]:
>
> https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L644
> [3]:
>
> https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L665
> [4]:
>
> https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L1253
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Tue, Apr 5, 2022 at 1:55 AM Yue Ni <ni...@gmail.com> wrote:
>
> > Hi there,
> >
> > I am investigating analyzing time series data using apache arrow. I would
> > like to store some record batch specific metadata, for example, some
> > statistics/tags about data in a particular record batch. More
> specifically,
> > I may use a single record batch to store metric samples for a certain
> time
> > range, and would like to store the min/max time and some dimensional data
> > like `host` and `aws_region` as metadata for a particular record batch so
> > that when loading multiple record batches from IPC file, the metadata may
> > vary from batch to batch in an IPC file, and I can filter these batches
> > quickly simply using metadata without looking into data in the arrays.
> And
> > I would like to know if it is possible to store such per record batch
> > metadata in an arrow IPC file.
> >
> > There is a similar effort I can find on the web [1], but it stores all
> the
> > record batches metadata in the IPC file footer's schema. I think the
> footer
> > will be fully loaded for every access, which will introduce some
> > unnecessary IO if only a few of the record batches are read each time.
> >
> > I read some docs/source code [2] [3], and if my understanding is correct,
> > it is technically possible to store different metadata in different
> record
> > batches since in the streaming format, each message has a
> `custom_metadata`
> > associated with it. But I don't find any API (at least in pyarrow)
> allowing
> > me to do this. APIs like `pyarrow.record_batch` does allow users to
> specify
> > metadata when constructing a record batch, but it doesn't seem to be used
> > if `RecordBatchFileWriter` has a schema provided (which of course doesn't
> > have such record batch specific metadata).
> >
> > I haven't looked into the lower level C++ API yet, and it seems the
> > assumption is that all the batches in the IPC file should share the same
> > schema, but do we allow them to have different metadata if the schema
> > (field names and their types) is the same? If we don't allow such usage
> > currently, do you think it is a valid use case to support this kind of
> > usage? Thanks.
> >
> > [1]
> >
> >
> https://github.com/heterodb/pg-strom/wiki/806%3A-Apache-Arrow-Min-Max-Statistics-Hint
> > [2]
> >
> >
> https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
> > [3]
> >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py
> >
>

Re: storing per record batch metadata in arrow IPC file

Posted by Aldrin <ak...@ucsc.edu.INVALID>.
Hm, I didn't think it was possible, but it looks like there may be some
things you can try?

My understanding was that you create a writer for an IPC stream or file and
you pass a schema on construction which is used as "the schema" for the IPC
stream/file. So, RecordBatches written using that writer should/need to
match the given schema. This doesn't check the metadata, I don't think, but
it only writes an "IPC payload" if the equality check passes.

That being said, I did some checking, and some things seem like it's more
flexible now (but I could be wrong). I'm not sure what the dictionary
deltas are (maybe it's for dictionary arrays rather than metadata), but
the "emit_dictionary_deltas" IpcOption may be relevant [1]. Otherwise, the
`WriteRecordBatch` function appears to take a metadata length [2] and the
`WriteRecordBatchStream` function [3] seems to only check that a vector of
RecordBatches have matching schemas. Also, the `WritePayload` function
(from a RecordBatchWriter via MakeFileWriter) seems to be relevant for how
to write metadata that can be leveraged for a seek-based interface [4].

But, ultimately, I am not sure these things are exposed at a higher level
(e.g. pyarrow), even though they're available for use. They're also not
exposed via the feather interface, as far as I know.

[1]:
https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions22emit_dictionary_deltasE
[2]:
https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L644
[3]:
https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L665
[4]:
https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L1253

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Tue, Apr 5, 2022 at 1:55 AM Yue Ni <ni...@gmail.com> wrote:

> Hi there,
>
> I am investigating analyzing time series data using apache arrow. I would
> like to store some record batch specific metadata, for example, some
> statistics/tags about data in a particular record batch. More specifically,
> I may use a single record batch to store metric samples for a certain time
> range, and would like to store the min/max time and some dimensional data
> like `host` and `aws_region` as metadata for a particular record batch so
> that when loading multiple record batches from IPC file, the metadata may
> vary from batch to batch in an IPC file, and I can filter these batches
> quickly simply using metadata without looking into data in the arrays. And
> I would like to know if it is possible to store such per record batch
> metadata in an arrow IPC file.
>
> There is a similar effort I can find on the web [1], but it stores all the
> record batches metadata in the IPC file footer's schema. I think the footer
> will be fully loaded for every access, which will introduce some
> unnecessary IO if only a few of the record batches are read each time.
>
> I read some docs/source code [2] [3], and if my understanding is correct,
> it is technically possible to store different metadata in different record
> batches since in the streaming format, each message has a `custom_metadata`
> associated with it. But I don't find any API (at least in pyarrow) allowing
> me to do this. APIs like `pyarrow.record_batch` does allow users to specify
> metadata when constructing a record batch, but it doesn't seem to be used
> if `RecordBatchFileWriter` has a schema provided (which of course doesn't
> have such record batch specific metadata).
>
> I haven't looked into the lower level C++ API yet, and it seems the
> assumption is that all the batches in the IPC file should share the same
> schema, but do we allow them to have different metadata if the schema
> (field names and their types) is the same? If we don't allow such usage
> currently, do you think it is a valid use case to support this kind of
> usage? Thanks.
>
> [1]
>
> https://github.com/heterodb/pg-strom/wiki/806%3A-Apache-Arrow-Min-Max-Statistics-Hint
> [2]
>
> https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
> [3]
>
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py
>