You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2023/04/02 00:49:38 UTC
Re: row counts in footer of IPC file format

IIRC Struct's are immutable once defined, if you want to evolve, then
Tables are necessary.

On Mon, Mar 20, 2023 at 8:22 AM Weston Pace <we...@gmail.com> wrote:

> +1, I'm generally in favor of the idea.  I would prefer
> `recordBatchNumRows` (or, less favorably, `recordBatchSize`).  I don't
> think `recordBatchLengths` works because there are already places in the
> footer where "length" is interpreted as "number of bytes".
>
> I'm not an expert on flatbuffers evolution but I wonder if we want to
> create a new struct (table?) so that new statistics can be added in the
> future if we desire.
>
> ```
> struct RecordBatchStatistics {
>   // Number of rows in the batch
>   num_rows: [long];
>   // Open spot for extending with new statistics such as min/max,
> cardinality, bloom filter, etc.
> }
>
> ...
>
> recordBatchStatistics: [RecordBatchStatistics];
> ```
>
> On Sat, Mar 18, 2023 at 9:39 PM Steve Kim <ch...@gmail.com> wrote:
>
> > Hello everyone,
> >
> > I would like to be able to quickly seek to an arbitrary row in an Arrow
> > file.
> >
> > With the current file format, reading the file footer alone is not enough
> > to
> > determine the record batch that contains a given row index. The row
> counts
> > of the record batches are only found in the metadata for each record
> batch,
> > which are scattered at different offsets in the file. Multiple
> > non-contiguous small reads can be costly (e.g., HTTP GET requests to read
> > byte ranges from a S3 object).
> >
> > This problem has been discussed in GitHub issues:
> >
> > https://github.com/apache/arrow/issues/18250
> >
> > https://github.com/apache/arrow/issues/24575
> >
> > To solve this problem, I propose a small backwards-compatible change to
> the
> > file format. We can add a
> >
> >     recordBatchLengths: [long];
> >
> > field to the Footer table (
> > https://github.com/apache/arrow/blob/main/format/File.fbs). The name and
> > type of this new field match the length field in the RecordBatch table.
> > This new field must be after the custom_metadata field in the Footer
> table
> > to satisfy constraints of FlatBuffers schema evolution. An Arrow file
> whose
> > footer lacks the recordBatchLengths field would be read with a default
> > value of null, which indicates that the row counts are not present.
> >
> > What do people think?
> >
> > Thanks,
> > Steve
> >
>