You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Lawrence Chan (JIRA)" <ji...@apache.org> on 2018/03/12 17:38:00 UTC

[jira] [Comment Edited] (ARROW-2296) [C++] Add num_rows to file footer

    [ https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395564#comment-16395564 ] 

Lawrence Chan edited comment on ARROW-2296 at 3/12/18 5:37 PM:
---------------------------------------------------------------

Yeah, I was thinking somewhere in the Footer struct, so we don't need to walk all the batches to sum them up.

Also they are indeed in the existing RecordBatch metadata, but the current implementation is inside a .cc file and I'd have to either copy+paste or modify my build to expose more of the existing code. Maybe we could expose something like this on the RecordBatchFileReader?
{code:cpp}
Status ReadRecordBatchMessage(int i, const flatbuf::RecordBatch** metadata) const;
{code}
Then it'd be possible to read the length fields without copying a bunch of code. Not sure if this is a good idea though, since it seems that we dont usually expose the flatbuffers through the public API. Maybe just a 
{code:cpp}
int64_t num_rows() const;
{code}
is all I really want, and that can read the new Footer field once it's in there, and walk the batches in the current format?


was (Author: llchan):
Yeah, I was thinking somewhere in the Footer struct, so we don't need to walk all the batches to sum them up.

Also they are indeed in the existing RecordBatch metadata, but the current implementation is inside a .cc file and I'd have to either copy+paste or modify my build to expose more of the existing code. Maybe we could expose something like this on the RecordBatchFileReader?
{code:cpp}
Status ReadRecordBatchMessage(int i, const flatbuf::RecordBatch** metadata) const;
{code}
Then it'd be possible to read the length fields without copying some of the other stuff. Not sure if this is a good idea though, since it seems that we dont usually expose the flatbuffers through the public API. Maybe just a 
{code:cpp}
int64_t num_rows() const;
{code}
is all I really want, and that can read the new Footer field once it's in there, and walk the batches in the current format?

> [C++] Add num_rows to file footer
> ---------------------------------
>
>                 Key: ARROW-2296
>                 URL: https://issues.apache.org/jira/browse/ARROW-2296
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Format
>            Reporter: Lawrence Chan
>            Priority: Minor
>
> Maybe I'm overlooking something, but I don't see something on the API surface to get the number of rows in a arrow file without reading all the record batches. This is useful when we want to read into contiguous buffers, because it allows us to allocate the right sizes up front.
> I'd like to propose that we add `num_rows` as a field in the file footer so it's easy to query without reading the whole file.
> Meanwhile, before we get that added to the official format fbs, it would be nice to haveĀ a method that iterates over the record batch headers and sums up the lengths without reading the actual record batch body.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)