You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Michael Knopf <mk...@rapidminer.com> on 2018/09/21 09:33:38 UTC

[JAVA] Total row count of an Arrow file

Hi all,

I am looking for a quick way to look up the total row count of a data set stored in Arrow’s random access file format using the Java API. Basically, a quicker way to do this:

// The reader is in an instance of ArrowFileReader
List<ArrowBlock> blocks = reader.getRecordBlocks();
int nRows = 0;
for (ArrowBlock block : blocks) {
    reader.loadRecordBatch(block);
    nRows += root.getRowCount();
}

My understanding is that the above snippets loads the entire data set instead of just the block headers.

To give you some context, I am looking into using Arrow for IPC between a JVM and a Python interpreter using a custom data format and PyArrow/Pandas respectively. While the streaming API might be a better tool for this job, I started out with using files to keep things simple.

Any help would be greatly appreciated – maybe I just missed the right bit of documentation.

Thanks,
Michael

Re: [JAVA] Total row count of an Arrow file

Posted by Michael Knopf <mk...@rapidminer.com>.

Hi Li,

Thanks for the explanation! I’ll keep the code as is for now (and an eye on ARROW-3283). 

As you pointed out, I’ll need another solution for streaming the table over a socket anyway.

To clarify, my code does read the actual data in a second pass. However, doing so without knowing how many rows to expect is very expensive.

Thanks again,
Michael

> On 21. Sep 2018, at 16:31, Li Jin <ic...@gmail.com> wrote:
> 
> Hi Michael,
> 
> I think ArrowFileReader takes SeekableByteChannel so it's possible to only
> read the metadata for each record batches and skip the data. However it is
> not implemented.
> 
> If the input Channel is not seekable (for example, a socket channel) then
> you would need to read the body for each record batches to get the next
> batch, so my hunch is that the performance will be similar whether you read
> record batch body into VectorSchemaRoot or just read the bytes.
> 
> If you don't assume your input data is always going to be seekable, I am
> not sure there is a quicker way to do this.
> 
> 
> 
> On Fri, Sep 21, 2018 at 9:33 AM Michael Knopf <mk...@rapidminer.com> wrote:
> 
>> Hi all,
>> 
>> I am looking for a quick way to look up the total row count of a data set
>> stored in Arrow’s random access file format using the Java API. Basically,
>> a quicker way to do this:
>> 
>> // The reader is in an instance of ArrowFileReader
>> List<ArrowBlock> blocks = reader.getRecordBlocks();
>> int nRows = 0;
>> for (ArrowBlock block : blocks) {
>>    reader.loadRecordBatch(block);
>>    nRows += root.getRowCount();
>> }
>> 
>> My understanding is that the above snippets loads the entire data set
>> instead of just the block headers.
>> 
>> To give you some context, I am looking into using Arrow for IPC between a
>> JVM and a Python interpreter using a custom data format and PyArrow/Pandas
>> respectively. While the streaming API might be a better tool for this job,
>> I started out with using files to keep things simple.
>> 
>> Any help would be greatly appreciated – maybe I just missed the right bit
>> of documentation.
>> 
>> Thanks,
>> Michael

Re: [JAVA] Total row count of an Arrow file

Posted by Wes McKinney <we...@gmail.com>.

It would be nice to have an API to look at the file footer (we don't
have one in C++ either), I opened

https://issues.apache.org/jira/browse/ARROW-3283
On Fri, Sep 21, 2018 at 10:32 AM Li Jin <ic...@gmail.com> wrote:
>
> Hi Michael,
>
> I think ArrowFileReader takes SeekableByteChannel so it's possible to only
> read the metadata for each record batches and skip the data. However it is
> not implemented.
>
> If the input Channel is not seekable (for example, a socket channel) then
> you would need to read the body for each record batches to get the next
> batch, so my hunch is that the performance will be similar whether you read
> record batch body into VectorSchemaRoot or just read the bytes.
>
> If you don't assume your input data is always going to be seekable, I am
> not sure there is a quicker way to do this.
>
>
>
> On Fri, Sep 21, 2018 at 9:33 AM Michael Knopf <mk...@rapidminer.com> wrote:
>
> > Hi all,
> >
> > I am looking for a quick way to look up the total row count of a data set
> > stored in Arrow’s random access file format using the Java API. Basically,
> > a quicker way to do this:
> >
> > // The reader is in an instance of ArrowFileReader
> > List<ArrowBlock> blocks = reader.getRecordBlocks();
> > int nRows = 0;
> > for (ArrowBlock block : blocks) {
> >     reader.loadRecordBatch(block);
> >     nRows += root.getRowCount();
> > }
> >
> > My understanding is that the above snippets loads the entire data set
> > instead of just the block headers.
> >
> > To give you some context, I am looking into using Arrow for IPC between a
> > JVM and a Python interpreter using a custom data format and PyArrow/Pandas
> > respectively. While the streaming API might be a better tool for this job,
> > I started out with using files to keep things simple.
> >
> > Any help would be greatly appreciated – maybe I just missed the right bit
> > of documentation.
> >
> > Thanks,
> > Michael

Re: [JAVA] Total row count of an Arrow file

Posted by Li Jin <ic...@gmail.com>.

Hi Michael,

I think ArrowFileReader takes SeekableByteChannel so it's possible to only
read the metadata for each record batches and skip the data. However it is
not implemented.

If the input Channel is not seekable (for example, a socket channel) then
you would need to read the body for each record batches to get the next
batch, so my hunch is that the performance will be similar whether you read
record batch body into VectorSchemaRoot or just read the bytes.

If you don't assume your input data is always going to be seekable, I am
not sure there is a quicker way to do this.

On Fri, Sep 21, 2018 at 9:33 AM Michael Knopf <mk...@rapidminer.com> wrote:

> Hi all,
>
> I am looking for a quick way to look up the total row count of a data set
> stored in Arrow’s random access file format using the Java API. Basically,
> a quicker way to do this:
>
> // The reader is in an instance of ArrowFileReader
> List<ArrowBlock> blocks = reader.getRecordBlocks();
> int nRows = 0;
> for (ArrowBlock block : blocks) {
>     reader.loadRecordBatch(block);
>     nRows += root.getRowCount();
> }
>
> My understanding is that the above snippets loads the entire data set
> instead of just the block headers.
>
> To give you some context, I am looking into using Arrow for IPC between a
> JVM and a Python interpreter using a custom data format and PyArrow/Pandas
> respectively. While the streaming API might be a better tool for this job,
> I started out with using files to keep things simple.
>
> Any help would be greatly appreciated – maybe I just missed the right bit
> of documentation.
>
> Thanks,
> Michael