You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Keith Chapman <ke...@gmail.com> on 2017/12/14 19:49:28 UTC

[PARQUET-CPP] Performance of ReadBatch vs ReadBatchSpaced API's

I've looked at the parquet read API and I see that it has two general API's
that could be used,

1. ReadBatch
2. ReadBatchSpaced

I understand that ReadBatchSpaced does an extra copy while ReadBatch foes
not. In terms of performance which API would you recommend using?

Regards,
Keith.

http://keith-chapman.com

Re: [PARQUET-CPP] Performance of ReadBatch vs ReadBatchSpaced API's

Posted by Wes McKinney <we...@gmail.com>.
hi Keith,

+1 to what Uwe said. Also, you said "I understand that ReadBatchSpaced
does an extra copy while ReadBatch does not." That's not right -- both
APIs read into pre-allocated memory and do not do any further
allocations. ReadBatchSpaced has to "move" the values based on the
nulls in the definition levels, see

https://github.com/apache/parquet-cpp/blob/master/src/parquet/encoding.h#L112

- Wes

On Fri, Dec 15, 2017 at 9:10 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> Hello Keith,
>
> using the API mainly depends on how you want to have the data later on.
> In the case of ReadBatch you will get an array of values which is the
> size of the number of non-null values. In ReadBatchSpaced, the array
> will be the size of the number all of all values, including unused
> memory for null-entries. For random access to your data, you normally
> want to use ReadBatchSpaced. In contrast ReadBatch should be faster as
> it will return the data as it is stored in the Parquet format.
>
> Does this help you? Otherwise I can try to explain the difference in
> more detail.
>
> Uwe
>
> On Thu, Dec 14, 2017, at 08:49 PM, Keith Chapman wrote:
>> I've looked at the parquet read API and I see that it has two general
>> API's
>> that could be used,
>>
>> 1. ReadBatch
>> 2. ReadBatchSpaced
>>
>> I understand that ReadBatchSpaced does an extra copy while ReadBatch foes
>> not. In terms of performance which API would you recommend using?
>>
>> Regards,
>> Keith.
>>
>> http://keith-chapman.com

Re: [PARQUET-CPP] Performance of ReadBatch vs ReadBatchSpaced API's

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Keith,

using the API mainly depends on how you want to have the data later on.
In the case of ReadBatch you will get an array of values which is the
size of the number of non-null values. In ReadBatchSpaced, the array
will be the size of the number all of all values, including unused
memory for null-entries. For random access to your data, you normally
want to use ReadBatchSpaced. In contrast ReadBatch should be faster as
it will return the data as it is stored in the Parquet format.

Does this help you? Otherwise I can try to explain the difference in
more detail.

Uwe

On Thu, Dec 14, 2017, at 08:49 PM, Keith Chapman wrote:
> I've looked at the parquet read API and I see that it has two general
> API's
> that could be used,
> 
> 1. ReadBatch
> 2. ReadBatchSpaced
> 
> I understand that ReadBatchSpaced does an extra copy while ReadBatch foes
> not. In terms of performance which API would you recommend using?
> 
> Regards,
> Keith.
> 
> http://keith-chapman.com