You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Ying Zhou <yz...@gmail.com> on 2020/10/18 17:24:03 UTC

[C++] AppendValues for numeric types with invalid slots omitted from source

Hi,

Unlike Arrow in ORC when an entry is null it is only recorded in the PRESENT stream (equivalent to the validity bitmap in Arrow) but not in any DATA stream for any type including numeric types. Hence the notNull (aka PRESENT) and data buffers from ORC generally don’t have the same size.

However according to cpp/src/arrow/adaptes/orc/adapter_util.cc <http://adapter_util.cc/> line 126 it is possible to directly use AppendValues to call builder->AppendValues(source, length, valid_bytes) with builder being an Int64Builder with source and valid_bytes having different sizes which doesn’t seem to be reasonable. May I ask whether this is actually valid usage of AppendValues? Thanks!


Best,
Ying Zhou

Re: [C++] AppendValues for numeric types with invalid slots omitted from source

Posted by Ying Zhou <yz...@gmail.com>.
Really thanks!

After more experimentation with liborc::ColumnVectorBatch this morning I found that it is actually spaced so there is no need to write another function to efficiently append “compressed” values. This also simplifies the Arrow2ORC adapter I’m working on.

> On Oct 20, 2020, at 12:55 AM, Micah Kornfield <em...@gmail.com> wrote:
> 
> For reference, that parquet uses to space out values is in rle_decoder.h
> [1].  This uses both BitBlockCounter and BitRunReader.  BitBlockCounter is
> faster than BitRunReader but on micro-benchmarks BitRunReader still
> provides some benefits assuming nulls are fairly infrequent.
> 
> It is worth noting that this code assumes preallocated arrays (i.e. it
> doesn't use builders).
> 
> [1]
> https://github.com/apache/arrow/blob/e0a9d0f28affdccb45bf76fde58d0eec1328cd40/cpp/src/arrow/util/rle_encoding.h
> 
> On Sun, Oct 18, 2020 at 10:35 AM Wes McKinney <we...@gmail.com> wrote:
> 
>> hi Ying, the code in adapter_util.cc doesn't look right to me unless
>> the data in liborc::ColumnVectorBatch is spaced (has placeholder bytes
>> where there is a null). We have quite a bit of code in Parquet that
>> deals specifically with this issue -- I'm not sure if we have a
>> ready-made function that will efficiently append the "compressed"
>> value efficiently to a builder, but we certianly have all the tools
>> you need to do so (e.g. the BitRunReader is helpful here)
>> 
>> On Sun, Oct 18, 2020 at 12:24 PM Ying Zhou <yz...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> Unlike Arrow in ORC when an entry is null it is only recorded in the
>> PRESENT stream (equivalent to the validity bitmap in Arrow) but not in any
>> DATA stream for any type including numeric types. Hence the notNull (aka
>> PRESENT) and data buffers from ORC generally don’t have the same size.
>>> 
>>> However according to cpp/src/arrow/adaptes/orc/adapter_util.cc <
>> http://adapter_util.cc/> line 126 it is possible to directly use
>> AppendValues to call builder->AppendValues(source, length, valid_bytes)
>> with builder being an Int64Builder with source and valid_bytes having
>> different sizes which doesn’t seem to be reasonable. May I ask whether this
>> is actually valid usage of AppendValues? Thanks!
>>> 
>>> 
>>> Best,
>>> Ying Zhou
>> 


Re: [C++] AppendValues for numeric types with invalid slots omitted from source

Posted by Micah Kornfield <em...@gmail.com>.
For reference, that parquet uses to space out values is in rle_decoder.h
[1].  This uses both BitBlockCounter and BitRunReader.  BitBlockCounter is
faster than BitRunReader but on micro-benchmarks BitRunReader still
provides some benefits assuming nulls are fairly infrequent.

It is worth noting that this code assumes preallocated arrays (i.e. it
doesn't use builders).

[1]
https://github.com/apache/arrow/blob/e0a9d0f28affdccb45bf76fde58d0eec1328cd40/cpp/src/arrow/util/rle_encoding.h

On Sun, Oct 18, 2020 at 10:35 AM Wes McKinney <we...@gmail.com> wrote:

> hi Ying, the code in adapter_util.cc doesn't look right to me unless
> the data in liborc::ColumnVectorBatch is spaced (has placeholder bytes
> where there is a null). We have quite a bit of code in Parquet that
> deals specifically with this issue -- I'm not sure if we have a
> ready-made function that will efficiently append the "compressed"
> value efficiently to a builder, but we certianly have all the tools
> you need to do so (e.g. the BitRunReader is helpful here)
>
> On Sun, Oct 18, 2020 at 12:24 PM Ying Zhou <yz...@gmail.com> wrote:
> >
> > Hi,
> >
> > Unlike Arrow in ORC when an entry is null it is only recorded in the
> PRESENT stream (equivalent to the validity bitmap in Arrow) but not in any
> DATA stream for any type including numeric types. Hence the notNull (aka
> PRESENT) and data buffers from ORC generally don’t have the same size.
> >
> > However according to cpp/src/arrow/adaptes/orc/adapter_util.cc <
> http://adapter_util.cc/> line 126 it is possible to directly use
> AppendValues to call builder->AppendValues(source, length, valid_bytes)
> with builder being an Int64Builder with source and valid_bytes having
> different sizes which doesn’t seem to be reasonable. May I ask whether this
> is actually valid usage of AppendValues? Thanks!
> >
> >
> > Best,
> > Ying Zhou
>

Re: [C++] AppendValues for numeric types with invalid slots omitted from source

Posted by Wes McKinney <we...@gmail.com>.
hi Ying, the code in adapter_util.cc doesn't look right to me unless
the data in liborc::ColumnVectorBatch is spaced (has placeholder bytes
where there is a null). We have quite a bit of code in Parquet that
deals specifically with this issue -- I'm not sure if we have a
ready-made function that will efficiently append the "compressed"
value efficiently to a builder, but we certianly have all the tools
you need to do so (e.g. the BitRunReader is helpful here)

On Sun, Oct 18, 2020 at 12:24 PM Ying Zhou <yz...@gmail.com> wrote:
>
> Hi,
>
> Unlike Arrow in ORC when an entry is null it is only recorded in the PRESENT stream (equivalent to the validity bitmap in Arrow) but not in any DATA stream for any type including numeric types. Hence the notNull (aka PRESENT) and data buffers from ORC generally don’t have the same size.
>
> However according to cpp/src/arrow/adaptes/orc/adapter_util.cc <http://adapter_util.cc/> line 126 it is possible to directly use AppendValues to call builder->AppendValues(source, length, valid_bytes) with builder being an Int64Builder with source and valid_bytes having different sizes which doesn’t seem to be reasonable. May I ask whether this is actually valid usage of AppendValues? Thanks!
>
>
> Best,
> Ying Zhou