You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Uwe L. Korn" <uw...@xhochy.com> on 2018/05/01 10:40:01 UTC

Re: What is the maximum run length in the RLE encoding?

Hello Tim,

taking a brief look at what we have in parquet-cpp (which is probably very similar to Impala), we would also have problems with runs that are longer than 2^31. While supporting arbitrary long runs might be a really cool feature, I think it will come at a cost that we would have to refactor a lot of code in the current RLE implementations and it may lead to subtle bugs. I would therefore add a maximum run length to the spec. If there is really a need for having longer runs, then someone needs to step up and make the changes to the spec and the implementations. As long as there is no great need, I don't think we should pay the cost of supporting it.

Uwe

On Mon, Apr 30, 2018, at 11:18 PM, Tim Armstrong wrote:
> I'm looking at Impala bug with decoding Parquet RLE with run lengths >=
> 2^31. The bug was found by fuzz testing rather than a realistic file. 
> I'm
> trying to determine whether the Parquet spec actually allows runs of 
> that
> length, but Encodings.md does not seem to specify any upper bound. It
> mentions ULEB128 encoding, but that can encode arbitrarily large 
> numbers.
> See
> https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
> 
> Is there a practical limit I can assume? Should we amend the Parquet spec
> to clarify this?
> 
> The Impala bug is https://issues.apache.org/jira/browse/IMPALA-6946 if
> anyone is curious.
> 
> Thanks,
> Tim

Re: What is the maximum run length in the RLE encoding?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
+1. I think it's rare to have 2 billion values in a single page, plus
encoding values too densely causes problems. 2^31-1 should work fine for a
maximum. Plus, it maintains forward-compatibility with older readers that
are expecting this to be an int.

rb

On Tue, May 1, 2018 at 11:36 AM, Tim Armstrong <ta...@cloudera.com>
wrote:

> I created https://issues.apache.org/jira/browse/PARQUET-1290. My proposal
> is that we retrospectively limit the run lengths to (2^31 - 1). That
> technically breaks backwards compatibility but it sounds like none of the
> major implementations could read or write such files anyway.
>
> We could alternatively make it more of a suggestion that a rule and say
> that it's a valid Parquet file, but implementations are not required to
> support longer run lengths.
>
> On Tue, May 1, 2018 at 10:08 AM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> It looks like there is no length check in the Parquet Java code:
>> https://github.com/apache/parquet-mr/blob/master/parquet-
>> column/src/main/java/org/apache/parquet/column/values/
>> rle/RunLengthBitPackingHybridEncoder.java#L242
>>
>> But, that uses `writeUnsignedVarInt`, which uses an int:
>> https://github.com/apache/parquet-mr/blob/master/parquet-
>> common/src/main/java/org/apache/parquet/bytes/BytesUtils.java#L219-L225
>>
>> The run length is tracked as an int, so Java would also have a problem if
>> the int is 2^31 or greater. An overflow is possible when incrementing or
>> when shifting to add the type bit (RLE or bit-packed). Looks like we have
>> an effective max of 2^31-1.
>>
>> rb
>>
>> On Tue, May 1, 2018 at 3:40 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>>
>> > Hello Tim,
>> >
>> > taking a brief look at what we have in parquet-cpp (which is probably
>> very
>> > similar to Impala), we would also have problems with runs that are
>> longer
>> > than 2^31. While supporting arbitrary long runs might be a really cool
>> > feature, I think it will come at a cost that we would have to refactor a
>> > lot of code in the current RLE implementations and it may lead to subtle
>> > bugs. I would therefore add a maximum run length to the spec. If there
>> is
>> > really a need for having longer runs, then someone needs to step up and
>> > make the changes to the spec and the implementations. As long as there
>> is
>> > no great need, I don't think we should pay the cost of supporting it.
>> >
>> > Uwe
>> >
>> > On Mon, Apr 30, 2018, at 11:18 PM, Tim Armstrong wrote:
>> > > I'm looking at Impala bug with decoding Parquet RLE with run lengths
>> >=
>> > > 2^31. The bug was found by fuzz testing rather than a realistic file.
>> > > I'm
>> > > trying to determine whether the Parquet spec actually allows runs of
>> > > that
>> > > length, but Encodings.md does not seem to specify any upper bound. It
>> > > mentions ULEB128 encoding, but that can encode arbitrarily large
>> > > numbers.
>> > > See
>> > > https://github.com/apache/parquet-format/blob/master/
>> > Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
>> > >
>> > > Is there a practical limit I can assume? Should we amend the Parquet
>> spec
>> > > to clarify this?
>> > >
>> > > The Impala bug is https://issues.apache.org/jira/browse/IMPALA-6946
>> if
>> > > anyone is curious.
>> > >
>> > > Thanks,
>> > > Tim
>> >
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: What is the maximum run length in the RLE encoding?

Posted by Tim Armstrong <ta...@cloudera.com>.
I created https://issues.apache.org/jira/browse/PARQUET-1290. My proposal
is that we retrospectively limit the run lengths to (2^31 - 1). That
technically breaks backwards compatibility but it sounds like none of the
major implementations could read or write such files anyway.

We could alternatively make it more of a suggestion that a rule and say
that it's a valid Parquet file, but implementations are not required to
support longer run lengths.

On Tue, May 1, 2018 at 10:08 AM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> It looks like there is no length check in the Parquet Java code:
> https://github.com/apache/parquet-mr/blob/master/
> parquet-column/src/main/java/org/apache/parquet/column/values/rle/
> RunLengthBitPackingHybridEncoder.java#L242
>
> But, that uses `writeUnsignedVarInt`, which uses an int:
> https://github.com/apache/parquet-mr/blob/master/
> parquet-common/src/main/java/org/apache/parquet/bytes/
> BytesUtils.java#L219-L225
>
> The run length is tracked as an int, so Java would also have a problem if
> the int is 2^31 or greater. An overflow is possible when incrementing or
> when shifting to add the type bit (RLE or bit-packed). Looks like we have
> an effective max of 2^31-1.
>
> rb
>
> On Tue, May 1, 2018 at 3:40 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>
> > Hello Tim,
> >
> > taking a brief look at what we have in parquet-cpp (which is probably
> very
> > similar to Impala), we would also have problems with runs that are longer
> > than 2^31. While supporting arbitrary long runs might be a really cool
> > feature, I think it will come at a cost that we would have to refactor a
> > lot of code in the current RLE implementations and it may lead to subtle
> > bugs. I would therefore add a maximum run length to the spec. If there is
> > really a need for having longer runs, then someone needs to step up and
> > make the changes to the spec and the implementations. As long as there is
> > no great need, I don't think we should pay the cost of supporting it.
> >
> > Uwe
> >
> > On Mon, Apr 30, 2018, at 11:18 PM, Tim Armstrong wrote:
> > > I'm looking at Impala bug with decoding Parquet RLE with run lengths >=
> > > 2^31. The bug was found by fuzz testing rather than a realistic file.
> > > I'm
> > > trying to determine whether the Parquet spec actually allows runs of
> > > that
> > > length, but Encodings.md does not seem to specify any upper bound. It
> > > mentions ULEB128 encoding, but that can encode arbitrarily large
> > > numbers.
> > > See
> > > https://github.com/apache/parquet-format/blob/master/
> > Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
> > >
> > > Is there a practical limit I can assume? Should we amend the Parquet
> spec
> > > to clarify this?
> > >
> > > The Impala bug is https://issues.apache.org/jira/browse/IMPALA-6946 if
> > > anyone is curious.
> > >
> > > Thanks,
> > > Tim
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: What is the maximum run length in the RLE encoding?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
It looks like there is no length check in the Parquet Java code:
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridEncoder.java#L242

But, that uses `writeUnsignedVarInt`, which uses an int:
https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/bytes/BytesUtils.java#L219-L225

The run length is tracked as an int, so Java would also have a problem if
the int is 2^31 or greater. An overflow is possible when incrementing or
when shifting to add the type bit (RLE or bit-packed). Looks like we have
an effective max of 2^31-1.

rb

On Tue, May 1, 2018 at 3:40 AM, Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Tim,
>
> taking a brief look at what we have in parquet-cpp (which is probably very
> similar to Impala), we would also have problems with runs that are longer
> than 2^31. While supporting arbitrary long runs might be a really cool
> feature, I think it will come at a cost that we would have to refactor a
> lot of code in the current RLE implementations and it may lead to subtle
> bugs. I would therefore add a maximum run length to the spec. If there is
> really a need for having longer runs, then someone needs to step up and
> make the changes to the spec and the implementations. As long as there is
> no great need, I don't think we should pay the cost of supporting it.
>
> Uwe
>
> On Mon, Apr 30, 2018, at 11:18 PM, Tim Armstrong wrote:
> > I'm looking at Impala bug with decoding Parquet RLE with run lengths >=
> > 2^31. The bug was found by fuzz testing rather than a realistic file.
> > I'm
> > trying to determine whether the Parquet spec actually allows runs of
> > that
> > length, but Encodings.md does not seem to specify any upper bound. It
> > mentions ULEB128 encoding, but that can encode arbitrarily large
> > numbers.
> > See
> > https://github.com/apache/parquet-format/blob/master/
> Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
> >
> > Is there a practical limit I can assume? Should we amend the Parquet spec
> > to clarify this?
> >
> > The Impala bug is https://issues.apache.org/jira/browse/IMPALA-6946 if
> > anyone is curious.
> >
> > Thanks,
> > Tim
>



-- 
Ryan Blue
Software Engineer
Netflix