You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2016/08/31 02:26:21 UTC
[jira] [Commented] (PARQUET-698) Max buffer size for RLE encoding too small resulting in CheckBufferFull failure

    [ https://issues.apache.org/jira/browse/PARQUET-698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15450826#comment-15450826 ] 

Wes McKinney commented on PARQUET-698:
--------------------------------------

This may be a dup of PARQUET-676 -- that JIRA contains the makings of a test case. A patch would be most welcome, but I want to make sure that we understand exactly what the problem is. 

https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3

In Impala's implementation of RLE encoding, which is what parquet-cpp's came from, it does not encode literal runs larger than 504 so that the indicator byte for RLE vs. literal is only a single byte (and the LSB is 1 for literal and 0 for RLE). 

See also the parquet-mr code:

https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridEncoder.java#L105

and here's where the limit enforced

https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridEncoder.java#L188

here's where the limit is enforced in parquet-cpp

https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/rle-encoding.h#L447

so in other words when there are 63 groups of 8 bit-packed values, it flushes the literal run. 

So the MAX_VALUES_PER_LITERAL_RUN should be 504, not 512, but it seems there is little harm in the overestimate. 

That's as far as I can dig right now, if someone can find the bug and write a patch with a test case that would be most helpful. 

> Max buffer size for RLE encoding too small resulting in CheckBufferFull failure
> -------------------------------------------------------------------------------
>
>                 Key: PARQUET-698
>                 URL: https://issues.apache.org/jira/browse/PARQUET-698
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>    Affects Versions: cpp-0.1
>            Reporter: Anand Mitra
>
> I have been serializing dataset of 500 documents with nested schema
> and repeated attributes. I had been batching 50 records to a rowgroup.
> The check DCHECK_EQ(encoded, num_buffered_values_) in
> ColumnWriter::RleEncodeLevels() failed.
> We are running out of space in the allocated buffer. This seems
> unlikely since we compute the worst case max size required and
> allocate accordingly. 
> Looking more closely at the how the max size is computed and comparing
> it with how we write the RLE I noticed the following inconsitency.
> The worst case space required would be when there are no repeats and
> everything is a literal run with the overhead of the
> "litteral_indicator" which is 4 bytes(32-bits). The computation
> assumes that we can get MAX_VALUES_PER_LITERAL_RUN = (1 << 6) * 8 of
> literals encoded for the overhead of one literal_indicator. 
> However when I examine the actual code it actually emits literal
> values after every 8 literals giving it the overhead of 4 bytes for
> every 8 literal values. This can be ascertained from the following
> fragments of code.
> RleEncoder::Put() {
>   .....
>   if (++num_buffered_values_ == 8) {
>     DCHECK_EQ(literal_count_ % 8, 0);
>     FlushBufferedValues(false);
>   }
>   .....
> }
> as a result of this in RleEncoder::FlushBufferedValues() we will
> encode only one group and not encode the maximum which is
> theoretically possible with the scheme.
> Suggested fix is to change MAX_VALUES_PER_LITERAL_RUN to 8 to
> accurately calculate the buffer size for the current encoding scheme.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)