You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2016/08/31 02:26:21 UTC
[jira] [Commented] (PARQUET-698) Max buffer size for RLE encoding
too small resulting in CheckBufferFull failure
[ https://issues.apache.org/jira/browse/PARQUET-698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15450826#comment-15450826 ]
Wes McKinney commented on PARQUET-698:
--------------------------------------
This may be a dup of PARQUET-676 -- that JIRA contains the makings of a test case. A patch would be most welcome, but I want to make sure that we understand exactly what the problem is.
https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
In Impala's implementation of RLE encoding, which is what parquet-cpp's came from, it does not encode literal runs larger than 504 so that the indicator byte for RLE vs. literal is only a single byte (and the LSB is 1 for literal and 0 for RLE).
See also the parquet-mr code:
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridEncoder.java#L105
and here's where the limit enforced
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridEncoder.java#L188
here's where the limit is enforced in parquet-cpp
https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/rle-encoding.h#L447
so in other words when there are 63 groups of 8 bit-packed values, it flushes the literal run.
So the MAX_VALUES_PER_LITERAL_RUN should be 504, not 512, but it seems there is little harm in the overestimate.
That's as far as I can dig right now, if someone can find the bug and write a patch with a test case that would be most helpful.
> Max buffer size for RLE encoding too small resulting in CheckBufferFull failure
> -------------------------------------------------------------------------------
>
> Key: PARQUET-698
> URL: https://issues.apache.org/jira/browse/PARQUET-698
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Affects Versions: cpp-0.1
> Reporter: Anand Mitra
>
> I have been serializing dataset of 500 documents with nested schema
> and repeated attributes. I had been batching 50 records to a rowgroup.
> The check DCHECK_EQ(encoded, num_buffered_values_) in
> ColumnWriter::RleEncodeLevels() failed.
> We are running out of space in the allocated buffer. This seems
> unlikely since we compute the worst case max size required and
> allocate accordingly.
> Looking more closely at the how the max size is computed and comparing
> it with how we write the RLE I noticed the following inconsitency.
> The worst case space required would be when there are no repeats and
> everything is a literal run with the overhead of the
> "litteral_indicator" which is 4 bytes(32-bits). The computation
> assumes that we can get MAX_VALUES_PER_LITERAL_RUN = (1 << 6) * 8 of
> literals encoded for the overhead of one literal_indicator.
> However when I examine the actual code it actually emits literal
> values after every 8 literals giving it the overhead of 4 bytes for
> every 8 literal values. This can be ascertained from the following
> fragments of code.
> RleEncoder::Put() {
> .....
> if (++num_buffered_values_ == 8) {
> DCHECK_EQ(literal_count_ % 8, 0);
> FlushBufferedValues(false);
> }
> .....
> }
> as a result of this in RleEncoder::FlushBufferedValues() we will
> encode only one group and not encode the maximum which is
> theoretically possible with the scheme.
> Suggested fix is to change MAX_VALUES_PER_LITERAL_RUN to 8 to
> accurately calculate the buffer size for the current encoding scheme.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)