You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2016/09/02 20:15:20 UTC

[jira] [Commented] (PARQUET-698) Max buffer size for RLE encoding too small resulting in CheckBufferFull failure

    [ https://issues.apache.org/jira/browse/PARQUET-698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15459486#comment-15459486 ] 

Wes McKinney commented on PARQUET-698:
--------------------------------------

This should be fixed by https://github.com/apache/parquet-cpp/pull/150

> Max buffer size for RLE encoding too small resulting in CheckBufferFull failure
> -------------------------------------------------------------------------------
>
>                 Key: PARQUET-698
>                 URL: https://issues.apache.org/jira/browse/PARQUET-698
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>    Affects Versions: cpp-0.1
>            Reporter: Anand Mitra
>            Assignee: Wes McKinney
>
> I have been serializing dataset of 500 documents with nested schema
> and repeated attributes. I had been batching 50 records to a rowgroup.
> The check DCHECK_EQ(encoded, num_buffered_values_) in
> ColumnWriter::RleEncodeLevels() failed.
> We are running out of space in the allocated buffer. This seems
> unlikely since we compute the worst case max size required and
> allocate accordingly. 
> Looking more closely at the how the max size is computed and comparing
> it with how we write the RLE I noticed the following inconsitency.
> The worst case space required would be when there are no repeats and
> everything is a literal run with the overhead of the
> "litteral_indicator" which is 4 bytes(32-bits). The computation
> assumes that we can get MAX_VALUES_PER_LITERAL_RUN = (1 << 6) * 8 of
> literals encoded for the overhead of one literal_indicator. 
> However when I examine the actual code it actually emits literal
> values after every 8 literals giving it the overhead of 4 bytes for
> every 8 literal values. This can be ascertained from the following
> fragments of code.
> RleEncoder::Put() {
>   .....
>   if (++num_buffered_values_ == 8) {
>     DCHECK_EQ(literal_count_ % 8, 0);
>     FlushBufferedValues(false);
>   }
>   .....
> }
> as a result of this in RleEncoder::FlushBufferedValues() we will
> encode only one group and not encode the maximum which is
> theoretically possible with the scheme.
> Suggested fix is to change MAX_VALUES_PER_LITERAL_RUN to 8 to
> accurately calculate the buffer size for the current encoding scheme.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)