You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Anand Mitra (JIRA)" <ji...@apache.org> on 2016/08/30 13:23:21 UTC

[jira] [Created] (PARQUET-698) Max buffer size for RLE encoding too small resulting in CheckBufferFull failure

Anand Mitra created PARQUET-698:
-----------------------------------

             Summary: Max buffer size for RLE encoding too small resulting in CheckBufferFull failure
                 Key: PARQUET-698
                 URL: https://issues.apache.org/jira/browse/PARQUET-698
             Project: Parquet
          Issue Type: Bug
          Components: parquet-cpp
    Affects Versions: cpp-0.1
            Reporter: Anand Mitra


I have been serializing dataset of 500 documents with nested schema
and repeated attributes. I had been batching 50 records to a rowgroup.

The check DCHECK_EQ(encoded, num_buffered_values_) in
ColumnWriter::RleEncodeLevels() failed.

We are running out of space in the allocated buffer. This seems
unlikely since we compute the worst case max size required and
allocate accordingly. 

Looking more closely at the how the max size is computed and comparing
it with how we write the RLE I noticed the following inconsitency.

The worst case space required would be when there are no repeats and
everything is a literal run with the overhead of the
"litteral_indicator" which is 4 bytes(32-bits). The computation
assumes that we can get MAX_VALUES_PER_LITERAL_RUN = (1 << 6) * 8 of
literals encoded for the overhead of one literal_indicator. 

However when I examine the actual code it actually emits literal
values after every 8 literals giving it the overhead of 4 bytes for
every 8 literal values. This can be ascertained from the following
fragments of code.

RleEncoder::Put() {
  .....
  if (++num_buffered_values_ == 8) {
    DCHECK_EQ(literal_count_ % 8, 0);
    FlushBufferedValues(false);
  }
  .....
}

as a result of this in RleEncoder::FlushBufferedValues() we will
encode only one group and not encode the maximum which is
theoretically possible with the scheme.

Suggested fix is to change MAX_VALUES_PER_LITERAL_RUN to 8 to
accurately calculate the buffer size for the current encoding scheme.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)