You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@impala.apache.org by "Csaba Ringhofer (JIRA)" <ji...@apache.org> on 2018/03/19 16:35:00 UTC

[jira] [Created] (IMPALA-6700) Improve error handling of Parquet RLE decoding

Csaba Ringhofer created IMPALA-6700:
---------------------------------------

             Summary: Improve error handling of Parquet RLE decoding
                 Key: IMPALA-6700
                 URL: https://issues.apache.org/jira/browse/IMPALA-6700
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
            Reporter: Csaba Ringhofer


RleBatchDecoder's error handling should be stricter and more explicit in my opinion. My main problem with the current solution is readability - it is hard to tell at the first glance what will be the consequences of different decoding errors.

There are 4 things that can go wrong during RLE decoding:
1. repeated / literal run found with count 0 (now this leads to error in some cases, while is skipped in some others)
2. literal run that exceeds the input (currently GetLiteralValues returns false in this case)
3. more values are expected, but there are no more runs in the input (currently this has to be checked by the caller)
4. there shouldn't be any more value, but the input still has more bytes (this is currently not checked)

1 and 2 should set RleBatchDecoder to an error state, while there should be a function to check 4 (e.g. STATUS CheckIfAtTheEnd() ). Handling of 3 is ok as it is.

Changing 1 and 4  would mean that some Parquet pages that are currently accepted would be considered corrupt. 
- 1 is a valid change in my opinion, as parquet-mr also doesn't handle 0 counts - 0 sized literal runs wil lead to exception, while 0 sizes repeated runs will lead to incorrect results, see https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java
- 4 could be useful to detect errors in encoders / corrupted data  - nearly any random buffer can be decoded as RLE, but it is unlikely that the number of encoded values will match the number of expected values



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)