You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Tianyi Wang (JIRA)" <ji...@apache.org> on 2017/09/22 01:34:01 UTC
[jira] [Resolved] (IMPALA-5250) Non-deterministic error reporting for compressed corrupt Parquet files

     [ https://issues.apache.org/jira/browse/IMPALA-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tianyi Wang resolved IMPALA-5250.
---------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.11.0


IMPALA-5250: Unify decompressor output_length semantics

This patch makes the semantics of the output_length parameter in
Codec::ProcessBlock to be the same across all codecs. In existing code
different decompressor treats output_length differently:
1. SnappyDecompressor needs output_length to be greater than or equal
   to the actual decompressed length, but it does not set it to the
   actual decompressed length after decompression.
2. SnappyBlockDecompressor and Lz4Decompressor require output_length to
   be exactly the same as the actual decompressed length, otherwise
   decompression fails.
3. Other decompressors need output_length to be greater than or equal to
   the actual decompressed length and will set it to actual decompressed
   length if oversized.
This inconsistency leads to a bug where the error message is
undeterministic when the compressed block is corrupted. This patch makes
all decompressor behave like a modified version of 3:
Output_length should be greater than or equal to the actual decompressed
length and it will be set to actual decompressed length if oversized. A
decompression failure sets it to 0.
Lz4Decompressor will use the "safe" instead of the "fast" decompression
function, for the latter is insecure with corrupted data and requires
the decompressed length to be known.

Testing: A testcase is added checking that the decompressors can handle
an oversized output buffer correctly. A regression test for the exact
case described in IMPALA-5250 is also added. A benchmark is run on a
16-node cluster testing the performance impact of the LZ4Decompressor
change and no performance regression is found.

Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Reviewed-on: http://gerrit.cloudera.org:8080/8030
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Impala Public Jenkins

> Non-deterministic error reporting for compressed corrupt Parquet files
> ----------------------------------------------------------------------
>
>                 Key: IMPALA-5250
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5250
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.8.0
>            Reporter: Alexander Behm
>            Assignee: Tianyi Wang
>             Fix For: Impala 2.11.0
>
>
> Impala may return non-deterministic errors for certain corrupt Parquet files that are compressed. See the relevant snippet from BaseScalarColumnReader::ReadDataPage() below:
> {code}
>     if (decompressor_.get() != NULL) {
>       SCOPED_TIMER(parent_->decompress_timer_);
>       uint8_t* decompressed_buffer =
>           decompressed_data_pool_->TryAllocate(uncompressed_size);
>       if (UNLIKELY(decompressed_buffer == NULL)) {
>         string details = Substitute(PARQUET_COL_MEM_LIMIT_EXCEEDED, "ReadDataPage",
>             uncompressed_size, "decompressed data");
>         return decompressed_data_pool_->mem_tracker()->MemLimitExceeded(
>             parent_->state_, details, uncompressed_size);
>       }
>       RETURN_IF_ERROR(decompressor_->ProcessBlock32(true,
>           current_page_header_.compressed_page_size, data_, &uncompressed_size,
>           &decompressed_buffer));
>       VLOG_FILE << "Decompressed " << current_page_header_.compressed_page_size
>                 << " to " << uncompressed_size;
>       if (current_page_header_.uncompressed_page_size != uncompressed_size) {
>         return Status(Substitute("Error decompressing data page in file '$0'. "
>             "Expected $1 uncompressed bytes but got $2", filename(),
>             current_page_header_.uncompressed_page_size, uncompressed_size));
>       }
>       data_ = decompressed_buffer;
>       data_size = current_page_header_.uncompressed_page_size;
>       data_end_ = data_ + data_size;
> {code}
> The 'decompressed_buffer' is not initialized, and it is possible that decompressor_->ProcessBlock32() succeeds without writing to all the bytes in the 'decompressed_buffer' leading to non-deterministic errors being reported later in the scan. For example, this may happen when the 'compressed_page_size' is corrupt and set to 1.
> We've seen the following errors being reported for files like this:
> {code}
> Could not read definition level, even though metadata states there are <some_number> values remaining in data page. 
> Corrupt Parquet file '<file>' <some_number> bytes of encoded levels but only <some_number> bytes left in page.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)