You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Alexander Behm (JIRA)" <ji...@apache.org> on 2017/04/24 16:56:04 UTC
[jira] [Created] (IMPALA-5250) Non-deterministic error reporting
for compressed corrupt Parquet files
Alexander Behm created IMPALA-5250:
--------------------------------------
Summary: Non-deterministic error reporting for compressed corrupt Parquet files
Key: IMPALA-5250
URL: https://issues.apache.org/jira/browse/IMPALA-5250
Project: IMPALA
Issue Type: Bug
Components: Backend
Affects Versions: Impala 2.8.0
Reporter: Alexander Behm
Impala may return non-deterministic errors for certain corrupt Parquet files that are compressed. See the relevant snippet from BaseScalarColumnReader::ReadDataPage() below:
{code}
if (decompressor_.get() != NULL) {
SCOPED_TIMER(parent_->decompress_timer_);
uint8_t* decompressed_buffer =
decompressed_data_pool_->TryAllocate(uncompressed_size);
if (UNLIKELY(decompressed_buffer == NULL)) {
string details = Substitute(PARQUET_COL_MEM_LIMIT_EXCEEDED, "ReadDataPage",
uncompressed_size, "decompressed data");
return decompressed_data_pool_->mem_tracker()->MemLimitExceeded(
parent_->state_, details, uncompressed_size);
}
RETURN_IF_ERROR(decompressor_->ProcessBlock32(true,
current_page_header_.compressed_page_size, data_, &uncompressed_size,
&decompressed_buffer));
VLOG_FILE << "Decompressed " << current_page_header_.compressed_page_size
<< " to " << uncompressed_size;
if (current_page_header_.uncompressed_page_size != uncompressed_size) {
return Status(Substitute("Error decompressing data page in file '$0'. "
"Expected $1 uncompressed bytes but got $2", filename(),
current_page_header_.uncompressed_page_size, uncompressed_size));
}
data_ = decompressed_buffer;
data_size = current_page_header_.uncompressed_page_size;
data_end_ = data_ + data_size;
{code}
The 'decompressed_buffer' is not initialized, and it is possible that decompressor_->ProcessBlock32() succeeds without writing to all the bytes in the 'decompressed_buffer' leading to non-deterministic errors being reported later in the scan. For example, this may happen when the 'compressed_page_size' is corrupt and set to 1.
We've seen the following errors being reported for files like this:
{code}
Could not read definition level, even though metadata states there are <some_number> values remaining in data page.
Corrupt Parquet file '<file>' <some_number> bytes of encoded levels but only <some_number> bytes left in page.
{code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)