You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jo...@apache.org on 2020/10/14 22:48:32 UTC
[impala] 01/06: IMPALA-10055: Fix DCHECK hit on corrupt ORC file
This is an automated email from the ASF dual-hosted git repository.
joemcdonnell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git
commit 0fb234923c580ee39f3c296eebfc5062a71e8fb9
Author: Zoltan Borok-Nagy <bo...@cloudera.com>
AuthorDate: Tue Oct 13 13:54:57 2020 +0200
IMPALA-10055: Fix DCHECK hit on corrupt ORC file
Our ORC scanner could hit a DCHECK on corrupt ORC files. In
test_scanners_fuzz we randomly modify ORC files, so the this test
might hit a DCHECK occasionally.
I converted the DCHECK to a parse error. This way the fuzz test
won't crash the Impala daemon.
Testing:
Unfortunately I don't have an ORC file on which we hit the DCHECK.
So I manually changed the code to always raise this error and
executed the fuzz test to see if it still succeeds.
Change-Id: I18d9f56c3c37afd1a4898ee36f8cc2ddb5049972
Reviewed-on: http://gerrit.cloudera.org:8080/16591
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
be/src/exec/orc-column-readers.cc | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/be/src/exec/orc-column-readers.cc b/be/src/exec/orc-column-readers.cc
index 2e6f664..138fa5a 100644
--- a/be/src/exec/orc-column-readers.cc
+++ b/be/src/exec/orc-column-readers.cc
@@ -418,9 +418,18 @@ Status OrcStructReader::TopLevelReadValueBatch(ScratchTupleBatch* scratch_batch,
}
int num_rows_read = scratch_batch->num_tuples - scratch_batch_idx;
if (children_.empty()) {
- DCHECK((scanner_->row_batches_need_validation_ &&
- scanner_->scan_node_->IsZeroSlotTableScan()) ||
- scanner_->acid_original_file_);
+ // We allow empty 'children_' for original files, because we might select the
+ // synthetic 'rowid' field which is not present in original files.
+ // We also allow empty 'children_' when we need to validate row batches of a zero slot
+ // scan. In that case 'children_' is empty and only 'row_validator_' owns an ORC
+ // vector batch (the write id batch).
+ bool valid_empty_children = scanner_->acid_original_file_ ||
+ (scanner_->row_batches_need_validation_ &&
+ scanner_->scan_node_->IsZeroSlotTableScan());
+ if (!valid_empty_children) {
+ return Status(Substitute("Parse error in possibly corrupt ORC file: '$0'",
+ scanner_->filename()));
+ }
DCHECK_EQ(0, num_rows_read);
num_rows_read = std::min(scratch_batch->capacity - scratch_batch->num_tuples,
NumElements() - row_idx_);