You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jo...@apache.org on 2020/10/14 22:48:32 UTC

[impala] 01/06: IMPALA-10055: Fix DCHECK hit on corrupt ORC file

This is an automated email from the ASF dual-hosted git repository.

joemcdonnell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 0fb234923c580ee39f3c296eebfc5062a71e8fb9
Author: Zoltan Borok-Nagy <bo...@cloudera.com>
AuthorDate: Tue Oct 13 13:54:57 2020 +0200

    IMPALA-10055: Fix DCHECK hit on corrupt ORC file
    
    Our ORC scanner could hit a DCHECK on corrupt ORC files. In
    test_scanners_fuzz we randomly modify ORC files, so the this test
    might hit a DCHECK occasionally.
    
    I converted the DCHECK to a parse error. This way the fuzz test
    won't crash the Impala daemon.
    
    Testing:
    Unfortunately I don't have an ORC file on which we hit the DCHECK.
    So I manually changed the code to always raise this error and
    executed the fuzz test to see if it still succeeds.
    
    Change-Id: I18d9f56c3c37afd1a4898ee36f8cc2ddb5049972
    Reviewed-on: http://gerrit.cloudera.org:8080/16591
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 be/src/exec/orc-column-readers.cc | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/be/src/exec/orc-column-readers.cc b/be/src/exec/orc-column-readers.cc
index 2e6f664..138fa5a 100644
--- a/be/src/exec/orc-column-readers.cc
+++ b/be/src/exec/orc-column-readers.cc
@@ -418,9 +418,18 @@ Status OrcStructReader::TopLevelReadValueBatch(ScratchTupleBatch* scratch_batch,
   }
   int num_rows_read = scratch_batch->num_tuples - scratch_batch_idx;
   if (children_.empty()) {
-    DCHECK((scanner_->row_batches_need_validation_ &&
-            scanner_->scan_node_->IsZeroSlotTableScan()) ||
-            scanner_->acid_original_file_);
+    // We allow empty 'children_' for original files, because we might select the
+    // synthetic 'rowid' field which is not present in original files.
+    // We also allow empty 'children_' when we need to validate row batches of a zero slot
+    // scan. In that case 'children_' is empty and only 'row_validator_' owns an ORC
+    // vector batch (the write id batch).
+    bool valid_empty_children = scanner_->acid_original_file_ ||
+         (scanner_->row_batches_need_validation_ &&
+          scanner_->scan_node_->IsZeroSlotTableScan());
+    if (!valid_empty_children) {
+      return Status(Substitute("Parse error in possibly corrupt ORC file: '$0'",
+          scanner_->filename()));
+    }
     DCHECK_EQ(0, num_rows_read);
     num_rows_read = std::min(scratch_batch->capacity - scratch_batch->num_tuples,
                              NumElements() - row_idx_);