You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by bo...@apache.org on 2022/02/23 10:03:32 UTC

[impala] branch master updated (fe04c50 -> 8e5a42f)

This is an automated email from the ASF dual-hosted git repository.

boroknagyz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git.


    from fe04c50  IMPALA-10961: Implementing adaptive 3-way quicksort in sorter
     new f0d4afa  IMPALA-11144: fix testAggregationNodeGroupByCardinalityCapping
     new b60ccab  IMPALA-11134: Impala returns "Couldn't skip rows in file" error for old Parquet file
     new 8e5a42f  IMPALA-11133: Decode author of a commit with utf8 before printing it

The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 be/src/exec/parquet/hdfs-parquet-scanner.cc            |   9 ++++++---
 be/src/exec/parquet/parquet-column-readers.cc          |   9 ++++++++-
 bin/compare_branches.py                                |   3 ++-
 common/thrift/generate_error_codes.py                  |   2 ++
 .../org/apache/impala/planner/CardinalityTest.java     |   4 ++--
 testdata/data/README                                   |   3 +++
 testdata/data/too_many_def_levels.parquet              | Bin 0 -> 500880 bytes
 tests/query_test/test_scanners.py                      |   9 +++++++++
 8 files changed, 32 insertions(+), 7 deletions(-)
 create mode 100644 testdata/data/too_many_def_levels.parquet

[impala] 01/03: IMPALA-11144: fix testAggregationNodeGroupByCardinalityCapping

Posted by bo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

boroknagyz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit f0d4afaff1e8ed93834f5198046d137bcd03a843
Author: noemi <np...@cloudera.com>
AuthorDate: Tue Feb 22 13:06:10 2022 +0100

    IMPALA-11144: fix testAggregationNodeGroupByCardinalityCapping
    
    IMPALA-10961 can decrease the size of the files in
    functional_parquet.alltypes, which led to breaking the test that
    rely on the file size.
    Lowered the expected value, so it fits in the current tolerance range
    of 0,05.
    See the jira for more detailed analysis
    
    Change-Id: I7956db444549c02fab3e56cb8bd281535f679776
    Reviewed-on: http://gerrit.cloudera.org:8080/18264
    Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
    Reviewed-by: Fang-Yu Rao <fa...@cloudera.com>
    Tested-by: Csaba Ringhofer <cs...@cloudera.com>
---
 fe/src/test/java/org/apache/impala/planner/CardinalityTest.java | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java b/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
index e9b6eaa..067d838 100644
--- a/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
+++ b/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
@@ -366,11 +366,11 @@ public class CardinalityTest extends PlannerTestBase {
     // The estimated number of rows caps the output cardinality.
     verifyApproxCardinality(
         "select distinct id, int_col from functional_parquet.alltypes",
-        12760, true, ImmutableSet.of(),
+        12400, true, ImmutableSet.of(),
         pathToFirstAggregationNode, AggregationNode.class);
     verifyApproxCardinality(
         "select distinct id, int_col from functional_parquet.alltypes",
-        12760, true, ImmutableSet.of(),
+        12400, true, ImmutableSet.of(),
         pathToSecondAggregationNode, AggregationNode.class);
     // No column stats available and row estimation disabled - no estimate is possible.
     verifyCardinality(

[impala] 02/03: IMPALA-11134: Impala returns "Couldn't skip rows in file" error for old Parquet file

Posted by bo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

boroknagyz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit b60ccabd5b6f09842284c657b910bb65d2f30fe8
Author: Zoltan Borok-Nagy <bo...@cloudera.com>
AuthorDate: Sat Feb 19 18:21:49 2022 +0100

    IMPALA-11134: Impala returns "Couldn't skip rows in file" error for old Parquet file
    
    Impala returns "Couldn't skip rows in file" error for old Parquet
    file written by an old Impala (e.g. Impala 2.5, 2.6) In DEBUG build
    Impala crashes by a DCHECK:
    
     Check failed: num_buffered_values_ > 0 (-1 vs. 0)
    
    The problem is that in some old Parquet files there can be a mismatch
    between 'num_values' in a page and the encoded def/rep levels.
    There is usually one more def/rep levels encoded in these files.
    
    In SkipTopLevelRows() we skipped values based on how many def levels are
    https://github.com/apache/impala/blob/92ce6fe48e75d7780efe9a275122554e59aac916/be/src/exec/parquet/parquet-column-readers.cc#L1308-L1314
    
    Since there are more def levels than values in some old files,
    num_buferred_values_ could become negative.
    
    This patch also takes the value of num_buferred_values_ into account
    when calculating 'read_count', so we can deal with such files. With
    this patch we also include the column name in the "Couldn't skip rows"
    error message, so in the future it'll be easier to identify the
    problematic columns.
    
    Testing:
     * added Parquet file written by Impala 2.5 and e2e test for it
    
    Change-Id: I568fe59df720ea040be4926812412ba4c1510a26
    Reviewed-on: http://gerrit.cloudera.org:8080/18257
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 be/src/exec/parquet/hdfs-parquet-scanner.cc   |   9 ++++++---
 be/src/exec/parquet/parquet-column-readers.cc |   9 ++++++++-
 common/thrift/generate_error_codes.py         |   2 ++
 testdata/data/README                          |   3 +++
 testdata/data/too_many_def_levels.parquet     | Bin 0 -> 500880 bytes
 tests/query_test/test_scanners.py             |   9 +++++++++
 6 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/be/src/exec/parquet/hdfs-parquet-scanner.cc b/be/src/exec/parquet/hdfs-parquet-scanner.cc
index 87ad679..94f2b6e 100644
--- a/be/src/exec/parquet/hdfs-parquet-scanner.cc
+++ b/be/src/exec/parquet/hdfs-parquet-scanner.cc
@@ -2392,12 +2392,14 @@ Status HdfsParquetScanner::FillScratchMicroBatches(
       if (r == 0) {
         if (micro_batches[0].start > 0) {
           if (UNLIKELY(!col_reader->SkipRows(micro_batches[0].start, -1))) {
-            return Status(Substitute("Couldn't skip rows in file $0.", filename()));
+            return Status(ErrorMsg(TErrorCode::PARQUET_ROWS_SKIPPING,
+                col_reader->schema_element().name, filename()));
           }
         }
       } else {
         if (UNLIKELY(!col_reader->SkipRows(micro_batches[r].start - last - 1, -1))) {
-          return Status(Substitute("Couldn't skip rows in file $0.", filename()));
+          return Status(ErrorMsg(TErrorCode::PARQUET_ROWS_SKIPPING,
+              col_reader->schema_element().name, filename()));
         }
       }
       uint8_t* next_tuple_mem = scratch_batch_->tuple_mem
@@ -2430,7 +2432,8 @@ Status HdfsParquetScanner::FillScratchMicroBatches(
     }
     if (UNLIKELY(last < max_num_tuples - 1)) {
       if (UNLIKELY(!col_reader->SkipRows(max_num_tuples - 1 - last, -1))) {
-        return Status(Substitute("Couldn't skip rows in file $0.", filename()));
+        return Status(ErrorMsg(TErrorCode::PARQUET_ROWS_SKIPPING,
+            col_reader->schema_element().name, filename()));
       }
     }
   }
diff --git a/be/src/exec/parquet/parquet-column-readers.cc b/be/src/exec/parquet/parquet-column-readers.cc
index 3a8ac36..70e736e 100644
--- a/be/src/exec/parquet/parquet-column-readers.cc
+++ b/be/src/exec/parquet/parquet-column-readers.cc
@@ -1270,7 +1270,8 @@ Status BaseScalarColumnReader::StartPageFiltering() {
     int64_t skip_rows = range_start - current_row_ - 1;
     int64_t remaining = 0;
     if (!SkipTopLevelRows(skip_rows, &remaining)) {
-      return Status(Substitute("Couldn't skip rows in file $0.", filename()));
+      return Status(ErrorMsg(TErrorCode::PARQUET_ROWS_SKIPPING,
+          schema_element().name, filename()));
     }
     DCHECK_EQ(remaining, 0);
     DCHECK_EQ(current_row_, range_start - 1);
@@ -1308,12 +1309,18 @@ bool BaseScalarColumnReader::SkipTopLevelRows(int64_t num_rows, int64_t* remaini
       int repeated_run_length = def_levels_.NextRepeatedRunLength();
       if (repeated_run_length > 0) {
         int read_count = min<int64_t>(num_rows - i, repeated_run_length);
+        // IMPALA-11134: there can be a mismatch between page_header.num_values and
+        // encoded def levels in old Parquet files.
+        read_count = min(num_buffered_values_, read_count);
         int16_t def_level = def_levels_.GetRepeatedValue(read_count);
         if (def_level >= max_def_level_) num_values_to_skip += read_count;
         i += read_count;
         num_buffered_values_ -= read_count;
       } else if (def_levels_.CacheHasNext()) {
         int read_count = min<int64_t>(num_rows - i, def_levels_.CacheRemaining());
+        // IMPALA-11134: there can be a mismatch between page_header.num_values and
+        // encoded def levels in old Parquet files.
+        read_count = min(num_buffered_values_, read_count);
         for (int j = 0; j < read_count; ++j) {
           if (def_levels_.CacheGetNext() >= max_def_level_) ++num_values_to_skip;
         }
diff --git a/common/thrift/generate_error_codes.py b/common/thrift/generate_error_codes.py
index b4751c8..23e08a8 100755
--- a/common/thrift/generate_error_codes.py
+++ b/common/thrift/generate_error_codes.py
@@ -475,6 +475,8 @@ error_codes = (
   ("JWKS_PARSE_ERROR", 153, "Error parsing JWKS: $0."),
 
   ("JWT_VERIFY_FAILED", 154, "Error verifying JWT Token: $0."),
+
+  ("PARQUET_ROWS_SKIPPING", 155, "Couldn't skip rows in column '$0' in file '$1'."),
 )
 
 import sys
diff --git a/testdata/data/README b/testdata/data/README
index bfaa0c1..f687034 100644
--- a/testdata/data/README
+++ b/testdata/data/README
@@ -675,3 +675,6 @@ field ids.
 iceberg_test/hadoop_catalog/ice/airports_orc:
 Regular ORC table converted to Iceberg, which means that the data file doesn't contain
 field ids.
+
+too_many_def_levels.parquet:
+Written by Impala 2.5. The Parquet pages have more encoded def levels than num_values.
diff --git a/testdata/data/too_many_def_levels.parquet b/testdata/data/too_many_def_levels.parquet
new file mode 100644
index 0000000..fc00ab6
Binary files /dev/null and b/testdata/data/too_many_def_levels.parquet differ
diff --git a/tests/query_test/test_scanners.py b/tests/query_test/test_scanners.py
index f2b0c4d..33b433e 100644
--- a/tests/query_test/test_scanners.py
+++ b/tests/query_test/test_scanners.py
@@ -589,6 +589,15 @@ class TestParquet(ImpalaTestSuite):
     self.run_test_case('QueryTest/parquet-num-values-def-levels-mismatch',
         vector, unique_database)
 
+    """IMPALA-11134: Impala returns "Couldn't skip rows in file" error for old
+    (possibly corrupt) Parquet file where there are more def levels than num_values"""
+    create_table_from_parquet(self.client, unique_database,
+        "too_many_def_levels")
+    result = self.client.execute("select i_item_id from {0}."
+        "too_many_def_levels where i_item_sk = 350963".format(unique_database))
+    assert len(result.data) == 1
+    assert "AAAAAAAACPKFFAAA" in result.data
+
   @SkipIfS3.hdfs_block_size
   @SkipIfGCS.hdfs_block_size
   @SkipIfCOS.hdfs_block_size

[impala] 03/03: IMPALA-11133: Decode author of a commit with utf8 before printing it

Posted by bo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

boroknagyz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 8e5a42fff432e7a81f0b403301f6f1cc789cd1a2
Author: Fang-Yu Rao <fa...@cloudera.com>
AuthorDate: Fri Feb 18 11:28:48 2022 -0800

    IMPALA-11133: Decode author of a commit with utf8 before printing it
    
    We found that compare_branches.py could fail if the author of a commit
    contains non-ASCII characters because the script attempts to print the
    field. This patch fixes the problem by explicitly decoding the value of
    author with the encoding 'utf8'. The commit message is also decoded with
    'utf8' to prevent similar problems from happening when there are
    non-ASCII characters in the commit message.
    
    Testing:
     - Manually verified that we won't get the UnicodeDecodeError after this
       patch.
    
    Change-Id: Ieb03b0937a994db2bf08e4199574d04f7fb99f5d
    Reviewed-on: http://gerrit.cloudera.org:8080/18256
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 bin/compare_branches.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/bin/compare_branches.py b/bin/compare_branches.py
index f57aad1..4027974 100755
--- a/bin/compare_branches.py
+++ b/bin/compare_branches.py
@@ -267,7 +267,8 @@ def main():
       logging.debug("NOT ignoring commit {0} since not in ignored commits ({1},{2})"
                    .format(commit_hash, options.source_branch, options.target_branch))
     if not change_in_target and not ignore_by_config and not ignore_by_commit_message:
-      print u'{0} {1} ({2}) - {3}'.format(commit_hash, msg, date, author)
+      print u'{0} {1} ({2}) - {3}'\
+          .format(commit_hash, msg.decode('utf8'), date, author.decode('utf8'))
       cherry_pick_hashes.append(commit_hash)
       jira_keys += jira_key_pat.findall(msg)