You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by bo...@apache.org on 2022/02/23 10:03:32 UTC
[impala] branch master updated (fe04c50 -> 8e5a42f)
This is an automated email from the ASF dual-hosted git repository.
boroknagyz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git.
from fe04c50 IMPALA-10961: Implementing adaptive 3-way quicksort in sorter
new f0d4afa IMPALA-11144: fix testAggregationNodeGroupByCardinalityCapping
new b60ccab IMPALA-11134: Impala returns "Couldn't skip rows in file" error for old Parquet file
new 8e5a42f IMPALA-11133: Decode author of a commit with utf8 before printing it
The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
Summary of changes:
be/src/exec/parquet/hdfs-parquet-scanner.cc | 9 ++++++---
be/src/exec/parquet/parquet-column-readers.cc | 9 ++++++++-
bin/compare_branches.py | 3 ++-
common/thrift/generate_error_codes.py | 2 ++
.../org/apache/impala/planner/CardinalityTest.java | 4 ++--
testdata/data/README | 3 +++
testdata/data/too_many_def_levels.parquet | Bin 0 -> 500880 bytes
tests/query_test/test_scanners.py | 9 +++++++++
8 files changed, 32 insertions(+), 7 deletions(-)
create mode 100644 testdata/data/too_many_def_levels.parquet
[impala] 01/03: IMPALA-11144: fix testAggregationNodeGroupByCardinalityCapping
Posted by bo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.
boroknagyz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git
commit f0d4afaff1e8ed93834f5198046d137bcd03a843
Author: noemi <np...@cloudera.com>
AuthorDate: Tue Feb 22 13:06:10 2022 +0100
IMPALA-11144: fix testAggregationNodeGroupByCardinalityCapping
IMPALA-10961 can decrease the size of the files in
functional_parquet.alltypes, which led to breaking the test that
rely on the file size.
Lowered the expected value, so it fits in the current tolerance range
of 0,05.
See the jira for more detailed analysis
Change-Id: I7956db444549c02fab3e56cb8bd281535f679776
Reviewed-on: http://gerrit.cloudera.org:8080/18264
Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
Reviewed-by: Fang-Yu Rao <fa...@cloudera.com>
Tested-by: Csaba Ringhofer <cs...@cloudera.com>
---
fe/src/test/java/org/apache/impala/planner/CardinalityTest.java | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java b/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
index e9b6eaa..067d838 100644
--- a/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
+++ b/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
@@ -366,11 +366,11 @@ public class CardinalityTest extends PlannerTestBase {
// The estimated number of rows caps the output cardinality.
verifyApproxCardinality(
"select distinct id, int_col from functional_parquet.alltypes",
- 12760, true, ImmutableSet.of(),
+ 12400, true, ImmutableSet.of(),
pathToFirstAggregationNode, AggregationNode.class);
verifyApproxCardinality(
"select distinct id, int_col from functional_parquet.alltypes",
- 12760, true, ImmutableSet.of(),
+ 12400, true, ImmutableSet.of(),
pathToSecondAggregationNode, AggregationNode.class);
// No column stats available and row estimation disabled - no estimate is possible.
verifyCardinality(
[impala] 02/03: IMPALA-11134: Impala returns "Couldn't skip rows in file" error for old Parquet file
Posted by bo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.
boroknagyz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git
commit b60ccabd5b6f09842284c657b910bb65d2f30fe8
Author: Zoltan Borok-Nagy <bo...@cloudera.com>
AuthorDate: Sat Feb 19 18:21:49 2022 +0100
IMPALA-11134: Impala returns "Couldn't skip rows in file" error for old Parquet file
Impala returns "Couldn't skip rows in file" error for old Parquet
file written by an old Impala (e.g. Impala 2.5, 2.6) In DEBUG build
Impala crashes by a DCHECK:
Check failed: num_buffered_values_ > 0 (-1 vs. 0)
The problem is that in some old Parquet files there can be a mismatch
between 'num_values' in a page and the encoded def/rep levels.
There is usually one more def/rep levels encoded in these files.
In SkipTopLevelRows() we skipped values based on how many def levels are
https://github.com/apache/impala/blob/92ce6fe48e75d7780efe9a275122554e59aac916/be/src/exec/parquet/parquet-column-readers.cc#L1308-L1314
Since there are more def levels than values in some old files,
num_buferred_values_ could become negative.
This patch also takes the value of num_buferred_values_ into account
when calculating 'read_count', so we can deal with such files. With
this patch we also include the column name in the "Couldn't skip rows"
error message, so in the future it'll be easier to identify the
problematic columns.
Testing:
* added Parquet file written by Impala 2.5 and e2e test for it
Change-Id: I568fe59df720ea040be4926812412ba4c1510a26
Reviewed-on: http://gerrit.cloudera.org:8080/18257
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
be/src/exec/parquet/hdfs-parquet-scanner.cc | 9 ++++++---
be/src/exec/parquet/parquet-column-readers.cc | 9 ++++++++-
common/thrift/generate_error_codes.py | 2 ++
testdata/data/README | 3 +++
testdata/data/too_many_def_levels.parquet | Bin 0 -> 500880 bytes
tests/query_test/test_scanners.py | 9 +++++++++
6 files changed, 28 insertions(+), 4 deletions(-)
diff --git a/be/src/exec/parquet/hdfs-parquet-scanner.cc b/be/src/exec/parquet/hdfs-parquet-scanner.cc
index 87ad679..94f2b6e 100644
--- a/be/src/exec/parquet/hdfs-parquet-scanner.cc
+++ b/be/src/exec/parquet/hdfs-parquet-scanner.cc
@@ -2392,12 +2392,14 @@ Status HdfsParquetScanner::FillScratchMicroBatches(
if (r == 0) {
if (micro_batches[0].start > 0) {
if (UNLIKELY(!col_reader->SkipRows(micro_batches[0].start, -1))) {
- return Status(Substitute("Couldn't skip rows in file $0.", filename()));
+ return Status(ErrorMsg(TErrorCode::PARQUET_ROWS_SKIPPING,
+ col_reader->schema_element().name, filename()));
}
}
} else {
if (UNLIKELY(!col_reader->SkipRows(micro_batches[r].start - last - 1, -1))) {
- return Status(Substitute("Couldn't skip rows in file $0.", filename()));
+ return Status(ErrorMsg(TErrorCode::PARQUET_ROWS_SKIPPING,
+ col_reader->schema_element().name, filename()));
}
}
uint8_t* next_tuple_mem = scratch_batch_->tuple_mem
@@ -2430,7 +2432,8 @@ Status HdfsParquetScanner::FillScratchMicroBatches(
}
if (UNLIKELY(last < max_num_tuples - 1)) {
if (UNLIKELY(!col_reader->SkipRows(max_num_tuples - 1 - last, -1))) {
- return Status(Substitute("Couldn't skip rows in file $0.", filename()));
+ return Status(ErrorMsg(TErrorCode::PARQUET_ROWS_SKIPPING,
+ col_reader->schema_element().name, filename()));
}
}
}
diff --git a/be/src/exec/parquet/parquet-column-readers.cc b/be/src/exec/parquet/parquet-column-readers.cc
index 3a8ac36..70e736e 100644
--- a/be/src/exec/parquet/parquet-column-readers.cc
+++ b/be/src/exec/parquet/parquet-column-readers.cc
@@ -1270,7 +1270,8 @@ Status BaseScalarColumnReader::StartPageFiltering() {
int64_t skip_rows = range_start - current_row_ - 1;
int64_t remaining = 0;
if (!SkipTopLevelRows(skip_rows, &remaining)) {
- return Status(Substitute("Couldn't skip rows in file $0.", filename()));
+ return Status(ErrorMsg(TErrorCode::PARQUET_ROWS_SKIPPING,
+ schema_element().name, filename()));
}
DCHECK_EQ(remaining, 0);
DCHECK_EQ(current_row_, range_start - 1);
@@ -1308,12 +1309,18 @@ bool BaseScalarColumnReader::SkipTopLevelRows(int64_t num_rows, int64_t* remaini
int repeated_run_length = def_levels_.NextRepeatedRunLength();
if (repeated_run_length > 0) {
int read_count = min<int64_t>(num_rows - i, repeated_run_length);
+ // IMPALA-11134: there can be a mismatch between page_header.num_values and
+ // encoded def levels in old Parquet files.
+ read_count = min(num_buffered_values_, read_count);
int16_t def_level = def_levels_.GetRepeatedValue(read_count);
if (def_level >= max_def_level_) num_values_to_skip += read_count;
i += read_count;
num_buffered_values_ -= read_count;
} else if (def_levels_.CacheHasNext()) {
int read_count = min<int64_t>(num_rows - i, def_levels_.CacheRemaining());
+ // IMPALA-11134: there can be a mismatch between page_header.num_values and
+ // encoded def levels in old Parquet files.
+ read_count = min(num_buffered_values_, read_count);
for (int j = 0; j < read_count; ++j) {
if (def_levels_.CacheGetNext() >= max_def_level_) ++num_values_to_skip;
}
diff --git a/common/thrift/generate_error_codes.py b/common/thrift/generate_error_codes.py
index b4751c8..23e08a8 100755
--- a/common/thrift/generate_error_codes.py
+++ b/common/thrift/generate_error_codes.py
@@ -475,6 +475,8 @@ error_codes = (
("JWKS_PARSE_ERROR", 153, "Error parsing JWKS: $0."),
("JWT_VERIFY_FAILED", 154, "Error verifying JWT Token: $0."),
+
+ ("PARQUET_ROWS_SKIPPING", 155, "Couldn't skip rows in column '$0' in file '$1'."),
)
import sys
diff --git a/testdata/data/README b/testdata/data/README
index bfaa0c1..f687034 100644
--- a/testdata/data/README
+++ b/testdata/data/README
@@ -675,3 +675,6 @@ field ids.
iceberg_test/hadoop_catalog/ice/airports_orc:
Regular ORC table converted to Iceberg, which means that the data file doesn't contain
field ids.
+
+too_many_def_levels.parquet:
+Written by Impala 2.5. The Parquet pages have more encoded def levels than num_values.
diff --git a/testdata/data/too_many_def_levels.parquet b/testdata/data/too_many_def_levels.parquet
new file mode 100644
index 0000000..fc00ab6
Binary files /dev/null and b/testdata/data/too_many_def_levels.parquet differ
diff --git a/tests/query_test/test_scanners.py b/tests/query_test/test_scanners.py
index f2b0c4d..33b433e 100644
--- a/tests/query_test/test_scanners.py
+++ b/tests/query_test/test_scanners.py
@@ -589,6 +589,15 @@ class TestParquet(ImpalaTestSuite):
self.run_test_case('QueryTest/parquet-num-values-def-levels-mismatch',
vector, unique_database)
+ """IMPALA-11134: Impala returns "Couldn't skip rows in file" error for old
+ (possibly corrupt) Parquet file where there are more def levels than num_values"""
+ create_table_from_parquet(self.client, unique_database,
+ "too_many_def_levels")
+ result = self.client.execute("select i_item_id from {0}."
+ "too_many_def_levels where i_item_sk = 350963".format(unique_database))
+ assert len(result.data) == 1
+ assert "AAAAAAAACPKFFAAA" in result.data
+
@SkipIfS3.hdfs_block_size
@SkipIfGCS.hdfs_block_size
@SkipIfCOS.hdfs_block_size
[impala] 03/03: IMPALA-11133: Decode author of a commit with utf8 before printing it
Posted by bo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.
boroknagyz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git
commit 8e5a42fff432e7a81f0b403301f6f1cc789cd1a2
Author: Fang-Yu Rao <fa...@cloudera.com>
AuthorDate: Fri Feb 18 11:28:48 2022 -0800
IMPALA-11133: Decode author of a commit with utf8 before printing it
We found that compare_branches.py could fail if the author of a commit
contains non-ASCII characters because the script attempts to print the
field. This patch fixes the problem by explicitly decoding the value of
author with the encoding 'utf8'. The commit message is also decoded with
'utf8' to prevent similar problems from happening when there are
non-ASCII characters in the commit message.
Testing:
- Manually verified that we won't get the UnicodeDecodeError after this
patch.
Change-Id: Ieb03b0937a994db2bf08e4199574d04f7fb99f5d
Reviewed-on: http://gerrit.cloudera.org:8080/18256
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
bin/compare_branches.py | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/bin/compare_branches.py b/bin/compare_branches.py
index f57aad1..4027974 100755
--- a/bin/compare_branches.py
+++ b/bin/compare_branches.py
@@ -267,7 +267,8 @@ def main():
logging.debug("NOT ignoring commit {0} since not in ignored commits ({1},{2})"
.format(commit_hash, options.source_branch, options.target_branch))
if not change_in_target and not ignore_by_config and not ignore_by_commit_message:
- print u'{0} {1} ({2}) - {3}'.format(commit_hash, msg, date, author)
+ print u'{0} {1} ({2}) - {3}'\
+ .format(commit_hash, msg.decode('utf8'), date, author.decode('utf8'))
cherry_pick_hashes.append(commit_hash)
jira_keys += jira_key_pat.findall(msg)