You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by st...@apache.org on 2022/08/08 23:36:57 UTC

[impala] branch branch-4.1.1 created (now 6fc556be5)

This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a change to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git


      at 6fc556be5 Update version to 4.1.1-SNAPSHOT

This branch includes the following new commits:

     new 964c55f68 IMPALA-11275: log thread info during minidump
     new 019938efa IMPALA-11291: deflake minidump-test by not starting JVM
     new 4879f49bb IMPALA-11208: Fix uninitialized counter of CollectionItemsRead in orc-scanner
     new e5164c89e IMPALA-11301: Fix extreme != selectivity for NDV=1
     new 41a2352a7 IMPALA-11305: Fix TypeError in impala-shell summary progress
     new dc0370875 IMPALA-11274: CNF Rewrite causes a regress in join node performance
     new f6a870ccd IMPALA-11325: Fix UnicodeDecodeError for shell file output
     new f7f190912 IMPALA-11332: Fix trailing whitespace for CSV output
     new b2e05224b IMPALA-11323: Don't evaluate constants-only inferred predicates
     new b25aef4e0 IMPALA-5845: Limit the number of non-fatal errors logging to INFO
     new 8a04e0d3d IMPALA-11335: allocate WriteId before taking locks during INSERT to ACID tables
     new 40ceefa9a IMPALA-11302, IMPALA-11303: Fix error handling of external Iceberg tables
     new 67894f972 IMPALA-11280: Join node incorrectly picks up unnest(array) predicates
     new 328eedf96 IMPALA-11337: Flush row output before writing "Fetched X row(s)"
     new 407072f1d IMPALA-11358: Fixed Kudu table's missing comment
     new 822fcf327 IMPALA-11365: Dereferencing null pointer in TopNNode
     new 7e40c2887 IMPALA-11368: Iceberg time-travel error message should show timestamp in local timezone
     new 3088c3746 IMPALA-11416: SlotRef::tuple_is_nullable_ uninitialised for struct children
     new 1cdb5122a IMPALA-11412: CodegenFnPtr<FuncType>::store() has a compile time error when instantiated
     new 83e9cfc1d IMPALA-11414: Off-by-one error in Parquet late materialization
     new f767619dc IMPALA-11406: Fix incorrect duration log for authorization
     new 06bdfa3ba IMPALA-10267: Properly handle continued scanning after parse error in HdfsAvroScanner
     new 07e3c76dd IMPALA-10267: Addendum (Properly handle continued scanning after parse error in HdfsAvroScanner)
     new 2df70759c IMPALA-11445: Fix bug in firing insert event of partitions located in different FS
     new 57ac65871 IMPALA-11443: Fix partitoned top-n with -1 input cardinality
     new 0766bc9c6 IMPALA-11434: Fix analysis of multiple more than 1d arrays in select list
     new 6fc556be5 Update version to 4.1.1-SNAPSHOT

The 27 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



[impala] 03/27: IMPALA-11208: Fix uninitialized counter of CollectionItemsRead in orc-scanner

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 4879f49bb59f944d846650b0b06e1abb95c73646
Author: stiga-huang <hu...@gmail.com>
AuthorDate: Sat May 14 14:48:43 2022 +0800

    IMPALA-11208: Fix uninitialized counter of CollectionItemsRead in orc-scanner
    
    CollectionItemsRead in the runtime profile counts the total number of
    nested collection items read by the scan node. Only created for scans
    that support nested types, e.g. Parquet or ORC.
    
    Each scanner thread maintains its local counter and merges it into
    HdfsScanNode counter for each row batch. However, the local counter in
    orc-scanner is uninitialized, leading to weird values. This patch simply
    initializes it to 0 and adds test coverage.
    
    Tests:
    Add profile verification for this counter on some existing query tests.
    Note that there are some implementation difference between Parquet and
    ORC scanners (e.g. in predicate pushdown). So we will see different
    counter results in some query. I just pick some queries that have
    consistent counters.
    
    Change-Id: Id7783d1460ac9b98e94d3a31028b43f5a9884f99
    Reviewed-on: http://gerrit.cloudera.org:8080/18528
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 be/src/exec/hdfs-orc-scanner.h                                    | 2 +-
 .../QueryTest/nested-types-scanner-array-materialization.test     | 8 ++++++++
 .../functional-query/queries/QueryTest/nested-types-tpch.test     | 2 ++
 testdata/workloads/tpch/queries/tpch-q1.test                      | 4 +++-
 testdata/workloads/tpch/queries/tpch-q10.test                     | 2 ++
 testdata/workloads/tpch/queries/tpch-q11.test                     | 4 +++-
 testdata/workloads/tpch/queries/tpch-q12.test                     | 4 +++-
 testdata/workloads/tpch/queries/tpch-q13.test                     | 4 +++-
 testdata/workloads/tpch/queries/tpch-q14.test                     | 4 +++-
 testdata/workloads/tpch/queries/tpch-q15.test                     | 2 ++
 testdata/workloads/tpch/queries/tpch-q16.test                     | 4 +++-
 testdata/workloads/tpch/queries/tpch-q17.test                     | 4 +++-
 testdata/workloads/tpch/queries/tpch-q18.test                     | 2 ++
 testdata/workloads/tpch/queries/tpch-q19.test                     | 4 +++-
 testdata/workloads/tpch/queries/tpch-q2.test                      | 4 +++-
 testdata/workloads/tpch/queries/tpch-q20.test                     | 4 +++-
 testdata/workloads/tpch/queries/tpch-q21.test                     | 4 +++-
 testdata/workloads/tpch/queries/tpch-q22.test                     | 4 +++-
 testdata/workloads/tpch/queries/tpch-q3.test                      | 2 ++
 testdata/workloads/tpch/queries/tpch-q4.test                      | 4 +++-
 testdata/workloads/tpch/queries/tpch-q5.test                      | 4 +++-
 testdata/workloads/tpch/queries/tpch-q6.test                      | 4 +++-
 testdata/workloads/tpch/queries/tpch-q7.test                      | 4 +++-
 testdata/workloads/tpch/queries/tpch-q8.test                      | 4 +++-
 testdata/workloads/tpch/queries/tpch-q9.test                      | 4 +++-
 25 files changed, 73 insertions(+), 19 deletions(-)

diff --git a/be/src/exec/hdfs-orc-scanner.h b/be/src/exec/hdfs-orc-scanner.h
index 477f0b7b5..a446c1998 100644
--- a/be/src/exec/hdfs-orc-scanner.h
+++ b/be/src/exec/hdfs-orc-scanner.h
@@ -299,7 +299,7 @@ class HdfsOrcScanner : public HdfsColumnarScanner {
   /// used to reduce the frequency of updating HdfsScanNode counter. It is updated by the
   /// callees of AssembleRows() and is merged into the HdfsScanNode counter at the end of
   /// AssembleRows() and then is reset to 0.
-  int64_t coll_items_read_counter_;
+  int64_t coll_items_read_counter_ = 0;
 
   const char *filename() const { return metadata_range_->file(); }
 
diff --git a/testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-array-materialization.test b/testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-array-materialization.test
index bef001671..ef3ce090a 100644
--- a/testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-array-materialization.test
+++ b/testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-array-materialization.test
@@ -15,6 +15,10 @@ select id, a.item from complextypestbl t, t.int_array a
 8,-1
 ---- TYPES
 bigint,int
+---- RUNTIME_PROFILE
+row_regex: .*CollectionItemsRead: 5 \(5\).*
+row_regex: .*CollectionItemsRead: 1 \(1\).*
+row_regex: .*CollectionItemsRead: 9 \(9\).*
 ====
 ---- QUERY
 -- Materialize array (for now, may be optimized away someday)
@@ -32,6 +36,10 @@ NULL
 -1
 ---- TYPES
 int
+---- RUNTIME_PROFILE
+row_regex: .*CollectionItemsRead: 5 \(5\).*
+row_regex: .*CollectionItemsRead: 1 \(1\).*
+row_regex: .*CollectionItemsRead: 9 \(9\).*
 ====
 ---- QUERY
 -- Materialize scalar and array
diff --git a/testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test b/testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
index dcfc1a980..26a5cf40c 100644
--- a/testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
+++ b/testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
@@ -15,6 +15,8 @@ from customer c, c.c_orders o, o.o_lineitems i
 6001215
 ---- TYPES
 bigint
+---- RUNTIME_PROFILE
+row_regex: .*CollectionItemsRead: 2.50M \(2500405\).*
 ====
 ---- QUERY
 select count(l_linenumber)
diff --git a/testdata/workloads/tpch/queries/tpch-q1.test b/testdata/workloads/tpch/queries/tpch-q1.test
index 08cb7ccbd..94ed5034a 100644
--- a/testdata/workloads/tpch/queries/tpch-q1.test
+++ b/testdata/workloads/tpch/queries/tpch-q1.test
@@ -29,4 +29,6 @@ order by
 'R','F',37719753.00,56568041380.90,53741292684.6040,55889619119.831932,25.505794,38250.854626,0.050009,1478870
 ---- TYPES
 string, string, decimal, decimal, decimal, decimal, decimal, decimal, decimal, bigint
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q10.test b/testdata/workloads/tpch/queries/tpch-q10.test
index b87748733..2693d1889 100644
--- a/testdata/workloads/tpch/queries/tpch-q10.test
+++ b/testdata/workloads/tpch/queries/tpch-q10.test
@@ -59,4 +59,6 @@ limit 20
 23431,'Customer#000023431',554269.5360,3381.86,'ROMANIA','HgiV0phqhaIa9aydNoIlb','29-915-458-2654','nusual, even instructions: furiously stealthy n'
 ---- TYPES
 bigint, string, decimal, decimal, string, string, string, string
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
 ====
diff --git a/testdata/workloads/tpch/queries/tpch-q11.test b/testdata/workloads/tpch/queries/tpch-q11.test
index 7e87b6974..1d5bfc86e 100644
--- a/testdata/workloads/tpch/queries/tpch-q11.test
+++ b/testdata/workloads/tpch/queries/tpch-q11.test
@@ -1085,4 +1085,6 @@ order by
 5182,7874521.73
 ---- TYPES
 BIGINT, decimal
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q12.test b/testdata/workloads/tpch/queries/tpch-q12.test
index ce4080326..ca86c318f 100644
--- a/testdata/workloads/tpch/queries/tpch-q12.test
+++ b/testdata/workloads/tpch/queries/tpch-q12.test
@@ -34,4 +34,6 @@ order by
 'SHIP',6200,9262
 ---- TYPES
 string, bigint, bigint
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q13.test b/testdata/workloads/tpch/queries/tpch-q13.test
index d09613b4b..fc131fc70 100644
--- a/testdata/workloads/tpch/queries/tpch-q13.test
+++ b/testdata/workloads/tpch/queries/tpch-q13.test
@@ -66,4 +66,6 @@ order by
 39,1
 ---- TYPES
 BIGINT, BIGINT
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q14.test b/testdata/workloads/tpch/queries/tpch-q14.test
index 8fb41b4af..29417cc7e 100644
--- a/testdata/workloads/tpch/queries/tpch-q14.test
+++ b/testdata/workloads/tpch/queries/tpch-q14.test
@@ -18,4 +18,6 @@ where
 16.380779
 ---- TYPES
 decimal
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q15.test b/testdata/workloads/tpch/queries/tpch-q15.test
index 19c2909a8..b34624925 100644
--- a/testdata/workloads/tpch/queries/tpch-q15.test
+++ b/testdata/workloads/tpch/queries/tpch-q15.test
@@ -35,4 +35,6 @@ order by
 8449,'Supplier#000008449','Wp34zim9qYFbVctdW','20-469-856-8873',1772627.2087
 ---- TYPES
 BIGINT, STRING, STRING, STRING, DECIMAL
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
 ====
diff --git a/testdata/workloads/tpch/queries/tpch-q16.test b/testdata/workloads/tpch/queries/tpch-q16.test
index 20f65df55..0dda01260 100644
--- a/testdata/workloads/tpch/queries/tpch-q16.test
+++ b/testdata/workloads/tpch/queries/tpch-q16.test
@@ -18348,4 +18348,6 @@ order by
 'Brand#55','STANDARD PLATED TIN',49,3
 ---- TYPES
 STRING, STRING, INT, BIGINT
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q17.test b/testdata/workloads/tpch/queries/tpch-q17.test
index ab4a62ee8..dab9a0e09 100644
--- a/testdata/workloads/tpch/queries/tpch-q17.test
+++ b/testdata/workloads/tpch/queries/tpch-q17.test
@@ -22,4 +22,6 @@ where
 348406.054286
 ---- TYPES
 decimal
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q18.test b/testdata/workloads/tpch/queries/tpch-q18.test
index bdf74dac6..82629b910 100644
--- a/testdata/workloads/tpch/queries/tpch-q18.test
+++ b/testdata/workloads/tpch/queries/tpch-q18.test
@@ -97,4 +97,6 @@ limit 100
 'Customer#000088703',88703,2995076,'1994-01-30',363812.12,302.00
 ---- TYPES
 STRING, BIGINT, BIGINT, STRING, DECIMAL, DECIMAL
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
 ====
diff --git a/testdata/workloads/tpch/queries/tpch-q19.test b/testdata/workloads/tpch/queries/tpch-q19.test
index c342f215e..caac7a16d 100644
--- a/testdata/workloads/tpch/queries/tpch-q19.test
+++ b/testdata/workloads/tpch/queries/tpch-q19.test
@@ -40,4 +40,6 @@ where
 3083843.0578
 ---- TYPES
 decimal
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q2.test b/testdata/workloads/tpch/queries/tpch-q2.test
index 3f8f77d3f..7d40d48c1 100644
--- a/testdata/workloads/tpch/queries/tpch-q2.test
+++ b/testdata/workloads/tpch/queries/tpch-q2.test
@@ -148,4 +148,6 @@ limit 100
 7843.52,'Supplier#000006683','FRANCE',11680,'Manufacturer#4','2Z0JGkiv01Y00oCFwUGfviIbhzCdy','16-464-517-8943',' express, final pinto beans x-ray slyly asymptotes. unusual, unusual'
 ---- TYPES
 DECIMAL, STRING, STRING, BIGINT, STRING, STRING, STRING, STRING
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q20.test b/testdata/workloads/tpch/queries/tpch-q20.test
index 64568a359..69e699485 100644
--- a/testdata/workloads/tpch/queries/tpch-q20.test
+++ b/testdata/workloads/tpch/queries/tpch-q20.test
@@ -226,4 +226,6 @@ order by
 'Supplier#000009974','7wJ,J5DKcxSU4Kp1cQLpbcAvB5AsvKT'
 ---- TYPES
 string, string
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q21.test b/testdata/workloads/tpch/queries/tpch-q21.test
index 3fbadc495..486b33ee2 100644
--- a/testdata/workloads/tpch/queries/tpch-q21.test
+++ b/testdata/workloads/tpch/queries/tpch-q21.test
@@ -144,4 +144,6 @@ limit 100
 'Supplier#000002483',12
 ---- TYPES
 string, bigint
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q22.test b/testdata/workloads/tpch/queries/tpch-q22.test
index 1299ee160..93b00f016 100644
--- a/testdata/workloads/tpch/queries/tpch-q22.test
+++ b/testdata/workloads/tpch/queries/tpch-q22.test
@@ -45,4 +45,6 @@ order by
 '31',922,6806670.18
 ---- TYPES
 string, bigint, decimal
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q3.test b/testdata/workloads/tpch/queries/tpch-q3.test
index 9b57607d0..f9fc981d1 100644
--- a/testdata/workloads/tpch/queries/tpch-q3.test
+++ b/testdata/workloads/tpch/queries/tpch-q3.test
@@ -39,4 +39,6 @@ limit 10
 2300070,367371.1452,'1995-03-13',0
 ---- TYPES
 BIGINT, DECIMAL, STRING, INT
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
 ====
diff --git a/testdata/workloads/tpch/queries/tpch-q4.test b/testdata/workloads/tpch/queries/tpch-q4.test
index 00c3a87d5..69095bad7 100644
--- a/testdata/workloads/tpch/queries/tpch-q4.test
+++ b/testdata/workloads/tpch/queries/tpch-q4.test
@@ -30,4 +30,6 @@ order by
 '5-LOW',10487
 ---- TYPES
 string, bigint
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q5.test b/testdata/workloads/tpch/queries/tpch-q5.test
index d05d29a52..311c8690d 100644
--- a/testdata/workloads/tpch/queries/tpch-q5.test
+++ b/testdata/workloads/tpch/queries/tpch-q5.test
@@ -33,4 +33,6 @@ order by
 'JAPAN',45410175.6954
 ---- TYPES
 string, decimal
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q6.test b/testdata/workloads/tpch/queries/tpch-q6.test
index 1642d0e2f..a4cf082dc 100644
--- a/testdata/workloads/tpch/queries/tpch-q6.test
+++ b/testdata/workloads/tpch/queries/tpch-q6.test
@@ -14,4 +14,6 @@ where
 123141078.2283
 ---- TYPES
 decimal
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q7.test b/testdata/workloads/tpch/queries/tpch-q7.test
index 7c08522f4..e596743d6 100644
--- a/testdata/workloads/tpch/queries/tpch-q7.test
+++ b/testdata/workloads/tpch/queries/tpch-q7.test
@@ -46,4 +46,6 @@ order by
 'GERMANY','FRANCE',1996,52520549.0224
 ---- TYPES
 STRING, STRING, INT, DECIMAL
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q8.test b/testdata/workloads/tpch/queries/tpch-q8.test
index 01ed84386..54a53b415 100644
--- a/testdata/workloads/tpch/queries/tpch-q8.test
+++ b/testdata/workloads/tpch/queries/tpch-q8.test
@@ -43,4 +43,6 @@ order by
 1996,0.041486
 ---- TYPES
 int, decimal
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====
diff --git a/testdata/workloads/tpch/queries/tpch-q9.test b/testdata/workloads/tpch/queries/tpch-q9.test
index 78981da0c..8b45bdad5 100644
--- a/testdata/workloads/tpch/queries/tpch-q9.test
+++ b/testdata/workloads/tpch/queries/tpch-q9.test
@@ -210,4 +210,6 @@ order by
 'VIETNAM',1992,47846355.6485
 ---- TYPES
 STRING, INT, DECIMAL
-====
\ No newline at end of file
+---- RUNTIME_PROFILE
+!row_regex: .*CollectionItemsRead: [^0].*
+====


[impala] 27/27: Update version to 4.1.1-SNAPSHOT

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 6fc556be5192bb770d0621b6bdcf2a5c1f18be01
Author: stiga-huang <hu...@gmail.com>
AuthorDate: Mon Aug 1 17:02:45 2022 +0800

    Update version to 4.1.1-SNAPSHOT
    
    Change-Id: Id7e5821897d98dcc7ecb0357dfe091587866e506
---
 bin/save-version.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/bin/save-version.sh b/bin/save-version.sh
index 972f29ba6..017a237e0 100755
--- a/bin/save-version.sh
+++ b/bin/save-version.sh
@@ -21,8 +21,8 @@
 # Note: for internal (aka pre-release) versions, the version should have
 # "-INTERNAL" appended. Parts of the code will look for this to distinguish
 # between released and internal versions.
-VERSION=4.1.0-RELEASE
-GIT_HASH=b27a701fc02c73ea561ccee5dd587c8eb0c0245a
+VERSION=4.1.1-SNAPSHOT
+GIT_HASH=$(git rev-parse HEAD 2> /dev/null)
 if [ -z $GIT_HASH ]
 then
   GIT_HASH="Could not obtain git hash"


[impala] 06/27: IMPALA-11274: CNF Rewrite causes a regress in join node performance

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit dc0370875c6c60095f253c4fe4aa98ff4b7ef94d
Author: Michael Smith <mi...@cloudera.com>
AuthorDate: Thu May 12 17:01:21 2022 -0700

    IMPALA-11274: CNF Rewrite causes a regress in join node performance
    
    This patch defines a subset of all predicates that are common and
    relatively inexpensive to compute. Such predicates must involve
    columns, constants, simple math or cast functions only.
    
    Examples of the subset of the predicates allowed:
    
      1. (a = 1 AND cast(b as int) = 2) OR (c = d AND e = f)
      2. a in ('1', '2', '3') OR ((b = 'abc') AND (c = d))
      3. (a between 1 and 100) OR ((b is null) AND (c = d))
    
    Examples of the predicates not allowed:
    
      1. (upper(a) != 'Y') AND b = 2) OR (c = d AND e = f)
      2. (coalesce(CAST(a AS string), '') = '') AND b = 2) OR
         (c = d AND e = f)
    
    This patch further restricts the predicates to be converted to
    conjunctive normal form (CNF) to be such a subset, with the aim to
    reduce the run-time evaluation overhead of CNFs in which some
    of the predicates can be duplicated.
    
    Uses a cache in branching expressions to avoid visiting the entire
    subtree on each call to applyRuleBottomUp. Skips cache complexity on
    casts as they don't branch and are unlikely to be deeply nested.
    
    Testing:
    - New expression writer tests
    - New planner tests
    
    Change-Id: I326406c6b004fe31ec0e2a2f390a3845b8925aa9
    Reviewed-on: http://gerrit.cloudera.org:8080/18458
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 common/function-registry/impala_functions.py       |  2 +
 .../org/apache/impala/analysis/ArithmeticExpr.java | 24 ++++++
 .../java/org/apache/impala/analysis/CastExpr.java  |  6 ++
 .../apache/impala/analysis/ExistsPredicate.java    |  4 +
 .../main/java/org/apache/impala/analysis/Expr.java |  7 ++
 .../apache/impala/analysis/FunctionCallExpr.java   | 44 +++++++++++
 .../impala/analysis/IsNotEmptyPredicate.java       |  4 +
 .../org/apache/impala/analysis/LikePredicate.java  |  4 +
 .../java/org/apache/impala/analysis/Predicate.java | 25 ++++++
 .../java/org/apache/impala/analysis/SlotRef.java   | 10 ++-
 .../impala/analysis/TupleIsNullPredicate.java      |  4 +
 .../apache/impala/rewrite/ConvertToCNFRule.java    |  9 +++
 .../apache/impala/analysis/ExprRewriterTest.java   | 51 +++++++++++++
 .../queries/PlannerTest/convert-to-cnf.test        | 88 ++++++++++++++++++++++
 14 files changed, 279 insertions(+), 3 deletions(-)

diff --git a/common/function-registry/impala_functions.py b/common/function-registry/impala_functions.py
index 30fb3c4e8..dd81a692d 100644
--- a/common/function-registry/impala_functions.py
+++ b/common/function-registry/impala_functions.py
@@ -323,6 +323,8 @@ visible_functions = [
   [['months_between'], 'DOUBLE', ['DATE', 'DATE'], "_ZN6impala13DateFunctions13MonthsBetweenEPN10impala_udf15FunctionContextERKNS1_7DateValES6_"],
 
   # Math builtin functions
+  # Add new math builtins that operate on scalar values to
+  # org.apache.impala.analysis.FunctionCallExpr::builtinMathScalarFunctionNames_.
   [['pi'], 'DOUBLE', [], 'impala::MathFunctions::Pi'],
   [['e'], 'DOUBLE', [], 'impala::MathFunctions::E'],
   [['abs'], 'BIGINT', ['BIGINT'], 'impala::MathFunctions::Abs'],
diff --git a/fe/src/main/java/org/apache/impala/analysis/ArithmeticExpr.java b/fe/src/main/java/org/apache/impala/analysis/ArithmeticExpr.java
index 663291d25..0edd21339 100644
--- a/fe/src/main/java/org/apache/impala/analysis/ArithmeticExpr.java
+++ b/fe/src/main/java/org/apache/impala/analysis/ArithmeticExpr.java
@@ -17,6 +17,7 @@
 
 package org.apache.impala.analysis;
 
+import java.util.Optional;
 import org.apache.impala.catalog.Db;
 import org.apache.impala.catalog.Function.CompareMode;
 import org.apache.impala.catalog.ScalarFunction;
@@ -75,6 +76,9 @@ public class ArithmeticExpr extends Expr {
   }
 
   private final Operator op_;
+  // cache prior shouldConvertToCNF checks to avoid repeat tree walking
+  // omitted from clone in case cloner plans to mutate the expr
+  protected Optional<Boolean> shouldConvertToCNF_ = Optional.empty();
 
   public Operator getOp() { return op_; }
 
@@ -267,6 +271,26 @@ public class ArithmeticExpr extends Expr {
     return hasChildCosts() ? getChildCosts() + ARITHMETIC_OP_COST : UNKNOWN_COST;
   }
 
+  private boolean lookupShouldConvertToCNF() {
+    for (int i = 0; i < children_.size(); ++i) {
+      if (!getChild(i).shouldConvertToCNF()) return false;
+    }
+    return true;
+  }
+
+  /**
+   * Return true if this expression's children should be converted to CNF.
+   */
+  @Override
+  public boolean shouldConvertToCNF() {
+    if (shouldConvertToCNF_.isPresent()) {
+      return shouldConvertToCNF_.get();
+    }
+    boolean result = lookupShouldConvertToCNF();
+    shouldConvertToCNF_ = Optional.of(result);
+    return result;
+  }
+
   @Override
   public Expr clone() { return new ArithmeticExpr(this); }
 }
diff --git a/fe/src/main/java/org/apache/impala/analysis/CastExpr.java b/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
index ba395ab13..771560d82 100644
--- a/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
+++ b/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
@@ -409,6 +409,12 @@ public class CastExpr extends Expr {
         && type_.equals(other.type_);
   }
 
+  // Pass through since cast's are cheap.
+  @Override
+  public boolean shouldConvertToCNF() {
+    return getChild(0).shouldConvertToCNF();
+  }
+
   @Override
   public Expr clone() { return new CastExpr(this); }
 }
diff --git a/fe/src/main/java/org/apache/impala/analysis/ExistsPredicate.java b/fe/src/main/java/org/apache/impala/analysis/ExistsPredicate.java
index c55be82f4..235e56398 100644
--- a/fe/src/main/java/org/apache/impala/analysis/ExistsPredicate.java
+++ b/fe/src/main/java/org/apache/impala/analysis/ExistsPredicate.java
@@ -83,4 +83,8 @@ public class ExistsPredicate extends Predicate {
     strBuilder.append(getChild(0).toSql(options));
     return strBuilder.toString();
   }
+
+  // Return false since existence can be expensive to determine.
+  @Override
+  public boolean shouldConvertToCNF() { return false; }
 }
diff --git a/fe/src/main/java/org/apache/impala/analysis/Expr.java b/fe/src/main/java/org/apache/impala/analysis/Expr.java
index 3401aed3f..76175d51c 100644
--- a/fe/src/main/java/org/apache/impala/analysis/Expr.java
+++ b/fe/src/main/java/org/apache/impala/analysis/Expr.java
@@ -1883,4 +1883,11 @@ abstract public class Expr extends TreeNode<Expr> implements ParseNode, Cloneabl
     }
     return null;
   }
+
+  /**
+   * Returns true if 'this' is a constant.
+   */
+  public boolean shouldConvertToCNF() {
+    return isConstant();
+  }
 }
diff --git a/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java b/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
index eb1225f7c..ec1c98a98 100644
--- a/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
+++ b/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
@@ -17,7 +17,12 @@
 
 package org.apache.impala.analysis;
 
+import java.util.Arrays;
+import java.util.ArrayList;
 import java.util.List;
+import java.util.HashSet;
+import java.util.Optional;
+import java.util.Set;
 
 import org.apache.impala.authorization.Privilege;
 import org.apache.impala.catalog.AggregateFunction;
@@ -52,6 +57,17 @@ public class FunctionCallExpr extends Expr {
   private boolean isAnalyticFnCall_ = false;
   private boolean isInternalFnCall_ = false;
 
+  // cache prior shouldConvertToCNF checks to avoid repeat tree walking
+  // omitted from clone in case cloner plans to mutate the expr
+  protected Optional<Boolean> shouldConvertToCNF_ = Optional.empty();
+  private static Set<String> builtinMathScalarFunctionNames_ =
+      new HashSet<String>(Arrays.asList("abs", "acos", "asin", "atan", "atan2", "bin",
+          "ceil", "ceiling", "conv", "cos", "cosh", "cot", "dceil", "degrees", "dexp",
+          "dfloor", "dlog1", "dlog10", "dpow", "dround", "dsqrt", "dtrunc", "e", "exp",
+          "floor", "fmod", "fpow", "hex", "ln", "log", "log10", "log2", "mod", "pi",
+          "pmod", "pow", "power", "quotient", "radians", "rand", "random", "round",
+          "sign", "sin", "sinh", "sqrt", "tan", "tanh", "trunc", "truncate", "unhex"));
+
   // Non-null iff this is an aggregation function that executes the Merge() step. This
   // is an analyzed clone of the FunctionCallExpr that executes the Update() function
   // feeding into this Merge(). This is stored so that we can access the types of the
@@ -803,4 +819,32 @@ public class FunctionCallExpr extends Expr {
     return e;
   }
 
+  public boolean isBuiltinMathScalarFunction() {
+    return (fnName_.isBuiltin()
+        && builtinMathScalarFunctionNames_.contains(fnName_.getFunction()));
+  }
+
+  private boolean lookupShouldConvertToCNF() {
+    if (isBuiltinCastFunction() || isBuiltinMathScalarFunction()) {
+      for (int i = 0; i < children_.size(); ++i) {
+        if (!getChild(i).shouldConvertToCNF()) return false;
+      }
+      return true;
+    }
+    return false;
+  }
+
+  /**
+   * Returns true if the function call is considered inexpensive to duplicate
+   * and the arguments should also be converted.
+   */
+  @Override
+  public boolean shouldConvertToCNF() {
+    if (shouldConvertToCNF_.isPresent()) {
+      return shouldConvertToCNF_.get();
+    }
+    boolean result = lookupShouldConvertToCNF();
+    shouldConvertToCNF_ = Optional.of(result);
+    return result;
+  }
 }
diff --git a/fe/src/main/java/org/apache/impala/analysis/IsNotEmptyPredicate.java b/fe/src/main/java/org/apache/impala/analysis/IsNotEmptyPredicate.java
index 0185c56e7..7a2118f54 100644
--- a/fe/src/main/java/org/apache/impala/analysis/IsNotEmptyPredicate.java
+++ b/fe/src/main/java/org/apache/impala/analysis/IsNotEmptyPredicate.java
@@ -68,4 +68,8 @@ public class IsNotEmptyPredicate extends Predicate {
 
   @Override
   public Expr clone() { return new IsNotEmptyPredicate(getChild(0).clone()); }
+
+  // Return false since emptiness can be expensive to determine.
+  @Override
+  public boolean shouldConvertToCNF() { return false; }
 }
diff --git a/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java b/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
index c3d256ad9..5223f4de3 100644
--- a/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
+++ b/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
@@ -168,4 +168,8 @@ public class LikePredicate extends Predicate {
   public Expr clone() { return new LikePredicate(this); }
 
   public Operator getOp() { return op_; }
+
+  // Return false since comparison can be expensive.
+  @Override
+  public boolean shouldConvertToCNF() { return false; }
 }
diff --git a/fe/src/main/java/org/apache/impala/analysis/Predicate.java b/fe/src/main/java/org/apache/impala/analysis/Predicate.java
index 314e50db8..64ee408e9 100644
--- a/fe/src/main/java/org/apache/impala/analysis/Predicate.java
+++ b/fe/src/main/java/org/apache/impala/analysis/Predicate.java
@@ -17,6 +17,7 @@
 
 package org.apache.impala.analysis;
 
+import java.util.Optional;
 import org.apache.impala.catalog.Type;
 import org.apache.impala.common.AnalysisException;
 import org.apache.impala.common.Pair;
@@ -26,6 +27,9 @@ public abstract class Predicate extends Expr {
   protected boolean isEqJoinConjunct_;
   // true if this predicate has an always_true hint
   protected boolean hasAlwaysTrueHint_;
+  // cache prior shouldConvertToCNF checks to avoid repeat tree walking
+  // omitted from clone in case cloner plans to mutate the expr
+  protected Optional<Boolean> shouldConvertToCNF_ = Optional.empty();
 
   public Predicate() {
     super();
@@ -82,6 +86,27 @@ public abstract class Predicate extends Expr {
     return true;
   }
 
+  private boolean lookupShouldConvertToCNF() {
+    for (int i = 0; i < children_.size(); ++i) {
+      if (!getChild(i).shouldConvertToCNF()) return false;
+    }
+    return true;
+  }
+
+  /**
+   * Return true if this predicate's children should be converted to CNF.
+   * Predicates that are considered expensive can override to return false.
+   */
+  @Override
+  public boolean shouldConvertToCNF() {
+    if (shouldConvertToCNF_.isPresent()) {
+      return shouldConvertToCNF_.get();
+    }
+    boolean result = lookupShouldConvertToCNF();
+    shouldConvertToCNF_ = Optional.of(result);
+    return result;
+  }
+
   public static boolean isEquivalencePredicate(Expr expr) {
     return (expr instanceof BinaryPredicate)
         && ((BinaryPredicate) expr).getOp().isEquivalence();
diff --git a/fe/src/main/java/org/apache/impala/analysis/SlotRef.java b/fe/src/main/java/org/apache/impala/analysis/SlotRef.java
index 836109f63..266e4da1c 100644
--- a/fe/src/main/java/org/apache/impala/analysis/SlotRef.java
+++ b/fe/src/main/java/org/apache/impala/analysis/SlotRef.java
@@ -50,9 +50,9 @@ public class SlotRef extends Expr {
   // The resolved path after resolving 'rawPath_'.
   protected Path resolvedPath_ = null;
 
-  // Indicates if this SlotRef is coming from zipping unnest where the unest is given in
-  // the FROM clause. Note, when the unnest in in the select list then an UnnestExpr would
-  // be used instead of a SlotRef.
+  // Indicates if this SlotRef is coming from zipping unnest where the unnest is given in
+  // the FROM clause. Note, when the unnest is in the select list then an UnnestExpr
+  // would be used instead of a SlotRef.
   protected boolean isZippingUnnest_ = false;
 
   public SlotRef(List<String> rawPath) {
@@ -446,4 +446,8 @@ public class SlotRef extends Expr {
       return super.uncheckedCastTo(targetType);
     }
   }
+
+  // Return true since SlotRefs should be easy to access.
+  @Override
+  public boolean shouldConvertToCNF() { return true; }
 }
diff --git a/fe/src/main/java/org/apache/impala/analysis/TupleIsNullPredicate.java b/fe/src/main/java/org/apache/impala/analysis/TupleIsNullPredicate.java
index 5fb2c0461..da78e6dc3 100644
--- a/fe/src/main/java/org/apache/impala/analysis/TupleIsNullPredicate.java
+++ b/fe/src/main/java/org/apache/impala/analysis/TupleIsNullPredicate.java
@@ -197,4 +197,8 @@ public class TupleIsNullPredicate extends Predicate {
 
   @Override
   public Expr clone() { return new TupleIsNullPredicate(this); }
+
+  // Return true since only tuples are involved during evaluation.
+  @Override
+  public boolean shouldConvertToCNF() { return true; }
 }
diff --git a/fe/src/main/java/org/apache/impala/rewrite/ConvertToCNFRule.java b/fe/src/main/java/org/apache/impala/rewrite/ConvertToCNFRule.java
index 527084a5e..1d04cb76d 100644
--- a/fe/src/main/java/org/apache/impala/rewrite/ConvertToCNFRule.java
+++ b/fe/src/main/java/org/apache/impala/rewrite/ConvertToCNFRule.java
@@ -26,6 +26,9 @@ import org.apache.impala.analysis.TableRef;
 import org.apache.impala.analysis.TupleId;
 import org.apache.impala.common.AnalysisException;
 
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.List;
@@ -73,6 +76,7 @@ import java.util.List;
  *   a AND b
  */
 public class ConvertToCNFRule implements ExprRewriteRule {
+  private final static Logger LOG = LoggerFactory.getLogger(ConvertToCNFRule.class);
 
   // maximum number of CNF exprs (each AND is counted as 1) allowed
   private final int maxCnfExprs_;
@@ -94,6 +98,11 @@ public class ConvertToCNFRule implements ExprRewriteRule {
       return pred;
     }
 
+    if (!((CompoundPredicate)pred).shouldConvertToCNF()) {
+      LOG.debug("It is not feasible to rewrite predicate " + pred.toSql() + " to CNF.");
+      return pred;
+    }
+
     if (maxCnfExprs_ > 0 && numCnfExprs_ >= maxCnfExprs_) {
       // max allowed CNF exprs has been reached .. in this case we
       // return the supplied predicate (also see related comments
diff --git a/fe/src/test/java/org/apache/impala/analysis/ExprRewriterTest.java b/fe/src/test/java/org/apache/impala/analysis/ExprRewriterTest.java
index 9fa7fc6a6..15f34a534 100644
--- a/fe/src/test/java/org/apache/impala/analysis/ExprRewriterTest.java
+++ b/fe/src/test/java/org/apache/impala/analysis/ExprRewriterTest.java
@@ -17,6 +17,10 @@
 
 package org.apache.impala.analysis;
 
+import java.util.Arrays;
+import java.util.ArrayList;
+import java.util.List;
+
 import org.apache.impala.analysis.AnalysisContext.AnalysisResult;
 import org.apache.impala.common.AnalysisException;
 import org.apache.impala.common.ImpalaException;
@@ -33,6 +37,8 @@ import com.google.common.base.Preconditions;
 import static org.apache.impala.analysis.ToSqlOptions.DEFAULT;
 import static org.apache.impala.analysis.ToSqlOptions.REWRITTEN;
 import static org.apache.impala.analysis.ToSqlOptions.SHOW_IMPLICIT_CASTS;
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
 
 /**
  * Tests that the ExprRewriter framework covers all clauses as well as nested statements.
@@ -112,6 +118,14 @@ public class ExprRewriterTest extends AnalyzerTest {
     Assert.assertEquals(0, exprToTrue_.getNumChanges());
   }
 
+  private Expr analyze(String query) {
+    AnalysisContext ctx = createAnalysisCtx();
+    ctx.getQueryOptions().setDecimal_v2(true);
+    ctx.getQueryOptions().setEnable_expr_rewrites(false);
+    return ((SelectStmt) AnalyzesOk(query, ctx)).getSelectList()
+        .getItems().get(0).getExpr();
+  }
+
   // Select statement with all clauses that has 11 rewritable Expr trees.
   // We expect a total of 23 exprs to be changed.
   private final String stmt_ =
@@ -586,4 +600,41 @@ public class ExprRewriterTest extends AnalyzerTest {
             "SELECT ndv(id), ndv(id, 5), count(DISTINCT id) FROM functional.alltypes");
 
   }
+
+  @Test
+  public void TestShouldConvertToCNF() {
+    TQueryOptions options = new TQueryOptions();
+    options.setEnable_expr_rewrites(false);
+    AnalysisContext ctx = createAnalysisCtx(options);
+
+    // Positive tests
+    List<String> convertablePredicates = Arrays.asList("select (1=cast(1 as int))",
+        "select (cast(d_date_sk as int) = 10) from tpcds_parquet.date_dim",
+        "select (d_date_sk = d_year) from tpcds_parquet.date_dim",
+        "select (d_date_sk between 1 and 10) from tpcds_parquet.date_dim",
+        "select (d_date_sk in (1,2,10)) from tpcds_parquet.date_dim",
+        "select (d_date_sk is null) from tpcds_parquet.date_dim", "select (cos(1) = 1.1)",
+        "select (cast(d_date_sk as int) * 2 = 10) from tpcds_parquet.date_dim",
+        "select ((2 = cast(1 as int)) and (cos(1) = 1))",
+        "select ((2 = cast(1 as int)) or (cast(0 as int) is not null))",
+        "select (sin(cos(2*pi())))");
+
+    for (String query: convertablePredicates) {
+      Expr expr = analyze(query);
+      assertTrue("Should convert to CNF: "+query, expr.shouldConvertToCNF());
+    }
+
+    // Negative tests
+    List<String> inconvertablePredicates = Arrays.asList(
+        "select (upper(d_day_name) = 'A') from tpcds_parquet.date_dim",
+        "select (d_day_name like '%A') from tpcds_parquet.date_dim",
+        "select (coalesce(d_date_sk, -1) = d_year) from tpcds_parquet.date_dim",
+        "select (log10(cast(1 + length(upper(d_day_name)) as double)) > 1.0) from "
+         + "tpcds_parquet.date_dim");
+
+    for (String query: inconvertablePredicates) {
+      Expr expr = analyze(query);
+      assertFalse("Should not convert to CNF: "+query, expr.shouldConvertToCNF());
+    }
+  }
 }
diff --git a/testdata/workloads/functional-planner/queries/PlannerTest/convert-to-cnf.test b/testdata/workloads/functional-planner/queries/PlannerTest/convert-to-cnf.test
index 2a1042103..73f82467c 100644
--- a/testdata/workloads/functional-planner/queries/PlannerTest/convert-to-cnf.test
+++ b/testdata/workloads/functional-planner/queries/PlannerTest/convert-to-cnf.test
@@ -373,3 +373,91 @@ PLAN-ROOT SINK
    HDFS partitions=1/1 files=1 size=718.94MB
    row-size=8B cardinality=6.00M
 ====
+
+# IMPALA-11274: Test with string functions in the disjunctive predicate.
+# In this case the predicate is not converted to CNF
+select count(*) from lineitem, orders
+ where l_orderkey = o_orderkey and
+  ((upper(l_returnflag) = 'Y' and upper(o_orderpriority) = 'HIGH')
+    or (upper(l_returnflag) = 'N' and upper(o_orderpriority) = 'LOW'))
+  and l_partkey > 0;
+---- PLAN
+PLAN-ROOT SINK
+|
+03:AGGREGATE [FINALIZE]
+|  output: count(*)
+|  row-size=8B cardinality=1
+|
+02:HASH JOIN [INNER JOIN]
+|  hash predicates: o_orderkey = l_orderkey
+|  other predicates: ((upper(l_returnflag) = 'Y' AND upper(o_orderpriority) = 'HIGH') OR (upper(l_returnflag) = 'N' AND upper(o_orderpriority) = 'LOW'))
+|  runtime filters: RF000 <- l_orderkey
+|  row-size=57B cardinality=600.12K
+|
+|--00:SCAN HDFS [tpch_parquet.lineitem]
+|     HDFS partitions=1/1 files=3 size=193.99MB
+|     predicates: l_partkey > 0
+|     row-size=29B cardinality=600.12K
+|
+01:SCAN HDFS [tpch_parquet.orders]
+   HDFS partitions=1/1 files=2 size=54.21MB
+   runtime filters: RF000 -> o_orderkey
+   row-size=28B cardinality=1.50M
+====
+
+# IMPALA-11274: Functions like CAST should still be eligible for CNF
+select count(*) from lineitem, orders
+ where l_orderkey = o_orderkey and
+  ((cast(l_returnflag as varchar(2)) = 'Y' and cast(o_orderpriority as varchar(5)) = 'HIGH')
+    or (cast(l_returnflag as varchar(2)) = 'N' and cast(o_orderpriority as varchar(5)) = 'LOW'))
+  and l_partkey > 0;
+---- PLAN
+PLAN-ROOT SINK
+|
+03:AGGREGATE [FINALIZE]
+|  output: count(*)
+|  row-size=8B cardinality=1
+|
+02:HASH JOIN [INNER JOIN]
+|  hash predicates: l_orderkey = o_orderkey
+|  other predicates: CAST(l_returnflag AS VARCHAR(2)) = 'Y' OR CAST(o_orderpriority AS VARCHAR(5)) = 'LOW', CAST(o_orderpriority AS VARCHAR(5)) = 'HIGH' OR CAST(l_returnflag AS VARCHAR(2)) = 'N'
+|  runtime filters: RF000 <- o_orderkey
+|  row-size=57B cardinality=57.58K
+|
+|--01:SCAN HDFS [tpch_parquet.orders]
+|     HDFS partitions=1/1 files=2 size=54.21MB
+|     predicates: CAST(o_orderpriority AS VARCHAR(5)) IN ('HIGH', 'LOW')
+|     row-size=28B cardinality=150.00K
+|
+00:SCAN HDFS [tpch_parquet.lineitem]
+   HDFS partitions=1/1 files=3 size=193.99MB
+   predicates: l_partkey > 0, CAST(l_returnflag AS VARCHAR(2)) IN ('Y', 'N')
+   runtime filters: RF000 -> l_orderkey
+   row-size=29B cardinality=600.12K
+====
+
+# IMPALA-11274: Simple arithmetic expressions should still be eligible for CNF
+select count(*) from lineitem, orders
+ where l_orderkey = o_orderkey and
+  (2 * log10(l_quantity) < 3 and cast(l_returnflag as varchar(2)) = 'Y')
+   or l_quantity >= 50;
+---- PLAN
+PLAN-ROOT SINK
+|
+03:AGGREGATE [FINALIZE]
+|  output: count(*)
+|  row-size=8B cardinality=1
+|
+02:NESTED LOOP JOIN [INNER JOIN]
+|  predicates: l_orderkey = o_orderkey OR l_quantity >= 50
+|  row-size=37B cardinality=600.12K
+|
+|--01:SCAN HDFS [tpch_parquet.orders]
+|     HDFS partitions=1/1 files=2 size=54.21MB
+|     row-size=8B cardinality=1.50M
+|
+00:SCAN HDFS [tpch_parquet.lineitem]
+   HDFS partitions=1/1 files=3 size=193.99MB
+   predicates: (2 * log10(l_quantity) < 3 AND CAST(l_returnflag AS VARCHAR(2)) = 'Y') OR l_quantity >= 50
+   row-size=29B cardinality=600.12K
+====


[impala] 02/27: IMPALA-11291: deflake minidump-test by not starting JVM

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 019938efa59c8ca67792ec17ab49f4b4d02dd5ce
Author: stiga-huang <hu...@gmail.com>
AuthorDate: Mon May 16 11:39:45 2022 +0800

    IMPALA-11291: deflake minidump-test by not starting JVM
    
    minidump-test is unified into unified-be-test which will always init the
    JVM and start the JVM pause monitor. It will print the following log to
    stdout:
    22/05/16 11:35:46 INFO util.JvmPauseMonitor: Starting JVM pause monitor
    
    It's printed concurrently with the minidump output, which makes the test
    flaky since it's verifying the output lines.
    
    This patch moves minidump-test out of unified-be-test and explicitly
    invokes InitCommonRuntime with init_jvm=false. So it won't be disrupted
    by the above log.
    
    Tests
     - The issue occurs when I run minidump-test several times repeatedly.
       After the fix, I can run minidump-test more than 2000 times without
       errors.
    
    Change-Id: I89f81d408a2e905d5dfdd7f87177ebe2079d4d27
    Reviewed-on: http://gerrit.cloudera.org:8080/18529
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 be/src/util/CMakeLists.txt   |  4 ++--
 be/src/util/minidump-test.cc | 28 ++++++++++++++++++++--------
 2 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/be/src/util/CMakeLists.txt b/be/src/util/CMakeLists.txt
index 8aa4386de..a471ac1db 100644
--- a/be/src/util/CMakeLists.txt
+++ b/be/src/util/CMakeLists.txt
@@ -164,7 +164,6 @@ add_library(UtilTests STATIC
   lru-multi-cache-test.cc
   metrics-test.cc
   min-max-filter-test.cc
-  minidump-test.cc
   openssl-util-test.cc
   os-info-test.cc
   os-util-test.cc
@@ -238,7 +237,8 @@ ADD_UNIFIED_BE_LSAN_TEST(lru-multi-cache-test "LruMultiCache.*")
 ADD_UNIFIED_BE_LSAN_TEST(logging-support-test "LoggingSupport.*")
 ADD_UNIFIED_BE_LSAN_TEST(metrics-test "MetricsTest.*")
 ADD_UNIFIED_BE_LSAN_TEST(min-max-filter-test "MinMaxFilterTest.*")
-ADD_UNIFIED_BE_LSAN_TEST(minidump-test "Minidump.*")
+# minidump-test is flaky when the jvm pause monitor is running. So it can't be unified.
+ADD_BE_LSAN_TEST(minidump-test)
 ADD_UNIFIED_BE_LSAN_TEST(openssl-util-test "OpenSSLUtilTest.*")
 ADD_UNIFIED_BE_LSAN_TEST(os-info-test "OsInfo.*")
 ADD_UNIFIED_BE_LSAN_TEST(os-util-test "OsUtil.*")
diff --git a/be/src/util/minidump-test.cc b/be/src/util/minidump-test.cc
index afcf4a883..571f5e256 100644
--- a/be/src/util/minidump-test.cc
+++ b/be/src/util/minidump-test.cc
@@ -22,7 +22,9 @@
 #include <gtest/gtest.h>
 
 #include "client/linux/handler/minidump_descriptor.h"
+#include "common/init.h"
 #include "common/thread-debug-info.h"
+#include "util/test-info.h"
 
 namespace impala {
 
@@ -44,10 +46,12 @@ TEST(Minidump, DumpCallback) {
   for (std::string output : {stdout, stderr}) {
     std::vector<std::string> lines;
     boost::split(lines, output, boost::is_any_of("\n\r"), boost::token_compress_on);
-    EXPECT_EQ(3, lines.size());
-    EXPECT_EQ("Minidump with no thread info available.", lines[0]);
-    EXPECT_TRUE(boost::regex_match(lines[1], wrote_minidump));
-    EXPECT_EQ("", lines[2]);
+    EXPECT_EQ(3, lines.size()) << output;
+    EXPECT_EQ("Minidump with no thread info available.", lines[0])
+        << lines[0] << "\nOutput:\n" << output;
+    EXPECT_TRUE(boost::regex_match(lines[1], wrote_minidump))
+        << lines[1] << "\nOutput:\n" << output;
+    EXPECT_EQ("", lines[2]) << output;
   }
 }
 
@@ -73,11 +77,19 @@ TEST(Minidump, DumpCallbackWithThread) {
   for (std::string output : {stdout, stderr}) {
     std::vector<std::string> lines;
     boost::split(lines, output, boost::is_any_of("\n\r"), boost::token_compress_on);
-    EXPECT_EQ(3, lines.size());
-    EXPECT_TRUE(boost::regex_match(lines[0], minidump_in_thread));
-    EXPECT_TRUE(boost::regex_match(lines[1], wrote_minidump));
-    EXPECT_EQ("", lines[2]);
+    EXPECT_EQ(3, lines.size()) << output;
+    EXPECT_TRUE(boost::regex_match(lines[0], minidump_in_thread))
+        << lines[0] << "\nOutput:\n" << output;
+    EXPECT_TRUE(boost::regex_match(lines[1], wrote_minidump))
+        << lines[1] << "\nOutput:\n" << output;
+    EXPECT_EQ("", lines[2]) << output;
   }
 }
 
 } // namespace impala
+
+int main(int argc, char **argv) {
+  ::testing::InitGoogleTest(&argc, argv);
+  impala::InitCommonRuntime(argc, argv, /*init_jvm*/ false, impala::TestInfo::BE_TEST);
+  return RUN_ALL_TESTS();
+}


[impala] 08/27: IMPALA-11332: Fix trailing whitespace for CSV output

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit f7f1909120f5b962b7ed14646d0d9d7dd5b5eb91
Author: Joe McDonnell <jo...@cloudera.com>
AuthorDate: Wed Jun 1 13:34:09 2022 -0700

    IMPALA-11332: Fix trailing whitespace for CSV output
    
    The current CSV output is stripping trailing
    whitespaces from the last line of CSV output. This
    rstrip() was intended to remove an extra newline,
    but it is matching other white space. This is a
    problem for a SQL query like:
    select 'Trailing whitespace          ';
    
    This changes the rstrip() to rstrip('\n') to
    avoid removing the other white space.
    
    Testing:
     - Current shell tests pass
     - Added a shell test that verifies trailing whitespace
       is not being stripped.
    
    Change-Id: I69d032ca2f581587b0938d0878fdf402fee0d57e
    Reviewed-on: http://gerrit.cloudera.org:8080/18580
    Reviewed-by: Quanlong Huang <hu...@gmail.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 shell/shell_output.py                 |  9 ++++++++-
 tests/shell/test_shell_commandline.py | 10 ++++++++++
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/shell/shell_output.py b/shell/shell_output.py
index becc4dd06..608a8ca92 100644
--- a/shell/shell_output.py
+++ b/shell/shell_output.py
@@ -90,7 +90,14 @@ class DelimitedOutputFormatter(object):
         row = [val.encode('utf-8', 'replace') if isinstance(val, unicode) else val
             for val in row]
       writer.writerow(row)
-    rows = temp_buffer.getvalue().rstrip()
+    # The CSV writer produces an extra newline. Strip that extra newline (and
+    # only that extra newline). csv wraps newlines for data values in quotes,
+    # so rstrip will be limited to the extra newline.
+    if sys.version_info.major == 2:
+      # Python 2 is in encoded Unicode bytes, so this needs to be a bytes \n.
+      rows = temp_buffer.getvalue().rstrip(b'\n')
+    else:
+      rows = temp_buffer.getvalue().rstrip('\n')
     temp_buffer.close()
     return rows
 
diff --git a/tests/shell/test_shell_commandline.py b/tests/shell/test_shell_commandline.py
index c8393ecc4..1cfd90746 100644
--- a/tests/shell/test_shell_commandline.py
+++ b/tests/shell/test_shell_commandline.py
@@ -1240,3 +1240,13 @@ class TestImpalaShell(ImpalaTestSuite):
     result = run_impala_shell_cmd(vector, args + ['--http_socket_timeout_s=None'])
     assert result.stderr == ""
     assert result.stdout == "0\n"
+
+  def test_trailing_whitespace(self, vector):
+    """Test CSV output with trailing whitespace"""
+
+    # Ten trailing spaces
+    query = "select 'Trailing Whitespace          '"
+    # Only one column, no need for output_delimiter
+    output = run_impala_shell_cmd(vector, ['-q', query, '-B'])
+    assert "Fetched 1 row(s)" in output.stderr
+    assert "Trailing Whitespace          \n" in output.stdout


[impala] 01/27: IMPALA-11275: log thread info during minidump

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 964c55f6878d02fa8239229969796da5a3cfbe4b
Author: Michael Smith <mi...@cloudera.com>
AuthorDate: Mon May 9 16:24:28 2022 -0700

    IMPALA-11275: log thread info during minidump
    
    Writes ThreadDebugInfo to stdout/stderr when a minidump is generated to
    capture thread and query details related to the dump. Example message:
    > Minidump in thread [1790536]async-exec-thread running query
      1a47cc1e2df94cb4:88dfa08200000000, fragment instance
      0000000000000000:0000000000000000
    
    Refactors DumpCallback so that repeated writes to STDOUT/STDERR are less
    redundant.
    
    Adds unit tests to run with ThreadDebugInfo. Removes the 'static' prefix
    from DumpCallback so it can be invoked from unit tests, but doesn't add
    it to the header as it's intended to be for internal use.
    
    Testing:
    - Added crash to Coordinator::Exec and manually tested dump handling.
    - Added a new unit test for DumpCallback.
    - Ran tests/custom_cluster/test_breakpad.py to verify nothing broke in
      refactor. Those tests don't have ThreadDebugInfo available.
    
    Change-Id: Iea2bdf10db29a0f8ccbe5e767b708781d42a9b8a
    Reviewed-on: http://gerrit.cloudera.org:8080/18508
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 be/src/util/CMakeLists.txt            |  2 +
 be/src/util/debug-util-test.cc        | 15 +++++++
 be/src/util/debug-util.cc             | 14 ++++++
 be/src/util/debug-util.h              |  9 +++-
 be/src/util/minidump-test.cc          | 83 +++++++++++++++++++++++++++++++++++
 be/src/util/minidump.cc               | 80 ++++++++++++++++++++++++++-------
 tests/custom_cluster/test_breakpad.py |  4 ++
 7 files changed, 190 insertions(+), 17 deletions(-)

diff --git a/be/src/util/CMakeLists.txt b/be/src/util/CMakeLists.txt
index 69622a5b6..8aa4386de 100644
--- a/be/src/util/CMakeLists.txt
+++ b/be/src/util/CMakeLists.txt
@@ -164,6 +164,7 @@ add_library(UtilTests STATIC
   lru-multi-cache-test.cc
   metrics-test.cc
   min-max-filter-test.cc
+  minidump-test.cc
   openssl-util-test.cc
   os-info-test.cc
   os-util-test.cc
@@ -237,6 +238,7 @@ ADD_UNIFIED_BE_LSAN_TEST(lru-multi-cache-test "LruMultiCache.*")
 ADD_UNIFIED_BE_LSAN_TEST(logging-support-test "LoggingSupport.*")
 ADD_UNIFIED_BE_LSAN_TEST(metrics-test "MetricsTest.*")
 ADD_UNIFIED_BE_LSAN_TEST(min-max-filter-test "MinMaxFilterTest.*")
+ADD_UNIFIED_BE_LSAN_TEST(minidump-test "Minidump.*")
 ADD_UNIFIED_BE_LSAN_TEST(openssl-util-test "OpenSSLUtilTest.*")
 ADD_UNIFIED_BE_LSAN_TEST(os-info-test "OsInfo.*")
 ADD_UNIFIED_BE_LSAN_TEST(os-util-test "OsUtil.*")
diff --git a/be/src/util/debug-util-test.cc b/be/src/util/debug-util-test.cc
index 0b92c2785..417330d51 100644
--- a/be/src/util/debug-util-test.cc
+++ b/be/src/util/debug-util-test.cc
@@ -37,6 +37,21 @@ TEST(DebugUtil, UniqueID) {
   EXPECT_EQ("feedbeeff00d7777:0000000000000020", PrintId(unique_id));
 }
 
+TEST(DebugUtil, UniqueIDCompromised) {
+  TUniqueId unique_id;
+  unique_id.hi = 0xfeedbeeff00d7777ULL;
+  unique_id.lo = 0x2020202020202020ULL;
+  char out[TUniqueIdBufferSize+1];
+  out[TUniqueIdBufferSize] = '\0';
+
+  PrintIdCompromised(unique_id, out);
+  EXPECT_EQ(string("feedbeeff00d7777:2020202020202020"), out);
+
+  unique_id.lo = 0x20ULL;
+  PrintIdCompromised(unique_id, out);
+  EXPECT_EQ(string("feedbeeff00d7777:0000000000000020"), out);
+}
+
 string RecursionStack(int level) {
   if (level == 0) return GetStackTrace();
   return RecursionStack(level - 1);
diff --git a/be/src/util/debug-util.cc b/be/src/util/debug-util.cc
index 734476486..59d61c8b0 100644
--- a/be/src/util/debug-util.cc
+++ b/be/src/util/debug-util.cc
@@ -120,6 +120,20 @@ string PrintId(const UniqueIdPB& id, const string& separator) {
   return out.str();
 }
 
+static void my_i64tohex(int64_t w, char out[16]) {
+  static const char* digits = "0123456789abcdef";
+  for (size_t i = 0, j=60; i < 16; ++i, j -= 4) {
+    out[i] = digits[(w>>j) & 0x0f];
+  }
+}
+
+void PrintIdCompromised(const TUniqueId& id, char out[TUniqueIdBufferSize],
+    const char separator) {
+  my_i64tohex(id.hi, out);
+  out[16] = separator;
+  my_i64tohex(id.lo, out+17);
+}
+
 bool ParseId(const string& s, TUniqueId* id) {
   // For backwards compatibility, this method parses two forms of query ID from text:
   //  - <hex-int64_t><colon><hex-int64_t> - this format is the standard going forward
diff --git a/be/src/util/debug-util.h b/be/src/util/debug-util.h
index 5ae707444..e25ee92ad 100644
--- a/be/src/util/debug-util.h
+++ b/be/src/util/debug-util.h
@@ -87,11 +87,18 @@ std::string PrintThriftEnum(const TParquetBloomFilterWrite::type& value);
 std::string PrintTuple(const Tuple* t, const TupleDescriptor& d);
 std::string PrintRow(TupleRow* row, const RowDescriptor& d);
 std::string PrintBatch(RowBatch* batch);
-/// Converts id to a string represantation. If necessary, the gdb equivalent is:
+/// Converts id to a string representation. If necessary, the gdb equivalent is:
 ///    printf "%lx:%lx\n", id.hi, id.lo
 std::string PrintId(const TUniqueId& id, const std::string& separator = ":");
 std::string PrintId(const UniqueIdPB& id, const std::string& separator = ":");
 
+/// Converts id to a string representation without using any shared library calls.
+/// Follows Breakpad's guidance for compromised contexts, see
+/// https://github.com/google/breakpad/blob/main/docs/linux_starter_guide.md
+constexpr int TUniqueIdBufferSize = 33;
+void PrintIdCompromised(const TUniqueId& id, char out[TUniqueIdBufferSize],
+    const char separator = ':');
+
 inline ostream& operator<<(ostream& os, const UniqueIdPB& id) {
   return os << PrintId(id);
 }
diff --git a/be/src/util/minidump-test.cc b/be/src/util/minidump-test.cc
new file mode 100644
index 000000000..afcf4a883
--- /dev/null
+++ b/be/src/util/minidump-test.cc
@@ -0,0 +1,83 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "util/minidump.h"
+
+#include <boost/algorithm/string.hpp>
+#include <boost/regex.hpp>
+#include <gtest/gtest.h>
+
+#include "client/linux/handler/minidump_descriptor.h"
+#include "common/thread-debug-info.h"
+
+namespace impala {
+
+bool DumpCallback(const google_breakpad::MinidumpDescriptor& descriptor, void* context,
+    bool succeeded);
+
+TEST(Minidump, DumpCallback) {
+  testing::internal::CaptureStdout();
+  testing::internal::CaptureStderr();
+
+  google_breakpad::MinidumpDescriptor descriptor("/tmp/arbitrary/path");
+  descriptor.UpdatePath();
+  DumpCallback(descriptor, nullptr, true);
+
+  std::string stdout = testing::internal::GetCapturedStdout();
+  std::string stderr = testing::internal::GetCapturedStderr();
+  boost::regex wrote_minidump("Wrote minidump to /tmp/arbitrary/path/.*\\.dmp");
+
+  for (std::string output : {stdout, stderr}) {
+    std::vector<std::string> lines;
+    boost::split(lines, output, boost::is_any_of("\n\r"), boost::token_compress_on);
+    EXPECT_EQ(3, lines.size());
+    EXPECT_EQ("Minidump with no thread info available.", lines[0]);
+    EXPECT_TRUE(boost::regex_match(lines[1], wrote_minidump));
+    EXPECT_EQ("", lines[2]);
+  }
+}
+
+TEST(Minidump, DumpCallbackWithThread) {
+  testing::internal::CaptureStdout();
+  testing::internal::CaptureStderr();
+
+  ThreadDebugInfo tdi;
+  TUniqueId query, instance;
+  std::tie(query.hi, query.lo, instance.hi, instance.lo) = std::make_tuple(1, 2, 3, 4);
+  tdi.SetQueryId(query);
+  tdi.SetInstanceId(instance);
+  google_breakpad::MinidumpDescriptor descriptor("/tmp/arbitrary/path");
+  descriptor.UpdatePath();
+  DumpCallback(descriptor, nullptr, true);
+
+  std::string stdout = testing::internal::GetCapturedStdout();
+  std::string stderr = testing::internal::GetCapturedStderr();
+  boost::regex minidump_in_thread("Minidump in thread \\[.*\\] running query "
+      "0{15}1:0{15}2, fragment instance 0{15}3:0{15}4");
+  boost::regex wrote_minidump("Wrote minidump to /tmp/arbitrary/path/.*\\.dmp");
+
+  for (std::string output : {stdout, stderr}) {
+    std::vector<std::string> lines;
+    boost::split(lines, output, boost::is_any_of("\n\r"), boost::token_compress_on);
+    EXPECT_EQ(3, lines.size());
+    EXPECT_TRUE(boost::regex_match(lines[0], minidump_in_thread));
+    EXPECT_TRUE(boost::regex_match(lines[1], wrote_minidump));
+    EXPECT_EQ("", lines[2]);
+  }
+}
+
+} // namespace impala
diff --git a/be/src/util/minidump.cc b/be/src/util/minidump.cc
index f09439f70..25a329f80 100644
--- a/be/src/util/minidump.cc
+++ b/be/src/util/minidump.cc
@@ -32,7 +32,9 @@
 #include <sstream>
 
 #include "common/logging.h"
+#include "common/thread-debug-info.h"
 #include "common/version.h"
+#include "util/debug-util.h"
 #include "util/filesystem-util.h"
 #include "util/time.h"
 
@@ -64,31 +66,77 @@ static bool FilterCallback(void* context) {
   return minidumps_enabled;
 }
 
+static void write_dump_threadinfo(int fd, ThreadDebugInfo* thread_info) {
+  constexpr char thread_msg[] = "Minidump in thread ";
+  // pid_t is signed, but valid process IDs must always be positive.
+  const int64_t thread_id = thread_info->GetSystemThreadId();
+  // 20 characters needed for UINT64_MAX.
+  char thread_id_str[20];
+  const unsigned int thread_id_len = my_uint_len(thread_id);
+  my_uitos(thread_id_str, thread_id, thread_id_len);
+  const char* thread_name = thread_info->GetThreadName();
+
+  constexpr char query_msg[] = " running query ";
+  const TUniqueId& query_id = thread_info->GetQueryId();
+  char query_id_str[TUniqueIdBufferSize];
+  PrintIdCompromised(query_id, query_id_str);
+
+  constexpr char instance_msg[] = ", fragment instance ";
+  const TUniqueId& instance_id = thread_info->GetInstanceId();
+  // Format TUniqueId according to PrintId from util/debug-util.h
+  char instance_id_str[TUniqueIdBufferSize];
+  PrintIdCompromised(instance_id, instance_id_str);
+
+  // Example:
+  // > Minidump in thread [1790536]async-exec-thread running query 1a47cc1e2df94cb4:
+  //   88dfa08200000000, fragment instance 0000000000000000:0000000000000000
+  sys_write(fd, thread_msg, sizeof(thread_msg) / sizeof(thread_msg[0]) - 1);
+  sys_write(fd, "[", 1);
+  sys_write(fd, thread_id_str, thread_id_len);
+  sys_write(fd, "]", 1);
+  sys_write(fd, thread_name, my_strlen(thread_name));
+  sys_write(fd, query_msg, sizeof(query_msg) / sizeof(query_msg[0]) - 1);
+  sys_write(fd, query_id_str, TUniqueIdBufferSize);
+  sys_write(fd, instance_msg, sizeof(instance_msg) / sizeof(instance_msg[0]) - 1);
+  sys_write(fd, instance_id_str, TUniqueIdBufferSize);
+  sys_write(fd, "\n", 1);
+}
+
+static void write_dump_path(int fd, const char* path) {
+  // We use the linux syscall support methods from chromium here as per the
+  // recommendation of the breakpad docs to avoid calling into other shared libraries.
+  const char msg[] = "Wrote minidump to ";
+  sys_write(fd, msg, sizeof(msg) / sizeof(msg[0]) - 1);
+    // We use breakpad's reimplementation of strlen to avoid calling into libc.
+  sys_write(fd, path, my_strlen(path));
+  sys_write(fd, "\n", 1);
+}
+
 /// Callback for breakpad. It is called by breakpad whenever a minidump file has been
 /// written and should not be called directly. It logs the event before breakpad crashes
 /// the process. Due to the process being in a failed state we write to stdout/stderr and
 /// let the surrounding redirection make sure the output gets logged. The calls might
 /// still fail in unknown scenarios as the process is in a broken state. However we don't
 /// rely on them as the minidump file has been written already.
-static bool DumpCallback(const google_breakpad::MinidumpDescriptor& descriptor,
-    void* context, bool succeeded) {
+bool DumpCallback(const google_breakpad::MinidumpDescriptor& descriptor, void* context,
+    bool succeeded) {
   // See if a file was written successfully.
   if (succeeded) {
-    // Write message to stdout/stderr, which will usually be captured in the INFO/ERROR
-    // log.
-    const char msg[] = "Wrote minidump to ";
-    const int msg_len = sizeof(msg) / sizeof(msg[0]) - 1;
+    // Write to stdout/stderr, which will usually be captured in the INFO/ERROR log.
+    ThreadDebugInfo* thread_info = GetThreadDebugInfo();
+    if (thread_info != nullptr) {
+      write_dump_threadinfo(STDOUT_FILENO, thread_info);
+      write_dump_threadinfo(STDERR_FILENO, thread_info);
+    } else {
+      const char msg[] = "Minidump with no thread info available.\n";
+      const int msg_len = sizeof(msg) / sizeof(msg[0]) - 1;
+      sys_write(STDOUT_FILENO, msg, msg_len);
+      sys_write(STDERR_FILENO, msg, msg_len);
+    }
+
     const char* path = descriptor.path();
-    // We use breakpad's reimplementation of strlen to avoid calling into libc.
-    const int path_len = my_strlen(path);
-    // We use the linux syscall support methods from chromium here as per the
-    // recommendation of the breakpad docs to avoid calling into other shared libraries.
-    sys_write(STDOUT_FILENO, msg, msg_len);
-    sys_write(STDOUT_FILENO, path, path_len);
-    sys_write(STDOUT_FILENO, "\n", 1);
-    sys_write(STDERR_FILENO, msg, msg_len);
-    sys_write(STDERR_FILENO, path, path_len);
-    sys_write(STDERR_FILENO, "\n", 1);
+    write_dump_path(STDOUT_FILENO, path);
+    write_dump_path(STDERR_FILENO, path);
   }
   // Return the value received in the call as described in the minidump documentation. If
   // this values is true, then no other handlers will be called. Breakpad will still crash
diff --git a/tests/custom_cluster/test_breakpad.py b/tests/custom_cluster/test_breakpad.py
index 033e4790d..d106093ef 100644
--- a/tests/custom_cluster/test_breakpad.py
+++ b/tests/custom_cluster/test_breakpad.py
@@ -143,6 +143,10 @@ class TestBreakpadBase(CustomClusterTestSuite):
         expected_count=expected_count)
     self.assert_impalad_log_contains('ERROR', 'Wrote minidump to ',
         expected_count=expected_count)
+    self.assert_impalad_log_contains('INFO', 'Minidump with no thread info available.',
+        expected_count=expected_count)
+    self.assert_impalad_log_contains('ERROR', 'Minidump with no thread info available.',
+        expected_count=expected_count)
 
 
 class TestBreakpadCore(TestBreakpadBase):


[impala] 18/27: IMPALA-11416: SlotRef::tuple_is_nullable_ uninitialised for struct children

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 3088c37468f5867d2bb19cfbe182e6418d7655af
Author: Daniel Becker <da...@cloudera.com>
AuthorDate: Tue Jul 5 13:52:20 2022 +0200

    IMPALA-11416: SlotRef::tuple_is_nullable_ uninitialised for struct
    children
    
    In 'SlotRef::Init', 'tuple_is_nullable_' is only assigned a value if the
    'SlotRef' is not within a struct:
    
      ```
      if (!slot_desc_->parent()->isTupleOfStructSlot()) {
        tuple_is_nullable_ = row_desc.TupleIsNullable(tuple_idx_);
      }
      ```
    
    Otherwise 'tuple_is_nullable_' remains uninitialised, leading to
    undefined behaviour when it is read.
    
    After this commit, 'tuple_is_nullable_' is set to false within structs.
    
    Change-Id: I67517502ebc8f9bec52cb61d9922523cc4a56c4a
    Reviewed-on: http://gerrit.cloudera.org:8080/18702
    Reviewed-by: Gabor Kaszab <ga...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 be/src/exprs/slot-ref.cc | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/be/src/exprs/slot-ref.cc b/be/src/exprs/slot-ref.cc
index 55b1137b2..32d4e254c 100644
--- a/be/src/exprs/slot-ref.cc
+++ b/be/src/exprs/slot-ref.cc
@@ -99,9 +99,8 @@ Status SlotRef::Init(
       return Status(error);
     }
     DCHECK(tuple_idx_ != RowDescriptor::INVALID_IDX);
-    if (!slot_desc_->parent()->isTupleOfStructSlot()) {
-      tuple_is_nullable_ = row_desc.TupleIsNullable(tuple_idx_);
-    }
+    tuple_is_nullable_ = slot_desc_->parent()->isTupleOfStructSlot() ?
+        false : row_desc.TupleIsNullable(tuple_idx_);
     slot_offset_ = slot_desc_->tuple_offset();
     null_indicator_offset_ = slot_desc_->null_indicator_offset();
   }


[impala] 19/27: IMPALA-11412: CodegenFnPtr::store() has a compile time error when instantiated

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 1cdb5122a57e1b66dabe089d8d8f6557a4f1771e
Author: Daniel Becker <da...@cloudera.com>
AuthorDate: Mon Jul 4 11:27:34 2022 +0200

    IMPALA-11412: CodegenFnPtr<FuncType>::store() has a compile time error
    when instantiated
    
    The function template 'CodegenFnPtr<FuncType>::store()' tries to
    implicitly cast a function pointer to 'void*', which is a compile time
    error. The reason this hasn't come up in the builds is that this
    function template is currently not used anywhere, and the function
    pointers are stored through the parent class, using
    'CodegenFnPtrBase::store()', which  takes a 'void*'.
    
    This commit adds the correct explicit cast from function pointer to
    'void*' in 'CodegenFnPtr<FuncType>::store()' and adds tests which
    instantiate (and test) this function template.
    
    Testing:
     - Added tests in llvm-codegen-test.cc that assert that
       'CodegenFnPtr<FuncType>::load()' gives back the same function pointer
       that we store in the object. One test stores the function pointer as
       a 'void*' through 'CodegenFnPtrBase' and the other as a typed
       function pointer through 'CodegenFnPtr<FuncType>'.
    
    Change-Id: I86c78b6237630403db6ca14a3d3e1cffe0e477db
    Reviewed-on: http://gerrit.cloudera.org:8080/18698
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 be/src/codegen/codegen-fn-ptr.h     |  2 +-
 be/src/codegen/llvm-codegen-test.cc | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/be/src/codegen/codegen-fn-ptr.h b/be/src/codegen/codegen-fn-ptr.h
index 519912be9..341d01589 100644
--- a/be/src/codegen/codegen-fn-ptr.h
+++ b/be/src/codegen/codegen-fn-ptr.h
@@ -55,7 +55,7 @@ class CodegenFnPtr : public CodegenFnPtrBase {
    }
 
    void store(FuncType value) {
-     CodegenFnPtrBase::store(value);
+     CodegenFnPtrBase::store(reinterpret_cast<void*>(value));
    }
 };
 
diff --git a/be/src/codegen/llvm-codegen-test.cc b/be/src/codegen/llvm-codegen-test.cc
index c9699c337..e1704506a 100644
--- a/be/src/codegen/llvm-codegen-test.cc
+++ b/be/src/codegen/llvm-codegen-test.cc
@@ -42,6 +42,26 @@ using std::unique_ptr;
 
 namespace impala {
 
+class CodegenFnPtrTest : public testing:: Test {
+ protected:
+   typedef int (*FnPtr)(int);
+   static int int_identity(int a) { return a; }
+   CodegenFnPtr<FnPtr> codegen_fn_ptr;
+};
+
+TEST_F(CodegenFnPtrTest, StoreVoidPtrAndLoad) {
+  void* fn_ptr = reinterpret_cast<void*>(int_identity);
+  codegen_fn_ptr.CodegenFnPtrBase::store(fn_ptr);
+  FnPtr loaded_fn_ptr = codegen_fn_ptr.load();
+  ASSERT_TRUE(loaded_fn_ptr == int_identity);
+}
+
+TEST_F(CodegenFnPtrTest, StoreNonVoidPtrAndLoad) {
+  codegen_fn_ptr.store(int_identity);
+  FnPtr loaded_fn_ptr = codegen_fn_ptr.load();
+  ASSERT_TRUE(loaded_fn_ptr == int_identity);
+}
+
 class LlvmCodeGenTest : public testing:: Test {
  protected:
   scoped_ptr<TestEnv> test_env_;


[impala] 17/27: IMPALA-11368: Iceberg time-travel error message should show timestamp in local timezone

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 7e40c2887810bd97dcff154d85601ac0106ddc0f
Author: LPL <li...@sensorsdata.cn>
AuthorDate: Fri Jun 24 23:32:25 2022 +0800

    IMPALA-11368: Iceberg time-travel error message should show timestamp in local timezone
    
    In the FOR SYSTEM_TIME AS OF clause we expect timestamps in the local
    timezone, while the error message shows the timestamp in UTC timezone.
    The error message should show timestamp in local timezone.
    
    Testing:
     - Add e2e test
    
    Change-Id: Iba5d5eb65133f11cc4eb2fc15a19f7b25c14cc46
    Reviewed-on: http://gerrit.cloudera.org:8080/18675
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 .../org/apache/impala/analysis/TimeTravelSpec.java | 18 ++++++++++
 .../main/java/org/apache/impala/util/ExprUtil.java | 41 ++++++++++++++++++++--
 .../java/org/apache/impala/util/IcebergUtil.java   | 16 +++++++--
 .../queries/QueryTest/iceberg-negative.test        | 17 +++++++++
 4 files changed, 88 insertions(+), 4 deletions(-)

diff --git a/fe/src/main/java/org/apache/impala/analysis/TimeTravelSpec.java b/fe/src/main/java/org/apache/impala/analysis/TimeTravelSpec.java
index 7c42dbc31..e89e9c53a 100644
--- a/fe/src/main/java/org/apache/impala/analysis/TimeTravelSpec.java
+++ b/fe/src/main/java/org/apache/impala/analysis/TimeTravelSpec.java
@@ -52,6 +52,9 @@ public class TimeTravelSpec extends StmtNode {
   // Iceberg uses millis, Kudu uses micros for time travel, so using micros here.
   private long asOfMicros_ = -1;
 
+  // A time string represents the asOfMicros_ for the query option TIMEZONE
+  private String timeString_;
+
   public Kind getKind() { return kind_; }
 
   public long getAsOfVersion() { return asOfVersion_; }
@@ -71,6 +74,7 @@ public class TimeTravelSpec extends StmtNode {
     asOfExpr_ = other.asOfExpr_.clone();
     asOfVersion_ = other.asOfVersion_;
     asOfMicros_ = other.asOfMicros_;
+    timeString_ = other.timeString_;
   }
 
   @Override
@@ -106,6 +110,15 @@ public class TimeTravelSpec extends StmtNode {
       throw new AnalysisException(
           "Invalid TIMESTAMP expression: " + ie.getMessage(), ie);
     }
+    try {
+      timeString_ = ExprUtil.localTimestampToString(analyzer, asOfExpr_);
+      LOG.debug("FOR SYSTEM_TIME AS OF time: {}, {}", timeString_,
+          analyzer.getQueryCtx().getLocal_time_zone());
+    } catch (InternalException ie) {
+      throw new AnalysisException(
+          "Invalid TIMESTAMP expression: " + ie.getMessage(), ie);
+    }
+
   }
 
   private void analyzeVersionBased(Analyzer analyzer) throws AnalysisException {
@@ -146,4 +159,9 @@ public class TimeTravelSpec extends StmtNode {
   public final String toSql() {
     return toSql(DEFAULT);
   }
+
+  public String toTimeString() {
+    Preconditions.checkState(Kind.TIME_AS_OF.equals(kind_));
+    return timeString_;
+  }
 }
diff --git a/fe/src/main/java/org/apache/impala/util/ExprUtil.java b/fe/src/main/java/org/apache/impala/util/ExprUtil.java
index b5fc97227..934fabd0b 100644
--- a/fe/src/main/java/org/apache/impala/util/ExprUtil.java
+++ b/fe/src/main/java/org/apache/impala/util/ExprUtil.java
@@ -51,18 +51,55 @@ public class ExprUtil {
     return result.getLong_val();
   }
 
+  /**
+   * Converts a UTC timestamp to string value for a specified time zone.
+   */
+  public static String utcTimestampToSpecifiedTimeZoneTimestamp(Analyzer analyzer,
+      Expr timestampExpr) throws AnalysisException, InternalException {
+    Preconditions.checkArgument(timestampExpr.isAnalyzed());
+    Preconditions.checkArgument(timestampExpr.isConstant());
+    Preconditions.checkArgument(timestampExpr.getType() == Type.TIMESTAMP);
+    Expr fromUtcTimestampExpr = new FunctionCallExpr("from_utc_timestamp", Lists
+        .newArrayList(timestampExpr,
+            new StringLiteral(analyzer.getQueryCtx().getLocal_time_zone())));
+    fromUtcTimestampExpr.analyze(analyzer);
+    TColumnValue result = FeSupport
+        .EvalExprWithoutRow(fromUtcTimestampExpr, analyzer.getQueryCtx());
+    if (!result.isSetString_val()) {
+      throw new InternalException("Error converting timestamp expression: " +
+          timestampExpr.debugString());
+    }
+    return result.getString_val();
+  }
+
   /**
    * Converts a timestamp in local timezone to UTC, then to UNIX microseconds.
    */
   public static long localTimestampToUnixTimeMicros(Analyzer analyzer, Expr timestampExpr)
       throws AnalysisException, InternalException {
+    return utcTimestampToUnixTimeMicros(analyzer,
+        toUtcTimestampExpr(analyzer, timestampExpr));
+  }
+
+
+  /**
+   * Converts a timestamp in local timezone to string value.
+   */
+  public static String localTimestampToString(Analyzer analyzer, Expr timestampExpr)
+      throws AnalysisException, InternalException {
+    return utcTimestampToSpecifiedTimeZoneTimestamp(analyzer,
+        toUtcTimestampExpr(analyzer, timestampExpr));
+  }
+
+  private static Expr toUtcTimestampExpr(Analyzer analyzer, Expr timestampExpr)
+      throws AnalysisException {
     Preconditions.checkArgument(timestampExpr.isAnalyzed());
     Preconditions.checkArgument(timestampExpr.isConstant());
     Preconditions.checkArgument(timestampExpr.getType() == Type.TIMESTAMP);
     Expr toUtcTimestamp = new FunctionCallExpr("to_utc_timestamp",
         Lists.newArrayList(timestampExpr,
-        new StringLiteral(analyzer.getQueryCtx().getLocal_time_zone())));
+            new StringLiteral(analyzer.getQueryCtx().getLocal_time_zone())));
     toUtcTimestamp.analyze(analyzer);
-    return utcTimestampToUnixTimeMicros(analyzer, toUtcTimestamp);
+    return toUtcTimestamp;
   }
 }
diff --git a/fe/src/main/java/org/apache/impala/util/IcebergUtil.java b/fe/src/main/java/org/apache/impala/util/IcebergUtil.java
index 8f4fca0df..b212b96f1 100644
--- a/fe/src/main/java/org/apache/impala/util/IcebergUtil.java
+++ b/fe/src/main/java/org/apache/impala/util/IcebergUtil.java
@@ -548,13 +548,13 @@ public class IcebergUtil {
   }
 
   private static TableScan createScanAsOf(FeIcebergTable table,
-      TimeTravelSpec timeTravelSpec) throws TableLoadingException {
+      TimeTravelSpec timeTravelSpec) {
     TableScan scan = table.getIcebergApiTable().newScan();
     if (timeTravelSpec == null) {
       scan = scan.useSnapshot(table.snapshotId());
     } else {
       if (timeTravelSpec.getKind() == Kind.TIME_AS_OF) {
-        scan = scan.asOfTime(timeTravelSpec.getAsOfMillis());
+        scan = createScanAsOfTime(timeTravelSpec, scan);
       } else {
         Preconditions.checkState(timeTravelSpec.getKind() == Kind.VERSION_AS_OF);
         scan = scan.useSnapshot(timeTravelSpec.getAsOfVersion());
@@ -563,6 +563,18 @@ public class IcebergUtil {
     return scan;
   }
 
+  private static TableScan createScanAsOfTime(TimeTravelSpec timeTravelSpec,
+      TableScan scan) {
+    Preconditions.checkState(timeTravelSpec.getKind() == Kind.TIME_AS_OF);
+    try {
+      scan = scan.asOfTime(timeTravelSpec.getAsOfMillis());
+    } catch (IllegalArgumentException e) {
+      throw new IllegalArgumentException(
+          "Cannot find a snapshot older than " + timeTravelSpec.toTimeString());
+    }
+    return scan;
+  }
+
   /**
    * Use DataFile path to generate 128-bit Murmur3 hash as map key, cached in memory
    */
diff --git a/testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test b/testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test
index 15dc723e3..a9722048c 100644
--- a/testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test
+++ b/testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test
@@ -629,3 +629,20 @@ CREATE TABLE clone_ice LIKE functional_parquet.alltypestiny STORED AS ICEBERG;
 ---- CATCH
 CREATE TABLE LIKE is not supported for Iceberg tables.
 ====
+---- QUERY
+select * from functional_parquet.iceberg_alltypes_part for system_time as of '2000-01-01 01:02:03';
+---- CATCH
+IllegalArgumentException: Cannot find a snapshot older than 2000-01-01 01:02:03
+====
+---- QUERY
+SET TIMEZONE='Europe/Budapest';
+select * from functional_parquet.iceberg_alltypes_part for system_time as of '2000-01-01 01:02:03';
+---- CATCH
+IllegalArgumentException: Cannot find a snapshot older than 2000-01-01 01:02:03
+====
+---- QUERY
+SET TIMEZONE='Asia/Shanghai';
+select * from functional_parquet.iceberg_alltypes_part for system_time as of '2000-01-01 01:02:03';
+---- CATCH
+IllegalArgumentException: Cannot find a snapshot older than 2000-01-01 01:02:03
+====


[impala] 09/27: IMPALA-11323: Don't evaluate constants-only inferred predicates

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit b2e05224b94bf2dc96920c1a09922eb502ef1fcb
Author: Steve Carlin <sc...@cloudera.com>
AuthorDate: Wed Jun 1 12:29:56 2022 -0700

    IMPALA-11323: Don't evaluate constants-only inferred predicates
    
    IMPALA-10182 fixed the problem of creating inferred predicates when
    both sides of an equality predicate came from the same slot.
    
    Inferred predicates also should not be created when both sides
    of an equality predicate are constant values which do not have
    scan slots.
    
    Change-Id: If1cd4559dda406d2d38703ed594b70b41ed336fd
    Reviewed-on: http://gerrit.cloudera.org:8080/18579
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
    Reviewed-by: Aman Sinha <am...@cloudera.com>
---
 .../main/java/org/apache/impala/analysis/Expr.java | 39 +++++++++++++++++-----
 .../java/org/apache/impala/planner/PlanNode.java   | 19 ++++++++---
 .../queries/QueryTest/inline-view.test             | 16 ++++++++-
 3 files changed, 61 insertions(+), 13 deletions(-)

diff --git a/fe/src/main/java/org/apache/impala/analysis/Expr.java b/fe/src/main/java/org/apache/impala/analysis/Expr.java
index 76175d51c..0bc470053 100644
--- a/fe/src/main/java/org/apache/impala/analysis/Expr.java
+++ b/fe/src/main/java/org/apache/impala/analysis/Expr.java
@@ -1642,24 +1642,47 @@ abstract public class Expr extends TreeNode<Expr> implements ParseNode, Cloneabl
   }
 
   /**
-   * Returns the descriptor of the scan slot that directly or indirectly produces
-   * the values of 'this' SlotRef. Traverses the source exprs of intermediate slot
-   * descriptors to resolve materialization points (e.g., aggregations).
-   * Returns null if 'e' or any source expr of 'e' is not a SlotRef or cast SlotRef.
-   */
-  public SlotDescriptor findSrcScanSlot() {
+   * Returns the source expression for this expression. Traverses the source
+   * exprs of intermediate slot descriptors to resolve materialization points
+   * (e.g., aggregations). Returns null if there are multiple source Exprs
+   * mapped to the expression at any given point.
+   */
+  public Expr findSrcExpr() {
+    // If the source expression is a constant expression, it won't have a scanSlotRef
+    // and we can return this.
+    if (isConstant()) {
+      return this;
+    }
     SlotRef slotRef = unwrapSlotRef(false);
     if (slotRef == null) return null;
     SlotDescriptor slotDesc = slotRef.getDesc();
-    if (slotDesc.isScanSlot()) return slotDesc;
+    if (slotDesc.isScanSlot()) return slotRef;
     if (slotDesc.getSourceExprs().size() == 1) {
-      return slotDesc.getSourceExprs().get(0).findSrcScanSlot();
+      return slotDesc.getSourceExprs().get(0).findSrcExpr();
     }
     // No known source expr, or there are several source exprs meaning the slot is
     // has no single source table.
     return null;
   }
 
+  /**
+   * Returns the descriptor of the scan slot that directly or indirectly produces
+   * the values of 'this' SlotRef. Traverses the source exprs of intermediate slot
+   * descriptors to resolve materialization points (e.g., aggregations).
+   * Returns null if 'e' or any source expr of 'e' is not a SlotRef or cast SlotRef.
+   */
+  public SlotDescriptor findSrcScanSlot() {
+    Expr sourceExpr = findSrcExpr();
+    if (sourceExpr == null) {
+      return null;
+    }
+    SlotRef slotRef = sourceExpr.unwrapSlotRef(false);
+    if (slotRef == null) {
+      return null;
+    }
+    return slotRef.getDesc();
+  }
+
   /**
    * Pushes negation to the individual operands of a predicate
    * tree rooted at 'root'.
diff --git a/fe/src/main/java/org/apache/impala/planner/PlanNode.java b/fe/src/main/java/org/apache/impala/planner/PlanNode.java
index b59acedc3..2dc21afe2 100644
--- a/fe/src/main/java/org/apache/impala/planner/PlanNode.java
+++ b/fe/src/main/java/org/apache/impala/planner/PlanNode.java
@@ -574,12 +574,23 @@ abstract public class PlanNode extends TreeNode<PlanNode> {
     // Check if this is an inferred identity predicate i.e for c1 = c2 both
     // sides are pointing to the same source slot. In such cases it is wrong
     // to add the predicate to the SELECT node because it will incorrectly
-    // eliminate rows with NULL values.
+    // eliminate rows with NULL values. We also check if both sides are pointing
+    // to equal constant values.
     for (Expr e : conjuncts) {
       if (e instanceof BinaryPredicate && ((BinaryPredicate) e).isInferred()) {
-        SlotDescriptor lhs = ((BinaryPredicate) e).getChild(0).findSrcScanSlot();
-        SlotDescriptor rhs = ((BinaryPredicate) e).getChild(1).findSrcScanSlot();
-        if (lhs != null && rhs != null && lhs.equals(rhs)) continue;
+        Expr lhsSrcExpr = ((BinaryPredicate) e).getChild(0).findSrcExpr();
+        Expr rhsSrcExpr  = ((BinaryPredicate) e).getChild(1).findSrcExpr();
+        if (lhsSrcExpr != null && rhsSrcExpr != null) {
+          if (lhsSrcExpr.isConstant() && rhsSrcExpr.isConstant() &&
+              lhsSrcExpr.equals(rhsSrcExpr)) {
+            continue;
+          }
+          if (lhsSrcExpr instanceof SlotRef && rhsSrcExpr instanceof SlotRef) {
+            SlotRef lhsSlotRef = (SlotRef) lhsSrcExpr;
+            SlotRef rhsSlotRef = (SlotRef) rhsSrcExpr;
+            if (lhsSlotRef.getDesc().equals(rhsSlotRef.getDesc())) continue;
+          }
+        }
       }
       finalConjuncts.add(e);
     }
diff --git a/testdata/workloads/functional-query/queries/QueryTest/inline-view.test b/testdata/workloads/functional-query/queries/QueryTest/inline-view.test
index 2cbda7bba..5c0cc185d 100644
--- a/testdata/workloads/functional-query/queries/QueryTest/inline-view.test
+++ b/testdata/workloads/functional-query/queries/QueryTest/inline-view.test
@@ -578,4 +578,18 @@ NULL,NULL
 NULL,NULL
 ---- TYPES
 INT, INT
-====
\ No newline at end of file
+====
+---- QUERY
+# IMPALA-11323: Constant expressions should get filtered out so they
+# don't get placed in the select node in the planner.
+with t as (select 1 a), v as
+  (select distinct a, cast(null as smallint) b, cast(null as smallint) c from t)
+select distinct a,b,c from v
+  union all
+select distinct a,b,c from v;
+---- RESULTS
+1,NULL,NULL
+1,NULL,NULL
+---- TYPES
+TINYINT, SMALLINT, SMALLINT
+====


[impala] 21/27: IMPALA-11406: Fix incorrect duration log for authorization

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit f767619dc6ba94dc09fb6c43e53d4f6dd2238a7b
Author: stiga-huang <hu...@gmail.com>
AuthorDate: Thu Jun 30 14:59:41 2022 +0800

    IMPALA-11406: Fix incorrect duration log for authorization
    
    IMPALA-8443 extends EventSequence.markEvent() to return the duration
    between the last and the current event. However, the duration is
    calculated using the start time, not the last time it's invoked, which
    causes misleading time in logs of "Authorization check took n ms".
    
    This fixes the bug and also adds a log for the analysis duration.
    
    Change-Id: I8b665f1b4ac86577711598ce9d845cf82fedbcd7
    Reviewed-on: http://gerrit.cloudera.org:8080/18682
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java | 3 ++-
 fe/src/main/java/org/apache/impala/util/EventSequence.java       | 8 ++++++--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java b/fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java
index 8442a27a4..9304ffbe5 100644
--- a/fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java
+++ b/fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java
@@ -471,7 +471,8 @@ public class AnalysisContext {
     } finally {
       authzChecker.postAnalyze(authzCtx);
     }
-    timeline_.markEvent("Analysis finished");
+    long durationMs = timeline_.markEvent("Analysis finished") / 1000000;
+    LOG.info("Analysis took {} ms", durationMs);
 
     // Authorize statement and record exception. Authorization relies on information
     // collected during analysis.
diff --git a/fe/src/main/java/org/apache/impala/util/EventSequence.java b/fe/src/main/java/org/apache/impala/util/EventSequence.java
index d137208c1..72b1126d9 100644
--- a/fe/src/main/java/org/apache/impala/util/EventSequence.java
+++ b/fe/src/main/java/org/apache/impala/util/EventSequence.java
@@ -33,10 +33,12 @@ public class EventSequence {
 
   private final long startTime_;
   private final String name_;
+  private long lastTime_;
 
   public EventSequence(String name) {
     name_ = name;
     startTime_ = System.nanoTime();
+    lastTime_ = startTime_;
   }
 
   /**
@@ -45,8 +47,10 @@ public class EventSequence {
    */
   public long markEvent(String label) {
     // Timestamps should be in ns resolution
-    long durationNs = System.nanoTime() - startTime_;
-    timestamps_.add(durationNs);
+    long currentTime = System.nanoTime();
+    long durationNs = currentTime - lastTime_;
+    lastTime_ = currentTime;
+    timestamps_.add(currentTime - startTime_);
     labels_.add(label);
     return durationNs;
   }


[impala] 14/27: IMPALA-11337: Flush row output before writing "Fetched X row(s)"

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 328eedf96725c1eafd29675eea89b72990a06d15
Author: Joe McDonnell <jo...@cloudera.com>
AuthorDate: Mon Jun 13 15:08:42 2022 -0700

    IMPALA-11337: Flush row output before writing "Fetched X row(s)"
    
    When redirecting stdout and stderr to a file, the
    existing code can sometimes output the "Fetched X row(s)"
    line before finishing the row output. e.g.
    impala-shell -B -q "select 1" >> outfile.txt 2>> outfile.txt
    
    The rows output goes to stdout while the control messages
    like "Fetched X row(s)" go to stderr. Since stdout can buffer
    output, that can delay the output. This adds a flush for
    stdout before writing the "Fetched X row(s)" message.
    
    Testing:
     - Added a shell test that redirects stdout and stderr to
       a file and verifies the contents. This consistently
       fails without the flush.
     - Other shell tests pass
    
    Change-Id: I83f89c110fd90d2d54331c7121e407d9de99146c
    Reviewed-on: http://gerrit.cloudera.org:8080/18625
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 shell/impala_shell.py                 |  4 ++++
 shell/shell_output.py                 |  6 ++++++
 tests/shell/test_shell_commandline.py | 21 +++++++++++++++++++++
 tests/shell/util.py                   | 33 +++++++++++++++++++++++----------
 4 files changed, 54 insertions(+), 10 deletions(-)

diff --git a/shell/impala_shell.py b/shell/impala_shell.py
index 7bf891d56..acaa02b88 100755
--- a/shell/impala_shell.py
+++ b/shell/impala_shell.py
@@ -1328,6 +1328,10 @@ class ImpalaShell(cmd.Cmd, object):
         # retrieve the error log
         warning_log = self.imp_client.get_warning_log(self.last_query_handle)
 
+        # Flush the row output. This is important so that the row output will not
+        # come after the "Fetch X row(s)" message.
+        self.output_stream.flush()
+
       end_time = time.time()
 
       if warning_log:
diff --git a/shell/shell_output.py b/shell/shell_output.py
index 608a8ca92..cac21455d 100644
--- a/shell/shell_output.py
+++ b/shell/shell_output.py
@@ -137,6 +137,12 @@ class OutputStream(object):
       # If filename is None, then just print to stdout
       print(formatted_data)
 
+  def flush(self):
+    # When outputing to a file, the file is currently closed with each write,
+    # so the flush doesn't need to do anything.
+    if self.filename is None:
+      sys.stdout.flush()
+
 
 class OverwritingStdErrOutputStream(object):
   """This class is used to write output to stderr and overwrite the previous text as
diff --git a/tests/shell/test_shell_commandline.py b/tests/shell/test_shell_commandline.py
index 1cfd90746..ef6536e26 100644
--- a/tests/shell/test_shell_commandline.py
+++ b/tests/shell/test_shell_commandline.py
@@ -1250,3 +1250,24 @@ class TestImpalaShell(ImpalaTestSuite):
     output = run_impala_shell_cmd(vector, ['-q', query, '-B'])
     assert "Fetched 1 row(s)" in output.stderr
     assert "Trailing Whitespace          \n" in output.stdout
+
+  def test_shell_flush(self, vector, tmp_file):
+    """Verify that the rows are flushed before the Fetch X row(s) message"""
+
+    # Run a simple "select 1" with stdout and stderr redirected to the same file.
+    with open(tmp_file, "w") as f:
+      output = run_impala_shell_cmd(vector, ['-q', DEFAULT_QUERY, '-B'], stdout_file=f,
+                                    stderr_file=f)
+      # Stdout and stderr should be empty
+      assert output.stderr is None
+      assert output.stdout is None
+
+    # Verify the file contents
+    # The output should be in this order:
+    # 1\n
+    # Fetched 1 row(s) in ...\n
+    with open(tmp_file, "r") as f:
+      lines = f.readlines()
+      assert len(lines) >= 2
+      assert "1\n" in lines[len(lines) - 2]
+      assert "Fetched 1 row(s)" in lines[len(lines) - 1]
diff --git a/tests/shell/util.py b/tests/shell/util.py
index fc0365f3b..1aa49b2cb 100755
--- a/tests/shell/util.py
+++ b/tests/shell/util.py
@@ -165,14 +165,17 @@ def assert_pattern(pattern, result, text, message):
 
 
 def run_impala_shell_cmd(vector, shell_args, env=None, expect_success=True,
-                         stdin_input=None, wait_until_connected=True):
+                         stdin_input=None, wait_until_connected=True,
+                         stdout_file=None, stderr_file=None):
   """Runs the Impala shell on the commandline.
 
   'shell_args' is a string which represents the commandline options.
   Returns a ImpalaShellResult.
   """
   result = run_impala_shell_cmd_no_expect(vector, shell_args, env, stdin_input,
-                                          expect_success and wait_until_connected)
+                                          expect_success and wait_until_connected,
+                                          stdout_file=stdout_file,
+                                          stderr_file=stderr_file)
   if expect_success:
     assert result.rc == 0, "Cmd %s was expected to succeed: %s" % (shell_args,
                                                                    result.stderr)
@@ -182,7 +185,8 @@ def run_impala_shell_cmd(vector, shell_args, env=None, expect_success=True,
 
 
 def run_impala_shell_cmd_no_expect(vector, shell_args, env=None, stdin_input=None,
-                                   wait_until_connected=True):
+                                   wait_until_connected=True, stdout_file=None,
+                                   stderr_file=None):
   """Runs the Impala shell on the commandline.
 
   'shell_args' is a string which represents the commandline options.
@@ -190,7 +194,8 @@ def run_impala_shell_cmd_no_expect(vector, shell_args, env=None, stdin_input=Non
 
   Does not assert based on success or failure of command.
   """
-  p = ImpalaShell(vector, shell_args, env=env, wait_until_connected=wait_until_connected)
+  p = ImpalaShell(vector, shell_args, env=env, wait_until_connected=wait_until_connected,
+                  stdout_file=stdout_file, stderr_file=stderr_file)
   result = p.get_result(stdin_input)
   return result
 
@@ -262,11 +267,16 @@ class ImpalaShell(object):
      get_result() to retrieve the process output. This constructor will wait until
      Impala shell is connected for the specified timeout unless wait_until_connected is
      set to False or --quiet is passed into the args."""
-  def __init__(self, vector, args=None, env=None, wait_until_connected=True, timeout=60):
-    self.shell_process = self._start_new_shell_process(vector, args, env=env)
+  def __init__(self, vector, args=None, env=None, wait_until_connected=True, timeout=60,
+               stdout_file=None, stderr_file=None):
+    self.shell_process = self._start_new_shell_process(vector, args, env=env,
+                                                       stdout_file=stdout_file,
+                                                       stderr_file=stderr_file)
     # When --quiet option is passed to Impala shell, we should not wait until we see
-    # "Connected to" because it will never be printed to stderr.
-    if wait_until_connected and (args is None or "--quiet" not in args):
+    # "Connected to" because it will never be printed to stderr. The same is true
+    # if stderr is redirected.
+    if wait_until_connected and (args is None or "--quiet" not in args) and \
+       stderr_file is None:
       start_time = time.time()
       connected = False
       while time.time() - start_time < timeout and not connected:
@@ -311,11 +321,14 @@ class ImpalaShell(object):
     result.rc = self.shell_process.returncode
     return result
 
-  def _start_new_shell_process(self, vector, args=None, env=None):
+  def _start_new_shell_process(self, vector, args=None, env=None, stdout_file=None,
+                               stderr_file=None):
     """Starts a shell process and returns the process handle"""
     cmd = get_shell_cmd(vector)
     if args is not None: cmd += args
-    return Popen(cmd, shell=False, stdout=PIPE, stdin=PIPE, stderr=PIPE,
+    stdout_arg = stdout_file if stdout_file is not None else PIPE
+    stderr_arg = stderr_file if stderr_file is not None else PIPE
+    return Popen(cmd, shell=False, stdout=stdout_arg, stdin=PIPE, stderr=stderr_arg,
                  env=build_shell_env(env))
 
 


[impala] 05/27: IMPALA-11305: Fix TypeError in impala-shell summary progress

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 41a2352a79509a94c19d3d6a1388c52512465f50
Author: Riza Suminto <ri...@cloudera.com>
AuthorDate: Thu May 19 21:12:32 2022 -0700

    IMPALA-11305: Fix TypeError in impala-shell summary progress
    
    impala-shell fail with TypeError when installed with python3. This is
    due to behavior change of division operator ('/') between python2 vs
    python3. This patch fix the issue by changing the operator with floor
    division ('//') that result in integer type as described in
    https://peps.python.org/pep-0238/.
    
    Testing:
    - Manually install impala-shell with from pip with python3 and verify
      the fix works.
    
    Change-Id: Ifbe4df6a7a4136e590f383fc6475e2283e35eadc
    Reviewed-on: http://gerrit.cloudera.org:8080/18546
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Reviewed-by: Michael Smith <mi...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 shell/impala_shell.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/shell/impala_shell.py b/shell/impala_shell.py
index 55387b602..7bf891d56 100755
--- a/shell/impala_shell.py
+++ b/shell/impala_shell.py
@@ -1246,7 +1246,7 @@ class ImpalaShell(cmd.Cmd, object):
           return
 
         if self.live_progress and progress.total_scan_ranges > 0:
-          val = ((summary.progress.num_completed_scan_ranges * 100) /
+          val = ((summary.progress.num_completed_scan_ranges * 100) //
                  summary.progress.total_scan_ranges)
           fragment_text = "[%s%s] %s%%\n" % ("#" * val, " " * (100 - val), val)
           data += fragment_text


[impala] 10/27: IMPALA-5845: Limit the number of non-fatal errors logging to INFO

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit b25aef4e0a145a57319efabd7be87a4d1262a67f
Author: Riza Suminto <ri...@cloudera.com>
AuthorDate: Wed May 25 23:51:58 2022 -0700

    IMPALA-5845: Limit the number of non-fatal errors logging to INFO
    
    RuntimeState::LogError() does both error aggregation to the coordinator
    and logging the error to the log file depending on the vlog_level. This
    can flood INFO log if the specified vlog_level is 1 and makes it
    difficult to analyze other more significant log lines. This patch limits
    the number of errors logged to INFO based on max_error_logs_per_instance
    flag (default is 2000). When this number is exceeded, vlog_level=1 will
    be downgraded to vlog_level=2.
    
    To allow easy debugging in the future, this flag will be ignored if the
    user sets query option max_errors < 0, which in that case all errors
    targetting vlog_level 1 will be logged.
    
    This patch also fixes a bug where the error count is not increased for
    non-general error code that is already in 'error_log_' map.
    
    Testing:
    - Add test_logging.py::TestLoggingCore
    
    Change-Id: I924768ec461735c172fbf75d6415033bbdb77f9b
    Reviewed-on: http://gerrit.cloudera.org:8080/18565
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 be/src/runtime/query-state.cc        |  6 ----
 be/src/runtime/runtime-state.cc      | 42 +++++++++++++++++++++----
 be/src/runtime/runtime-state.h       | 12 +++++++
 tests/custom_cluster/test_logging.py | 61 ++++++++++++++++++++++++++++++++++++
 4 files changed, 109 insertions(+), 12 deletions(-)

diff --git a/be/src/runtime/query-state.cc b/be/src/runtime/query-state.cc
index e7badc60a..6b6286f7a 100644
--- a/be/src/runtime/query-state.cc
+++ b/be/src/runtime/query-state.cc
@@ -118,12 +118,6 @@ QueryState::QueryState(
   }
   TQueryOptions& query_options =
       const_cast<TQueryOptions&>(query_ctx_.client_request.query_options);
-  // max_errors does not indicate how many errors in total have been recorded, but rather
-  // how many are distinct. It is defined as the sum of the number of generic errors and
-  // the number of distinct other errors.
-  if (query_options.max_errors <= 0) {
-    query_options.max_errors = 100;
-  }
   if (query_options.batch_size <= 0) {
     query_options.__set_batch_size(DEFAULT_BATCH_SIZE);
   }
diff --git a/be/src/runtime/runtime-state.cc b/be/src/runtime/runtime-state.cc
index 8a994b35e..d9c83e624 100644
--- a/be/src/runtime/runtime-state.cc
+++ b/be/src/runtime/runtime-state.cc
@@ -59,7 +59,11 @@
 
 using strings::Substitute;
 
-DECLARE_int32(max_errors);
+DEFINE_int32(max_error_logs_per_instance, 2000,
+    "Maximum number of non-fatal error to be logged in log level 1 (INFO). "
+    "Once this number exceeded, further non-fatal error will be logged at log level 2 "
+    "(DEBUG) severity. This flag is ignored if user set negative max_errors query "
+    "option. Default to 2000");
 
 namespace impala {
 
@@ -190,11 +194,37 @@ string RuntimeState::ErrorLog() {
 
 bool RuntimeState::LogError(const ErrorMsg& message, int vlog_level) {
   lock_guard<SpinLock> l(error_log_lock_);
-  // All errors go to the log, unreported_error_count_ is counted independently of the
-  // size of the error_log to account for errors that were already reported to the
-  // coordinator
-  VLOG(vlog_level) << "Error from query " << PrintId(query_id()) << ": " << message.msg();
-  if (ErrorCount(error_log_) < query_options().max_errors) {
+  // All errors go to the log. If the amount of errors logged to vlog level 1 exceed
+  // or equal max_error_logs_per_instance, then that error will be downgraded to vlog
+  // level 2.
+  int user_max_errors = query_options().max_errors;
+  if (vlog_level == 1 && user_max_errors >= 0
+      && vlog_1_errors >= FLAGS_max_error_logs_per_instance) {
+    vlog_level = 2;
+  }
+
+  if (VLOG_IS_ON(vlog_level)) {
+    VLOG(vlog_level) << "Error from query " << PrintId(query_id()) << ": "
+                     << message.msg();
+  }
+
+  if (vlog_level == 1 && user_max_errors >= 0) {
+    vlog_1_errors++;
+    DCHECK_LE(vlog_1_errors, FLAGS_max_error_logs_per_instance);
+    if (vlog_1_errors == FLAGS_max_error_logs_per_instance) {
+      VLOG(vlog_level) << "Query " << PrintId(query_id()) << " printed "
+                       << FLAGS_max_error_logs_per_instance
+                       << " non-fatal error to log level 1 (INFO). Further non-fatal "
+                       << "error will be downgraded to log level 2 (DEBUG).";
+    }
+  }
+
+  TErrorCode::type code = message.error();
+  if (ErrorCount(error_log_) < max_errors()
+      || (code != TErrorCode::GENERAL && error_log_.find(code) != error_log_.end())) {
+    // Appending general error is expensive since it writes the entire message to the
+    // error_log_ map. Meanwhile, appending non-general (specific) error that already
+    // exist in error_log_ is cheap since it only increment count.
     AppendError(&error_log_, message);
     return true;
   }
diff --git a/be/src/runtime/runtime-state.h b/be/src/runtime/runtime-state.h
index 7d2d57bb3..5d00df7ae 100644
--- a/be/src/runtime/runtime-state.h
+++ b/be/src/runtime/runtime-state.h
@@ -166,6 +166,15 @@ class RuntimeState {
     return Status::OK();
   }
 
+  /// Return maximum number of non-fatal error to report to client through coordinator.
+  /// max_errors does not indicate how many errors in total have been recorded, but rather
+  /// how many are distinct. It is defined as the sum of the number of generic errors and
+  /// the number of distinct other errors. Default to 100 if non-positive number is
+  /// specified in max_errors query option.
+  inline int max_errors() const {
+    return query_options().max_errors <= 0 ? 100 : query_options().max_errors;
+  }
+
   /// Log an error that will be sent back to the coordinator based on an instance of the
   /// ErrorMsg class. The runtime state aggregates log messages based on type with one
   /// exception: messages with the GENERAL type are not aggregated but are kept
@@ -318,6 +327,9 @@ class RuntimeState {
   /// Logs error messages.
   ErrorLogMap error_log_;
 
+  /// Track how many error has been printed to VLOG(1).
+  int64_t vlog_1_errors = 0;
+
   /// Global QueryState and original thrift descriptors for this fragment instance.
   QueryState* const query_state_;
   const TPlanFragment* const fragment_;
diff --git a/tests/custom_cluster/test_logging.py b/tests/custom_cluster/test_logging.py
new file mode 100644
index 000000000..34cf2578e
--- /dev/null
+++ b/tests/custom_cluster/test_logging.py
@@ -0,0 +1,61 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import pytest
+
+from tests.common.custom_cluster_test_suite import CustomClusterTestSuite
+
+
+class TestLoggingCore(CustomClusterTestSuite):
+  """Test existence of certain log lines under some scenario."""
+
+  @classmethod
+  def get_workload(cls):
+    return 'functional-query'
+
+  def _test_max_errors(self, max_error_logs_per_instance, max_errors, expect_downgraded):
+    """Test that number of non-fatal error printed to INFO log is limited by
+    max_errors and max_error_logs_per_instance."""
+
+    query = ("select id, bool_col, tinyint_col, smallint_col "
+        "from functional.alltypeserror order by id")
+    client = self.create_impala_client()
+
+    self.execute_query_expect_success(client, query, {'max_errors': max_errors})
+    self.assert_impalad_log_contains("INFO", "Error parsing row",
+        max_error_logs_per_instance if expect_downgraded else 8)
+    self.assert_impalad_log_contains("INFO",
+        "printed {0} non-fatal error to log level 1".format(max_error_logs_per_instance),
+        1 if expect_downgraded else 0)
+
+  @pytest.mark.execute_serially
+  @CustomClusterTestSuite.with_args(cluster_size=1,
+      impalad_args="--max_error_logs_per_instance=2")
+  def test_max_errors(self):
+    self._test_max_errors(2, 4, True)
+
+  @pytest.mark.execute_serially
+  @CustomClusterTestSuite.with_args(cluster_size=1,
+      impalad_args="--max_error_logs_per_instance=3")
+  def test_max_errors_0(self):
+    self._test_max_errors(3, 0, True)
+
+  @pytest.mark.execute_serially
+  @CustomClusterTestSuite.with_args(cluster_size=1,
+      impalad_args="--max_error_logs_per_instance=2")
+  def test_max_errors_no_downgrade(self):
+    self._test_max_errors(2, -1, False)


[impala] 13/27: IMPALA-11280: Join node incorrectly picks up unnest(array) predicates

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 67894f9723168ef4b576930765eb18c6dc1110b8
Author: Gabor Kaszab <ga...@cloudera.com>
AuthorDate: Mon Jun 13 14:04:36 2022 +0200

    IMPALA-11280: Join node incorrectly picks up unnest(array) predicates
    
    The expectation for predicates on unnested arrays is that they are
    either picked up by the SCAN node or the UNNEST node for evaluation. If
    there is only one array being unnested then the SCAN node, otherwise
    the UNNEST node will be responsible for the evaluation. However, if
    there is a JOIN node involved where the JOIN construction happens
    before creating the UNNEST node then the JOIN node incorrectly picks
    up the predicates for the unnested arrays as well. This patch is to fix
    this behaviour.
    
    Tests:
      - Added E2E tests to cover result correctness.
      - Added planner tests to verify that the desired node picks up the
        predicates for unnested arrays.
    
    Change-Id: I89fed4eef220ca513b259f0e2649cdfbe43c797a
    Reviewed-on: http://gerrit.cloudera.org:8080/18614
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 .../java/org/apache/impala/planner/PlanNode.java   |  2 +-
 .../java/org/apache/impala/planner/ScanNode.java   |  3 +
 .../apache/impala/planner/SingleNodePlanner.java   |  1 +
 .../apache/impala/planner/SingularRowSrcNode.java  |  3 -
 .../java/org/apache/impala/planner/UnnestNode.java |  3 +
 .../queries/PlannerTest/zipping-unnest.test        | 80 ++++++++++++++++++++++
 .../QueryTest/nested-array-in-select-list.test     | 45 +++++++++++-
 7 files changed, 131 insertions(+), 6 deletions(-)

diff --git a/fe/src/main/java/org/apache/impala/planner/PlanNode.java b/fe/src/main/java/org/apache/impala/planner/PlanNode.java
index 2dc21afe2..b936d5a5c 100644
--- a/fe/src/main/java/org/apache/impala/planner/PlanNode.java
+++ b/fe/src/main/java/org/apache/impala/planner/PlanNode.java
@@ -548,7 +548,7 @@ abstract public class PlanNode extends TreeNode<PlanNode> {
     analyzer.markConjunctsAssigned(unassigned);
   }
 
-  protected boolean shouldPickUpZippingUnnestConjuncts() { return true; }
+  protected boolean shouldPickUpZippingUnnestConjuncts() { return false; }
 
   /**
    * Apply the provided conjuncts to the this node, returning the new root of
diff --git a/fe/src/main/java/org/apache/impala/planner/ScanNode.java b/fe/src/main/java/org/apache/impala/planner/ScanNode.java
index e8731d6e8..30f7fee62 100644
--- a/fe/src/main/java/org/apache/impala/planner/ScanNode.java
+++ b/fe/src/main/java/org/apache/impala/planner/ScanNode.java
@@ -85,6 +85,9 @@ abstract public class ScanNode extends PlanNode {
 
   public TupleDescriptor getTupleDesc() { return desc_; }
 
+  @Override
+  protected boolean shouldPickUpZippingUnnestConjuncts() { return true; }
+
   /**
    * Checks if this scan is supported based on the types of scanned columns and the
    * underlying file formats, in particular, whether complex types are supported.
diff --git a/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java b/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
index 734885fbe..143ff4bce 100644
--- a/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
+++ b/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
@@ -2054,6 +2054,7 @@ public class SingleNodePlanner {
         Preconditions.checkState(hasNullMatchingEqOperator);
       }
     }
+    PlanNode.removeZippingUnnestConjuncts(otherJoinConjuncts, analyzer);
     analyzer.markConjunctsAssigned(otherJoinConjuncts);
 
     if (analyzer.getQueryOptions().isEnable_distinct_semi_join_optimization() &&
diff --git a/fe/src/main/java/org/apache/impala/planner/SingularRowSrcNode.java b/fe/src/main/java/org/apache/impala/planner/SingularRowSrcNode.java
index bef43b783..e9c498417 100644
--- a/fe/src/main/java/org/apache/impala/planner/SingularRowSrcNode.java
+++ b/fe/src/main/java/org/apache/impala/planner/SingularRowSrcNode.java
@@ -88,7 +88,4 @@ public class SingularRowSrcNode extends PlanNode {
   protected void toThrift(TPlanNode msg) {
     msg.node_type = TPlanNodeType.SINGULAR_ROW_SRC_NODE;
   }
-
-  @Override
-  protected boolean shouldPickUpZippingUnnestConjuncts() { return false; }
 }
diff --git a/fe/src/main/java/org/apache/impala/planner/UnnestNode.java b/fe/src/main/java/org/apache/impala/planner/UnnestNode.java
index e5247af0f..dd6eb33f5 100644
--- a/fe/src/main/java/org/apache/impala/planner/UnnestNode.java
+++ b/fe/src/main/java/org/apache/impala/planner/UnnestNode.java
@@ -82,6 +82,9 @@ public class UnnestNode extends PlanNode {
     computeMemLayout(analyzer);
   }
 
+  @Override
+  protected boolean shouldPickUpZippingUnnestConjuncts() { return true; }
+
   @Override
   public void computeStats(Analyzer analyzer) {
     super.computeStats(analyzer);
diff --git a/testdata/workloads/functional-planner/queries/PlannerTest/zipping-unnest.test b/testdata/workloads/functional-planner/queries/PlannerTest/zipping-unnest.test
index a9672637d..f4a39cab6 100644
--- a/testdata/workloads/functional-planner/queries/PlannerTest/zipping-unnest.test
+++ b/testdata/workloads/functional-planner/queries/PlannerTest/zipping-unnest.test
@@ -204,3 +204,83 @@ PLAN-ROOT SINK
    HDFS partitions=1/1 files=1 size=1.06KB
    row-size=16B cardinality=1.35K
 ====
+# IMPALA-11280. There is a join involved here by using the IN operator, and multiple
+# arrays are unnested. Checks that the predicate on an unnested array is not picked up by
+# the join node.
+select id, unnested_arr1, unnested_arr2
+from (
+    select id, unnest(arr1) as unnested_arr1, unnest(arr2) as unnested_arr2
+    from functional_parquet.complextypes_arrays
+    where id % 2 = 1 and id in (select id from functional_parquet.alltypestiny)
+) a
+where a.unnested_arr1 < 5;
+---- PLAN
+PLAN-ROOT SINK
+|
+06:SUBPLAN
+|  row-size=44B cardinality=1.35K
+|
+|--04:NESTED LOOP JOIN [CROSS JOIN]
+|  |  row-size=44B cardinality=10
+|  |
+|  |--02:SINGULAR ROW SRC
+|  |     row-size=28B cardinality=1
+|  |
+|  03:UNNEST [functional_parquet.complextypes_arrays.arr1 arr1, functional_parquet.complextypes_arrays.arr2 arr2]
+|     predicates: UNNEST(arr1) < 5
+|     row-size=0B cardinality=10
+|
+05:HASH JOIN [LEFT SEMI JOIN]
+|  hash predicates: id = id
+|  runtime filters: RF000 <- id
+|  row-size=28B cardinality=135
+|
+|--01:SCAN HDFS [functional_parquet.alltypestiny]
+|     HDFS partitions=4/4 files=4 size=11.92KB
+|     predicates: functional_parquet.alltypestiny.id % 2 = 1
+|     row-size=4B cardinality=76
+|
+00:SCAN HDFS [functional_parquet.complextypes_arrays]
+   HDFS partitions=1/1 files=1 size=1.06KB
+   predicates: id % 2 = 1
+   runtime filters: RF000 -> id
+   row-size=28B cardinality=135
+====
+# Similar as above but here the join is explicitly included in the query string and is not
+# a result of a query rewrite.
+select a.id, unnested_arr1, unnested_arr2
+from (
+    select cta.id, unnest(arr1) as unnested_arr1, unnest(arr2) as unnested_arr2
+    from functional_parquet.complextypes_arrays cta left join functional_parquet.alltypestiny ti on cta.id = ti.id
+    where cta.id % 2 = 1) a
+where a.unnested_arr1 < 5;
+---- PLAN
+PLAN-ROOT SINK
+|
+06:SUBPLAN
+|  row-size=44B cardinality=1.35K
+|
+|--04:NESTED LOOP JOIN [CROSS JOIN]
+|  |  row-size=44B cardinality=10
+|  |
+|  |--02:SINGULAR ROW SRC
+|  |     row-size=28B cardinality=1
+|  |
+|  03:UNNEST [cta.arr1 arr1, cta.arr2 arr2]
+|     predicates: UNNEST(arr1) < 5
+|     row-size=0B cardinality=10
+|
+05:HASH JOIN [LEFT OUTER JOIN]
+|  hash predicates: cta.id = ti.id
+|  row-size=32B cardinality=135
+|
+|--01:SCAN HDFS [functional_parquet.alltypestiny ti]
+|     HDFS partitions=4/4 files=4 size=11.92KB
+|     predicates: ti.id % 2 = 1
+|     row-size=4B cardinality=76
+|
+00:SCAN HDFS [functional_parquet.complextypes_arrays cta]
+   HDFS partitions=1/1 files=1 size=1.06KB
+   predicates: cta.id % 2 = 1
+   row-size=28B cardinality=135
+====
\ No newline at end of file
diff --git a/testdata/workloads/functional-query/queries/QueryTest/nested-array-in-select-list.test b/testdata/workloads/functional-query/queries/QueryTest/nested-array-in-select-list.test
index 13362f23e..1b2ef1e63 100644
--- a/testdata/workloads/functional-query/queries/QueryTest/nested-array-in-select-list.test
+++ b/testdata/workloads/functional-query/queries/QueryTest/nested-array-in-select-list.test
@@ -334,11 +334,52 @@ BIGINT,INT
 select item from complextypes_arrays_only_view.int_array
 ---- CATCH
 AnalysisException: Non-relative collections are currently not supported on collections from views.
-=====
+====
 ---- QUERY
 # IMPALA-11052: allow using collections returned from views as non-relative table refs
 with s as (select int_array a from complextypestbl t)
 select item from s.a
 ---- CATCH
 AnalysisException: Could not resolve table reference: 's.a
-=====
\ No newline at end of file
+====
+---- QUERY
+# IMPALA-11280. There is a join involved here by using the IN operator, and multiple
+# arrays are unnested. Checks that the predicate on an unnested array is evaluated
+# correctly.
+select id, unnested_arr1, unnested_arr2
+from (
+    select id, unnest(arr1) as unnested_arr1, unnest(arr2) as unnested_arr2
+    from complextypes_arrays
+    where id % 2 = 1 and id in (select id from alltypestiny)
+) a
+where a.unnested_arr1 < 5;
+---- RESULTS
+1,1,'one'
+1,2,'two'
+1,3,'three'
+1,4,'four'
+7,1,'NULL'
+7,2,'NULL'
+---- TYPES
+INT,INT,STRING
+====
+---- QUERY
+# Similar as above but here the join is explicitly included in the query string and is not
+# a result of a query rewrite.
+select a.id, unnested_arr1, unnested_arr2
+from (
+    select cta.id, unnest(arr1) as unnested_arr1, unnest(arr2) as unnested_arr2
+    from functional_parquet.complextypes_arrays cta
+        left join functional_parquet.alltypestiny ti on cta.id = ti.id
+    where cta.id % 2 = 1) a
+where a.unnested_arr1 < 5;
+---- RESULTS
+1,1,'one'
+1,2,'two'
+1,3,'three'
+1,4,'four'
+7,1,'NULL'
+7,2,'NULL'
+---- TYPES
+INT,INT,STRING
+====
\ No newline at end of file


[impala] 20/27: IMPALA-11414: Off-by-one error in Parquet late materialization

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 83e9cfc1d57827f72121d70781aa5e70ae39d7b1
Author: Zoltan Borok-Nagy <bo...@cloudera.com>
AuthorDate: Mon Jul 4 16:22:03 2022 +0200

    IMPALA-11414: Off-by-one error in Parquet late materialization
    
    With PARQUET_LATE_MATERIALIZATION we can set the number of minimum
    consecutive rows that if filtered out, we avoid materialization of rows
    in other columns in parquet.
    
    E.g. if PARQUET_LATE_MATERIALIZATION is 10, and in a filtered column we
    find at least 10 consecutive rows that don't pass the predicates we
    avoid materializing the corresponding rows in the other columns.
    
    But due to an off-by-one error we actually only needed
    (PARQUET_LATE_MATERIALIZATION - 1) consecutive elements. This means if
    we set PARQUET_LATE_MATERIALIZATION to one, then we need zero
    consecutive filtered out elements which leads to a crash/DCHECK. The bug
    is in the GetMicroBatches() algorithm when we produce the micro batches
    based on the selected rows.
    
    Setting PARQUET_LATE_MATERIALIZATION to 0 doesn't make sense so it
    shouldn't be allowed.
    
    Testing
     * e2e test with PARQUET_LATE_MATERIALIZATION=1
     * e2e test for checking SET PARQUET_LATE_MATERIALIZATION=N
    
    Change-Id: I38f95ad48c4ac8c1e06651565ab5c496283b29fa
    Reviewed-on: http://gerrit.cloudera.org:8080/18700
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 be/src/exec/scratch-tuple-batch-test.cc                  |  2 +-
 be/src/exec/scratch-tuple-batch.h                        | 11 ++++++-----
 be/src/service/query-options.cc                          |  6 ++++--
 .../parquet-late-materialization-unique-db.test          | 15 +++++++++++++++
 .../functional-query/queries/QueryTest/set.test          | 16 ++++++++++++++++
 tests/query_test/test_parquet_late_materialization.py    |  4 ++++
 6 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/be/src/exec/scratch-tuple-batch-test.cc b/be/src/exec/scratch-tuple-batch-test.cc
index fc730bc78..df9d3d2ff 100644
--- a/be/src/exec/scratch-tuple-batch-test.cc
+++ b/be/src/exec/scratch-tuple-batch-test.cc
@@ -64,7 +64,7 @@ class ScratchTupleBatchTest : public testing::Test {
       int last_true_idx = batch.start;
       for (int j = batch.start + 1; j < batch.end; j++) {
         if (selected_rows[j]) {
-          EXPECT_TRUE(j - last_true_idx + 1 <= gap);
+          EXPECT_LE(j - last_true_idx, gap);
           last_true_idx = j;
         }
       }
diff --git a/be/src/exec/scratch-tuple-batch.h b/be/src/exec/scratch-tuple-batch.h
index 31edd2bb9..6513c5c70 100644
--- a/be/src/exec/scratch-tuple-batch.h
+++ b/be/src/exec/scratch-tuple-batch.h
@@ -169,10 +169,11 @@ struct ScratchTupleBatch {
   /// Creates micro batches that needs to be scanned.
   /// Bits set in 'selected_rows' are the rows that needs to be scanned. Consecutive
   /// bits set are used to create micro batches. Micro batches that differ by less than
-  /// 'skip_length', are merged together. E.g., for micro batches 1-8, 11-20, 35-100
-  /// derived from 'selected_rows' and 'skip_length' as 10, first two micro batches would
-  /// be merged into 1-20 as they differ by 3 (11 - 8) which is less than 10
-  /// ('skip_length'). Precondition for the function is there is at least one micro batch
+  /// or equal to 'skip_length', are merged together.
+  /// E.g., for micro batches 1-8, 11-20, 35-100 derived from 'selected_rows' and
+  /// 'skip_length' as 10, first two micro batches would be merged into 1-20 as there is a
+  /// non-matching run of 2 rows (9, 10) which is <= 10 ('skip_length').
+  /// Precondition for the function is there is at least one micro batch
   /// present i.e., atleast one of the 'selected_rows' is true.
   int GetMicroBatches(int skip_length, ScratchMicroBatch* batches) {
     int batch_idx = 0;
@@ -185,7 +186,7 @@ struct ScratchTupleBatch {
           // start the first ever micro batch.
           start = i;
           last = i;
-        } else if (i - last < skip_length) {
+        } else if (i - last <= skip_length) {
           // continue the old micro batch as 'last' is within 'skip_length' of last
           // micro batch.
           last = i;
diff --git a/be/src/service/query-options.cc b/be/src/service/query-options.cc
index 49bb06476..c5a1206ff 100644
--- a/be/src/service/query-options.cc
+++ b/be/src/service/query-options.cc
@@ -1160,9 +1160,11 @@ Status impala::SetQueryOption(const string& key, const string& value,
         StringParser::ParseResult result;
         const int32_t threshold =
             StringParser::StringToInt<int32_t>(value.c_str(), value.length(), &result);
-        if (value == nullptr || result != StringParser::PARSE_SUCCESS || threshold < -1) {
+        if (value == nullptr || result != StringParser::PARSE_SUCCESS || threshold < -1 ||
+            threshold == 0) {
           return Status(Substitute("Invalid parquet late materialization threshold: "
-              "'$0'. Only integer value -1 and above is allowed.", value));
+              "'$0'. Use -1 to disable the feature, and values > 0 to specify the "
+              "minimum skip length.", value));
         }
         query_options->__set_parquet_late_materialization_threshold(threshold);
         break;
diff --git a/testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization-unique-db.test b/testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization-unique-db.test
new file mode 100644
index 000000000..8097bb781
--- /dev/null
+++ b/testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization-unique-db.test
@@ -0,0 +1,15 @@
+# This tests pages skipped by parquet late materialization.
+====
+---- QUERY
+create table late_mat(i int, j int)
+stored as parquet;
+insert into late_mat values (1,1), (0,0), (1,1), (0,0), (0,0), (1,1);
+set PARQUET_LATE_MATERIALIZATION_THRESHOLD=1;
+select i, j from late_mat where j = 0;
+---- RESULTS
+0,0
+0,0
+0,0
+---- TYPES
+INT, INT
+====
diff --git a/testdata/workloads/functional-query/queries/QueryTest/set.test b/testdata/workloads/functional-query/queries/QueryTest/set.test
index b899f8656..f3a49aaeb 100644
--- a/testdata/workloads/functional-query/queries/QueryTest/set.test
+++ b/testdata/workloads/functional-query/queries/QueryTest/set.test
@@ -284,3 +284,19 @@ set SCAN_NODE_CODEGEN_THRESHOLD = "foo";
 set max_io_buffers="foo";
 ---- RESULTS
 ====
+---- QUERY
+set PARQUET_LATE_MATERIALIZATION_THRESHOLD=-1;
+set PARQUET_LATE_MATERIALIZATION_THRESHOLD=1;
+set PARQUET_LATE_MATERIALIZATION_THRESHOLD=10000;
+---- RESULTS
+====
+---- QUERY
+set PARQUET_LATE_MATERIALIZATION_THRESHOLD=-2;
+---- CATCH
+Invalid parquet late materialization threshold: '-2'. Use -1 to disable the feature, and values > 0 to specify the minimum skip length.
+====
+---- QUERY
+set PARQUET_LATE_MATERIALIZATION_THRESHOLD=0;
+---- CATCH
+Invalid parquet late materialization threshold: '0'. Use -1 to disable the feature, and values > 0 to specify the minimum skip length.
+====
diff --git a/tests/query_test/test_parquet_late_materialization.py b/tests/query_test/test_parquet_late_materialization.py
index a1f64d92c..670daa4d6 100644
--- a/tests/query_test/test_parquet_late_materialization.py
+++ b/tests/query_test/test_parquet_late_materialization.py
@@ -35,3 +35,7 @@ class TestParquetLateMaterialization(ImpalaTestSuite):
 
   def test_parquet_late_materialization(self, vector):
     self.run_test_case('QueryTest/parquet-late-materialization', vector)
+
+  def test_parquet_late_materialization_unique_db(self, vector, unique_database):
+    self.run_test_case('QueryTest/parquet-late-materialization-unique-db', vector,
+        unique_database)


[impala] 16/27: IMPALA-11365: Dereferencing null pointer in TopNNode

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 822fcf32744d323f101ebf9b03ac7d869785ba62
Author: Daniel Becker <da...@cloudera.com>
AuthorDate: Thu Jun 16 16:44:08 2022 +0200

    IMPALA-11365: Dereferencing null pointer in TopNNode
    
    In the constructor of TopNNode, if 'pnode.partition_comparator_config_'
    is NULL, we initialise 'partition_cmp_' with a NULL pointer. However,
    when initialising 'partition_heaps_', we dereference 'partition_cmp_'
    because 'ComparatorWrapper' expects a reference.
    
    This has so far not lead to a crash because in this case the comparator
    of 'partition_heaps_' is not used, but assigning a NULL pointer to a
    reference is undefined behaviour.
    
    After this change, instead of assigning a NULL pointer to
    'partition_cmp_', we use a dummy comparator, and no undefined behaviour
    is invoked.
    
    Change-Id: I0b15b06f608b4d17fdf8a24e05967aaa16ebb79c
    Reviewed-on: http://gerrit.cloudera.org:8080/18629
    Reviewed-by: Zoltan Borok-Nagy <bo...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 be/src/exec/topn-node.cc        | 29 ++++++++++++++++++++++++++++-
 be/src/util/tuple-row-compare.h |  6 ++++++
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/be/src/exec/topn-node.cc b/be/src/exec/topn-node.cc
index fc4318637..f65af7df3 100644
--- a/be/src/exec/topn-node.cc
+++ b/be/src/exec/topn-node.cc
@@ -121,6 +121,33 @@ Status TopNPlanNode::CreateExecNode(RuntimeState* state, ExecNode** node) const
   return Status::OK();
 }
 
+/// In the TopNNode constructor if 'pnode.partition_comparator_config_' is NULL, we use
+/// this dummy comparator to avoid 'partition_cmp_' becoming a null pointer. This is
+/// needed because 'partition_cmp_' is wrapped in a
+/// 'ComparatorWrapper<TupleRowComparator>' that takes a reference to the underlying
+/// comparator, so we cannot pass in a null pointer or dereference it.
+class DummyTupleRowComparator: public TupleRowComparator {
+ public:
+  DummyTupleRowComparator()
+   : TupleRowComparator(dummy_scalar_exprs_, dummy_codegend_compare_fn_) {
+  }
+ private:
+  static const std::vector<ScalarExpr*> dummy_scalar_exprs_;
+  static const CodegenFnPtr<TupleRowComparatorConfig::CompareFn>
+      dummy_codegend_compare_fn_;
+
+  int CompareInterpreted(const TupleRow* lhs, const TupleRow* rhs) const override {
+    // This function should never be called as this is a dummy comparator.
+    DCHECK(false);
+    return std::less<const TupleRow*>{}(lhs, rhs);
+  }
+};
+
+/// Initialise vector length to 0 so no buffer needs to be allocated.
+const std::vector<ScalarExpr*> DummyTupleRowComparator::dummy_scalar_exprs_{0};
+const CodegenFnPtr<TupleRowComparatorConfig::CompareFn>
+DummyTupleRowComparator::dummy_codegend_compare_fn_{};
+
 TopNNode::TopNNode(
     ObjectPool* pool, const TopNPlanNode& pnode, const DescriptorTbl& descs)
   : ExecNode(pool, pnode, descs),
@@ -129,7 +156,7 @@ TopNNode::TopNNode(
     output_tuple_desc_(pnode.output_tuple_desc_),
     order_cmp_(new TupleRowLexicalComparator(*pnode.ordering_comparator_config_)),
     partition_cmp_(pnode.partition_comparator_config_ == nullptr ?
-            nullptr :
+            static_cast<TupleRowComparator*>(new DummyTupleRowComparator()) :
             new TupleRowLexicalComparator(*pnode.partition_comparator_config_)),
     intra_partition_order_cmp_(pnode.intra_partition_comparator_config_ == nullptr ?
             nullptr :
diff --git a/be/src/util/tuple-row-compare.h b/be/src/util/tuple-row-compare.h
index 2a2be7d82..c17253cc3 100644
--- a/be/src/util/tuple-row-compare.h
+++ b/be/src/util/tuple-row-compare.h
@@ -179,6 +179,12 @@ class TupleRowComparator {
   static const char* LLVM_CLASS_NAME;
 
  protected:
+  TupleRowComparator(const std::vector<ScalarExpr*>& ordering_exprs,
+      const CodegenFnPtr<TupleRowComparatorConfig::CompareFn>& codegend_compare_fn)
+    : ordering_exprs_(ordering_exprs),
+      codegend_compare_fn_(codegend_compare_fn) {
+  }
+
   /// References to ordering expressions owned by the plan node which owns the
   /// TupleRowComparatorConfig used to create this instance.
   const std::vector<ScalarExpr*>& ordering_exprs_;


[impala] 24/27: IMPALA-11445: Fix bug in firing insert event of partitions located in different FS

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 2df70759c0603975e6f28f041574e854f7e8544b
Author: wzhou-code <wz...@cloudera.com>
AuthorDate: Mon Jul 18 23:37:16 2022 -0700

    IMPALA-11445: Fix bug in firing insert event of partitions located in different FS
    
    When adding a partition with location in a file system which is
    different from the file system of the table location, Impala accept
    it. But when insert values to the table, catalogd throw exception.
    
    This patch fix the issue by using the right FileSystem object.
    
    Testing:
     - Added new test case with partitions on different file systems.
       Ran the test on S3.
     - Did manual tests in cluster with partitions on HDFS and Ozone.
     - Passed core test.
    
    Change-Id: I0491ee1bf40c3d5240f9124cef3f3169c44a8267
    Reviewed-on: http://gerrit.cloudera.org:8080/18759
    Reviewed-by: Quanlong Huang <hu...@gmail.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 .../java/org/apache/impala/catalog/FeFsTable.java  | 10 +++
 .../apache/impala/service/CatalogOpExecutor.java   | 13 +++-
 .../queries/QueryTest/partition-location.test      | 86 ++++++++++++++++++++++
 tests/custom_cluster/test_partition.py             | 80 ++++++++++++++++++++
 4 files changed, 187 insertions(+), 2 deletions(-)

diff --git a/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java b/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java
index acb83a5b5..cfd4c4eda 100644
--- a/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java
+++ b/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java
@@ -209,6 +209,16 @@ public interface FeFsTable extends FeTable {
     return tableFs;
   }
 
+  public static FileSystem getFileSystem(Path filePath) throws CatalogException {
+    FileSystem tableFs;
+    try {
+      tableFs = filePath.getFileSystem(CONF);
+    } catch (IOException e) {
+      throw new CatalogException("Invalid path: " + filePath.toString(), e);
+    }
+    return tableFs;
+  }
+
   /**
    * @return  List of primary keys column names, useful for toSqlUtils. In local
    * catalog mode, this causes load of constraints.
diff --git a/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java b/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
index 48c892cf5..76ac5e6f7 100644
--- a/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
+++ b/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
@@ -6550,13 +6550,22 @@ public class CatalogOpExecutor {
     boolean isTransactional = AcidUtils
         .isTransactionalTable(tbl.getMetaStoreTable().getParameters());
     // in case of unpartitioned table, partVals will be empty
-    if (!partVals.isEmpty()) {
+    boolean isPartitioned = !partVals.isEmpty();
+    if (isPartitioned) {
       MetastoreShim.setPartitionVal(insertEventRequestData, partVals);
     }
-    FileSystem fs = tbl.getFileSystem();
+    // Get table file system with table location.
+    FileSystem tableFs = tbl.getFileSystem();
+    FileSystem fs;
     for (String file : newFiles) {
       try {
         Path filePath = new Path(file);
+        if (!isPartitioned) {
+          fs = tableFs;
+        } else {
+          // Partitions may be in different file systems.
+          fs = FeFsTable.getFileSystem(filePath);
+        }
         FileChecksum checkSum = fs.getFileChecksum(filePath);
         String checksumStr = checkSum == null ? ""
             : StringUtils.byteToHexString(checkSum.getBytes(), 0, checkSum.getLength());
diff --git a/testdata/workloads/functional-query/queries/QueryTest/partition-location.test b/testdata/workloads/functional-query/queries/QueryTest/partition-location.test
new file mode 100644
index 000000000..c27edc96e
--- /dev/null
+++ b/testdata/workloads/functional-query/queries/QueryTest/partition-location.test
@@ -0,0 +1,86 @@
+====
+---- QUERY
+# Create a table with location in HDFS as default.
+CREATE EXTERNAL TABLE hdfs_table_location_1 (name STRING, age INT, grade STRING)
+PARTITIONED BY (course STRING);
+====
+---- QUERY
+# Add partition with location in HDFS as default.
+ALTER TABLE hdfs_table_location_1 ADD PARTITION (course = "Sport");
+====
+---- QUERY
+INSERT INTO hdfs_table_location_1 PARTITION (course)
+VALUES ("Joe", 33, "A", "Sport");
+====
+---- QUERY
+select count(*) from hdfs_table_location_1;
+---- RESULTS
+1
+====
+---- QUERY
+# Create a table with location in HDFS as default.
+CREATE EXTERNAL TABLE hdfs_table_location_2 (name STRING, age INT, grade STRING)
+PARTITIONED BY (course STRING);
+====
+---- QUERY
+# Add partition with location in local file system.
+ALTER TABLE hdfs_table_location_2 ADD PARTITION (course = "Sport")
+LOCATION 'file:///tmp/$DATABASE.db/hdfs_table_location_2/course=Sport';
+====
+---- QUERY
+# Add partition with location in HDFS.
+ALTER TABLE hdfs_table_location_2 ADD PARTITION (course = "Math");
+====
+---- QUERY
+INSERT INTO hdfs_table_location_2 PARTITION (course)
+VALUES ("Joe", 33, "A", "Sport"), ("Melinda", 27, "B", "Math");
+====
+---- QUERY
+select count(*) from hdfs_table_location_2;
+---- RESULTS
+2
+====
+---- QUERY
+# Create a table with location in local file system.
+CREATE EXTERNAL TABLE local_fs_table_location_1 (name STRING, age INT, grade STRING)
+PARTITIONED BY (course STRING)
+LOCATION 'file:///tmp/$DATABASE.db/local_fs_table_location_1/';
+====
+---- QUERY
+# Add partition with location in local file system.
+ALTER TABLE local_fs_table_location_1 ADD PARTITION (course = "Sport");
+====
+---- QUERY
+INSERT INTO local_fs_table_location_1 PARTITION (course)
+VALUES ("Joe", 33, "A", "Sport");
+====
+---- QUERY
+select count(*) from local_fs_table_location_1;
+---- RESULTS
+1
+====
+---- QUERY
+# Create a table with location in local file system.
+CREATE EXTERNAL TABLE local_fs_table_location_2 (name STRING, age INT, grade STRING)
+PARTITIONED BY (course STRING)
+LOCATION 'file:///tmp/$DATABASE.db/local_fs_table_location_2/';
+====
+---- QUERY
+# Add partition with location in HDFS.
+ALTER TABLE local_fs_table_location_2 ADD PARTITION (course = "Sport")
+LOCATION
+'$FILESYSTEM_PREFIX/test-warehouse/$DATABASE.db/local_fs_table_location_2/course=Sport';
+====
+---- QUERY
+# Add partition with location in local file system.
+ALTER TABLE local_fs_table_location_2 ADD PARTITION (course = "Math");
+====
+---- QUERY
+INSERT INTO local_fs_table_location_2 PARTITION (course)
+VALUES ("Joe", 33, "A", "Sport"), ("Melinda", 27, "B", "Math");
+====
+---- QUERY
+select count(*) from local_fs_table_location_2;
+---- RESULTS
+2
+====
diff --git a/tests/custom_cluster/test_partition.py b/tests/custom_cluster/test_partition.py
new file mode 100644
index 000000000..102a159c5
--- /dev/null
+++ b/tests/custom_cluster/test_partition.py
@@ -0,0 +1,80 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import logging
+import pytest
+import shutil
+
+from tests.common.custom_cluster_test_suite import CustomClusterTestSuite
+from tests.common.skip import (SkipIfIsilon, SkipIfLocal)
+from tests.common.test_dimensions import (
+  create_single_exec_option_dimension,
+  create_uncompressed_text_dimension)
+
+LOG = logging.getLogger(__name__)
+
+
+class TestPartition(CustomClusterTestSuite):
+  """Tests to validate partitioning"""
+
+  @classmethod
+  def get_workload(self):
+    return 'functional-query'
+
+  @classmethod
+  def add_test_dimensions(cls):
+    super(TestPartition, cls).add_test_dimensions()
+    cls.ImpalaTestMatrix.add_dimension(create_single_exec_option_dimension())
+    # There's no reason to test this on other file formats/compression codecs right now
+    cls.ImpalaTestMatrix.add_dimension(
+      create_uncompressed_text_dimension(cls.get_workload()))
+
+  @SkipIfIsilon.hive
+  @SkipIfLocal.hive
+  @pytest.mark.execute_serially
+  @CustomClusterTestSuite.with_args(
+    impalad_args="--use_local_catalog=true",
+    catalogd_args="--catalog_topic_mode=minimal")
+  def test_partition_location_in_local_catalog_mode(self, vector, unique_database):
+    try:
+      self.run_test_case('QueryTest/partition-location', vector,
+          use_db=unique_database)
+    finally:
+      # Delete generated data files in local file system.
+      local_file_dir = '/tmp/%s.db/' % unique_database
+      try:
+        shutil.rmtree(local_file_dir)
+      except OSError as e:
+        LOG.info("Cannot remove directory %s, %s " % (local_file_dir, e.strerror))
+
+  @SkipIfIsilon.hive
+  @SkipIfLocal.hive
+  @pytest.mark.execute_serially
+  @CustomClusterTestSuite.with_args(
+    impalad_args="--use_local_catalog=false",
+    catalogd_args="--catalog_topic_mode=full")
+  def test_partition_location_in_legacy_mode(self, vector, unique_database):
+    try:
+      self.run_test_case('QueryTest/partition-location', vector,
+          use_db=unique_database)
+    finally:
+      # Delete generated data files in local file system.
+      local_file_dir = '/tmp/%s.db/' % unique_database
+      try:
+        shutil.rmtree(local_file_dir)
+      except OSError as e:
+        LOG.info("Cannot remove directory %s, %s " % (local_file_dir, e.strerror))


[impala] 15/27: IMPALA-11358: Fixed Kudu table's missing comment

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 407072f1dd870d1d4b4fd0131f13106c8c34645f
Author: Gergely Fürnstáhl <gf...@cloudera.com>
AuthorDate: Thu Jun 16 11:22:10 2022 +0200

    IMPALA-11358: Fixed Kudu table's missing comment
    
    If Kudu-HMS integration is enabled, Kudu creates the table in HMS too,
    which was missing the comment field. Added the code to forward the
    comment field to Kudu during creation.
    
    Testing:
    
    Added a test to verify the comment is present when the intergration is
    enabled.
    Reenabled several kudu tests as IMPALA-8751 (and follow ups) fixed the
    hive3 notification incompatibility.
    
    Change-Id: Idf66f8b4679b00da6693a27fed79b04e8f6afb55
    Reviewed-on: http://gerrit.cloudera.org:8080/18627
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 .../apache/impala/service/KuduCatalogOpExecutor.java |  4 +++-
 tests/common/skip.py                                 |  2 --
 tests/custom_cluster/test_kudu.py                    | 20 +++++++++-----------
 3 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/fe/src/main/java/org/apache/impala/service/KuduCatalogOpExecutor.java b/fe/src/main/java/org/apache/impala/service/KuduCatalogOpExecutor.java
index 1ac8783a6..5b8fc4daf 100644
--- a/fe/src/main/java/org/apache/impala/service/KuduCatalogOpExecutor.java
+++ b/fe/src/main/java/org/apache/impala/service/KuduCatalogOpExecutor.java
@@ -254,8 +254,10 @@ public class KuduCatalogOpExecutor {
       tableOpts.setNumReplicas(parsedReplicas);
     }
 
-    // Set the table's owner.
+    // Set the table's owner and table comment.
     tableOpts.setOwner(msTbl.getOwner());
+    if (params.getComment() != null) tableOpts.setComment(params.getComment());
+
     return tableOpts;
   }
 
diff --git a/tests/common/skip.py b/tests/common/skip.py
index f174689d6..d7e3f0128 100644
--- a/tests/common/skip.py
+++ b/tests/common/skip.py
@@ -259,8 +259,6 @@ class SkipIfDockerizedCluster:
 
 
 class SkipIfHive3:
-  kudu_hms_notifications_not_supported = pytest.mark.skipif(HIVE_MAJOR_VERSION >= 3,
-      reason="Kudu is not tested with Hive 3 notifications yet, see IMPALA-8751.")
   col_stat_separated_by_engine = pytest.mark.skipif(HIVE_MAJOR_VERSION >= 3,
       reason="Hive 3 separates column statistics by engine")
   without_hms_not_supported = pytest.mark.skipif(HIVE_MAJOR_VERSION >= 3,
diff --git a/tests/custom_cluster/test_kudu.py b/tests/custom_cluster/test_kudu.py
index b5959a442..8fbc6c938 100644
--- a/tests/custom_cluster/test_kudu.py
+++ b/tests/custom_cluster/test_kudu.py
@@ -172,7 +172,6 @@ class TestKuduHMSIntegration(CustomKuduTest):
     self.run_test_case('QueryTest/kudu_create', vector, use_db=unique_database)
 
   @pytest.mark.execute_serially
-  @SkipIfHive3.kudu_hms_notifications_not_supported
   def test_implicit_external_table_props(self, cursor, kudu_client):
     """Check that table properties added internally for external table during
        table creation are as expected.
@@ -198,7 +197,6 @@ class TestKuduHMSIntegration(CustomKuduTest):
             in table_desc
 
   @pytest.mark.execute_serially
-  @SkipIfHive3.kudu_hms_notifications_not_supported
   @CustomClusterTestSuite.with_args(impalad_args="-kudu_client_rpc_timeout_ms=30000")
   def test_implicit_managed_table_props(self, cursor, kudu_client, unique_database):
     """Check that table properties added internally for managed table during table
@@ -206,25 +204,29 @@ class TestKuduHMSIntegration(CustomKuduTest):
        avoid requests fail due to operation delay in the Hive Metastore for managed
        tables (IMPALA-8856).
     """
-    cursor.execute("""CREATE TABLE %s.foo (a INT PRIMARY KEY, s STRING)
-        PARTITION BY HASH(a) PARTITIONS 3 STORED AS KUDU""" % unique_database)
+    comment = "kudu_comment"
+    cursor.execute("""CREATE TABLE %s.foo (a INT PRIMARY KEY, s STRING) PARTITION BY
+        HASH(a) PARTITIONS 3 COMMENT '%s' STORED AS KUDU""" % (unique_database, comment))
     assert kudu_client.table_exists(
       KuduTestSuite.to_kudu_table_name(unique_database, "foo"))
     cursor.execute("DESCRIBE FORMATTED %s.foo" % unique_database)
     table_desc = [[col.strip() if col else col for col in row] for row in cursor]
     # Pytest shows truncated output on failure, so print the details just in case.
     LOG.info(table_desc)
-    assert not any("EXTERNAL" in s for s in table_desc)
+
+    # Commented out due to differences between toolchain and newer hive
+    # assert any("EXTERNAL" in s for s in table_desc)
+    # assert ["Table Type:", "EXTERNAL_TABLE", None] in table_desc
+
     assert any("Owner:" in s for s in table_desc)
     assert any("kudu.table_id" in s for s in table_desc)
     assert any("kudu.master_addresses" in s for s in table_desc)
-    assert ["Table Type:", "MANAGED_TABLE", None] in table_desc
+    assert ["", "comment", "%s" % comment] in table_desc
     assert ["", "kudu.table_name", "%s.foo" % unique_database] in table_desc
     assert ["", "storage_handler", "org.apache.hadoop.hive.kudu.KuduStorageHandler"] \
         in table_desc
 
   @pytest.mark.execute_serially
-  @SkipIfHive3.kudu_hms_notifications_not_supported
   @CustomClusterTestSuite.with_args(impalad_args="-kudu_client_rpc_timeout_ms=30000")
   def test_drop_non_empty_db(self, unique_cursor, kudu_client):
     """Check that an attempt to drop a database will fail if Kudu tables are present
@@ -251,7 +253,6 @@ class TestKuduHMSIntegration(CustomKuduTest):
       assert not kudu_client.table_exists(kudu_table.name)
 
   @pytest.mark.execute_serially
-  @SkipIfHive3.kudu_hms_notifications_not_supported
   @CustomClusterTestSuite.with_args(impalad_args="-kudu_client_rpc_timeout_ms=30000")
   def test_drop_db_cascade(self, unique_cursor, kudu_client):
     """Check that an attempt to drop a database cascade will succeed even if Kudu
@@ -276,7 +277,6 @@ class TestKuduHMSIntegration(CustomKuduTest):
       assert not kudu_client.table_exists(kudu_table.name)
 
   @pytest.mark.execute_serially
-  @SkipIfHive3.kudu_hms_notifications_not_supported
   @CustomClusterTestSuite.with_args(impalad_args="-kudu_client_rpc_timeout_ms=30000")
   def test_drop_managed_kudu_table(self, cursor, kudu_client, unique_database):
     """Check that dropping a managed Kudu table should fail if the underlying
@@ -299,7 +299,6 @@ class TestKuduHMSIntegration(CustomKuduTest):
       assert "Table %s no longer exists in the Hive MetaStore." % kudu_tbl_name in str(e)
 
   @pytest.mark.execute_serially
-  @SkipIfHive3.kudu_hms_notifications_not_supported
   def test_drop_external_kudu_table(self, cursor, kudu_client, unique_database):
     """Check that Impala can recover from the case where the underlying Kudu table of
        an external table is dropped using the Kudu client.
@@ -329,7 +328,6 @@ class TestKuduHMSIntegration(CustomKuduTest):
       assert (external_table_name,) not in cursor.fetchall()
 
   @SkipIfKudu.no_hybrid_clock
-  @SkipIfHive3.kudu_hms_notifications_not_supported
   def test_kudu_alter_table(self, vector, unique_database):
     self.run_test_case('QueryTest/kudu_hms_alter', vector, use_db=unique_database)
 


[impala] 22/27: IMPALA-10267: Properly handle continued scanning after parse error in HdfsAvroScanner

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 06bdfa3ba7f17eddbad1aab3552f1111c6290556
Author: Csaba Ringhofer <cs...@cloudera.com>
AuthorDate: Thu Jun 9 17:18:32 2022 +0200

    IMPALA-10267: Properly handle continued scanning after parse error in HdfsAvroScanner
    
    BaseSequenceScanner assumes that overloads of ProcessRange can continue
    working after a parse error (some bytes can be skipped in this case
    until the next sync marker) but HdfsAvroScanner::ProcessRange() did not
    handle this scenario.
    
    Testing:
    - ran core tests
    - didn't create new tests - the test was caught by a fuzz test,
      it would be great to check whether we actually handle this
      scenario, but I am not that familiar with Avro
    
    Change-Id: I361c7ea781cd84033b57ea9ca437378636fda91f
    Reviewed-on: http://gerrit.cloudera.org:8080/18610
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
    Tested-by: Csaba Ringhofer <cs...@cloudera.com>
---
 be/src/exec/hdfs-avro-scanner.cc | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/be/src/exec/hdfs-avro-scanner.cc b/be/src/exec/hdfs-avro-scanner.cc
index efd638f99..a8fa737ef 100644
--- a/be/src/exec/hdfs-avro-scanner.cc
+++ b/be/src/exec/hdfs-avro-scanner.cc
@@ -502,8 +502,15 @@ Status HdfsAvroScanner::ProcessRange(RowBatch* row_batch) {
   DCHECK_GT(row_batch->capacity(), row_batch->num_rows());
   while (!eos_ && !scan_node_->ReachedLimitShared()) {
     if (record_pos_ == num_records_in_block_) {
-      // Read new data block
-      RETURN_IF_FALSE(stream_->ReadZLong(&num_records_in_block_, &parse_status_));
+      // Read new data block. Reset members first to avoid corrupt state after
+      // recovery from parse error.
+      record_pos_ = 0;
+      num_records_in_block_ = 0;
+      data_block_len_ = 0;
+      data_block_ = nullptr;
+      data_block_end_ = nullptr;
+      int64_t num_records_in_block;
+      RETURN_IF_FALSE(stream_->ReadZLong(&num_records_in_block, &parse_status_));
       if (num_records_in_block_ < 0) {
         return Status(TErrorCode::AVRO_INVALID_RECORD_COUNT, stream_->filename(),
             num_records_in_block_, stream_->file_offset());
@@ -531,8 +538,8 @@ Status HdfsAvroScanner::ProcessRange(RowBatch* row_batch) {
         data_block_ = compressed_data;
         data_block_len_ = compressed_size;
       }
+      num_records_in_block_ = num_records_in_block;
       data_block_end_ = data_block_ + data_block_len_;
-      record_pos_ = 0;
     }
 
     int64_t prev_record_pos = record_pos_;


[impala] 26/27: IMPALA-11434: Fix analysis of multiple more than 1d arrays in select list

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 0766bc9c64fba0d883a2dc21e8327563190023c0
Author: Csaba Ringhofer <cs...@cloudera.com>
AuthorDate: Thu Jul 14 20:32:02 2022 +0200

    IMPALA-11434: Fix analysis of multiple more than 1d arrays in select list
    
    More than 1d arrays in select list tried to register a
    CollectionTableRef with name "item" for the inner arrays,
    leading to name collision if there was more than one such array.
    
    The logic is changed to always use the full path as implicit alias
    in CollectionTableRefs backing arrays in select list.
    
    As a side effect this leads to using the fully qualified names
    in expressions in the explain plans of queries that use arrays
    from views. This is not an intended change, but I don't consider
    it to be critical. Created IMPALA-11452 to deal with more
    sophisticated alias handling in collections.
    
    Testing:
    - added a new table to testdata and a regression test
    
    Change-Id: I6f2b6cad51fa25a6f6932420eccf1b0a964d5e4e
    Reviewed-on: http://gerrit.cloudera.org:8080/18734
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 .../java/org/apache/impala/analysis/Analyzer.java  |  5 ++---
 .../apache/impala/analysis/CollectionTableRef.java |  8 +++----
 .../org/apache/impala/analysis/FromClause.java     |  7 +++---
 .../org/apache/impala/analysis/StmtRewriter.java   |  9 +++++---
 .../main/java/org/apache/impala/catalog/Type.java  |  4 ++++
 .../apache/impala/analysis/AnalyzeStmtsTest.java   | 12 +++++------
 .../functional/functional_schema_template.sql      | 25 ++++++++++++++++++++++
 .../datasets/functional/schema_constraints.csv     |  2 ++
 .../queries/PlannerTest/zipping-unnest.test        |  8 +++----
 .../QueryTest/nested-array-in-select-list.test     | 11 +++++++++-
 10 files changed, 65 insertions(+), 26 deletions(-)

diff --git a/fe/src/main/java/org/apache/impala/analysis/Analyzer.java b/fe/src/main/java/org/apache/impala/analysis/Analyzer.java
index d9dad77ae..c9d23db28 100644
--- a/fe/src/main/java/org/apache/impala/analysis/Analyzer.java
+++ b/fe/src/main/java/org/apache/impala/analysis/Analyzer.java
@@ -870,7 +870,7 @@ public class Analyzer {
           table instanceof FeDataSourceTable);
       return new BaseTableRef(tableRef, resolvedPath);
     } else {
-      return new CollectionTableRef(tableRef, resolvedPath);
+      return new CollectionTableRef(tableRef, resolvedPath, false);
     }
   }
 
@@ -1654,8 +1654,7 @@ public class Analyzer {
     collectionTableRawPath.addAll(rawPath);
 
     TableRef tblRef = new TableRef(collectionTableRawPath, null);
-    CollectionTableRef collTblRef = new CollectionTableRef(tblRef, slotPath);
-    collTblRef.setInSelectList(true);
+    CollectionTableRef collTblRef = new CollectionTableRef(tblRef, slotPath, true);
     collTblRef.analyze(this);
 
     Preconditions.checkState(collTblRef.getCollectionExpr() instanceof SlotRef);
diff --git a/fe/src/main/java/org/apache/impala/analysis/CollectionTableRef.java b/fe/src/main/java/org/apache/impala/analysis/CollectionTableRef.java
index efacd578b..521a4bb09 100644
--- a/fe/src/main/java/org/apache/impala/analysis/CollectionTableRef.java
+++ b/fe/src/main/java/org/apache/impala/analysis/CollectionTableRef.java
@@ -52,16 +52,15 @@ public class CollectionTableRef extends TableRef {
    * its resolved path. Sets table aliases and join-related attributes.
    * If inSelectList is true, then the collection won't be flattened during execution.
    */
-  public CollectionTableRef(TableRef tableRef, Path resolvedPath) {
+  public CollectionTableRef(TableRef tableRef, Path resolvedPath, boolean inSelectList) {
     super(tableRef);
     Preconditions.checkState(resolvedPath.isResolved());
     resolvedPath_ = resolvedPath;
+    inSelectList_ = inSelectList;
     // Use the last path element as an implicit alias if no explicit alias was given.
     if (hasExplicitAlias()) return;
-    boolean rootTableHasExplicitAlias = resolvedPath.getRootDesc() != null
-        && resolvedPath.getRootDesc().hasExplicitAlias();
     String implicitAlias = rawPath_.get(rawPath_.size() - 1).toLowerCase();
-    if (rootTableHasExplicitAlias) {
+    if (rawPath_.size() > 1) {
       // Use the full path from the root table alias to be able to distinguish
       // among collection columns with the same name.
       aliases_ =
@@ -204,7 +203,6 @@ public class CollectionTableRef extends TableRef {
 
   @Override
   public boolean isCollectionInSelectList() { return inSelectList_; }
-  public void setInSelectList(boolean value) { inSelectList_ = value; }
 
   @Override
   protected CollectionTableRef clone() { return new CollectionTableRef(this); }
diff --git a/fe/src/main/java/org/apache/impala/analysis/FromClause.java b/fe/src/main/java/org/apache/impala/analysis/FromClause.java
index dbe60aaf7..765980b90 100644
--- a/fe/src/main/java/org/apache/impala/analysis/FromClause.java
+++ b/fe/src/main/java/org/apache/impala/analysis/FromClause.java
@@ -176,9 +176,10 @@ public class FromClause extends StmtNode implements Iterable<TableRef> {
         // contrary to the intended semantics of reset(). We could address this issue by
         // changing the WITH-clause analysis to register local views that have
         // fully-qualified table refs, and then remove the full qualification here.
-        if (!(origTblRef instanceof CollectionTableRef
-            && ((CollectionTableRef)origTblRef).isRelative())) {
-          newTblRef.rawPath_ = origTblRef.getResolvedPath().getFullyQualifiedRawPath();
+        Path oldPath = origTblRef.getResolvedPath();
+        if (oldPath.getRootDesc() == null
+            || !oldPath.getRootDesc().getType().isCollectionStructType()) {
+          newTblRef.rawPath_ = oldPath.getFullyQualifiedRawPath();
         }
         set(i, newTblRef);
       }
diff --git a/fe/src/main/java/org/apache/impala/analysis/StmtRewriter.java b/fe/src/main/java/org/apache/impala/analysis/StmtRewriter.java
index ad060b209..29669a2d4 100644
--- a/fe/src/main/java/org/apache/impala/analysis/StmtRewriter.java
+++ b/fe/src/main/java/org/apache/impala/analysis/StmtRewriter.java
@@ -1828,9 +1828,12 @@ public class StmtRewriter {
       newCollPath.add(0, alias);
       // Remove the alias of the old collection ref from the analyzer. We need to add
       // the new collection ref with the same old alias.
-      analyzer.removeAlias(collTblRef.getUniqueAlias());
-      TableRef newCollTblRef =
-          TableRef.newTableRef(analyzer, newCollPath, collTblRef.getUniqueAlias());
+      String[] old_aliases = collTblRef.getAliases();
+      for (String old_alias: old_aliases) {
+        analyzer.removeAlias(old_alias);
+      }
+      TableRef newCollTblRef = TableRef.newTableRef(
+          analyzer, newCollPath, old_aliases[old_aliases.length - 1]);
       // Set JOIN attributes. Please note that we cannot use TableRef.setJoinAttrs()
       // because we only want to copy the ON clause and the plan hints. The col names
       // in USING have been already converted to an ON clause. We let the analyzer/
diff --git a/fe/src/main/java/org/apache/impala/catalog/Type.java b/fe/src/main/java/org/apache/impala/catalog/Type.java
index f69e45560..249eb9dbd 100644
--- a/fe/src/main/java/org/apache/impala/catalog/Type.java
+++ b/fe/src/main/java/org/apache/impala/catalog/Type.java
@@ -20,6 +20,7 @@ package org.apache.impala.catalog;
 import java.util.ArrayList;
 import java.util.List;
 
+import org.apache.impala.analysis.CollectionStructType;
 import org.apache.impala.analysis.CreateTableStmt;
 import org.apache.impala.analysis.Parser;
 import org.apache.impala.analysis.StatementBase;
@@ -234,6 +235,9 @@ public abstract class Type {
   public boolean isMapType() { return this instanceof MapType; }
   public boolean isArrayType() { return this instanceof ArrayType; }
   public boolean isStructType() { return this instanceof StructType; }
+  public boolean isCollectionStructType() {
+    return this instanceof CollectionStructType;
+  }
 
   /**
    * Returns true if Impala supports this type in the metdata. It does not mean we
diff --git a/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java b/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
index 2a585c371..be7cf7776 100644
--- a/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
+++ b/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
@@ -78,6 +78,9 @@ public class AnalyzeStmtsTest extends AnalyzerTest {
         collectionField, collectionTable), tbl);
     TblsAnalyzeOk(String.format("select c.%s from $TBL a, a.int_array_col b, a.%s c",
         collectionField, collectionTable), tbl);
+    TblsAnalyzeOk(String.format(
+        "select 1 from $TBL, allcomplextypes.%s, functional.allcomplextypes.%s",
+        collectionTable, collectionTable), tbl);
 
     // Test join types. Parent/collection joins do not require an ON or USING clause.
     for (JoinOperator joinOp: JoinOperator.values()) {
@@ -118,15 +121,9 @@ public class AnalyzeStmtsTest extends AnalyzerTest {
         collectionField, collectionTable, collectionTable), tbl,
         "Duplicate table alias: 'b'");
     // Duplicate implicit alias.
-    String[] childTblPath = collectionTable.split("\\.");
-    String childTblAlias = childTblPath[childTblPath.length - 1];
     TblsAnalysisError(String.format("select %s from $TBL a, a.%s, a.%s",
         collectionField, collectionTable, collectionTable), tbl,
         String.format("Duplicate table alias: '%s'", "a." + collectionTable));
-    TblsAnalysisError(String.format(
-        "select 1 from $TBL, allcomplextypes.%s, functional.allcomplextypes.%s",
-        collectionTable, collectionTable), tbl,
-        String.format("Duplicate table alias: '%s'", childTblAlias));
     // Duplicate implicit/explicit alias.
     TblsAnalysisError(String.format(
         "select %s from $TBL, functional.allcomplextypes.%s allcomplextypes",
@@ -413,7 +410,8 @@ public class AnalyzeStmtsTest extends AnalyzerTest {
         "TABLESAMPLE is only supported on HDFS tables: functional.alltypes_view");
     AnalysisError("select * from functional.allcomplextypes.int_array_col " +
         "tablesample system (10)",
-        "TABLESAMPLE is only supported on HDFS tables: int_array_col");
+        "TABLESAMPLE is only supported on HDFS tables: " +
+        "functional.allcomplextypes.int_array_col");
     AnalysisError("select * from functional.allcomplextypes a, a.int_array_col " +
         "tablesample system (10)",
         "TABLESAMPLE is only supported on HDFS tables: a.int_array_col");
diff --git a/testdata/datasets/functional/functional_schema_template.sql b/testdata/datasets/functional/functional_schema_template.sql
index 996591da2..f1bef2a07 100644
--- a/testdata/datasets/functional/functional_schema_template.sql
+++ b/testdata/datasets/functional/functional_schema_template.sql
@@ -3471,3 +3471,28 @@ ALTER MATERIALIZED VIEW {db_name}{db_suffix}.{table_name} REBUILD;
 -- do a count to confirm if the rebuild populated rows in the MV
 select count(*) as mv_count from {db_name}{db_suffix}.{table_name};
 ====
+---- DATASET
+functional
+---- BASE_TABLE_NAME
+array_tbl
+---- COLUMNS
+id INT
+int_1d ARRAY<INT>
+int_2d ARRAY<ARRAY<INT>>
+int_3d ARRAY<ARRAY<ARRAY<INT>>>
+string_1d ARRAY<STRING>
+string_2d ARRAY<ARRAY<STRING>>
+string_3d ARRAY<ARRAY<ARRAY<STRING>>>
+---- DEPENDENT_LOAD_HIVE
+-- It would be nice to insert NULLs, but I couldn't find a way in Hive.
+INSERT INTO {db_name}{db_suffix}.{table_name} VALUES
+ (1,
+  array(1, 2, NULL),
+  array(array(1, 2, NULL), array(3)),
+  array(array(array(1, 2, NULL), array(3)), array(array(4))),
+  array("1", "2", NULL),
+  array(array("1", "2", NULL), array("3")),
+  array(array(array("1", "2", NULL), array("3")), array(array("4")))
+ )
+---- LOAD
+====
diff --git a/testdata/datasets/functional/schema_constraints.csv b/testdata/datasets/functional/schema_constraints.csv
index 6b26d3ccf..54caaeb5e 100644
--- a/testdata/datasets/functional/schema_constraints.csv
+++ b/testdata/datasets/functional/schema_constraints.csv
@@ -329,6 +329,8 @@ table_name:alltypessmall_bool_sorted, constraint:restrict_to, table_format:orc/d
 
 table_name:complextypes_arrays_only_view, constraint:restrict_to, table_format:parquet/none/none
 table_name:complextypes_arrays_only_view, constraint:restrict_to, table_format:orc/def/block
+table_name:array_tbl, constraint:restrict_to, table_format:parquet/none/none
+table_name:array_tbl, constraint:restrict_to, table_format:orc/def/block
 
 # 'alltypestiny_negative' only used in ORC tests.
 table_name:alltypestiny_negative, constraint:restrict_to, table_format:orc/def/block
diff --git a/testdata/workloads/functional-planner/queries/PlannerTest/zipping-unnest.test b/testdata/workloads/functional-planner/queries/PlannerTest/zipping-unnest.test
index f4a39cab6..bfd32243c 100644
--- a/testdata/workloads/functional-planner/queries/PlannerTest/zipping-unnest.test
+++ b/testdata/workloads/functional-planner/queries/PlannerTest/zipping-unnest.test
@@ -73,7 +73,7 @@ PLAN-ROOT SINK
 00:SCAN HDFS [functional_parquet.complextypes_arrays]
    HDFS partitions=1/1 files=1 size=1.06KB
    predicates: !empty(arr1)
-   predicates on arr1: UNNEST(arr1) < 5, arr1.item < 5
+    predicates on functional_parquet.complextypes_arrays.arr1: UNNEST(functional_parquet.complextypes_arrays.arr1) < 5, functional_parquet.complextypes_arrays.arr1.item < 5
    row-size=16B cardinality=1.35K
 ====
 # Similar as above but using the FROM clause syntax for zipping unnest.
@@ -99,7 +99,7 @@ PLAN-ROOT SINK
 00:SCAN HDFS [functional_parquet.complextypes_arrays]
    HDFS partitions=1/1 files=1 size=1.06KB
    predicates: !empty(arr1)
-   predicates on arr1: arr1.item < 5
+   predicates on functional_parquet.complextypes_arrays.arr1: functional_parquet.complextypes_arrays.arr1.item < 5
    row-size=16B cardinality=1.35K
 ====
 # Check that for multiple unnests the predicate on the unnested items is pushed to the
@@ -172,7 +172,7 @@ PLAN-ROOT SINK
 |  |     row-size=16B cardinality=1
 |  |
 |  03:UNNEST [test_zipping_unnest_db.view_arrays.arr1, test_zipping_unnest_db.view_arrays.arr2]
-|     predicates: UNNEST(arr1) < 5
+|     predicates: UNNEST(functional_parquet.complextypes_arrays.arr1) < 5
 |     row-size=4B cardinality=10
 |
 00:SCAN HDFS [functional_parquet.complextypes_arrays]
@@ -197,7 +197,7 @@ PLAN-ROOT SINK
 |  |     row-size=16B cardinality=1
 |  |
 |  03:UNNEST [a.arr1, a.arr2]
-|     predicates: arr1.item < 5
+|     predicates: functional_parquet.complextypes_arrays.arr1.item < 5
 |     row-size=4B cardinality=10
 |
 00:SCAN HDFS [functional_parquet.complextypes_arrays]
diff --git a/testdata/workloads/functional-query/queries/QueryTest/nested-array-in-select-list.test b/testdata/workloads/functional-query/queries/QueryTest/nested-array-in-select-list.test
index 1b2ef1e63..8904119d6 100644
--- a/testdata/workloads/functional-query/queries/QueryTest/nested-array-in-select-list.test
+++ b/testdata/workloads/functional-query/queries/QueryTest/nested-array-in-select-list.test
@@ -382,4 +382,13 @@ where a.unnested_arr1 < 5;
 7,2,'NULL'
 ---- TYPES
 INT,INT,STRING
-====
\ No newline at end of file
+====
+---- QUERY
+# Regression test for:
+#   IMPALA-11434: "More than 1 2d arrays in select list causes analysis error"
+select id, int_1d, int_2d, int_3d, string_1d, string_2d, string_3d from array_tbl;
+---- RESULTS
+1,'[1,2,NULL]','[[1,2,NULL],[3]]','[[[1,2,NULL],[3]],[[4]]]','["1","2",NULL]','[["1","2",NULL],["3"]]','[[["1","2",NULL],["3"]],[["4"]]]'
+---- TYPES
+INT,STRING,STRING,STRING,STRING,STRING,STRING
+=====


[impala] 12/27: IMPALA-11302, IMPALA-11303: Fix error handling of external Iceberg tables

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 40ceefa9a3d5878b0fc82ff61649fd6dfe4a52ab
Author: Zoltan Borok-Nagy <bo...@cloudera.com>
AuthorDate: Tue May 24 14:14:55 2022 +0200

    IMPALA-11302, IMPALA-11303: Fix error handling of external Iceberg tables
    
    IMPALA-11302: Improve error message for CREATE EXTERNAL TABLE iceberg
    command
    
    The error message complained about failed table loading. The new
    error message is more precise, and also notifies the user to use
    plain 'CREATE TABLE' for creating new Iceberg tables.
    
    IMPALA-11303: Exception is not raised for Iceberg DDL that misses
    LOCATION clause
    
    We returned the error in the result summary. Instead of that now
    we raise an error, and also notify  the user about how to create
    new Iceberg tables.
    
    Testing:
     * e2e tests
    
    Change-Id: I659115cc97a1a00e1ddf3fbb7dbe1f286bf1edcf
    Reviewed-on: http://gerrit.cloudera.org:8080/18563
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 .../apache/impala/analysis/CreateTableStmt.java    | 29 ++++++++++++++++++++--
 .../queries/QueryTest/iceberg-create.test          |  9 -------
 .../queries/QueryTest/iceberg-negative.test        | 17 +++++++++++++
 3 files changed, 44 insertions(+), 11 deletions(-)

diff --git a/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java b/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
index 9976b11b9..114bbdb04 100644
--- a/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
+++ b/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
@@ -340,8 +340,7 @@ public class CreateTableStmt extends StatementBase {
     }
 
     analyzeKuduTableProperties(analyzer);
-    if (isExternal() && !Boolean.parseBoolean(getTblProperties().get(
-        Table.TBL_PROP_EXTERNAL_TABLE_PURGE))) {
+    if (isExternalWithNoPurge()) {
       // this is an external table
       analyzeExternalKuduTableParams(analyzer);
     } else {
@@ -736,6 +735,19 @@ public class CreateTableStmt extends StatementBase {
       throw new AnalysisException(String.format("%s cannot be set for Iceberg table " +
           "stored in hive.catalog", IcebergTable.ICEBERG_CATALOG_LOCATION));
     }
+    if (isExternalWithNoPurge()) {
+      String tableId = getTblProperties().get(IcebergTable.ICEBERG_TABLE_IDENTIFIER);
+      if (tableId == null || tableId.isEmpty()) {
+        tableId = getTblProperties().get(Catalogs.NAME);
+      }
+      if (tableId == null || tableId.isEmpty()) {
+        throw new AnalysisException(String.format("Table property '%s' is necessary " +
+            "for external Iceberg tables stored in hive.catalog. " +
+            "For creating a completely new Iceberg table, use 'CREATE TABLE' " +
+            "(no EXTERNAL keyword).",
+            IcebergTable.ICEBERG_TABLE_IDENTIFIER));
+      }
+    }
   }
 
   private void validateTableInHadoopCatalog() throws AnalysisException {
@@ -757,6 +769,11 @@ public class CreateTableStmt extends StatementBase {
       throw new AnalysisException(String.format("%s cannot be set for Iceberg table " +
           "stored in hadoop.tables", IcebergTable.ICEBERG_CATALOG_LOCATION));
     }
+    if (isExternalWithNoPurge() && getLocation() == null) {
+      throw new AnalysisException("Set LOCATION for external Iceberg tables " +
+          "stored in hadoop.tables. For creating a completely new Iceberg table, use " +
+          "'CREATE TABLE' (no EXTERNAL keyword).");
+    }
   }
 
   private void validateTableInCatalogs() {
@@ -829,4 +846,12 @@ public class CreateTableStmt extends StatementBase {
     getColumnDefs().addAll(getPartitionColumnDefs());
     getPartitionColumnDefs().clear();
   }
+
+  /**
+   * @return true for external tables that don't have "external.table.purge" set to true.
+   */
+  private boolean isExternalWithNoPurge() {
+    return isExternal() && !Boolean.parseBoolean(getTblProperties().get(
+      Table.TBL_PROP_EXTERNAL_TABLE_PURGE));
+  }
 }
diff --git a/testdata/workloads/functional-query/queries/QueryTest/iceberg-create.test b/testdata/workloads/functional-query/queries/QueryTest/iceberg-create.test
index 67ad59fb1..1c0d86d11 100644
--- a/testdata/workloads/functional-query/queries/QueryTest/iceberg-create.test
+++ b/testdata/workloads/functional-query/queries/QueryTest/iceberg-create.test
@@ -66,15 +66,6 @@ DESCRIBE iceberg_hadoop_tables;
 STRING,STRING,STRING,STRING
 ====
 ---- QUERY
-CREATE EXTERNAL TABLE iceberg_hadoop_tbls_external(
-  level STRING
-)
-STORED AS ICEBERG
-TBLPROPERTIES('iceberg.catalog'='hadoop.tables');
----- RESULTS
-'Location is necessary for external iceberg table.'
-====
----- QUERY
 CREATE TABLE iceberg_hadoop_tbls_with_loc(
   level STRING
 )
diff --git a/testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test b/testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test
index 0a3555f51..15dc723e3 100644
--- a/testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test
+++ b/testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test
@@ -79,6 +79,23 @@ TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
 AnalysisException: Table property 'iceberg.catalog_location' is necessary for Iceberg table with 'hadoop.catalog'.
 ====
 ---- QUERY
+CREATE EXTERNAL TABLE iceberg_hadoop_tbls_external(
+  level STRING
+)
+STORED AS ICEBERG
+TBLPROPERTIES('iceberg.catalog'='hadoop.tables');
+---- CATCH
+Set LOCATION for external Iceberg tables stored in hadoop.tables. For creating a completely new Iceberg table, use 'CREATE TABLE' (no EXTERNAL keyword).
+====
+---- QUERY
+CREATE EXTERNAL TABLE iceberg_hive_tbls_external(
+  level STRING
+)
+STORED AS ICEBERG;
+---- CATCH
+Table property 'iceberg.table_identifier' is necessary for external Iceberg tables stored in hive.catalog. For creating a completely new Iceberg table, use 'CREATE TABLE' (no EXTERNAL keyword).
+====
+---- QUERY
 CREATE EXTERNAL TABLE fake_iceberg_table_hadoop_catalog
 STORED AS ICEBERG
 TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',


[impala] 25/27: IMPALA-11443: Fix partitoned top-n with -1 input cardinality

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 57ac65871b5b31d18d6a055be54584722c805943
Author: Csaba Ringhofer <cs...@cloudera.com>
AuthorDate: Mon Jul 18 20:19:53 2022 +0200

    IMPALA-11443: Fix partitoned top-n with -1 input cardinality
    
    Creating a plan with partitioned top-n node with -1 input
    cardinality led to having negative resource estimate and
    hitting a precondtition, returning error:
    "IllegalStateException: Mem estimate must be set"
    
    This change adds handling for unknown cardinality.
    
    Testing:
    - added a planner test
    
    Change-Id: I7e2dc7df397dc4684e6510a200f381a74c8dd984
    Reviewed-on: http://gerrit.cloudera.org:8080/18744
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 .../java/org/apache/impala/planner/SortNode.java   |  2 +-
 .../org/apache/impala/planner/PlannerTest.java     |  4 ++-
 .../limit-pushdown-partitioned-top-n.test          | 37 ++++++++++++++++++++++
 3 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/fe/src/main/java/org/apache/impala/planner/SortNode.java b/fe/src/main/java/org/apache/impala/planner/SortNode.java
index aa0b456d1..5f12dee45 100644
--- a/fe/src/main/java/org/apache/impala/planner/SortNode.java
+++ b/fe/src/main/java/org/apache/impala/planner/SortNode.java
@@ -501,7 +501,7 @@ public class SortNode extends PlanNode {
           PARTIAL_SORT_MEM_LIMIT : (long) Math.ceil(fullInputSize / numInstances);
       perInstanceMinMemReservation = 3 * bufferSize * pageMultiplier;
 
-      if (type_ == TSortType.PARTITIONED_TOPN) {
+      if (type_ == TSortType.PARTITIONED_TOPN && cardinality_ >= 0) {
         // We may be able to estimate a lower memory requirement based on the size
         // of in-memory heaps.
         long totalHeapBytes = getSortInfo().estimateMaterializedSize(cardinality_);
diff --git a/fe/src/test/java/org/apache/impala/planner/PlannerTest.java b/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
index fd5af1e82..7a7e60ebe 100644
--- a/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
+++ b/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
@@ -1248,7 +1248,9 @@ public class PlannerTest extends PlannerTestBase {
    */
   @Test
   public void testLimitPushdownPartitionedTopN() {
-    runPlannerTestFile("limit-pushdown-partitioned-top-n");
+    TQueryOptions options = defaultQueryOptions();
+    options.setDisable_hdfs_num_rows_estimate(true); // Needed to test IMPALA-11443.
+    runPlannerTestFile("limit-pushdown-partitioned-top-n", options);
   }
 
   /**
diff --git a/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-partitioned-top-n.test b/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-partitioned-top-n.test
index 167932190..24c59d02f 100644
--- a/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-partitioned-top-n.test
+++ b/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-partitioned-top-n.test
@@ -854,3 +854,40 @@ PLAN-ROOT SINK
    HDFS partitions=11/11 files=11 size=814.73KB
    row-size=14B cardinality=11.00K
 ====
+# Regression test for IMPALA-11443. Tests partitioned top-n with -1 cardinality.
+# Relies on functional.tinytable not having stats and running this test with
+# DISABLE_HDFS_NUM_ROWS_ESTIMATE=1
+select a, b, rn from (
+  select *, row_number() over (partition by a order by b) as rn
+  from functional.tinytable) v
+where rn <= 5
+order by a, rn;
+---- PLAN
+PLAN-ROOT SINK
+|
+04:SORT
+|  order by: a ASC, rn ASC
+|  row-size=32B cardinality=unavailable
+|
+03:SELECT
+|  predicates: row_number() <= 5
+|  row-size=32B cardinality=unavailable
+|
+02:ANALYTIC
+|  functions: row_number()
+|  partition by: a
+|  order by: b ASC
+|  window: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
+|  row-size=32B cardinality=unavailable
+|
+01:TOP-N
+|  partition by: a
+|  order by: b ASC
+|  partition limit: 5
+|  source expr: row_number() <= CAST(5 AS BIGINT)
+|  row-size=24B cardinality=unavailable
+|
+00:SCAN HDFS [functional.tinytable]
+   HDFS partitions=1/1 files=1 size=38B
+   row-size=24B cardinality=unavailable
+====


[impala] 07/27: IMPALA-11325: Fix UnicodeDecodeError for shell file output

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit f6a870ccdf1ce24b80b40ff3af7ac6c023138306
Author: Joe McDonnell <jo...@cloudera.com>
AuthorDate: Tue May 31 16:14:55 2022 -0700

    IMPALA-11325: Fix UnicodeDecodeError for shell file output
    
    When using the --output_file commandline option for
    impala-shell, the shell fails with UnicodeDecodeError
    if the output contains Unicode characters.
    
    For example, if running this command:
    impala-shell -B -q "select '引'" --output_file=output.txt
    This fails with:
    UnicodeDecodeError : 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
    
    This happens due to an encode('utf-8') call happening
    in OutputStream::write() on a string that is already UTF-8 encoded.
    This changes the code to skip the encode('utf-8') call for Python 2.
    Python 3 is using a string and still needs the encode call.
    
    This is mostly a pragmatic fix to make the code a little bit
    more functional, and there is more work to be done to have
    clear contracts for the format() methods and clear points
    of conversion to/from bytes.
    
    Testing:
     - Ran shell tests with Python 2 and Python 3 on Ubuntu 18
     - Added a shell test that outputs a Unicode character
       to an output file. Without the fix, this test fails.
    
    Change-Id: Ic40be3d530c2694465f7bd2edb0e0586ff0e1fba
    Reviewed-on: http://gerrit.cloudera.org:8080/18576
    Reviewed-by: Michael Smith <mi...@cloudera.com>
    Reviewed-by: Quanlong Huang <hu...@gmail.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 shell/shell_output.py                 | 10 +++++++++-
 tests/shell/test_shell_commandline.py | 17 +++++++++++++++++
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/shell/shell_output.py b/shell/shell_output.py
index 978196539..becc4dd06 100644
--- a/shell/shell_output.py
+++ b/shell/shell_output.py
@@ -112,7 +112,15 @@ class OutputStream(object):
         with open(self.filename, 'ab') as out_file:
           # Note that instances of this class do not persist, so it's fine to
           # close the we close the file handle after each write.
-          out_file.write(formatted_data.encode('utf-8'))  # file opened in binary mode
+          # The file is opened in binary mode. Python 2 returns Unicode bytes
+          # that can be written directly. Python 3 returns a string, which
+          # we need to encode before writing.
+          # TODO: Reexamine the contract of the format() function and see if
+          # we can remove this.
+          if sys.version_info.major == 2 and isinstance(formatted_data, str):
+            out_file.write(formatted_data)
+          else:
+            out_file.write(formatted_data.encode('utf-8'))
           out_file.write(b'\n')
       except IOError as err:
         file_err_msg = "Error opening file %s: %s" % (self.filename, str(err))
diff --git a/tests/shell/test_shell_commandline.py b/tests/shell/test_shell_commandline.py
index 14a6fdcfb..c8393ecc4 100644
--- a/tests/shell/test_shell_commandline.py
+++ b/tests/shell/test_shell_commandline.py
@@ -1193,6 +1193,23 @@ class TestImpalaShell(ImpalaTestSuite):
       rows_from_file = [line.rstrip() for line in f]
       assert rows_from_stdout == rows_from_file
 
+  def test_output_file_utf8(self, vector, tmp_file):
+    """Test that writing UTF-8 output to a file using '--output_file' produces the
+    same output as written to stdout."""
+    # This is purely about UTF-8 output, so it doesn't need multiple rows.
+    query = "select '引'"
+    # Run the query normally and keep the stdout
+    output = run_impala_shell_cmd(vector, ['-q', query, '-B', '--output_delimiter=;'])
+    assert "Fetched 1 row(s)" in output.stderr
+    rows_from_stdout = output.stdout.strip().split('\n')
+    # Run the query with output sent to a file using '--output_file'.
+    result = run_impala_shell_cmd(vector, ['-q', query, '-B', '--output_delimiter=;',
+                                           '--output_file=%s' % tmp_file])
+    assert "Fetched 1 row(s)" in result.stderr
+    with open(tmp_file, "r") as f:
+      rows_from_file = [line.rstrip() for line in f]
+      assert rows_from_stdout == rows_from_file
+
   def test_http_socket_timeout(self, vector):
     """Test setting different http_socket_timeout_s values."""
     if (vector.get_value('strict_hs2_protocol') or


[impala] 04/27: IMPALA-11301: Fix extreme != selectivity for NDV=1

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit e5164c89e57de817dede5beca7100fd8fea97565
Author: Csaba Ringhofer <cs...@cloudera.com>
AuthorDate: Wed May 18 17:58:08 2022 +0200

    IMPALA-11301: Fix extreme != selectivity for NDV=1
    
    The original selectivity of 1.0 - 1.0/ndv makes sense for
    large NDVs, but the result is 0.0 in case of ndv==1, which
    leads to a cardinality of 1 even for huge tables. The new
    selectivity is 0.5.
    
    Note that as the formula for = is not changed (1.0/ndv),
    NOT col="const" will still lead to 0.0 selectivity if ndv=1.
    Changing the formula of NOT or = would have caused a lot of
    subtle changes in plans in tests, so I don't want to touch
    those before coming to wider agreement about the correct
    approach.
    
    IMPALA-7601 contains some discussion about these formulas.
    
    Testing:
    - added a regression test
    
    Change-Id: I6b5334a8d7d6ca46a450ff98ae03e5269faaa3c6
    Reviewed-on: http://gerrit.cloudera.org:8080/18543
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 .../main/java/org/apache/impala/analysis/BinaryPredicate.java  |  3 ++-
 .../test/java/org/apache/impala/planner/CardinalityTest.java   | 10 ++++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java b/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
index 72b98c3a3..22c8730e5 100644
--- a/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
+++ b/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
@@ -270,7 +270,8 @@ public class BinaryPredicate extends Predicate {
       if (op_ == Operator.DISTINCT_FROM && rChildIsNull) {
         selectivity_ = 1.0;
       } else {
-        selectivity_ = 1.0 - 1.0 / distinctValues;
+        // avoid 0.0 selectivity if ndv == 1 (IMPALA-11301).
+        selectivity_ = distinctValues == 1 ? 0.5 : (1.0 - 1.0 / distinctValues);
       }
     } else {
       return;
diff --git a/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java b/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
index 067d8385f..317a36151 100644
--- a/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
+++ b/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
@@ -101,6 +101,10 @@ public class CardinalityTest extends PlannerTestBase {
         "SELECT COUNT(*) FROM functional.alltypes GROUP BY id", 7300);
     verifyCardinality(
         "SELECT COUNT(*) FROM functional.alltypes GROUP BY bool_col", 2);
+
+    // Regression test for IMPALA-11301.
+    verifyCardinality(
+        "SELECT * FROM tpcds_parquet.date_dim WHERE d_current_day != 'a'", 36525);
   }
 
   /**
@@ -396,6 +400,12 @@ public class CardinalityTest extends PlannerTestBase {
     verifyApproxCardinality("SELECT SUM(int_col) OVER() int_col "
         + "FROM functional_parquet.alltypestiny", 742, true,
         ImmutableSet.of(), path, AnalyticEvalNode.class);
+
+    // Regression test for IMPALA-11301. row_number() is (incorrectly) assumed to have
+    // NDV=1, which was leading to selectivity=0.0 in rn != 5. Will break if someone
+    // implements correct ndv estimates for analytic functions.
+    verifyCardinality("SELECT * FROM (SELECT *, row_number() OVER(order by id) "
+        + "as rn FROM functional.alltypestiny) v where rn != 5", 4);
   }
 
   @Test


[impala] 11/27: IMPALA-11335: allocate WriteId before taking locks during INSERT to ACID tables

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 8a04e0d3d5d925b4d3ca9d3a2ec1e43a57700627
Author: Csaba Ringhofer <cs...@cloudera.com>
AuthorDate: Thu Jun 2 15:01:57 2022 +0200

    IMPALA-11335: allocate WriteId before taking locks during INSERT to ACID tables
    
    It turned out that in case of INSERT HMS uses the writeId during lock
    creation to save it and use it to cleanup aborted / timeouted
    transactions. See the Jira for more details.
    
    Testing:
    - It is tricky to test this, so no new test was added. Hive should
      check whether there is already a new writeId for a table during
      lock creation and return an error if not - this would ensure
      that the correct behavior of Impala is tested.
    
    Change-Id: Ic13b8822662474a0d2d4d1a31f12745159c758f4
    Reviewed-on: http://gerrit.cloudera.org:8080/18583
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
    Reviewed-by: Zoltan Borok-Nagy <bo...@cloudera.com>
---
 fe/src/main/java/org/apache/impala/service/Frontend.java | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fe/src/main/java/org/apache/impala/service/Frontend.java b/fe/src/main/java/org/apache/impala/service/Frontend.java
index d1208519c..5324e55d1 100644
--- a/fe/src/main/java/org/apache/impala/service/Frontend.java
+++ b/fe/src/main/java/org/apache/impala/service/Frontend.java
@@ -2065,10 +2065,10 @@ public class Frontend {
               staticPartitionTarget = FeCatalogUtils.getPartitionName(
                   insertStmt.getPartitionKeyValues());
             }
-            createLockForInsert(txnId, tables, targetTable, insertStmt.isOverwrite(),
-                staticPartitionTarget, queryOptions);
             long writeId = allocateWriteId(queryCtx, targetTable);
             insertStmt.setWriteId(writeId);
+            createLockForInsert(txnId, tables, targetTable, insertStmt.isOverwrite(),
+                staticPartitionTarget, queryOptions);
 
             planCtx.compilationState_.setWriteId(writeId);
           } else {


[impala] 23/27: IMPALA-10267: Addendum (Properly handle continued scanning after parse error in HdfsAvroScanner)

Posted by st...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch branch-4.1.1
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 07e3c76dde554da932e05753df022091e50b37b8
Author: Daniel Becker <da...@cloudera.com>
AuthorDate: Thu Jul 21 13:49:26 2022 +0200

    IMPALA-10267: Addendum
    (Properly handle continued scanning after parse error in
    HdfsAvroScanner)
    
    The commit for IMPALA-10267 contained an error: in
    hdfs-avro-scanner.cc:514 the condition checked the num_records_in_block_
    member variable instead of the num_records_in_block local variable. This
    way the error described in IMPALA-10267 was not corrected.
    
    This commit fixes this error.
    
    Change-Id: I81e12e43c776f6eaa8a2803ea87e98093a98c2b7
    Reviewed-on: http://gerrit.cloudera.org:8080/18767
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 be/src/exec/hdfs-avro-scanner.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/be/src/exec/hdfs-avro-scanner.cc b/be/src/exec/hdfs-avro-scanner.cc
index a8fa737ef..3cd20ba6e 100644
--- a/be/src/exec/hdfs-avro-scanner.cc
+++ b/be/src/exec/hdfs-avro-scanner.cc
@@ -511,7 +511,7 @@ Status HdfsAvroScanner::ProcessRange(RowBatch* row_batch) {
       data_block_end_ = nullptr;
       int64_t num_records_in_block;
       RETURN_IF_FALSE(stream_->ReadZLong(&num_records_in_block, &parse_status_));
-      if (num_records_in_block_ < 0) {
+      if (num_records_in_block < 0) {
         return Status(TErrorCode::AVRO_INVALID_RECORD_COUNT, stream_->filename(),
             num_records_in_block_, stream_->file_offset());
       }