You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2022/04/20 12:46:06 UTC

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Quanlong Huang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/18433


Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................

IMPALA-11141: Use exact data types in IN-list filter

Currently, we use a std::unordered_set<int64_t> for all numeric types
(including DATE type). It's a waste of space for small data types like
tinyint, smallint, int, etc. This patch extends the base InListFilter
class with different implementations for different data types.

For string type in-list filters, this patch uses impala::StringValue
instead of std::string. This simplifies the Insert() method, which
improves the codegen time. To use impala::StringValue, this patch
switches the set implementation to boost::unordered_set. Same as what we
use in InPredicate.

Another improvement of using impala::StringValue is that we can easily
maintain the strings in MemPool. When inserting a new batch of values,
the new values are inserted into a temp set. String pointers still
reference to the original tuple values. At the end of processing each
batch, MaterializeValues() is invoked to copy the strings into the
filter's own mem pool. This is more memory-friendly than the original
approach since we can allocate the string batch in once.

Tests:
 - Add unit tests for different types of in-list filters

Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/filter-context.cc
M be/src/exec/hdfs-orc-scanner.cc
M be/src/runtime/runtime-filter-bank.cc
M be/src/util/CMakeLists.txt
M be/src/util/in-list-filter-ir.cc
A be/src/util/in-list-filter-test.cc
M be/src/util/in-list-filter.cc
M be/src/util/in-list-filter.h
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-query-options.test
11 files changed, 586 insertions(+), 233 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/33/18433/1
-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18433 )

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................


Patch Set 2:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8057/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 20 Apr 2022 13:58:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18433

to look at the new patch set (#3).

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................

IMPALA-11141: Use exact data types in IN-list filter

Currently, we use a std::unordered_set<int64_t> for all numeric types
(including DATE type). It's a waste of space for small data types like
tinyint, smallint, int, etc. This patch extends the base InListFilter
class with different implementations for different data types.

For string type in-list filters, this patch uses impala::StringValue
instead of std::string. This simplifies the Insert() method, which
improves the codegen time. To use impala::StringValue, this patch
switches the set implementation to boost::unordered_set. Same as what we
use in InPredicate.

Another improvement of using impala::StringValue is that we can easily
maintain the strings in MemPool. When inserting a new batch of values,
the new values are inserted into a temp set. String pointers still
reference to the original tuple values. At the end of processing each
batch, MaterializeValues() is invoked to copy the strings into the
filter's own mem pool. This is more memory-friendly than the original
approach since we can allocate the string batch in once.

Tests:
 - Add unit tests for different types of in-list filters

Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/filter-context.cc
M be/src/exec/hdfs-orc-scanner.cc
M be/src/runtime/runtime-filter-bank.cc
M be/src/util/CMakeLists.txt
M be/src/util/in-list-filter-ir.cc
A be/src/util/in-list-filter-test.cc
M be/src/util/in-list-filter.cc
M be/src/util/in-list-filter.h
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-query-options.test
11 files changed, 615 insertions(+), 235 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/33/18433/3
-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18433 )

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................


Patch Set 3:

(14 comments)

Thank Qifan! Addressed the comments.

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/exec/filter-context.cc@451
PS2, Line 451: DCHECK
> DCHECK(false)
Done


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter-ir.cc
File be/src/util/in-list-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter-ir.cc@32
PS2, Line 32: UNLIKELY(val
> UNLIKELY
Done


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter-ir.cc@41
PS2, Line 41:       }                                                                               \
> probably can be moved to a reset() method.
Done


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter-ir.cc@85
PS2, Line 85:  }                                                                                     \
            :                                                                                         \
            :   template<>                                                                            \
            :   bool InListFilterImpl<StringValue, SLOT_TYPE>::Find(const void* val,                  \
            :       const ColumnType& col_typ
> should be part of a reset() method.
Done


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter-test.cc
File be/src/util/in-list-filter-test.cc:

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter-test.cc@173
PS2, Line 173: char str_buffer[] = "aabbccddeeff";
             :   const char* ptr = str_buffer;
             :   // Insert 3 values first
             :   for (int i = 0; i < 3; ++i) {
             :     filter->Insert(ptr);
             :     ptr += 2;
             :   }
> Some code duplication between CHAR test case and VARCHAR/STRING test case. 
The trouble is they use different Find() and Insert() methods. VARCHAR/STRING expects their first argument to be "const StringValue*". CHAR expects their first argument to be "const char*". Maybe it's ok to keep the current code.


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.h
File be/src/util/in-list-filter.h:

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.h@70
PS2, Line 70: 
> nit: excluding
Done


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.h@113
PS2, Line 113:   InListFilter(entry_limit, contains_null) {}
> I wonder if this can be moved to the base class with the modification:
sure


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.h@145
PS2, Line 145:     total_len += (res.second? v.len : 0);
             :     return res.second;
             :   }
> See the above comment about moving the method to base class.
Done


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.h@163
PS2, Line 163: 
> nit. transferred
Done


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.h@165
PS2, Line 165: 
> nit. maybe newly_inserted_values_ to make it clear.
Done


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.h@165
PS2, Line 165: template<PrimitiveType SLOT_TYPE>
             : class InListFilterImpl<StringValue,
> These two data members can be combined into a new struct
Good point!


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc
File be/src/util/in-list-filter.cc:

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc@65
PS2, Line 65: default:
> nit. At some point of time in future, we may need to support float and doub
I think it's not a best practise to use float/double as join keys since floating point computation is inaccurate. We can add them if users require them.


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc@143
PS2, Line 143:   for (const StringValue& s : newly_inserted_values_.values) {
> nit. May add a comment here: total_entries_ is updated during insert().
Done


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc@144
PS2, Line 144:   Ubsan::MemCpy(buffer, s.ptr, s.len);
             :     values_.insert(StringVal
> this could be part of a method (say reset()) on the struct SetOfStringsWIth
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 25 Apr 2022 09:32:06 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18433 )

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................


Patch Set 2: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 20 Apr 2022 18:26:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18433

to look at the new patch set (#5).

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................

IMPALA-11141: Use exact data types in IN-list filter

Currently, we use a std::unordered_set<int64_t> for all numeric types
(including DATE type). It's a waste of space for small data types like
tinyint, smallint, int, etc. This patch extends the base InListFilter
class with native implementations for different data types.

For string type in-list filters, this patch uses impala::StringValue
instead of std::string. This simplifies the Insert() method, which
improves the codegen time. To use impala::StringValue, this patch
switches the set implementation to boost::unordered_set. Same as what we
use in InPredicate.

Another improvement of using impala::StringValue is that we can easily
maintain the strings in MemPool. When inserting a new batch of values,
the new values are inserted into a temp set. String pointers still
reference to the original tuple values. At the end of processing each
batch, MaterializeValues() is invoked to copy the strings into the
filter's own mem pool. This is more memory-friendly than the original
approach since we can allocate the string batch at once.

Tests:
 - Add unit tests for different types of in-list filters

Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/filter-context.cc
M be/src/exec/hdfs-orc-scanner.cc
M be/src/runtime/runtime-filter-bank.cc
M be/src/util/CMakeLists.txt
M be/src/util/in-list-filter-ir.cc
A be/src/util/in-list-filter-test.cc
M be/src/util/in-list-filter.cc
M be/src/util/in-list-filter.h
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-query-options.test
11 files changed, 615 insertions(+), 235 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/33/18433/5
-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18433 )

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10486/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 25 Apr 2022 09:51:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18433 )

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8064/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 26 Apr 2022 23:05:36 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18433 )

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18433/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18433/2//COMMIT_MSG@12
PS2, Line 12: class with native implementations for different data types.
> nit. native implementations
Done


http://gerrit.cloudera.org:8080/#/c/18433/2//COMMIT_MSG@26
PS2, Line 26: at
> nit at
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 26 Apr 2022 08:57:55 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18433 )

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc
File be/src/util/in-list-filter.cc:

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc@65
PS2, Line 65: default:
> Sure. But to the question whether to support float/double in InList filter,
I think min/max filters make more sense since they can be used in non-equal joins. IN-list filters can only be generated by equal joins. Maybe DECIMAL is better than float/double if users do want to join on decimal columns.

Found a related discussion on SQLServer:
https://social.msdn.microsoft.com/forums/sqlserver/en-US/9539c78e-3e63-4faf-b9fb-5163ce0980b5/joining-on-columns-with-float-data-types



-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 26 Apr 2022 01:07:18 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18433

to look at the new patch set (#2).

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................

IMPALA-11141: Use exact data types in IN-list filter

Currently, we use a std::unordered_set<int64_t> for all numeric types
(including DATE type). It's a waste of space for small data types like
tinyint, smallint, int, etc. This patch extends the base InListFilter
class with different implementations for different data types.

For string type in-list filters, this patch uses impala::StringValue
instead of std::string. This simplifies the Insert() method, which
improves the codegen time. To use impala::StringValue, this patch
switches the set implementation to boost::unordered_set. Same as what we
use in InPredicate.

Another improvement of using impala::StringValue is that we can easily
maintain the strings in MemPool. When inserting a new batch of values,
the new values are inserted into a temp set. String pointers still
reference to the original tuple values. At the end of processing each
batch, MaterializeValues() is invoked to copy the strings into the
filter's own mem pool. This is more memory-friendly than the original
approach since we can allocate the string batch in once.

Tests:
 - Add unit tests for different types of in-list filters

Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/filter-context.cc
M be/src/exec/hdfs-orc-scanner.cc
M be/src/runtime/runtime-filter-bank.cc
M be/src/util/CMakeLists.txt
M be/src/util/in-list-filter-ir.cc
A be/src/util/in-list-filter-test.cc
M be/src/util/in-list-filter.cc
M be/src/util/in-list-filter.h
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-query-options.test
11 files changed, 588 insertions(+), 237 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/33/18433/2
-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18433 )

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................


Patch Set 5: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 27 Apr 2022 03:30:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/18433 )

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................

IMPALA-11141: Use exact data types in IN-list filter

Currently, we use a std::unordered_set<int64_t> for all numeric types
(including DATE type). It's a waste of space for small data types like
tinyint, smallint, int, etc. This patch extends the base InListFilter
class with native implementations for different data types.

For string type in-list filters, this patch uses impala::StringValue
instead of std::string. This simplifies the Insert() method, which
improves the codegen time. To use impala::StringValue, this patch
switches the set implementation to boost::unordered_set. Same as what we
use in InPredicate.

Another improvement of using impala::StringValue is that we can easily
maintain the strings in MemPool. When inserting a new batch of values,
the new values are inserted into a temp set. String pointers still
reference to the original tuple values. At the end of processing each
batch, MaterializeValues() is invoked to copy the strings into the
filter's own mem pool. This is more memory-friendly than the original
approach since we can allocate the string batch at once.

Tests:
 - Add unit tests for different types of in-list filters

Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Reviewed-on: http://gerrit.cloudera.org:8080/18433
Reviewed-by: Qifan Chen <qc...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/filter-context.cc
M be/src/exec/hdfs-orc-scanner.cc
M be/src/runtime/runtime-filter-bank.cc
M be/src/util/CMakeLists.txt
M be/src/util/in-list-filter-ir.cc
A be/src/util/in-list-filter-test.cc
M be/src/util/in-list-filter.cc
M be/src/util/in-list-filter.h
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-query-options.test
11 files changed, 615 insertions(+), 235 deletions(-)

Approvals:
  Qifan Chen: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18433 )

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10472/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 20 Apr 2022 13:57:34 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18433 )

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................


Patch Set 5:

(1 comment)

Thank Qifan for the review!

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc
File be/src/util/in-list-filter.cc:

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc@65
PS2, Line 65: default:
> Yes, if to preserve the precision is critical for join, then DECIMAL data t
Sure. I filed MPALA-11272 as a follow-up JIRA for this.



-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 26 Apr 2022 23:05:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/18433 )

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................


Patch Set 4: Code-Review+2

(2 comments)

Looks great! Thanks a lot for the rework.

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter-test.cc
File be/src/util/in-list-filter-test.cc:

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter-test.cc@173
PS2, Line 173: char str_buffer[] = "aabbccddeeff";
             :   const char* ptr = str_buffer;
             :   // Insert 3 values first
             :   for (int i = 0; i < 3; ++i) {
             :     filter->Insert(ptr);
             :     ptr += 2;
             :   }
> The trouble is they use different Find() and Insert() methods. VARCHAR/STRI
Done


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc
File be/src/util/in-list-filter.cc:

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc@65
PS2, Line 65: default:
> I think it's not a best practise to use float/double as join keys since flo
Sure. But to the question whether to support float/double in InList filter, my thinking is as follows.

1. Min/max filters are inclusive in nature;
2. In 32 or 64 bit IEEE floating representation, integers (i.e., when fraction part is 0) can be precise for a very large range. See https://en.wikipedia.org/wiki/Single-precision_floating-point_format and https://en.wikipedia.org/wiki/Double-precision_floating-point_format.



-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 25 Apr 2022 13:22:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18433 )

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................


Patch Set 1:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/10471/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 20 Apr 2022 13:06:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/18433 )

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................


Patch Set 2:

(17 comments)

Looks very good!

http://gerrit.cloudera.org:8080/#/c/18433/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18433/2//COMMIT_MSG@12
PS2, Line 12: class with different implementations for different data types.
nit. native implementations


http://gerrit.cloudera.org:8080/#/c/18433/2//COMMIT_MSG@26
PS2, Line 26: in
nit at


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/exec/filter-context.cc@451
PS2, Line 451: break;
DCHECK(false)


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter-ir.cc
File be/src/util/in-list-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter-ir.cc@32
PS2, Line 32: always_true_
UNLIKELY


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter-ir.cc@41
PS2, Line 41:         always_true_ = true;                                                          \
probably can be moved to a reset() method.


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter-ir.cc@85
PS2, Line 85:          always_true_ = true;                                               \
            :           contains_null_ = false;                                            \
            :           values_.clear();                                                   \
            :           new_values_.clear();                                               \
            :           total_entries_ = 0;  
should be part of a reset() method.


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter-test.cc
File be/src/util/in-list-filter-test.cc:

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter-test.cc@173
PS2, Line 173: char str_buffer[] = "aabbccddeeff";
             :   const char* ptr = str_buffer;
             :   // Insert 3 values first
             :   for (int i = 0; i < 3; ++i) {
             :     filter->Insert(ptr);
             :     ptr += 2;
             :   }
Some code duplication between CHAR test case and VARCHAR/STRING test case. 

Maybe we can remove it by passing ss1 and ss2 to TestStringInListFilter(). In this way, we only need the code to build ss1 and ss2 for CHAR and VARCHAR/STRING first.


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.h
File be/src/util/in-list-filter.h:

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.h@70
PS2, Line 70: not including
nit: excluding


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.h@113
PS2, Line 113: return !always_true_ && !contains_null_ && values_.empty();
I wonder if this can be moved to the base class with the modification:

 bool AlwaysFalse() override {
    return !always_true_ && !contains_null_ && total_entries == 0;
}


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.h@145
PS2, Line 145:   bool AlwaysFalse() override {
             :     return !always_true_ && !contains_null_ && total_entries_ == 0;
             :   }
See the above comment about moving the method to base class.


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.h@163
PS2, Line 163: inserted
nit. transferred


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.h@165
PS2, Line 165: new_values_
nit. maybe newly_inserted_values_ to make it clear.


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.h@165
PS2, Line 165:   boost::unordered_set<StringValue> new_values_;
             :   size_t new_values_total_len_ = 0;
These two data members can be combined into a new struct

struct SetOfStringsWithTotalLenT {
 boost::unordered_set<StringValue> new_values_;
 size_t total_len_;
}

And then define two such objects 

SetOfStringsWithTotalLenT values_;
SetOfStringsWithTotalLenT newly_inserted_values_;


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc
File be/src/util/in-list-filter.cc:

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc@41
PS2, Line 41: InListFilterImpl
cool!


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc@65
PS2, Line 65: default:
nit. At some point of time in future, we may need to support float and double, like in min/max filter (see util/min-max-filter.h).


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc@143
PS2, Line 143:   }
nit. May add a comment here: total_entries_ is updated during insert().


http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc@144
PS2, Line 144: new_values_.clear();
             :   new_values_total_len_ = 0;
this could be part of a method (say reset()) on the struct SetOfStringsWIthLengthT



-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 22 Apr 2022 15:15:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11141: Use exact data types in IN-list filter

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/18433 )

Change subject: IMPALA-11141: Use exact data types in IN-list filter
......................................................................


Patch Set 5: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc
File be/src/util/in-list-filter.cc:

http://gerrit.cloudera.org:8080/#/c/18433/2/be/src/util/in-list-filter.cc@65
PS2, Line 65: default:
> I think min/max filters make more sense since they can be used in non-equal
Yes, if to preserve the precision is critical for join, then DECIMAL data type is a good choice. 

Since impala/hive allows joins on float/double, the issue is whether it can be helped with some performance feature like this one. I am in favor of it. There could be good use case of such. 

A min/max filter can be derived from an equal join in Impala and the comparisons of column values with the boundary values in the filter is inclusive.



-- 
To view, visit http://gerrit.cloudera.org:8080/18433
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9
Gerrit-Change-Number: 18433
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 26 Apr 2022 14:47:48 +0000
Gerrit-HasComments: Yes