You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2021/08/17 03:06:10 UTC

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions

Quanlong Huang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17780


Change subject: IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions
......................................................................

IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions

This patch provides consistent masking behavior with Hive's on UTF-8
strings by turning on the UTF-8 mode, i.e. set UTF8_MODE=true. In UTF-8
mode, the masked unit of a string is a unicode code point.

Implementation
 - Extends the existing MaskTransform function to deal with unicode code
   points(represented by uint32_t).
 - Extends the existing GetFirstChar function to get the code point of
   given masked charactors in UTF-8 mode.
 - Implement a MaskSubStrUtf8 method as the core functionality.
 - Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode.
 - For better testing, this patch also adds an overload for all mask
   functions for only masking other chars but keeping the
   upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X').

Tests
 - Add BE tests in expr-test
 - Add e2e tests in utf8-string-functions.test

Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
---
M be/src/exprs/expr-test.cc
M be/src/exprs/mask-functions-ir.cc
M be/src/exprs/mask-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
5 files changed, 293 insertions(+), 49 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17780/1
-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 4:

(10 comments)

Thanks for your review, Qifan! Addressed the comments.

http://gerrit.cloudera.org:8080/#/c/17780/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17780/3//COMMIT_MSG@10
PS3, Line 10: under the UTF-8 mode, i.e., set UTF8_MODE=true. In UT
> nit. I wonder if this is true. Or you mean to say "This patch provides cons
Yeah, done.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc
File be/src/exprs/mask-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@114
PS3, Line 114: CheckAndWarnCodePoint
> I wonder if there is a need to check here. Looks like we call Utf8CodePoint
In some places we don't call Utf8CodePointCount(), e.g. in MaskShowFirstNImpl(). So we still need this.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@122
PS3, Line 122: char_cnt
> I think this should be reset to 0 prior to the WHILE loop.
We are collecting code points at index range [start, end - 1) here, so should not reset 'char_cnt'.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@123
PS3, Line 123:  uint32_t c = utf8_codecvt<char>::to_unicode(cvt_state, p, p_end);
             :     if (CheckAndWarnCodePoint(ctx, c)) return StringVal::null();
> Redundant code to code block at line 112?
These are reading new code point from the string. Do you mean we should refactor these two-lines codes into a method?


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@142
PS3, Line 142: for (uint32_t c : masked_code_points) {
             :     uint32_t width = utf8_codecvt<char>::from_unicode(cvt_state, c, ptr, p_end);
             :     DCHECK(width != utf::illegal && width != utf::incomplete);
             :     ptr += width;
             :     DCHECK(ptr <= p_end);
             :   }
> nit. Maybe we can put the work in this loop in the loop about after line 12
I think we do need to iterate the result code points twice. The first pass is used to calculate the result byte length. Then we allocate the result StringVal and write bytes into it in the second pass.

We can do sth like what we do in StringFunctions::Replace() that guess the result byte length and double/reallocate it if it's not enough. But I think the current approach is better that we won't waste any mem space, though a bit slower.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@159
PS3, Line 159: if 
> nit. remove.
Done


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@160
PS3, Line 160: GetUtf8CodePointCo
> nit. Maybe name it as GetUtf8ByteCount().
Sorry that we are calculating the number of code points, not bytes here. I renamed it to GetUtf8CodePointCount().


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@170
PS3, Line 170: c == utf::illegal ? "illegal" : "i
> This string should be placed as the 2nd argument to produce a better messag
Done


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@238
PS3, Line 238:  int start = GetUtf8CodePointCount(ctx, val) - mask_char_count;
             :   if (start < 0) start = 0;
> I wonder if this is intended. Report error immediately when start == -1?
Yes, this is intended and is consistent with Hive. If 'mask_char_count' is larger than the number of all characters, we mask all characters.

 0: jdbc:hive2://localhost:11050> select mask_last_n('abcd', 10);
 xxxx


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@388
PS3, Line 388:  string msg = Substitute("$0 unicode code point found in the beginning of $1",
             :         c == utf::illegal ? "Illegal" : "Incomplete", AnyValUtil::ToString(str));
> same comment above by inserting the code point string immediately after "un
Sorry that this is for another purpose. 'str' here is the whole string instead of the first code point. This function is used to get the first code point from a string, e.g. "abc" => "a".



-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 30 Aug 2021 06:39:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9308/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 17 Aug 2021 07:19:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 4:

Note that in IMPALA-9962, the stated first goal is "It will be very useful to provide mask functions that can deal with UTF-8 characters".


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 07 Sep 2021 14:57:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17780/2/be/src/exprs/mask-functions-ir.cc
File be/src/exprs/mask-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17780/2/be/src/exprs/mask-functions-ir.cc@125
PS2, Line 125:     c = MaskTransform(c, masked_upper_char, masked_lower_char, masked_digit_char, masked_other_char);
line too long (101 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 17 Aug 2021 05:26:35 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9453/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 14 Sep 2021 07:05:18 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 4:

(6 comments)

Looks great. My only concern is whether we just want to mask ascii characters, or Unicode characters. For the former a simpler algorithm exists for the method MastSubStrUtf8(). 

We may need to decide the scope of the change (ascii characters only, or Unicode characters in general) first.

http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc
File be/src/exprs/mask-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@122
PS3, Line 122: char_cnt
> We are collecting code points at index range [start, end - 1) here, so shou
Done


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@123
PS3, Line 123:  uint32_t c = utf8_codecvt<char>::to_unicode(cvt_state, p, p_end);
             :     if (CheckAndWarnCodePoint(ctx, c)) return StringVal::null();
> These are reading new code point from the string. Do you mean we should ref
Got it. Done.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@142
PS3, Line 142: for (uint32_t c : masked_code_points) {
             :     uint32_t width = utf8_codecvt<char>::from_unicode(cvt_state, c, ptr, p_end);
             :     DCHECK(width != utf::illegal && width != utf::incomplete);
             :     ptr += width;
             :     DCHECK(ptr <= p_end);
             :   }
> I think we do need to iterate the result code points twice. The first pass 
Thanks a lot for the explanation. 

I think we may not need 'result' of StringVal since in this patch we are not masking any non-ascii Unicode characters (see MaskTransform()). All we do is to change certain characters (byte for byte). So we could make a copy of source, and directly update the substring in question and thus save some CPU cycles and extra memory here. 

Unicode standard does cover more upper, lower and title letters (see for example https://www.compart.com/en/unicode/category/Lu). If we want to pursuit the masking of those in this patch, then this method is okay. We do need to beef up MaskTransform().


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@160
PS3, Line 160: GetUtf8CodePointCo
> Sorry that we are calculating the number of code points, not bytes here. I 
Done


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@238
PS3, Line 238:  int start = GetUtf8CodePointCount(ctx, val) - mask_char_count;
             :   if (start < 0) start = 0;
> Yes, this is intended and is consistent with Hive. If 'mask_char_count' is 
Done


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@388
PS3, Line 388:  string msg = Substitute("$0 unicode code point found in the beginning of $1",
             :         c == utf::illegal ? "Illegal" : "Incomplete", AnyValUtil::ToString(str));
> Sorry that this is for another purpose. 'str' here is the whole string inst
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 30 Aug 2021 15:49:20 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions
......................................................................


Patch Set 3:

(10 comments)

Looks good to me!

http://gerrit.cloudera.org:8080/#/c/17780/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17780/3//COMMIT_MSG@10
PS3, Line 10: by turning on the UTF-8 mode, i.e. set UTF8_MODE=true
nit. I wonder if this is true. Or you mean to say "This patch provides consistent masking behavior with Hive for
strings under the UTF-8 mode, i.e., set UTF8_MODE=true.".


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc
File be/src/exprs/mask-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@114
PS3, Line 114: CheckAndWarnCodePoint
I wonder if there is a need to check here. Looks like we call Utf8CodePointCount() in the first place. If we reach here, the string is good.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@122
PS3, Line 122: char_cnt
I think this should be reset to 0 prior to the WHILE loop.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@123
PS3, Line 123:  uint32_t c = utf8_codecvt<char>::to_unicode(cvt_state, p, p_end);
             :     if (CheckAndWarnCodePoint(ctx, c)) return StringVal::null();
Redundant code to code block at line 112?


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@142
PS3, Line 142: for (uint32_t c : masked_code_points) {
             :     uint32_t width = utf8_codecvt<char>::from_unicode(cvt_state, c, ptr, p_end);
             :     DCHECK(width != utf::illegal && width != utf::incomplete);
             :     ptr += width;
             :     DCHECK(ptr <= p_end);
             :   }
nit. Maybe we can put the work in this loop in the loop about after line 126. So masked_code_points can become converted_code_points. Furthermore, converted_code_points can be declared as std::string() and we append converted code points to it.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@159
PS3, Line 159: for
nit. remove.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@160
PS3, Line 160: Utf8CodePointCount
nit. Maybe name it as GetUtf8ByteCount().


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@170
PS3, Line 170: AnyValUtil::ToString(val)).c_str()
This string should be placed as the 2nd argument to produce a better message. For example, 

The 1-th code point <xxxx> is illegal.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@238
PS3, Line 238:  int start = Utf8CodePointCount(ctx, val) - mask_char_count;
             :   if (start < 0) start = 0;
I wonder if this is intended. Report error immediately when start == -1?


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@388
PS3, Line 388:  string msg = Substitute("$0 unicode code point found in the beginning of $1",
             :         c == utf::illegal ? "Illegal" : "Incomplete", AnyValUtil::ToString(str));
same comment above by inserting the code point string immediately after "unicode code point".



-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Mon, 23 Aug 2021 16:06:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7473/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 14 Sep 2021 22:52:36 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9306/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 17 Aug 2021 03:29:28 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions
......................................................................


Patch Set 1:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/17780/1/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/17780/1/be/src/exprs/expr-test.cc@10658
PS1, Line 10658:   TestStringValue("mask_show_first_n('hello李小龙', 6, 'x', 'x', 'x', 'X')", "hello李XX");
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17780/1/be/src/exprs/expr-test.cc@10660
PS1, Line 10660:   TestStringValue("mask_show_first_n('hello李小龙', 4, 'x', 'x', 'x', 'X')", "hellxXXX");
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/17780/1/be/src/exprs/expr-test.cc@10664
PS1, Line 10664:   TestStringValue("mask_first_n('hello李小龙', 6, 'x', 'x', 'x', 'X')", "xxxxxX小龙");
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17780/1/be/src/exprs/expr-test.cc@10665
PS1, Line 10665:   TestStringValue("mask_show_last_n('hello李小龙', 2, 'x', 'x', 'x', 'X')", "xxxxxX小龙");
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/17780/1/be/src/exprs/expr-test.cc@10666
PS1, Line 10666:   TestStringValue("mask_show_last_n('hello李小龙', 4, 'x', 'x', 'x', 'X')", "xxxxo李小龙");
line too long (97 > 90)


http://gerrit.cloudera.org:8080/#/c/17780/1/be/src/exprs/expr-test.cc@10668
PS1, Line 10668:   // Test masking to unicode code points. Specify -1(unmask) for masking upper/lower/digit chars.
line too long (97 > 90)


http://gerrit.cloudera.org:8080/#/c/17780/1/be/src/exprs/expr-test.cc@10670
PS1, Line 10670:   TestStringValue("mask_last_n('hello李小龙', 4, -1, -1, -1, '某')", "hello某某某");
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17780/1/be/src/exprs/expr-test.cc@10671
PS1, Line 10671:   TestStringValue("mask_last_n('hello李小龙', 2, -1, -1, -1, '某')", "hello李某某");
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17780/1/be/src/exprs/expr-test.cc@10672
PS1, Line 10672:   TestStringValue("mask_show_first_n('hello李小龙', 4, -1, -1, -1, '某')", "hello某某某");
line too long (97 > 90)


http://gerrit.cloudera.org:8080/#/c/17780/1/be/src/exprs/expr-test.cc@10673
PS1, Line 10673:   TestStringValue("mask_show_first_n('hello李小龙', 6, -1, -1, -1, '某')", "hello李某某");
line too long (97 > 90)


http://gerrit.cloudera.org:8080/#/c/17780/1/be/src/exprs/expr-test.cc@10674
PS1, Line 10674:   TestStringValue("mask_first_n('李小龙hello', 4, -1, -1, -1, '某')", "某某某hello");
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/17780/1/be/src/exprs/expr-test.cc@10675
PS1, Line 10675:   TestStringValue("mask_show_last_n('李小龙hello', 5, -1, -1, -1, '某')", "某某某hello");
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/17780/1/be/src/exprs/mask-functions-ir.cc
File be/src/exprs/mask-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17780/1/be/src/exprs/mask-functions-ir.cc@125
PS1, Line 125:     c = MaskTransform(c, masked_upper_char, masked_lower_char, masked_digit_char, masked_other_char);
line too long (101 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 17 Aug 2021 03:07:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 4:

> Patch Set 4:
> 
> (6 comments)
> 
> Looks great. My only concern is whether we just want to mask ascii characters, or Unicode characters. For the former a simpler algorithm exists for the method MastSubStrUtf8(). 
> 
> We may need to decide the scope of the change (ascii characters only, or Unicode characters in general) first.

Sorry that I may misunderstanding this.. For masking only ascii characters, we already support it in non-UTF8 mode. However, Hive always masks unicode characters (as described in IMPALA-9662). So this patch aims to add support for masking unicode characters.

Or do you mean whether we should support masking characters using unicode characters?


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Sat, 04 Sep 2021 11:46:13 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 4:

Ah, I see. Yeah, the mask functions can mask characters based on their types, i.e. upper case, lower case, number digits, other chars etc.

I find that mask functions in Hive also recognize lower/upper cases non-ascii characters. As our first milestone (IMPALA-10407) is providing consistent UTF-8 behavior with Hive, I'll update the patch and add more descriptions in the JIRA to make it less confusing. Thanks for pointing this out!


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 09 Sep 2021 00:45:04 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9454/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 14 Sep 2021 07:06:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17780

to look at the new patch set (#6).

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................

IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Mask functions are used in Ranger column masking policies to mask
sensitive data. There are 5 mask functions: mask(), mask_first_n(),
mask_last_n(), mask_show_first_n(), mask_show_last_n(). Take mask() as
an example, by default, it will mask uppercase to 'X', lowercase to 'x',
digits to 'n' and leave other characters unmasked. For masking all
characters to '*', we can use
  mask(my_col, '*', '*', '*', '*');
The current implementations mask strings byte-to-byte, which have
inconsistent results with Hive when the string contains unicode
characters:
  mask('中国', '*', '*', '*', '*') => '******'
Each Chinese character is encoded into 3 bytes in UTF-8 so we get the
above result. The result in Hive is '**' since there are two Chinese
characters.

This patch provides consistent masking behavior with Hive for
strings under the UTF-8 mode, i.e., set UTF8_MODE=true. In UTF-8 mode,
the masked unit of a string is a unicode code point.

Implementation
 - Extends the existing MaskTransform function to deal with unicode code
   points(represented by uint32_t).
 - Extends the existing GetFirstChar function to get the code point of
   given masked charactors in UTF-8 mode.
 - Implement a MaskSubStrUtf8 method as the core functionality.
 - Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode.
 - For better testing, this patch also adds an overload for all mask
   functions for only masking other chars but keeping the
   upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X').

Tests
 - Add BE tests in expr-test
 - Add e2e tests in utf8-string-functions.test

Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
---
M CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/mask-functions-ir.cc
M be/src/exprs/mask-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
6 files changed, 364 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17780/6
-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................

IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Mask functions are used in Ranger column masking policies to mask
sensitive data. There are 5 mask functions: mask(), mask_first_n(),
mask_last_n(), mask_show_first_n(), mask_show_last_n(). Take mask() as
an example, by default, it will mask uppercase to 'X', lowercase to 'x',
digits to 'n' and leave other characters unmasked. For masking all
characters to '*', we can use
  mask(my_col, '*', '*', '*', '*');
The current implementations mask strings byte-to-byte, which have
inconsistent results with Hive when the string contains unicode
characters:
  mask('中国', '*', '*', '*', '*') => '******'
Each Chinese character is encoded into 3 bytes in UTF-8 so we get the
above result. The result in Hive is '**' since there are two Chinese
characters.

This patch provides consistent masking behavior with Hive for
strings under the UTF-8 mode, i.e., set UTF8_MODE=true. In UTF-8 mode,
the masked unit of a string is a unicode code point.

Implementation
 - Extends the existing MaskTransform function to deal with unicode code
   points(represented by uint32_t).
 - Extends the existing GetFirstChar function to get the code point of
   given masked charactors in UTF-8 mode.
 - Implement a MaskSubStrUtf8 method as the core functionality.
 - Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode.
 - For better testing, this patch also adds an overload for all mask
   functions for only masking other chars but keeping the
   upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X').

Tests
 - Add BE tests in expr-test
 - Add e2e tests in utf8-string-functions.test

Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Reviewed-on: http://gerrit.cloudera.org:8080/17780
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/mask-functions-ir.cc
M be/src/exprs/mask-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
6 files changed, 364 insertions(+), 54 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 4:

Yes, masking Unicode character implies that we change any lower case unicode characters to upper ones or vice versa for a UTF8 string. 

My impression that this patch we only change any lower case Ascii characters to upper ones or vice versa for a UTF8 string.


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Sun, 05 Sep 2021 14:15:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9307/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 17 Aug 2021 05:46:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 6: Code-Review+2

Looks great. Thanks a lot for enhancing MaskTransform(). 

The use of std::locale is a very good choice. It helps to save huge effort to handle other NLP aspects such as collate.


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 14 Sep 2021 17:47:03 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 6:

(1 comment)

Updated the patch to mask upper/lower cases in non-ascii characters.

http://gerrit.cloudera.org:8080/#/c/17780/5/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/17780/5/be/src/exprs/expr-test.cc@10680
PS5, Line 10680:   TestStringValue(
> line too long (101 > 90)
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 14 Sep 2021 06:45:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 6: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7472/


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 14 Sep 2021 22:24:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17780

to look at the new patch set (#3).

Change subject: IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions
......................................................................

IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions

This patch provides consistent masking behavior with Hive's on UTF-8
strings by turning on the UTF-8 mode, i.e. set UTF8_MODE=true. In UTF-8
mode, the masked unit of a string is a unicode code point.

Implementation
 - Extends the existing MaskTransform function to deal with unicode code
   points(represented by uint32_t).
 - Extends the existing GetFirstChar function to get the code point of
   given masked charactors in UTF-8 mode.
 - Implement a MaskSubStrUtf8 method as the core functionality.
 - Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode.
 - For better testing, this patch also adds an overload for all mask
   functions for only masking other chars but keeping the
   upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X').

Tests
 - Add BE tests in expr-test
 - Add e2e tests in utf8-string-functions.test

Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
---
M be/src/exprs/expr-test.cc
M be/src/exprs/mask-functions-ir.cc
M be/src/exprs/mask-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
5 files changed, 306 insertions(+), 49 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17780/3
-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17780

to look at the new patch set (#4).

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................

IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

This patch provides consistent masking behavior with Hive for
strings under the UTF-8 mode, i.e., set UTF8_MODE=true. In UTF-8 mode,
the masked unit of a string is a unicode code point.

Implementation
 - Extends the existing MaskTransform function to deal with unicode code
   points(represented by uint32_t).
 - Extends the existing GetFirstChar function to get the code point of
   given masked charactors in UTF-8 mode.
 - Implement a MaskSubStrUtf8 method as the core functionality.
 - Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode.
 - For better testing, this patch also adds an overload for all mask
   functions for only masking other chars but keeping the
   upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X').

Tests
 - Add BE tests in expr-test
 - Add e2e tests in utf8-string-functions.test

Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
---
M be/src/exprs/expr-test.cc
M be/src/exprs/mask-functions-ir.cc
M be/src/exprs/mask-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
5 files changed, 306 insertions(+), 49 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17780/4
-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17780

to look at the new patch set (#5).

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................

IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Mask functions are used in Ranger column masking policies to mask
sensitive data. There are 5 mask functions: mask(), mask_first_n(),
mask_last_n(), mask_show_first_n(), mask_show_last_n(). Take mask() as
an example, by default, it will mask uppercase to 'X', lowercase to 'x',
digits to 'n' and leave other characters unmasked. For masking all
characters to '*', we can use
  mask(my_col, '*', '*', '*', '*');
The current implementations mask strings byte-to-byte, which have
inconsistent results with Hive when the string contains unicode
characters:
  mask('中国', '*', '*', '*', '*') => '******'
Each Chinese character is encoded into 3 bytes in UTF-8 so we get the
above result. The result in Hive is '**' since there are two Chinese
characters.

This patch provides consistent masking behavior with Hive for
strings under the UTF-8 mode, i.e., set UTF8_MODE=true. In UTF-8 mode,
the masked unit of a string is a unicode code point.

Implementation
 - Extends the existing MaskTransform function to deal with unicode code
   points(represented by uint32_t).
 - Extends the existing GetFirstChar function to get the code point of
   given masked charactors in UTF-8 mode.
 - Implement a MaskSubStrUtf8 method as the core functionality.
 - Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode.
 - For better testing, this patch also adds an overload for all mask
   functions for only masking other chars but keeping the
   upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X').

Tests
 - Add BE tests in expr-test
 - Add e2e tests in utf8-string-functions.test

Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
---
M CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/mask-functions-ir.cc
M be/src/exprs/mask-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
6 files changed, 363 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17780/5
-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9397/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 30 Aug 2021 07:01:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7472/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 14 Sep 2021 17:48:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 7: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 14 Sep 2021 22:52:35 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 7:

> Patch Set 6: Verified-1
> 
> Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7472/

The failure is unrelated: IMPALA-10316. I just restart the GVO job.
Thank Qifan's review!


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 14 Sep 2021 23:22:13 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17780

to look at the new patch set (#2).

Change subject: IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions
......................................................................

IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions

This patch provides consistent masking behavior with Hive's on UTF-8
strings by turning on the UTF-8 mode, i.e. set UTF8_MODE=true. In UTF-8
mode, the masked unit of a string is a unicode code point.

Implementation
 - Extends the existing MaskTransform function to deal with unicode code
   points(represented by uint32_t).
 - Extends the existing GetFirstChar function to get the code point of
   given masked charactors in UTF-8 mode.
 - Implement a MaskSubStrUtf8 method as the core functionality.
 - Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode.
 - For better testing, this patch also adds an overload for all mask
   functions for only masking other chars but keeping the
   upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X').

Tests
 - Add BE tests in expr-test
 - Add e2e tests in utf8-string-functions.test

Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
---
M be/src/exprs/expr-test.cc
M be/src/exprs/mask-functions-ir.cc
M be/src/exprs/mask-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
5 files changed, 305 insertions(+), 49 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17780/2
-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 4:

> Patch Set 4:
> 
> Yes, masking Unicode character implies that we change any lower case unicode characters to upper ones or vice versa for a UTF8 string. 
> 
> My impression that this patch we only change any lower case Ascii characters to upper ones or vice versa for a UTF8 string.

This patch is not for case conversion(upper/lower)... Mask functions is another topic. Please see the examples in IMPALA-9662 or you can run some examples (using mask(), mask_first_n(), mask_show_first_n(), etc.) in Hive.


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 07 Sep 2021 00:56:13 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 4:

My understanding is that the following function will be used to return the masted character for the ASCII upper case or lower case. The concern is whether it should be enhanced to deal with Unicode upper case or lower case characters. For example, 00c0 is the latin upper case letter of A with grave. 

static inline uint32_t MaskTransform(uint32_t val, int masked_upper_char,
    int masked_lower_char, int masked_digit_char, int masked_other_char) {
  if ('A' <= val && val <= 'Z') {
    if (masked_upper_char == UNMASKED_VAL) return val;
    return masked_upper_char;
  }
  if ('a' <= val && val <= 'z') {
    if (masked_lower_char == UNMASKED_VAL) return val;
    return masked_lower_char;
  }
  if ('0' <= val && val <= '9') {
    if (masked_digit_char == UNMASKED_VAL) return val;
    return masked_digit_char;
  }
  if (masked_other_char == UNMASKED_VAL) return val;
  return masked_other_char;
}


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 07 Sep 2021 14:54:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17780/5/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/17780/5/be/src/exprs/expr-test.cc@10680
PS5, Line 10680:   TestStringValue("mask('French àâæçéèêëïîôœùûüÿ ÀÂÆÇÉÈÊËÏÎÔŒÙÛÜŸ')",
line too long (101 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 14 Sep 2021 06:42:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................


Patch Set 7: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 15 Sep 2021 05:04:06 +0000
Gerrit-HasComments: No