You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2021/06/11 06:40:47 UTC

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Quanlong Huang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17580


Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................

IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.

In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.

Tests:
 - Add BE unit tests and e2e tests

Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/util/CMakeLists.txt
M be/src/util/bit-util.h
M be/src/util/string-util-test.cc
M be/src/util/string-util.cc
M be/src/util/string-util.h
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
8 files changed, 235 insertions(+), 24 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17580/1
-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 13: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 20 Jul 2021 00:13:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 9:

The following loop can be further optimized because 2nd, 3rd or 4th byte in a UTF8 character starts with 0x10. 

int find_one_CHAR_backwards(int pos) {
  int last_pos = pos;
  for (int i=0; i<4; i++) {
    if (BitUtil::IsUtf8StartByte(ptr[pos])) return pos;
    pos--;
    if (pos<0) break;
  }
  // non UTF8 character
  return last_pos;
}


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 15 Jul 2021 15:06:59 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 12: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7313/


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 19 Jul 2021 19:00:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 13: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7317/


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 20 Jul 2021 06:26:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 1:

(16 comments)

Looks good.

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@273
PS1, Line 273:  if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt;
nit. A better algorithm would be to use utf8CodePointLen() to skip non-starting bytes.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@673
PS1, Line 673:     int search_start_pos = start_position.val - 1;
             :     if (utf8_mode) {
             :       search_start_pos = FindUtf8Pos(str.ptr, str.len, search_start_pos);
             :     }
nit. may combine into one expression to avoid extra computation when in UTF8 mode:

int serach_start_pos = (utf8_mode) ? FindUtf8Pos(str.ptr, str.len, search_start_pos) : start_position.val - 1;


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@688
PS1, Line 688:  int search_start_pos = haystack.len + start_position.val;
             :     if (utf8_mode) {
             :       search_start_pos = FindLastUtf8Pos(str.ptr, str.len, -start_position.val);
             :     }
nit. same as above (to combine).


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@708
PS1, Line 708: if (!utf8_mode) return IntVal(match_pos + 1);
nit. to combine.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@710
PS1, Line 710: chars (code points)
nit. Unicode characters in UTF8 encoding.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/CMakeLists.txt
File be/src/util/CMakeLists.txt:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/CMakeLists.txt@221
PS1, Line 221: ADD_UNIFIED_BE_LSAN_TEST(string-util-test "TruncateDownTest.*:TruncateUpTest.*:CommaSeparatedContainsTest.*:FindUtf8PosTest.*:FindLastUtf8PosTest.*")
Good to know!


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h
File be/src/util/bit-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h@132
PS1, Line 132: Utf8CodePointLen
nit. Probably should be named as UTF8EncodeLen() as it returns the length of a Unicode character in UTF8 encoding. Code point normally implies the logical value of a Unicode character.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@127
PS1, Line 127: EXPECT_EQ(10, FindUtf8Pos("李小龙 \x93 ", 12, 4));
nit. Add a new test case FindUtf8Pos("??? \x93 ", 12, 6)


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@128
PS1, Line 128: }
nit. Add a couple test cases with a combination of UTF8 characters in 1, 2, and 4 bytes.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@144
PS1, Line 144: }
nit. Add a couple test cases with a combination of UTF8 characters in 1,2 and 4 bytes.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h
File be/src/util/string-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@56
PS1, Line 56: /// position of it. If there are not enough characters, return 'str_len'. E.g.
nit. May need to describe 'index' here as follows.

The input string 'str_ptr' is viewed logically as sequence of Unicode characters encoded in UTF8. 'index' is the index (0-based) of Unicode character whose byte position in 'str_ptr' is to be found.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@60
PS1, Line 60: ///  - If the string is "你好" and index > 1, returns 6.
May add one more example.

"??Bob" and index is 4.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@63
PS1, Line 63: FindUtf8Pos
nit. Maybe named as FindUTF8PosForwared().


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@67
PS1, Line 67: 'index' must be positive and 1
nit. Making the index 0-based is consistent with the above function.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@75
PS1, Line 75: FindLastUtf8Pos
nit. Maybe named as FindUTF8PosBackward()


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.cc@99
PS1, Line 99: pos += BitUtil::Utf8CodePointLen(ptr[pos]);
We may need to call IsUtf8StartByte() first to handle non-UTF8 character situation.



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 21 Jun 2021 15:49:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 8: Code-Review+2

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17580/8/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/8/be/src/util/string-util-test.cc@249
PS8, Line 249: 0
nit: I am not sure that having such large things is that useful. This actually lowers the chance of being close to the edges.


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@120
PS5, Line 120:       // Can't find any legal characters. Count each 
> Nice catching! So in this case, malformed_bytes < 0. We can handle it as 0,
yes, handling it as 0 makes sense to me



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 15 Jul 2021 10:10:24 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 9:

(3 comments)

Looks great!

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc@273
PS9, Line 273: if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt
nit. Performance-wise, I wonder if we can skip all its #bytes when the first byte of a UTF8 char is recognized. 

 for (int i = 0; i < len; ) {
    if (BitUtil::IsUtf8StartByte(ptr[i])) {
      ++cnt; 
      i += #uTF8bytes(ptr[i])
    } eles {
       i++;
    }


http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.h
File be/src/util/string-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.h@79
PS9, Line 79: the last second
nit. second last


http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc@116
PS9, Line 116: while (pos >= 0) {
             :     // Point to the start byte of the last character.
             :     while (!BitUtil::IsUtf8StartByte(ptr[pos]) && pos >= 0) --pos;
             :     if (pos < 0) {
             :       // Can't find any legal characters. Count each byte from last_pos as one character.
             :       // Note that index is 0-based.
             :       if (index < last_pos) return last_pos - index - 1;
             :       return -1;
             :     }
             :     // Get bytes length of the located character.
             :     int bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
             :     // If there are not enough bytes after the first byte, i.e. last_pos-pos < bytes_len,
             :     // we consider the bytes belong to a malformed character, and count them as one
             :     // character.
             :     int malformed_bytes = max(last_pos - pos - bytes_len, 0);
             :     if (index < malformed_bytes) {
             :       // Count each malformed bytes as one character.
             :       return last_pos - index - 1;
             :     }
             :     // We found a legal character and 'malformed_bytes' malformed characters.
             :     // At this point, index >= malformed_bytes. So the lowest value of the updated index
             :     // is -1, which means 'pos' points at what we want.
             :     index -= malformed_bytes + 1;
             :     if (index < 0) return pos;
             :     last_pos = pos;
             :     --pos;
             :   }
nit. Compare to FindUtf8PosForward(), this loop looks complex, partially due to the first inner loop at line 118 that can over-count # of illegal characters beyond the desired number ('index'). 

I think if we can do something like FindUtf8PosForward(), this function can be simplified a little bit. 

The BNF grammar for the parsing work is 

list of <CHAR>

<CHAR> := <UTF8_CHAR> | <ILLEGAL_CHAR>

So the loop can be modeled as follows.

while (index > 0 && pos >= 0) {
  pos = find_one_CHAR_backwards(pos)-1;
  index--
}

// Find one CHAR in the range [pos-3, pos] and
// return the position of the first byte of the
// character.
int find_one_CHAR_backwards(int pos) {
  int last_pos = pos;
  for (int i=0; i<4; i++) {
    if (BitUtil::IsUtf8StartByte(ptr[pos])) return pos;
    pos--;
    if (pos<0) break;
  }
  // non UTF8 character
  return last_pos;
}



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 15 Jul 2021 14:59:04 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 12:

> Patch Set 10: Code-Review+2
> 
> Sure, resolving illegal UTF8 can be postponed to IMPALA-10761, where I hope that we can resolve it better. 
> 
> My concern is the complexity added to deal with such characters for the entire UTF8 feature. On paper, such complexity can be reduced/managed as follows. 
> 
> 1. A common place to check validity of UTF8 characters and raise error if necessary;
> 2. New UTF8 functions that only deal with UTF8 strings.
> 
> It is possible that we can do this in FE where a non-trusted source S as an input to a UTF8 func F() is translated to F(CHECK(S)), where CHECK() implements step 1).

Thank Qifan! Yeah, this sounds a much better solution.

I updated the header comments. The verify failure is due to Jenkins certificate expired. Let's wait for it come back to normal.


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 19 Jul 2021 02:33:00 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9095/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 15 Jul 2021 09:15:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 9:

(2 comments)

Thank Qifan's suggestion for simplifying the codes. Replied the comments.

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.h
File be/src/util/string-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.h@79
PS9, Line 79: the last second
> nit. second last
Done


http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc@116
PS9, Line 116: while (pos >= 0) {
             :     // Point to the start byte of the last character.
             :     while (!BitUtil::IsUtf8StartByte(ptr[pos]) && pos >= 0) --pos;
             :     if (pos < 0) {
             :       // Can't find any legal characters. Count each byte from last_pos as one character.
             :       // Note that index is 0-based.
             :       if (index < last_pos) return last_pos - index - 1;
             :       return -1;
             :     }
             :     // Get bytes length of the located character.
             :     int bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
             :     // If there are not enough bytes after the first byte, i.e. last_pos-pos < bytes_len,
             :     // we consider the bytes belong to a malformed character, and count them as one
             :     // character.
             :     int malformed_bytes = max(last_pos - pos - bytes_len, 0);
             :     if (index < malformed_bytes) {
             :       // Count each malformed bytes as one character.
             :       return last_pos - index - 1;
             :     }
             :     // We found a legal character and 'malformed_bytes' malformed characters.
             :     // At this point, index >= malformed_bytes. So the lowest value of the updated index
             :     // is -1, which means 'pos' points at what we want.
             :     index -= malformed_bytes + 1;
             :     if (index < 0) return pos;
             :     last_pos = pos;
             :     --pos;
             :   }
> nit. Compare to FindUtf8PosForward(), this loop looks complex, partially du
Yeah, the complexity is due to handling "underflow" and "overflow" malformed cases. Here "underflow" means there are less bytes than what the start byte indicates. "overflow" means there are more bytes than what the start byte indicates.

I think the above suggested change is simpler because it only handles the "underflow" cases. Let me give an example of the "overflow" case:

U+00C3 is the Latin Capital Letter "A" with Tilde. It's encoded into 2 bytes in UTF-8: [\xc3 \x83]. (You can print it by "echo -e '\xc3\x83'" in bash). The byte "\x83" will be a malformed UTF-8 byte if given along. It will be replaced by U+FFFD
(REPLACEMENT CHARACTER) in Hive or bash.

So the string, [\xc3 \x83 \x83], will be treated as two characters, i.e. [U+00C3, U+FFFD]. Using the above find_one_CHAR_backwards method, we will locate at the start byte \xc3 and ignore the second \x83.



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 16 Jul 2021 01:17:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util-test.cc@155
PS5, Line 155:   EXPECT_EQ(10, FindUtf8PosForward("李小龙 \x93\x93 ", 10, 6));
Can you add a test for the "all malformed" case here and for FindUtf8PosBackward too?


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@115
PS5, Line 115:   while (pos >= 0 && index >= 0) {
The loop seems correct to me, bit it was a bit hard for me to understand it. Some optional ideas to make it easier to read:
- replace the "break"s with "return"s. In this case the final return return could be return -1; + a DCHECK_EQ(pos, -1) could be added
- at line 119 add a separate "return" block for the pos<0 case


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@115
PS5, Line 115: index >= 0
Is index >= 0 needed? We assume it to be >= 0 at the start and break the loop at line 127 if it becomes negative. If the intention is to defend from incorrect input in release mode, then I would prefer to return -1 if it index is negative.


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@120
PS5, Line 120:     int malformed_bytes = last_pos - pos - bytes_len;
How should we handle the case when there is a valid utf8 start byte, but not enough characters after it since last_pos? This would have the weird effect of increasing index at line 127.



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 12 Jul 2021 12:27:19 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8980/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 23 Jun 2021 08:09:05 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 2: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7246/


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 22 Jun 2021 15:41:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 11: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7311/


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Sun, 18 Jul 2021 18:40:25 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 3:

(4 comments)

Thanks for looking into the test failure! Addressed the comments.

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h
File be/src/util/bit-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h@132
PS1, Line 132: NumBytesInUTF8En
> Yeah, it is a little bit trickle. I normally do not consider a particular u
OK, change to NumBytesInUTF8Encoding().


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@134
PS2, Line 134: sizeof(byte_lens) / sizeof(int);
> nit. A general version would be to find the number of elements in the array
Sorry, do you mean using 24 directly?


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc@99
PS2, Line 99: / Counting malformed UTF8 characters.
            :     while (!BitUtil::IsUtf8Start
> nit. May move this to the .h file.
This explains the following while-loop. Let me mention the error handing in the header file.


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc@118
PS2, Line 118: nt bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
             :     int malformed_bytes = last_pos - pos - bytes_len;
> nit. We probably should check illegal utf8 characters here too.
OK, I was thinking that adding too much complexity here would block future optimizations. So planned to add the error handling in IMPALA-10761. But sure, let's make it right and consistent with FindUtf8PosForward() first.



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 23 Jun 2021 07:48:31 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 10:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9099/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 16 Jul 2021 01:44:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 10:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7305/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 16 Jul 2021 06:27:19 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 14: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 20 Jul 2021 07:14:19 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8966/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 22 Jun 2021 07:16:13 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 12:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9111/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 19 Jul 2021 02:58:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17580/4/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/4/be/src/util/string-util-test.cc@150
PS4, Line 150:   // Here we just need 4 characters, i.e. "李小龙 " in byte length 10. Set 'str_len' to 10
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17580/4/be/src/util/string-util-test.cc@199
PS4, Line 199:   // Here we just need 4 characters, i.e. "李小龙 " in byte length 10. Set 'str_len' to 10
line too long (93 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 24 Jun 2021 07:11:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17580/8/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/8/be/src/util/string-util-test.cc@249
PS8, Line 249: ;
> nit: I am not sure that having such large things is that useful. This actua
Ah yeah, makes sense. Changed it to 100.



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 15 Jul 2021 13:38:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8894/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 11 Jun 2021 07:01:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 10: Code-Review+2

Sure, resolving illegal UTF8 can be postponed to IMPALA-10761, where I hope that we can resolve it better. 

My concern is the complexity added to deal with such characters for the entire UTF8 feature. On paper, such complexity can be reduced/managed as follows. 

1. A common place to check validity of UTF8 characters and raise error if necessary;
2. New UTF8 functions that only deal with UTF8 strings.

It is possible that we can do this in FE where a non-trusted source S as an input to a UTF8 func F() is translated to F(CHECK(S)), where CHECK() implements step 1).


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Sun, 18 Jul 2021 12:27:38 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 10: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 16 Jul 2021 09:53:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 13:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7317/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 20 Jul 2021 00:13:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9097/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 15 Jul 2021 14:00:33 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9093/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 14 Jul 2021 14:44:05 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 11: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Sun, 18 Jul 2021 12:29:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17580

to look at the new patch set (#12).

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................

IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.

In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.

Error handling:
Malformed UTF-8 characters are counted as one byte per character. This
is consistent with Hive since Hive replaces those bytes to U+FFFD
(REPLACEMENT CHARACTER). E.g. GenericUDFInstr calls Text#toString(),
which performs the replacement. We can provide more behaviors on error
handling like ignoring them or reporting errors. IMPALA-10761 will focus
on this.

Tests:
 - Add BE unit tests and e2e tests
 - Add random tests to make sure malformed UTF-8 characters won't crash
   us.

Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/util/CMakeLists.txt
M be/src/util/bit-util.h
M be/src/util/string-util-test.cc
M be/src/util/string-util.cc
M be/src/util/string-util.h
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
8 files changed, 410 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17580/12
-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 14: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 20 Jul 2021 13:28:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 10:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc@273
PS9, Line 273: if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt
> As we discussed in https://gerrit.cloudera.org/c/17580/5/be/src/exprs/strin
See my other comment. In the worst case of malformed characters, we should also make sure the 2nd, 3rd, and 4th byte start with 0x10.


http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc@116
PS9, Line 116: while (pos >= 0) {
             :     // Point to the start byte of the last character.
             :     while (!BitUtil::IsUtf8StartByte(ptr[pos]) && pos >= 0) --pos;
             :     if (pos < 0) {
             :       // Can't find any legal characters. Count each byte from last_pos as one character.
             :       // Note that index is 0-based.
             :       if (index < last_pos) return last_pos - index - 1;
             :       return -1;
             :     }
             :     // Get bytes length of the located character.
             :     int bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
             :     // If there are not enough bytes after the first byte, i.e. last_pos-pos < bytes_len,
             :     // we consider the bytes belong to a malformed character, and count them as one
             :     // character.
             :     int malformed_bytes = max(last_pos - pos - bytes_len, 0);
             :     if (index < malformed_bytes) {
             :       // Count each malformed bytes as one character.
             :       return last_pos - index - 1;
             :     }
             :     // We found a legal character and 'malformed_bytes' malformed characters.
             :     // At this point, index >= malformed_bytes. So the lowest value of the updated index
             :     // is -1, which means 'pos' points at what we want.
             :     index -= malformed_bytes + 1;
             :     if (index < 0) return pos;
             :     last_pos = pos;
             :     --pos;
             :   }
> Yeah, the complexity is due to handling "underflow" and "overflow" malforme
From the comment to this function
"/// Error handling:
/// Malformed UTF8 characters are counted as one byte per character. This is consistent
/// with Hive since Hive replaces those bytes to U+FFFD (REPLACEMENT CHARACTER)."

I have the impression that if in the span of at most 4 bytes, there are no valid utf8 characters, then we count as each byte as a malformed character. 

In the example of [\xc3 \x83 \x83], yes I think you are right about missing the 2nd \x83. This limitation can be remedied by checking the length of the first recognized utf8 char: if it occupies all the way to 'pos', it is valid utf8; otherwise, ptr[pos] is the invalid one. 

Also, it looks like we need to check the 2nd, 3rd or 4th byte after seeing a utf8 start byte, if malformed characters can be present and detected. 

Normally, I think we only need to check malformed characters when a string is from the user/client/risky_source/produced_by_some_SQL_functions and reject the string if so. After that, any utf8 strings can be assumed in utf8 and malformed character checking can be ignored.



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 16 Jul 2021 14:30:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 2:

(8 comments)

Looks very good!

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@673
PS1, Line 673:     // A positive starting position indicates regular searching from the left.
             :     int search_start_pos = start_position.val - 1;
             :     if (utf8_mode) {
             :      
> We always need to calculate 'start_position.val - 1' since it's the last ar
Good point. Done.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h
File be/src/util/bit-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h@132
PS1, Line 132: Utf8CodePointLen
> I think I read it as Utf8CodePoint-Len but not Utf8-CodePointLen so don't g
Yeah, it is a little bit trickle. I normally do not consider a particular utf8 sequence S of bytes, for one unicode character C, as a code point. See http://tutorials.jenkov.com/unicode/index.html. See also the encoding section in https://en.wikipedia.org/wiki/UTF-8.

The code point of interest, is thus that of C encoded by S. 

Would NumBytesInUTF8Encoding() sound better?


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@134
PS2, Line 134: sizeof(byte_lens) / sizeof(int);
nit. A general version would be to find the number of elements in the array.


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@136
PS2, Line 136: for (int i = 0; i < total_chars; ++i) {
             :     EXPECT_EQ(pos, FindUtf8PosForward(test_str, total_byte_len, i));
             :     pos += byte_lens[i];
             :   }
Good test case!


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@179
PS2, Line 179: chars = sizeof(byte_lens) / sizeof(int);
nit. same comment as above.


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@181
PS2, Line 181: for (int i = 0; i < total_chars; ++i) {
             :     pos -= byte_lens[total_chars - i - 1];
             :     EXPECT_EQ(pos, FindUtf8PosBackward(test_str, total_byte_len, i));
             :   }
cool


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc@99
PS2, Line 99: / Counting malformed UTF8 characters. This is consistent with Hive since Hive
            :     // replaces them by default.
nit. May move this to the .h file.


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc@118
PS2, Line 118: / We will get undesired results on malformed UTF8 characters.
             :     // TODO(IMPALA-10761): Add query option to control the error handling.
nit. We probably should check illegal utf8 characters here too.



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 22 Jun 2021 14:50:52 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc@273
PS9, Line 273: if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt
> nit. Performance-wise, I wonder if we can skip all its #bytes when the firs
As we discussed in https://gerrit.cloudera.org/c/17580/5/be/src/exprs/string-functions-ir.cc#269
this can't handle malformed characters that have less bytes than expected. BTW, for performance, the current version can use SIMD to speed up, i.e. check first bit of all bytes, if all of them are 0, just return the length (we can do this in another JIRA).



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 15 Jul 2021 23:09:57 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7313/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 19 Jul 2021 12:48:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................

IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.

In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.

Error handling:
Malformed UTF-8 characters are counted as one byte per character. This
is consistent with Hive since Hive replaces those bytes to U+FFFD
(REPLACEMENT CHARACTER). E.g. GenericUDFInstr calls Text#toString(),
which performs the replacement. We can provide more behaviors on error
handling like ignoring them or reporting errors. IMPALA-10761 will focus
on this.

Tests:
 - Add BE unit tests and e2e tests
 - Add random tests to make sure malformed UTF-8 characters won't crash
   us.

Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Reviewed-on: http://gerrit.cloudera.org:8080/17580
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/util/CMakeLists.txt
M be/src/util/bit-util.h
M be/src/util/string-util-test.cc
M be/src/util/string-util.cc
M be/src/util/string-util.h
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
8 files changed, 410 insertions(+), 27 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 15
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 2:

The failure in verification test is in min/max filter on sorted columns. A fix (https://gerrit.cloudera.org/#/c/17618/) is uploaded moments ago.


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 22 Jun 2021 17:41:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 13:

> Patch Set 13: Verified-1
> 
> Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7317/

The failed job is https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/14381/
It seems an unrelated failure. I saw the same failures in another unrelated job: https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/14369/

Re-run the job.


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 20 Jul 2021 07:13:34 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17580

to look at the new patch set (#4).

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................

IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.

In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.

Error handling:
Malformed UTF-8 characters are counted as one byte per character. This
is consistent with Hive since Hive replaces those bytes to U+FFFD
(REPLACEMENT CHARACTER). E.g. GenericUDFInstr calls Text#toString(),
which performs the replacement. We can provide more behaviors on error
handling like ignoring them or reporting errors. IMPALA-10761 will focus
on this.

Tests:
 - Add BE unit tests and e2e tests

Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/util/CMakeLists.txt
M be/src/util/bit-util.h
M be/src/util/string-util-test.cc
M be/src/util/string-util.cc
M be/src/util/string-util.h
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
8 files changed, 334 insertions(+), 25 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17580/4
-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17580

to look at the new patch set (#10).

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................

IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.

In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.

Error handling:
Malformed UTF-8 characters are counted as one byte per character. This
is consistent with Hive since Hive replaces those bytes to U+FFFD
(REPLACEMENT CHARACTER). E.g. GenericUDFInstr calls Text#toString(),
which performs the replacement. We can provide more behaviors on error
handling like ignoring them or reporting errors. IMPALA-10761 will focus
on this.

Tests:
 - Add BE unit tests and e2e tests
 - Add random tests to make sure malformed UTF-8 characters won't crash
   us.

Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/util/CMakeLists.txt
M be/src/util/bit-util.h
M be/src/util/string-util-test.cc
M be/src/util/string-util.cc
M be/src/util/string-util.h
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
8 files changed, 408 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17580/10
-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 2:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7246/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 22 Jun 2021 09:38:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util-test.cc@150
PS3, Line 150:   // Here we just need 4 characters, i.e. "李小龙 " in byte length 10. Set 'str_len' to 10
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util-test.cc@199
PS3, Line 199:   // Here we just need 4 characters, i.e. "李小龙 " in byte length 10. Set 'str_len' to 10
line too long (93 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 23 Jun 2021 07:48:20 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 3:

(6 comments)

Looks very good! Thanks a lot.

http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@134
PS2, Line 134: sizeof(byte_lens) / sizeof(int);
> Sorry, do you mean using 24 directly?
Never mind. sizeof(byte_lens) / sizeof(int) sounds fine. 

Done.


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc@118
PS2, Line 118: nt bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
             :     int malformed_bytes = last_pos - pos - bytes_len;
> OK, I was thinking that adding too much complexity here would block future 
Sounds like a good idea. Please refer to new comments in this method.


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@102
PS3, Line 102: --index;
I wonder if we can remove --index here, since index is for the ith Unicode character in UTF8, right?

That is to say, we treat illegal UTF8 characters as noise and ignore them.


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@118
PS3, Line 118:     int bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
Above line 118, we need to do the following to guard the case that a prefix of 'ptr' is not in UTF8.

if (pos < 0) break;


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@120
PS3, Line 120:  if (index < malformed_bytes) {
If index counts only the legal UTF8 characters, it probably should not be compared with malformed_bytes.


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@125
PS3, Line 125: ndex -= malformed_bytes + 1;
same comment as above.



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 23 Jun 2021 12:48:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17580

to look at the new patch set (#7).

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................

IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.

In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.

Error handling:
Malformed UTF-8 characters are counted as one byte per character. This
is consistent with Hive since Hive replaces those bytes to U+FFFD
(REPLACEMENT CHARACTER). E.g. GenericUDFInstr calls Text#toString(),
which performs the replacement. We can provide more behaviors on error
handling like ignoring them or reporting errors. IMPALA-10761 will focus
on this.

Tests:
 - Add BE unit tests and e2e tests

Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/util/CMakeLists.txt
M be/src/util/bit-util.h
M be/src/util/string-util-test.cc
M be/src/util/string-util.cc
M be/src/util/string-util.h
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
8 files changed, 346 insertions(+), 25 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17580/7
-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 14:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7321/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 20 Jul 2021 07:14:20 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17580

to look at the new patch set (#9).

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................

IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.

In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.

Error handling:
Malformed UTF-8 characters are counted as one byte per character. This
is consistent with Hive since Hive replaces those bytes to U+FFFD
(REPLACEMENT CHARACTER). E.g. GenericUDFInstr calls Text#toString(),
which performs the replacement. We can provide more behaviors on error
handling like ignoring them or reporting errors. IMPALA-10761 will focus
on this.

Tests:
 - Add BE unit tests and e2e tests
 - Add random tests to make sure malformed UTF-8 characters won't crash
   us.

Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/util/CMakeLists.txt
M be/src/util/bit-util.h
M be/src/util/string-util-test.cc
M be/src/util/string-util.cc
M be/src/util/string-util.h
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
8 files changed, 408 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17580/9
-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8995/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 24 Jun 2021 07:34:43 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 11:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7311/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Sun, 18 Jul 2021 12:29:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 5: Code-Review+1

(1 comment)

Looks great!

http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@102
PS3, Line 102: --index;
> Sorry that we need to be consistent with Hive that replaces all illegal UTF
Okay. Thanks for adding the comment.



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 24 Jun 2021 12:51:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 7:

My only expectation about malformed strings is that we shouldn't crash/hang. It would be great to return the same results as Hive, but this may turn out to be complex/inherently slow.

A way to catch unlucky cases that lead to crash would be to add a test that passes random strings (with malformed bytes included).


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 14 Jul 2021 17:00:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 8:

(3 comments)

> Patch Set 7:
> 
> My only expectation about malformed strings is that we shouldn't crash/hang. It would be great to return the same results as Hive, but this may turn out to be complex/inherently slow.
> 
> A way to catch unlucky cases that lead to crash would be to add a test that passes random strings (with malformed bytes included).

Agreed. Added more tests including random test.

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/exprs/string-functions-ir.cc@269
PS5, Line 269: CountUtf8Chars
> This can return different values compared to the original implementation of
Good question. I did an experiment in Hive and it shows up results of our old behavior. I'll rollback this change due to this. CC @Qifan.


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util-test.cc@155
PS5, Line 155:   EXPECT_EQ(10, FindUtf8PosForward("李小龙 \x93\x93 ", 10, 5));
> Can you add a test for the "all malformed" case here and for FindUtf8PosBac
Done


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@105
PS5, Line 105:     pos += BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
> Nice catching! One wrong length will mess up the whole processing since we 
Done in PS8. If there are less bytes than expected, the next character will be messed up with this malformed one, and being counted as one character with this. It would be inconsistent with Hive but I think it's ok for now. I just add a check at the end to deal with pos > len.



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 15 Jul 2021 08:53:44 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17580

to look at the new patch set (#2).

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................

IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.

In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.

Tests:
 - Add BE unit tests and e2e tests

Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/util/CMakeLists.txt
M be/src/util/bit-util.h
M be/src/util/string-util-test.cc
M be/src/util/string-util.cc
M be/src/util/string-util.h
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
8 files changed, 294 insertions(+), 25 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17580/2
-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/exprs/string-functions-ir.cc@269
PS5, Line 269: CountUtf8Chars
This can return different values compared to the original implementation of Utf8Length if there are malformed characters - is this intentional? E.g. a two start bytes where NumBytesInUTF8Encoding==2 (without the following bytes) would return 2 in the original but 1 in the new implementation.

I don't know what is the correct implementation in this case, but the original version seems slightly safer.


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@105
PS5, Line 105:     pos += BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
I realized that issue mentioned in line 120 also exists here: if NumBytesInUTF8Encoding() returns more bytes than there are  in the string, then pos can point out of the array.



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 12 Jul 2021 12:43:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17580

to look at the new patch set (#5).

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................

IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.

In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.

Error handling:
Malformed UTF-8 characters are counted as one byte per character. This
is consistent with Hive since Hive replaces those bytes to U+FFFD
(REPLACEMENT CHARACTER). E.g. GenericUDFInstr calls Text#toString(),
which performs the replacement. We can provide more behaviors on error
handling like ignoring them or reporting errors. IMPALA-10761 will focus
on this.

Tests:
 - Add BE unit tests and e2e tests

Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/util/CMakeLists.txt
M be/src/util/bit-util.h
M be/src/util/string-util-test.cc
M be/src/util/string-util.cc
M be/src/util/string-util.h
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
8 files changed, 334 insertions(+), 25 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17580/5
-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 12: Code-Review+2

Carrying Qifan and Csaba's +2.


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 20 Jul 2021 00:11:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 7:

(4 comments)

Thank Csaba's feedback and pointing out other corner cases of malformed UTF8 characters! I'm still modifying the patch but would like to post an update in time.

TBO, this patch got more and more complex due to dealing with malformed UTF8 characters. At first, I tend to not doing any error handling and just mentioning that illegal UTF8 characters will have undesired results. Take other systems as examples,
* in Presto, there are no explicit checks for valid UTF-8 and the functions may return incorrect results on invalid UTF-8: https://prestodb.io/docs/0.257/functions/string.html#string-functions
* in ClickHouse, the UTF8 functions also assume that the string contains a set of bytes that make up UTF-8 encoded text. https://clickhouse.tech/docs/en/sql-reference/functions/string-functions/

They both provide special functions for dealing with invalid UTF8, e.g. from_utf8() in Presto and isValidUTF8()/toValidUTF8() in ClickHouse. I think the reason for this approach is for efficiency. Assuming valid UTF-8 characters helps to vectorise the implementation by SIMD instructions.

However, Hive implicitly converts malformed UTF-8 characters. To be consistent with Hive, it seems we need to do the same at many places, which will results in inefficient and complex codes(branches). What do you think? Does it make sense if we choose the same behavior like Presto and ClickHouse that assume wrong results on invalid UTF-8 characters, and leave the consistent work done in IMPALA-10761?

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@105
PS5, Line 105:     pos += BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
> I realized that issue mentioned in line 120 also exists here: if NumBytesIn
Nice catching! One wrong length will mess up the whole processing since we can't jump to the next start byte correctly. I'm still thinking how to fix this in an efficient way.


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@115
PS5, Line 115:   while (pos >= 0) {
> The loop seems correct to me, bit it was a bit hard for me to understand it
Good points! Done.


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@115
PS5, Line 115: 
> Is index >= 0 needed? We assume it to be >= 0 at the start and break the lo
Sure. It's added at the first PS that we don't deal with illegal cases. We can remove it now.


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@120
PS5, Line 120:       // Note that index is 0-based.
> How should we handle the case when there is a valid utf8 start byte, but no
Nice catching! So in this case, malformed_bytes < 0. We can handle it as 0, so the bytes of [pos, last_pos) are considered as one character.



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 14 Jul 2021 14:23:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8994/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 24 Jun 2021 07:32:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 10: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 16 Jul 2021 12:42:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 10:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc@273
PS9, Line 273: if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt
> See my other comment. In the worst case of malformed characters, we should 
I agree that there are lot more we can/should do on malformed characters. But they will lead to more inefficient codes, e.g. checking remaining bytes instead of just the start byte, which isn't more performant than the current version.

Can we leave this as it is since it's consistent with Hive? I think IMPALA-10761 is a better JIRA to solve all these corner cases.


http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc@116
PS9, Line 116: while (pos >= 0) {
             :     // Point to the start byte of the last character.
             :     while (!BitUtil::IsUtf8StartByte(ptr[pos]) && pos >= 0) --pos;
             :     if (pos < 0) {
             :       // Can't find any legal characters. Count each byte from last_pos as one character.
             :       // Note that index is 0-based.
             :       if (index < last_pos) return last_pos - index - 1;
             :       return -1;
             :     }
             :     // Get bytes length of the located character.
             :     int bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
             :     // If there are not enough bytes after the first byte, i.e. last_pos-pos < bytes_len,
             :     // we consider the bytes belong to a malformed character, and count them as one
             :     // character.
             :     int malformed_bytes = max(last_pos - pos - bytes_len, 0);
             :     if (index < malformed_bytes) {
             :       // Count each malformed bytes as one character.
             :       return last_pos - index - 1;
             :     }
             :     // We found a legal character and 'malformed_bytes' malformed characters.
             :     // At this point, index >= malformed_bytes. So the lowest value of the updated index
             :     // is -1, which means 'pos' points at what we want.
             :     index -= malformed_bytes + 1;
             :     if (index < 0) return pos;
             :     last_pos = pos;
             :     --pos;
             :   }
> From the comment to this function
Yeah, sorry that the comment is confusing, I think I should change it to

 Bytes that don't belong any UTF8 characters are considered malformed UTF8 characters as one byte per character. This is consistent with Hive since Hive replaces each of those bytes to U+FFFD (REPLACEMENT CHARACTER).

Is it ok for you?

To avoid back-and-forth on dealing with malformed utf8, can we address them in IMPALA-10761?



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Sat, 17 Jul 2021 01:34:54 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 1:

(16 comments)

Thank Qifan's feekback! Addressed the comments.

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@273
PS1, Line 273:  if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt;
> nit. A better algorithm would be to use utf8CodePointLen() to skip non-star
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@673
PS1, Line 673:     int search_start_pos = start_position.val - 1;
             :     if (utf8_mode) {
             :       search_start_pos = FindUtf8Pos(str.ptr, str.len, search_start_pos);
             :     }
> nit. may combine into one expression to avoid extra computation when in UTF
We always need to calculate 'start_position.val - 1' since it's the last argument for FindUtf8Pos. So I think the current code is more readable.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@688
PS1, Line 688:  int search_start_pos = haystack.len + start_position.val;
             :     if (utf8_mode) {
             :       search_start_pos = FindLastUtf8Pos(str.ptr, str.len, -start_position.val);
             :     }
> nit. same as above (to combine).
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@708
PS1, Line 708: if (!utf8_mode) return IntVal(match_pos + 1);
> nit. to combine.
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@710
PS1, Line 710: chars (code points)
> nit. Unicode characters in UTF8 encoding.
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/CMakeLists.txt
File be/src/util/CMakeLists.txt:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/CMakeLists.txt@221
PS1, Line 221: ADD_UNIFIED_BE_LSAN_TEST(string-util-test "TruncateDownTest.*:TruncateUpTest.*:CommaSeparatedContainsTest.*:FindUtf8PosTest.*:FindLastUtf8PosTest.*")
> Good to know!
Thanks!


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h
File be/src/util/bit-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h@132
PS1, Line 132: Utf8CodePointLen
> nit. Probably should be named as UTF8EncodeLen() as it returns the length o
I think I read it as Utf8CodePoint-Len but not Utf8-CodePointLen so don't get confused. It seems other systems also use this name, e.g. https://www.freepascal.org/docs-html/rtl/system/utf8codepointlen.html
If we mean the number of code points, the function name will be numUtf8CodePoints. Is it ok for you to keep the original name given we have method comments mentioning byte length above?


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@127
PS1, Line 127: EXPECT_EQ(10, FindUtf8Pos("李小龙 \x93 ", 12, 4));
> nit. Add a new test case FindUtf8Pos("??? \x93 ", 12, 6)
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@128
PS1, Line 128: }
> nit. Add a couple test cases with a combination of UTF8 characters in 1, 2,
Good point!


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@144
PS1, Line 144: }
> nit. Add a couple test cases with a combination of UTF8 characters in 1,2 a
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h
File be/src/util/string-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@56
PS1, Line 56: /// position of it. If there are not enough characters, return 'str_len'. E.g.
> nit. May need to describe 'index' here as follows.
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@60
PS1, Line 60: ///  - If the string is "你好" and index > 1, returns 6.
> May add one more example.
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@63
PS1, Line 63: FindUtf8Pos
> nit. Maybe named as FindUTF8PosForwared().
Done. Change to FindUtf8PosForward since we use camel case.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@67
PS1, Line 67: 'index' must be positive and 1
> nit. Making the index 0-based is consistent with the above function.
OK, I was thinking negative index previously so just make them positive to ease the use. Change it to 0-based.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@75
PS1, Line 75: FindLastUtf8Pos
> nit. Maybe named as FindUTF8PosBackward()
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.cc@99
PS1, Line 99: pos += BitUtil::Utf8CodePointLen(ptr[pos]);
> We may need to call IsUtf8StartByte() first to handle non-UTF8 character si
Done. I think the better way is doing validation in the boundary. Then we can do this more efficiently. The desired behaviors can be replacing illegal UTF-8 characters with U+FFFD (REPLACEMENT CHARACTER), or just ignoring them, or reporting errors. Filed IMPALA-10761 for this.



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 22 Jun 2021 06:54:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 9:

The index argument may need to be the input/output of the function. 

int find_one_CHAR_backwards(int pos)


-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 15 Jul 2021 16:55:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17580

to look at the new patch set (#8).

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................

IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.

In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.

Error handling:
Malformed UTF-8 characters are counted as one byte per character. This
is consistent with Hive since Hive replaces those bytes to U+FFFD
(REPLACEMENT CHARACTER). E.g. GenericUDFInstr calls Text#toString(),
which performs the replacement. We can provide more behaviors on error
handling like ignoring them or reporting errors. IMPALA-10761 will focus
on this.

Tests:
 - Add BE unit tests and e2e tests
 - Add random tests to make sure malformed UTF-8 characters won't crash
   us.

Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/util/CMakeLists.txt
M be/src/util/bit-util.h
M be/src/util/string-util-test.cc
M be/src/util/string-util.cc
M be/src/util/string-util.h
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
8 files changed, 408 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17580/8
-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 5:

(6 comments)

Thanks for the feedback! Addressed the comments.

http://gerrit.cloudera.org:8080/#/c/17580/4/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/4/be/src/util/string-util-test.cc@150
PS4, Line 150:   // Here we just need 4 characters, i.e. "李小龙 " in byte length 10. Set 'str_len'
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/17580/4/be/src/util/string-util-test.cc@199
PS4, Line 199:   // Here we just need 4 characters, i.e. "李小龙 " in byte length 10. Set 'str_len'
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@102
PS3, Line 102: --index;
> I wonder if we can remove --index here, since index is for the ith Unicode 
Sorry that we need to be consistent with Hive that replaces all illegal UTF8 bytes with U+FFFD (REPLACEMENT CHARACTER). I moved the comments to string-util.h based on the previous comment. Past it here:

/// Error handling:
/// Malformed UTF8 characters are counted as one byte per character. This is consistent
/// with Hive since Hive replaces those bytes to U+FFFD (REPLACEMENT CHARACTER).
/// E.g. GenericUDFInstr calls Text#toString(), which performs the replacement.
/// TODO(IMPALA-10761): Add query option to control the error handling.

We can introduce the ignoring behavior in IMPALA-10761. Do you think it's ok?


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@118
PS3, Line 118:     // Get bytes length of the located character.
> Above line 118, we need to do the following to guard the case that a prefix
Oops! Thanks for catching this! Added test cases to cover this.


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@120
PS3, Line 120:  int malformed_bytes = last_pos
> If index counts only the legal UTF8 characters, it probably should not be c
Replied above


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@125
PS3, Line 125: / We found a legal character
> same comment as above.
Replied above



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 24 Jun 2021 07:13:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17580

to look at the new patch set (#3).

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................

IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.

In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.

Tests:
 - Add BE unit tests and e2e tests

Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/util/CMakeLists.txt
M be/src/util/bit-util.h
M be/src/util/string-util-test.cc
M be/src/util/string-util.cc
M be/src/util/string-util.h
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
8 files changed, 329 insertions(+), 25 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17580/3
-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>