You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org> on 2021/07/12 12:27:19 UTC

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
......................................................................


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util-test.cc@155
PS5, Line 155:   EXPECT_EQ(10, FindUtf8PosForward("李小龙 \x93\x93 ", 10, 6));
Can you add a test for the "all malformed" case here and for FindUtf8PosBackward too?


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@115
PS5, Line 115:   while (pos >= 0 && index >= 0) {
The loop seems correct to me, bit it was a bit hard for me to understand it. Some optional ideas to make it easier to read:
- replace the "break"s with "return"s. In this case the final return return could be return -1; + a DCHECK_EQ(pos, -1) could be added
- at line 119 add a separate "return" block for the pos<0 case


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@115
PS5, Line 115: index >= 0
Is index >= 0 needed? We assume it to be >= 0 at the start and break the loop at line 127 if it becomes negative. If the intention is to defend from incorrect input in release mode, then I would prefer to return -1 if it index is negative.


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@120
PS5, Line 120:     int malformed_bytes = last_pos - pos - bytes_len;
How should we handle the case when there is a valid utf8 start byte, but not enough characters after it since last_pos? This would have the weird effect of increasing index at line 127.



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 12 Jul 2021 12:27:19 +0000
Gerrit-HasComments: Yes