You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2020/11/02 02:05:05 UTC

[native-toolchain-CR] IMPALA-10145,IMPALA-10299: Apply unicode decoding bug fixes to thrift-0.11.0

Quanlong Huang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16688


Change subject: IMPALA-10145,IMPALA-10299: Apply unicode decoding bug fixes to thrift-0.11.0
......................................................................

IMPALA-10145,IMPALA-10299: Apply unicode decoding bug fixes to thrift-0.11.0

After we bump the impala-shell dependent thrift version to 0.11.0, we
hit some bugs in decoding malformed utf8 characters, which crash the
impala-shell or cause it hanging forever. Before we bump the thrift
version, impala-shell is able to print incomplete utf8 characters as
some replaced utf8 symbols, e.g.

impala-shell> select substr("引擎", 1, 4);
引�
impala-shell> select unhex("aa");
�

The cause is that thrift changes its internal strings representation
from bytes to unicode after 0.10 (THRIFT-3503) to support Python3, which
follows the "unicode sandwich" rule -- namely "bytes on the outside,
unicode on the inside, encode/decode at the edges". However, the error
handling method is not specified so we hit the decoding error. We need
patches of THRIFT-2087 and THRIFT-5303 to improve its robustness.
THRIFT-5303 is enough to resolve the issue we hitted since we mostly use
the _fast_decode code path. Backporting THRIFT-2087 as well in case we
use the normal decoding code path somewhere.

Tests:
 - Verify the issue is resolved after bumping the impala-shell dependent
   thrift version to 0.11.0-p4.

Change-Id: Id16b04248f2db3033bef3ab26b7ba8205768c9af
---
M buildall.sh
A source/thrift/thrift-0.11.0-patches/0003-THRIFT-2087-Python-compiler-replace-non-utf-8-char-w.patch
A source/thrift/thrift-0.11.0-patches/0004-THRIFT-5303-Fix-missing-error-handling-in-using-PyUn.patch
3 files changed, 55 insertions(+), 1 deletion(-)



  git pull ssh://gerrit.cloudera.org:29418/native-toolchain refs/changes/88/16688/1
-- 
To view, visit http://gerrit.cloudera.org:8080/16688
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Id16b04248f2db3033bef3ab26b7ba8205768c9af
Gerrit-Change-Number: 16688
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>

[native-toolchain-CR] IMPALA-10145,IMPALA-10299: Apply unicode decoding bug fixes to thrift-0.11.0

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16688 )

Change subject: IMPALA-10145,IMPALA-10299: Apply unicode decoding bug fixes to thrift-0.11.0
......................................................................


Patch Set 1: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/16688
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id16b04248f2db3033bef3ab26b7ba8205768c9af
Gerrit-Change-Number: 16688
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 04 Nov 2020 07:23:35 +0000
Gerrit-HasComments: No

[native-toolchain-CR] IMPALA-10145,IMPALA-10299: Apply unicode decoding bug fixes to thrift-0.11.0

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16688 )

Change subject: IMPALA-10145,IMPALA-10299: Apply unicode decoding bug fixes to thrift-0.11.0
......................................................................

IMPALA-10145,IMPALA-10299: Apply unicode decoding bug fixes to thrift-0.11.0

After we bump the impala-shell dependent thrift version to 0.11.0, we
hit some bugs in decoding malformed utf8 characters, which crash the
impala-shell or cause it hanging forever. Before we bump the thrift
version, impala-shell is able to print incomplete utf8 characters as
some replaced utf8 symbols, e.g.

impala-shell> select substr("引擎", 1, 4);
引�
impala-shell> select unhex("aa");
�

The cause is that thrift changes its internal strings representation
from bytes to unicode after 0.10 (THRIFT-3503) to support Python3, which
follows the "unicode sandwich" rule -- namely "bytes on the outside,
unicode on the inside, encode/decode at the edges". However, the error
handling method is not specified so we hit the decoding error. We need
patches of THRIFT-2087 and THRIFT-5303 to improve its robustness.
THRIFT-5303 is enough to resolve the issue we hitted since we mostly use
the _fast_decode code path. Backporting THRIFT-2087 as well in case we
use the normal decoding code path somewhere.

Tests:
 - Verify the issue is resolved after bumping the impala-shell dependent
   thrift version to 0.11.0-p4.

Change-Id: Id16b04248f2db3033bef3ab26b7ba8205768c9af
Reviewed-on: http://gerrit.cloudera.org:8080/16688
Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
Tested-by: Quanlong Huang <hu...@gmail.com>
---
M buildall.sh
A source/thrift/thrift-0.11.0-patches/0003-THRIFT-2087-Python-compiler-replace-non-utf-8-char-w.patch
A source/thrift/thrift-0.11.0-patches/0004-THRIFT-5303-Fix-missing-error-handling-in-using-PyUn.patch
3 files changed, 55 insertions(+), 1 deletion(-)

Approvals:
  Csaba Ringhofer: Looks good to me, approved
  Quanlong Huang: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/16688
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Id16b04248f2db3033bef3ab26b7ba8205768c9af
Gerrit-Change-Number: 16688
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[native-toolchain-CR] IMPALA-10145,IMPALA-10299: Apply unicode decoding bug fixes to thrift-0.11.0

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16688 )

Change subject: IMPALA-10145,IMPALA-10299: Apply unicode decoding bug fixes to thrift-0.11.0
......................................................................


Patch Set 1: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16688
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id16b04248f2db3033bef3ab26b7ba8205768c9af
Gerrit-Change-Number: 16688
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Comment-Date: Mon, 02 Nov 2020 16:04:16 +0000
Gerrit-HasComments: No