You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2021/08/30 06:39:01 UTC

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Hello Qifan Chen, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17780

to look at the new patch set (#4).

Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
......................................................................

IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

This patch provides consistent masking behavior with Hive for
strings under the UTF-8 mode, i.e., set UTF8_MODE=true. In UTF-8 mode,
the masked unit of a string is a unicode code point.

Implementation
 - Extends the existing MaskTransform function to deal with unicode code
   points(represented by uint32_t).
 - Extends the existing GetFirstChar function to get the code point of
   given masked charactors in UTF-8 mode.
 - Implement a MaskSubStrUtf8 method as the core functionality.
 - Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode.
 - For better testing, this patch also adds an overload for all mask
   functions for only masking other chars but keeping the
   upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X').

Tests
 - Add BE tests in expr-test
 - Add e2e tests in utf8-string-functions.test

Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
---
M be/src/exprs/expr-test.cc
M be/src/exprs/mask-functions-ir.cc
M be/src/exprs/mask-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
5 files changed, 306 insertions(+), 49 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17780/4
-- 
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>