You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/09/15 13:17:00 UTC
[jira] [Commented] (IMPALA-2019) Proper UTF-8 support in string functions

    [ https://issues.apache.org/jira/browse/IMPALA-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415505#comment-17415505 ] 

ASF subversion and git services commented on IMPALA-2019:
---------------------------------------------------------

Commit 3850d49711b88091101cfc3d89da28c76a17b04d in impala's branch refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=3850d49 ]

IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Mask functions are used in Ranger column masking policies to mask
sensitive data. There are 5 mask functions: mask(), mask_first_n(),
mask_last_n(), mask_show_first_n(), mask_show_last_n(). Take mask() as
an example, by default, it will mask uppercase to 'X', lowercase to 'x',
digits to 'n' and leave other characters unmasked. For masking all
characters to '*', we can use
  mask(my_col, '*', '*', '*', '*');
The current implementations mask strings byte-to-byte, which have
inconsistent results with Hive when the string contains unicode
characters:
  mask('中国', '*', '*', '*', '*') => '******'
Each Chinese character is encoded into 3 bytes in UTF-8 so we get the
above result. The result in Hive is '**' since there are two Chinese
characters.

This patch provides consistent masking behavior with Hive for
strings under the UTF-8 mode, i.e., set UTF8_MODE=true. In UTF-8 mode,
the masked unit of a string is a unicode code point.

Implementation
 - Extends the existing MaskTransform function to deal with unicode code
   points(represented by uint32_t).
 - Extends the existing GetFirstChar function to get the code point of
   given masked charactors in UTF-8 mode.
 - Implement a MaskSubStrUtf8 method as the core functionality.
 - Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode.
 - For better testing, this patch also adds an overload for all mask
   functions for only masking other chars but keeping the
   upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X').

Tests
 - Add BE tests in expr-test
 - Add e2e tests in utf8-string-functions.test

Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Reviewed-on: http://gerrit.cloudera.org:8080/17780
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Proper UTF-8 support in string functions
> ----------------------------------------
>
>                 Key: IMPALA-2019
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2019
>             Project: IMPALA
>          Issue Type: New Feature
>          Components: Backend
>    Affects Versions: Impala 2.1, Impala 2.2
>            Reporter: Andrés Cordero
>            Assignee: Quanlong Huang
>            Priority: Critical
>              Labels: sql-language
>
> As documented here: https://impala.apache.org/docs/build/html/topics/impala_string.html
> Impala does not properly handle non-ASCII UTF-8 characters, and will return results in string functions such as length that are inconsistent with Hive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org