You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2024/02/02 12:24:00 UTC

[jira] [Commented] (IMPALA-12718) trim() functions are lack of utf-8 support

    [ https://issues.apache.org/jira/browse/IMPALA-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813649#comment-17813649 ] 

ASF subversion and git services commented on IMPALA-12718:
----------------------------------------------------------

Commit e489ab35b1610aa387bc10caf466bbe0aafba19b in impala's branch refs/heads/master from Eyizoha
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=e489ab35b ]

IMPALA-12718: Provides UTF-8 support for the trim functions

Currently, the trim function (including BTRIM, LTRIM, RTRIM) cannot
correctly handle strings containing multi-byte UTF-8 characters.
Multi-byte UTF-8 characters are interpreted as multiple single-byte
characters, leading to unexpected results.

This patch provides UTF-8 support for the trim functions, enabling these
functions to correctly handle multi-byte UTF-8 characters (when set
utf8_mode=true). It also introduces a set of trim functions with the
'utf8_' prefix, offering the same capability even when utf8_mode is not
enabled.

Testing:
 - Added new BE test case in ExprTest#Utf8Test
 - Added new E2E test case in TestUtf8StringFunctions

Change-Id: I5cfaffd71009f16eae75910af835bd2a34410856
Reviewed-on: http://gerrit.cloudera.org:8080/20926
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> trim() functions are lack of utf-8 support
> ------------------------------------------
>
>                 Key: IMPALA-12718
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12718
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Quanlong Huang
>            Assignee: Ye Zihao
>            Priority: Critical
>              Labels: ramp-up
>
> The following string functions are lack of UTF-8 support:
> {noformat}
> BTRIM(STRING a, STRING chars_to_trim)
> LTRIM(STRING a, STRING chars_to_trim)
> RTRIM(STRING a , STRING chars_to_trim)
> {noformat}
> Here is an issue reported by our user:
> {noformat}
> [localhost:21050] default> select rtrim('价格,', ',');
> +-----------------------+
> | rtrim('价格,', ',') |
> +-----------------------+
> | 价�                   |
> +-----------------------+{noformat}
> The result is the same if setting utf8_mode=true. Note that the comma used in the above strings is Chinese punctuation mark ',' , not English(ASCII) mark ','.
> The cause is that the Chinese character ',' is used as a char set. The utf8 encoding of these characters:
>  * '价': 0xe4 0xbb 0xb7
>  * '格': 0xe6 0xa0 0xbc
>  * ',': 0xef 0xbc 0x8c
> Each character is encoded into 3 bytes. The last byte of '格' is 0xbc which also appears in the bytes of ','. So it's removed as well. The result is a string of '价' and the first two bytes of '格'. The last character becomes a malformed unicode so it's replaced with '�'.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org