You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2022/01/28 12:09:17 UTC

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17785

to look at the new patch set (#3).

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................

IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

There are 3 builtin string functions doing case conversion: upper,
lower, and initcap. Previously they only convert English alphabetic
characters. This patch adds support to deal with unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

This patch currently uses Boost.Locale in case conversion.
ICU(International Components for Unicode) is not integrated yet since
our boost in native-toolchain is not built with ICU. So currently the
localization backend of Boost.Locale is iconv, and the above corner
cases are not handled. We will consider integrating ICU in a follow-up
JIRA.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/mask-functions-ir.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
7 files changed, 327 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/85/17785/3
-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>