You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2021/08/18 08:10:42 UTC

[Impala-ASF-CR] WIP IMPALA-2019(part-3): Add UTF-8 support for case conversion functions

Quanlong Huang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17785


Change subject: WIP IMPALA-2019(part-3): Add UTF-8 support for case conversion functions
......................................................................

WIP IMPALA-2019(part-3): Add UTF-8 support for case conversion functions

There are 3 builtin string functions doing case conversion: upper,
lower, and initcap. Previously they only convert English alphabetic
characters. This patch adds support to deal with unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

This patch currently uses Boost.Locale in case conversion.
ICU(International Components for Unicode) is not integrated yet since
our boost in native-toolchain is not built with ICU. So currently the
localization backend of Boost.Locale is iconv, and the above corner
cases are not handled. We will consider integrating ICU in a follow-up
JIRA.

TODO: Add query option to specify the locale

Test:
 - Add BE unit tests
 - TODO: add more tests

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
---
M CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
3 files changed, 105 insertions(+), 15 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/85/17785/1
-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 4:

> Patch Set 3: Verified-1
> 
> Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7769/

The failure already exists in PS2:

F0128 15:17:45.989331 13575 llvm-codegen.cc:152] 64458eac382abfb9:85324b2c00000000] LLVM hit fatal error: Program used external function '_ZN5boost6locale9converterIcE2idE' which could not be resolved!
*** Check failure stack trace: ***
    @          0x56d70ec  google::LogMessage::Fail()
    @          0x56d899c  google::LogMessage::SendToLog()
    @          0x56d6a4a  google::LogMessage::Flush()
    @          0x56da608  google::LogMessageFatal::~LogMessageFatal()
    @          0x2ad61a5  impala::LlvmCodegenHandleError()
    @          0x55ddccd  llvm::report_fatal_error()
    @          0x52fdba4  llvm::RuntimeDyldImpl::resolveExternalSymbols()
    @          0x52fdcf4  llvm::RuntimeDyldImpl::resolveRelocations()
    @          0x50c1358  llvm::MCJIT::finalizeLoadedModules()
    @          0x50c1a31  llvm::MCJIT::finalizeObject()
    @          0x2ae0b0a  impala::LlvmCodeGen::FinalizeModule()
    @          0x24cd76b  impala::FragmentState::CodegenHelper()
    @          0x24cceb4  impala::FragmentState::InvokeCodegen()
    @          0x24d620e  impala::FragmentInstanceState::Open()
    @          0x24d2a74  impala::FragmentInstanceState::Exec()
    @          0x240f291  impala::QueryState::ExecFInstance()
    @          0x240d68f  _ZZN6impala10QueryState15StartFInstancesEvENKUlvE_clEv
    @          0x241231b  _ZN5boost6detail8function26void_function_obj_invoker0IZN6impala10QueryState15StartFInstancesEvEUlvE_vE6invokeERNS1_15function_bufferE
    @          0x22af13b  boost::function0<>::operator()()
    @          0x2a79976  impala::Thread::SuperviseThread()
    @          0x2a822c6  boost::_bi::list5<>::operator()<>()
    @          0x2a821ea  boost::_bi::bind_t<>::operator()()
    @          0x2a821ab  boost::detail::thread_data<>::run()
    @          0x4380a90  thread_proxy
    @     0x7f552827e6b9  start_thread
    @     0x7f5524cb951c  clone


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Sat, 29 Jan 2022 03:36:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 10: Code-Review+1

Thanks a lot for taking care of the comments.


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 14 Feb 2022 17:06:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10117/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 08 Feb 2022 08:29:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 11:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/17785/10/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17785/10/be/src/exprs/string-functions-ir.cc@379
PS10, Line 379:   uint8_t* result_ptr = result.ptr;
> This will be null if the allocation fails in the constructor. We should han
Thanks for catching this!


http://gerrit.cloudera.org:8080/#/c/17785/10/be/src/exprs/string-functions-ir.cc@383
PS10, Line 383: byte sequ
> My understanding is that we should use the same mbstate_t during the proces
Done


http://gerrit.cloudera.org:8080/#/c/17785/10/be/src/exprs/string-functions-ir.cc@422
PS10, Line 422: eturn StringVal::nul
> We shouldn't need this, Resize() should return false on error.
You are right. We probably need to change L599 as well.



-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 15 Feb 2022 07:07:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10131/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 10 Feb 2022 06:03:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 9:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/10138/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 11 Feb 2022 07:30:35 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10071/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Sat, 29 Jan 2022 00:01:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17785

to look at the new patch set (#3).

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................

IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

There are 3 builtin string functions doing case conversion: upper,
lower, and initcap. Previously they only convert English alphabetic
characters. This patch adds support to deal with unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

This patch currently uses Boost.Locale in case conversion.
ICU(International Components for Unicode) is not integrated yet since
our boost in native-toolchain is not built with ICU. So currently the
localization backend of Boost.Locale is iconv, and the above corner
cases are not handled. We will consider integrating ICU in a follow-up
JIRA.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/mask-functions-ir.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
7 files changed, 327 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/85/17785/3
-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10079/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 01 Feb 2022 10:07:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 7: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7822/


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 09 Feb 2022 14:04:42 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 11:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10157/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 15 Feb 2022 07:29:59 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 12:

Thank Qifan and Csaba for your review!


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 15 Feb 2022 11:57:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17785

to look at the new patch set (#5).

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................

IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

There are 3 builtin string functions doing case conversion: upper,
lower, and initcap. Previously they only convert English alphabetic
characters. This patch adds support to deal with unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

This patch uses the simple wchar_t conversions of std::towupper and
stdd::towlower so the above cases are not handled. We will try
supporting them in follow-up JIRAs.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
6 files changed, 165 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/85/17785/5
-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17785

to look at the new patch set (#4).

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................

IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

There are 3 builtin string functions doing case conversion: upper,
lower, and initcap. Previously they only convert English alphabetic
characters. This patch adds support to deal with unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

This patch currently uses Boost.Locale in case conversion.
ICU(International Components for Unicode) is not integrated yet since
our boost in native-toolchain is not built with ICU. So currently the
localization backend of Boost.Locale is iconv, and the above corner
cases are not handled. We will consider integrating ICU in a follow-up
JIRA.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/mask-functions-ir.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
7 files changed, 332 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/85/17785/4
-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7788/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 02 Feb 2022 00:12:58 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17785

to look at the new patch set (#10).

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................

IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

There are 3 builtin case conversion string functions: upper(), lower(),
and initcap(). Previously they only convert English alphabetic
characters. This patch adds support to deal with Unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

The above cases will be focus in follow-up JIRAs. This patch addes the
initial implementation of UTF-8 aware case conversion functions.

--------
Implementation:
In UTF-8 mode (turned on by set UTF8_MODE=true) of these functions, the
bytes in strings are converted to wide characters using std::mbrtowc().
Each wide character (wchar_t) will then be converted using std::towupper
or std::towlower correspondingly. We then convert them back to multi
bytes using std::wcrtomb().

Note that these builtins are locale aware. If impalad is launched
without a UTF-8 aware locale, e.g. LC_ALL="C", these builtins can't
recognize non-ascii characters, which will return unexpected results.
Thus we modify our docker images to set LC_ALL="C.UTF-8" instead of "C".
This patch also logs the current locale when launching impala daemons
for better debugging. We will support customized locale in IMPALA-11080.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
---
M be/src/common/init.cc
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
M docker/daemon_entrypoint.sh
M docker/test-with-docker.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
8 files changed, 251 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/85/17785/10
-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] WIP IMPALA-2019(part-3): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: WIP IMPALA-2019(part-3): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9313/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 18 Aug 2021 08:34:32 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17785

to look at the new patch set (#8).

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................

IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

There are 3 builtin case conversion string functions: upper(), lower(),
and initcap(). Previously they only convert English alphabetic
characters. This patch adds support to deal with Unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

This patch uses the simple wchar_t conversions of std::towupper and
stdd::towlower so the above cases are not handled. We will try
supporting them in follow-up JIRAs.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
---
M be/src/common/init.cc
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
6 files changed, 213 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/85/17785/8
-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(part-3): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-3): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 2:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/10038/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 25 Jan 2022 03:30:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17785

to look at the new patch set (#6).

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................

IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

There are 3 builtin string functions doing case conversion: upper,
lower, and initcap. Previously they only convert English alphabetic
characters. This patch adds support to deal with unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

This patch uses the simple wchar_t conversions of std::towupper and
stdd::towlower so the above cases are not handled. We will try
supporting them in follow-up JIRAs.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
5 files changed, 164 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/85/17785/6
-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 10: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 11 Feb 2022 15:46:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-3): Add UTF-8 support for case conversion functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-3): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 2:

> Patch Set 2:
> 
> Build Failed 
> 
> https://jenkins.impala.io/job/gerrit-code-review-checks/10038/ : Initial code review checks failed. See linked job for details on the failure.

The failure is https://jenkins.impala.io/job/clang-tidy-ub1604/17726/

/home/ubuntu/Impala/be/build/release//service/unifiedbetests: symbol lookup error: /home/ubuntu/Impala/be/build/release/exprs/libExprs.so: undefined symbol: _ZN5boost6locale9converterIcE2idE
FAILED: Unified backend test executable returned an error when trying
        to list tests.
Command: /home/ubuntu/Impala/bin/run-binary.sh /home/ubuntu/Impala/be/build/release//service/unifiedbetests --gtest_list_tests
Return Code: 127


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 25 Jan 2022 04:24:34 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 3: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7769/


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 28 Jan 2022 18:25:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 6: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7788/


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 02 Feb 2022 06:37:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 10:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/17785/10/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17785/10/be/src/exprs/string-functions-ir.cc@379
PS10, Line 379:   uint8_t* result_ptr = result.ptr;
This will be null if the allocation fails in the constructor. We should handle it similarly to the failure of resize()


http://gerrit.cloudera.org:8080/#/c/17785/10/be/src/exprs/string-functions-ir.cc@383
PS10, Line 383: mbstate_t
My understanding is that we should use the same mbstate_t during the processing of a string, as it's goal is to allow a conversion function to depend on the previous characters. This probably doesn't matter in utf8 though.


http://gerrit.cloudera.org:8080/#/c/17785/10/be/src/exprs/string-functions-ir.cc@422
PS10, Line 422: context->has_error()
We shouldn't need this, Resize() should return false on error.



-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 14 Feb 2022 14:45:44 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17785

to look at the new patch set (#11).

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................

IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

There are 3 builtin case conversion string functions: upper(), lower(),
and initcap(). Previously they only convert English alphabetic
characters. This patch adds support to deal with Unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

The above cases will be focus in follow-up JIRAs. This patch addes the
initial implementation of UTF-8 aware case conversion functions.

--------
Implementation:
In UTF-8 mode (turned on by set UTF8_MODE=true) of these functions, the
bytes in strings are converted to wide characters using std::mbrtowc().
Each wide character (wchar_t) will then be converted using std::towupper
or std::towlower correspondingly. We then convert them back to multi
bytes using std::wcrtomb().

Note that these builtins are locale aware. If impalad is launched
without a UTF-8 aware locale, e.g. LC_ALL="C", these builtins can't
recognize non-ascii characters, which will return unexpected results.
Thus we modify our docker images to set LC_ALL="C.UTF-8" instead of "C".
This patch also logs the current locale when launching impala daemons
for better debugging. We will support customized locale in IMPALA-11080.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
---
M be/src/common/init.cc
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
M docker/daemon_entrypoint.sh
M docker/test-with-docker.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
8 files changed, 249 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/85/17785/11
-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7844/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 15 Feb 2022 11:57:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-3): Add UTF-8 support for case conversion functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17785

to look at the new patch set (#2).

Change subject: IMPALA-2019(part-3): Add UTF-8 support for case conversion functions
......................................................................

IMPALA-2019(part-3): Add UTF-8 support for case conversion functions

There are 3 builtin string functions doing case conversion: upper,
lower, and initcap. Previously they only convert English alphabetic
characters. This patch adds support to deal with unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

This patch currently uses Boost.Locale in case conversion.
ICU(International Components for Unicode) is not integrated yet since
our boost in native-toolchain is not built with ICU. So currently the
localization backend of Boost.Locale is iconv, and the above corner
cases are not handled. We will consider integrating ICU in a follow-up
JIRA.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
---
M be/src/exprs/expr-test.cc
M be/src/exprs/mask-functions-ir.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
6 files changed, 323 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/85/17785/2
-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 3:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7769/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 28 Jan 2022 12:46:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 12: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 15 Feb 2022 10:42:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 10:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10139/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 11 Feb 2022 08:13:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 10:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7832/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 11 Feb 2022 08:58:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10080/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 01 Feb 2022 10:26:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10058/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 28 Jan 2022 12:32:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-2019(part-3): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: WIP IMPALA-2019(part-3): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17785/1/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/17785/1/be/src/exprs/expr-test.cc@10793
PS1, Line 10793:   TestStringValue("initcap('abcd áäèü ABCD ÁÄÈÜ')", "Abcd Áäèü Abcd Áäèü");
line too long (91 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 18 Aug 2021 08:11:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 6:

(7 comments)

Looks good!

http://gerrit.cloudera.org:8080/#/c/17785/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17785/6//COMMIT_MSG@9
PS6, Line 9: string functions doing case conversion
nit. case conversion string functions


http://gerrit.cloudera.org:8080/#/c/17785/6//COMMIT_MSG@11
PS6, Line 11: unicode
nit Unicode


http://gerrit.cloudera.org:8080/#/c/17785/6/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17785/6/be/src/exprs/string-functions-ir.cc@366
PS6, Line 366: if (wc_bytes == 0) break;
Per https://en.cppreference.com/w/cpp/string/multibyte/mbtowc, on  return value:

If s is not a null pointer, returns the number of bytes that are contained in the multibyte character or -1 if the first bytes pointed to by s do not form a valid multibyte character or ?0? if s is pointing at the null charcter '\0'.

Since '\0' is valid in Unicode, we should continue instead of break.


http://gerrit.cloudera.org:8080/#/c/17785/6/be/src/exprs/string-functions-ir.cc@370
PS6, Line 370: ;
nit. 0xFFFD.


http://gerrit.cloudera.org:8080/#/c/17785/6/be/src/exprs/string-functions-ir.cc@371
PS6, Line 371:       wc_bytes = 3;
nit. should it be 4?


http://gerrit.cloudera.org:8080/#/c/17785/6/be/src/exprs/string-functions-ir.cc@408
PS6, Line 408: iswspace
nit. UNLIKELY()


http://gerrit.cloudera.org:8080/#/c/17785/6/testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
File testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test:

http://gerrit.cloudera.org:8080/#/c/17785/6/testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test@183
PS6, Line 183: ---- QUERY
             : set utf8_mode=false;
             : select upper('abcd áäèü'), lower('ABCD ÁÄÈÜ'), initcap('abcd áäèü ABCD ÁÄÈÜ');
             : ---- RESULTS: RAW_STRING
             : 'ABCD áäèü','abcd ÁÄÈÜ','Abcd áäèü Abcd ÁÄÈÜ'
             : ---- TYPES
             : STRING,STRING,STRING
             : ====
             : ---- QUERY
             : set utf8_mode=true;
             : select upper('abcd áäèü'), lower('ABCD ÁÄÈÜ'), initcap('abcd áäèü ABCD ÁÄÈÜ');
             : ---- RESULTS: RAW_STRING
             : 'ABCD ÁÄÈÜ','abcd áäèü','Abcd Áäèü Abcd Áäèü'
             : ---- TYPES
             : STRING,STRING,STRING
             : ====
good tests.



-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 04 Feb 2022 18:50:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7822/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 09 Feb 2022 07:22:38 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17785

to look at the new patch set (#7).

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................

IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

There are 3 builtin case conversion string functions: upper(), lower(),
and initcap(). Previously they only convert English alphabetic
characters. This patch adds support to deal with Unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

This patch uses the simple wchar_t conversions of std::towupper and
stdd::towlower so the above cases are not handled. We will try
supporting them in follow-up JIRAs.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
5 files changed, 212 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/85/17785/7
-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 6:

(6 comments)

Thanks for the feedback, Qifan!

http://gerrit.cloudera.org:8080/#/c/17785/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17785/6//COMMIT_MSG@9
PS6, Line 9: string functions doing case conversion
> nit. case conversion string functions
Done


http://gerrit.cloudera.org:8080/#/c/17785/6//COMMIT_MSG@11
PS6, Line 11: unicode
> nit Unicode
Done


http://gerrit.cloudera.org:8080/#/c/17785/6/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17785/6/be/src/exprs/string-functions-ir.cc@366
PS6, Line 366: if (wc_bytes == 0) break;
> Per https://en.cppreference.com/w/cpp/string/multibyte/mbtowc, on  return v
Thanks for catching this!


http://gerrit.cloudera.org:8080/#/c/17785/6/be/src/exprs/string-functions-ir.cc@370
PS6, Line 370: ;
> nit. 0xFFFD.
Done


http://gerrit.cloudera.org:8080/#/c/17785/6/be/src/exprs/string-functions-ir.cc@371
PS6, Line 371:       wc_bytes = 3;
> nit. should it be 4?
Thanks for catching this! I think we should jump to the next legal UTF-8 start byte.


http://gerrit.cloudera.org:8080/#/c/17785/6/be/src/exprs/string-functions-ir.cc@408
PS6, Line 408: iswspace
> nit. UNLIKELY()
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 08 Feb 2022 08:06:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................

IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

There are 3 builtin case conversion string functions: upper(), lower(),
and initcap(). Previously they only convert English alphabetic
characters. This patch adds support to deal with Unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

The above cases will be focus in follow-up JIRAs. This patch addes the
initial implementation of UTF-8 aware case conversion functions.

--------
Implementation:
In UTF-8 mode (turned on by set UTF8_MODE=true) of these functions, the
bytes in strings are converted to wide characters using std::mbrtowc().
Each wide character (wchar_t) will then be converted using std::towupper
or std::towlower correspondingly. We then convert them back to multi
bytes using std::wcrtomb().

Note that these builtins are locale aware. If impalad is launched
without a UTF-8 aware locale, e.g. LC_ALL="C", these builtins can't
recognize non-ascii characters, which will return unexpected results.
Thus we modify our docker images to set LC_ALL="C.UTF-8" instead of "C".
This patch also logs the current locale when launching impala daemons
for better debugging. We will support customized locale in IMPALA-11080.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Reviewed-on: http://gerrit.cloudera.org:8080/17785
Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/common/init.cc
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
M docker/daemon_entrypoint.sh
M docker/test-with-docker.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
8 files changed, 249 insertions(+), 1 deletion(-)

Approvals:
  Csaba Ringhofer: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17785 )

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................


Patch Set 12: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 15 Feb 2022 18:40:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17785

to look at the new patch set (#9).

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
......................................................................

IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

There are 3 builtin case conversion string functions: upper(), lower(),
and initcap(). Previously they only convert English alphabetic
characters. This patch adds support to deal with Unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

The above cases will be focus in follow-up JIRAs. This patch addes the
initial implementation of UTF-8 aware case conversion functions.

--------
Implementation:
In UTF-8 mode (turned on by set UTF8_MODE=true) of these functions, the
bytes in strings are converted to wide characters using std::mbrtowc().
Each wide character (wchar_t) will then be converted using std::towupper
or std::towlower correspondingly. We then convert them back to multi
bytes using std::wcrtomb().

Note that these builtins are locale aware. If impalad is launched
without a UTF-8 aware locale, e.g. LC_ALL="C", these builtins can't
recognize non-ascii characters, which will return unexpected results.
Thus we modify our docker images to set LC_ALL="C.UTF-8" instead of "C".
This patch also logs the current locale when launching impala daemons
for better debugging. We will support customized locale in IMPALA-11080.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
---
M be/src/common/init.cc
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
M docker/daemon_entrypoint.sh
M docker/test-with-docker.py
M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
8 files changed, 251 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/85/17785/9
-- 
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>