You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2021/02/02 14:43:49 UTC
[Impala-ASF-CR] [WIP] IMPALA-5675: Support UTF-8 Varchar and Char

Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16909

to look at the new patch set (#8).

Change subject: [WIP] IMPALA-5675: Support UTF-8 Varchar and Char
......................................................................

[WIP] IMPALA-5675: Support UTF-8 Varchar and Char

This patch adds support for UTF-8 aware varchar and char types. In
UTF-8 mode, when truncating UTF-8 varchar(N) and char(N) strings,
lengths will be counted by UTF-8 characters instead of bytes. So the
result string will have up to N UTF-8 characters.

The UTF8_MODE query option is first detected in FE when analyzing the
query. A 'is_utf8' label is added in Exprs and SlotDescriptors. They are
used in generating thrift objects and computing the tuple layouts. A
char(N) slot will occupy 4 * N bytes if it's in UTF-8 type, because a
UTF-8 character can be encoded into 1~4 bytes. The slot will store up to
N UTF-8 characters.

There is a gotcha that we should not add the label in Type.java, because
Type instances are shared across the FE. Query compilation reuses the
Type instances from the metadata. If we modify Type instances during
compilation, other queries in non-UTF8 mode will be affected.

However, in BE, we need the type related classes (e.g. ColumnType,
TypeDesc) to carry in the utf8 markers. It's impractical to check the
UTF8_MODE query option everywhere it needs to be. E.g. in
AnyValUtil::SetAnyVal we can't access the query options. So we add the
'is_utf8' marker in TScalarType, ColumnType, TypeDesc to conveniently
distinguish char(N) and varchar(N) types in UTF-8 mode. When generating
thrift objects in FE, Exprs and SlotDescriptors deliver 'is_utf8'
markers to TScalaTypes. They finally landed in ColumnType and TypeDesc
instances.

Given the correct UTF-8 mode checked, we just need to truncate/pad the
char/varchar strings with their length counted by UTF-8 characters.

Since char(N) slots always occupy 4N bytes, when converting char(N) to
other string types, we need to re-calculate the actual length
corresponding to N UTF-8 characters. We can optimize this in later
patches, e.g. store the UTF-8 length in the slot, or deal with UTF-8
char(N) by the same way as varchar(N), i.e. reallocate the string space
and just store the pointer and length in the slot.

TODO: correct codegen for varchar in text-converter.cc
TODO: Create varchar, char columns on text files with longer strings.
Verify truncating.
TODO: kudu hasn't supported CHAR(N) yet. Add individual tests for Kudu
VARCHAR types.
TODO: test writing CHAR(N)/VARCHAR(N) in UTF-8 mode.
TODO: revisit computing stats in UTF-8 mode.

Tests:
 - Add tests for reading char(N) and varchar(N) columns in UTF8_MODE.

Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
---
M be/src/codegen/codegen-anyval.cc
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/llvm-codegen.cc
M be/src/exec/grouping-aggregator.cc
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/kudu-scanner.cc
M be/src/exec/orc-column-readers.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.inline.h
M be/src/exprs/agg-fn-evaluator.cc
M be/src/exprs/anyval-util.cc
M be/src/exprs/anyval-util.h
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/scalar-expr-evaluator.cc
M be/src/exprs/scalar-fn-call.cc
M be/src/exprs/slot-ref.cc
M be/src/runtime/raw-value-ir.cc
M be/src/runtime/raw-value.cc
M be/src/runtime/raw-value.inline.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/fe-support.cc
M be/src/service/hs2-util.cc
M be/src/udf/udf-internal.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M be/src/util/CMakeLists.txt
M be/src/util/bit-util.h
M be/src/util/string-util-test.cc
M be/src/util/string-util.cc
M be/src/util/string-util.h
M be/src/util/tuple-row-compare.cc
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/SlotDescriptor.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M testdata/datasets/functional/functional_schema_template.sql
A testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test
A testdata/workloads/functional-query/queries/QueryTest/utf8-chars.test
M tests/query_test/test_utf8_strings.py
48 files changed, 590 insertions(+), 122 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/09/16909/8
-- 
To view, visit http://gerrit.cloudera.org:8080/16909
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
Gerrit-Change-Number: 16909
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>