You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org> on 2022/06/01 16:14:21 UTC

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#11).

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................

IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
- Ran exhaustive tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
74 files changed, 829 insertions(+), 370 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/11
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>