You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org> on 2020/06/11 12:00:42 UTC

[Impala-ASF-CR] WIP IMPALA-9482 Support for BINARY columns

Csaba Ringhofer has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16066


Change subject: WIP IMPALA-9482 Support for BINARY columns
......................................................................

WIP IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

The reason for the WIP status is that:
- I plan to add more tests, mainly in FE.
- No successful exhaustive test run yet.
- There are a few differences compared to Hive:
  - In INSERT ... VALUES () string literals need to be explicitly
    cast to BINARY, while this is not needed in Hive.
  - UDF/UDAFs that expect STRING argument accept BINARY too, while in
    Hive explicit cast is needed in this case.
  - Hive doesn't calculate NDV during COMPUTE STATISTICS for BINARY
    columns. Impala still calculates it but  throws away the results.

I think that review of most parts can start in this state, so it makes
sense to upload it without the missing pieces.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added a basic coverage in FE/EE tests, will continue adding new
  ones while the patch is in WIP state.
- Ran core tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M be/src/util/symbols-util.cc
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/impala_connection.py
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
55 files changed, 488 insertions(+), 301 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/2
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>

[Impala-ASF-CR] WIP IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: WIP IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 2:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/6284/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 11 Jun 2020 12:46:00 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 20:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/11132/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 20
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 10 Aug 2022 18:37:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 11:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8171/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Jun 2022 16:23:59 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 14:

(4 comments)

PS 14 extended the testdata and EE tests and also added some new functions: concat, murmur_hash (this was needed for multiple count distinct).

In the next patch I plan to fix the older comments but Gabor and Tim

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@44
PS13, Line 44: import org.apache.impala.catalog.Type;
> nit: unused import?
Done


http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/catalog/Type.java
File fe/src/main/java/org/apache/impala/catalog/Type.java:

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/catalog/Type.java@681
PS13, Line 681: if
> nit: need one space after "if"
Done


http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java@409
PS13, Line 409:     RewritesOk("cast(concat('a', 'b') as binary)", rule, "'ab'");
> nit: Can we also test concat() on binary literals? i.e.
The concat for binary was not added yet. Added it in PS 14 + some tests in expr-test.cc


http://gerrit.cloudera.org:8080/#/c/16066/13/testdata/workloads/functional-query/queries/QueryTest/udf.test
File testdata/workloads/functional-query/queries/QueryTest/udf.test:

http://gerrit.cloudera.org:8080/#/c/16066/13/testdata/workloads/functional-query/queries/QueryTest/udf.test@236
PS13, Line 236: bytes(
> bytes()? in case of utf8_mode=true.
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 13 Jun 2022 18:42:52 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#13).

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................

IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
- Ran exhaustive tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
83 files changed, 918 insertions(+), 373 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/13
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 13:

(4 comments)

Thanks for reviving this patch! I just went throught the FE changes and some tests and they LGTM. I'm going to dig into more details.

Not sure if I missed it, but do we have tests on writing BINARY columns? Assuming the data loading of functional_parquet.binary_tbl is the test on writing BINARY to parquet, we still need test on writing BINARY to text files.

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@44
PS13, Line 44: import org.apache.impala.catalog.PrimitiveType;
nit: unused import?


http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/catalog/Type.java
File fe/src/main/java/org/apache/impala/catalog/Type.java:

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/catalog/Type.java@681
PS13, Line 681: if
nit: need one space after "if"


http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java@409
PS13, Line 409:     RewritesOk("cast(concat('a', 'b') as binary)", rule, "'ab'");
nit: Can we also test concat() on binary literals? i.e.

 concat(cast('a' as binary), cast('b' as binary))


http://gerrit.cloudera.org:8080/#/c/16066/13/testdata/workloads/functional-query/queries/QueryTest/udf.test
File testdata/workloads/functional-query/queries/QueryTest/udf.test:

http://gerrit.cloudera.org:8080/#/c/16066/13/testdata/workloads/functional-query/queries/QueryTest/udf.test@236
PS13, Line 236: length
bytes()? in case of utf8_mode=true.



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 13 Jun 2022 08:29:15 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 22:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py@69
PS22, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py@89
PS22, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py@100
PS22, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py@113
PS22, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py@123
PS22, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/22/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/22/tests/query_test/test_udfs.py@116
PS22, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 11 Aug 2022 09:46:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/8/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/8/tests/query_test/test_udfs.py@116
PS8, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 02 May 2022 09:26:31 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 14:

(11 comments)

Ps 15 is rebase + conflict resolution, ps 16 contains fixes for the comments.

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/hbase-table-writer.cc
File be/src/exec/hbase-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/hbase-table-writer.cc@161
PS14, Line 161:               data_len = encoded_value.length();
> I just realized we support writing to HBase. It worths adding tests for thi
Thanks for the comment, it turned out that the writer was buggy as 'base64_encoded_value' went out of scope while 'data' kept a pointer to its buffer.

Added a binary column to insertalltypesagg/insertalltypesaggbinary (only used by HBase tests) and added few tests to hbase-inserts.test

Also checked manually that Hive can read our data.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/text-converter.cc
File be/src/exec/text-converter.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/text-converter.cc@359
PS14, Line 359:       return auxType.string_subtype != AuxColumnType::StringSubtype::BINARY;
> Add JIRA for codegen on reading BINARY from text files? I guess we just nee
Created a jira.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256
PS14, Line 256:     return col_descs_[slot_desc->col_path().back()];
> Not sure if this is correct for nested types. Can we add tests on BINARY in
Added tests with binary in struct, binary in array was only tested manually so far.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/types.cc
File be/src/runtime/types.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/types.cc@141
PS14, Line 141:       return TPrimitiveType::BINARY;
> Should we return INVALID_TYPE here because of DCHECK(false) so that we catc
Agree with returning INVALID_TYPE to improve error handling in release builds.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc@872
PS14, Line 872:   AuxColumnType aux_type(columnType);
> If this is only used in the TYPE_STRING, perhaps it should be initialized t
Done


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc@950
PS14, Line 950:       type_entry.__set_type(thrift::TTypeId::STRING_TYPE);
> If DCHECK is false, do we want to set this to STRING_TYPE?  Or maybe we sho
There is no "invalid type" concept in the HS2 thrift api + this field is required, so I don't see a better idea here (maybe we could return some status in case of error, but that seems a bit too much effort for the unlikely case of not catching this scenario in debug builds.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/impala-beeswax-server.cc
File be/src/service/impala-beeswax-server.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/impala-beeswax-server.cc@212
PS14, Line 212:     return "string";
> How come we don't return TypeToOdbcString here?
What do you mean for inlineing? It is also used in some tests, so we can't keep it in this file.

About returning TypeToOdbcString - while odbc does support binary type, my guess is that existing beeswax clients don't, so it seemed safer to return "string"


http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@187
PS14, Line 187:         // No built-in function needed for BINARY <-> STRING conversion.
> I think the comment should also reflect other datatypes versus binary are n
Done


http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@357
PS14, Line 357:     boolean isStringBinaryCast =
> I may be understanding this wrong...
Added a precondition to ensure that if one is binary, the other must be string.


http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
File fe/src/main/java/org/apache/impala/analysis/LikePredicate.java:

http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java@116
PS14, Line 116:   private static boolean isLikeableType(Type type) {
> No real comment here. But I feel the other types might get upset for not be
:D poor other types - tbh I don't remember whether the naming was intentional


http://gerrit.cloudera.org:8080/#/c/16066/14/testdata/workloads/functional-query/queries/QueryTest/binary-type.test
File testdata/workloads/functional-query/queries/QueryTest/binary-type.test:

http://gerrit.cloudera.org:8080/#/c/16066/14/testdata/workloads/functional-query/queries/QueryTest/binary-type.test@68
PS14, Line 68:   where binary_col_with_nulls = cast("01/02/09" as binary)
> This is identical to the above query.
oops, I think I wanted to test > here



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Aug 2022 18:10:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 23:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py@69
PS23, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py@89
PS23, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py@100
PS23, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py@113
PS23, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py@123
PS23, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/23/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/23/tests/query_test/test_udfs.py@116
PS23, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 23
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Aug 2022 12:15:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 26: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Aug 2022 13:55:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-9482 Support for BINARY columns

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: WIP IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 2:

(5 comments)

Did a first pass. Only had some questions and a few nits.

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-scan-node.cc
File be/src/exec/hbase-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-scan-node.cc@49
PS2, Line 49: NULL
nit: nullptr? I see there's a lot of NULLs in the file, so maybe just update the NULLs in this init-list?


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc
File be/src/exec/hbase-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc@159
PS2, Line 159: Base64Encode
Why do we encode it for HBase?


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc
File be/src/exec/hdfs-rcfile-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc@593
PS2, Line 593:         const ColumnDescriptor& col_desc =
             :             scan_node_->hdfs_table()->GetColumnDesc(slot_desc);
Wouldn't it be better to have a BINARY type in the BE so we could save passing 'col_desc' all the time? Maybe it would be a cleaner design overall.

What will happen when we start supporting UTF-8 strings?

But I'm OK with it if that'd require too much changes.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.h
File be/src/runtime/types.h:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.h@67
PS2, Line 67: Aux
nit: 'Aux' is not too descriptive. Probably EncodedColumnType/StoredColumnType?


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc@289
PS2, Line 289: Same as for STRING with the exception of using a different
             : // Thrift field.
nit: refactor to a common code that takes the Thrift field as parameter?



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 16 Jun 2020 16:42:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................

IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
- Ran exhaustive tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
72 files changed, 748 insertions(+), 363 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/6
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10519/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 02 May 2022 09:45:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 14:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/14/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/14/tests/query_test/test_udfs.py@116
PS14, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 13 Jun 2022 18:38:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Steve Carlin (Code Review)" <ge...@cloudera.org>.
Steve Carlin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 14:

(7 comments)

Hopefully these comments are helpful.  Logic seems good, these are all pretty much nits.

Thanks!

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/types.cc
File be/src/runtime/types.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/types.cc@141
PS14, Line 141:       return TPrimitiveType::BINARY;
Should we return INVALID_TYPE here because of DCHECK(false) so that we catch the problem earlier on?


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc@872
PS14, Line 872:   AuxColumnType aux_type(columnType);
If this is only used in the TYPE_STRING, perhaps it should be initialized there.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc@950
PS14, Line 950:       type_entry.__set_type(thrift::TTypeId::STRING_TYPE);
If DCHECK is false, do we want to set this to STRING_TYPE?  Or maybe we should set it to INVALID_TYPE?  Not sure what is best.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/impala-beeswax-server.cc
File be/src/service/impala-beeswax-server.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/impala-beeswax-server.cc@212
PS14, Line 212:     return "string";
How come we don't return TypeToOdbcString here?

(sidenote, I'm rusty on my C++, but why is TypeToOdbcString not inlined?)


http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@187
PS14, Line 187:         // No built-in function needed for BINARY <-> STRING conversion.
I think the comment should also reflect other datatypes versus binary are not in the support matrix.


http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@357
PS14, Line 357:     boolean isStringBinaryCast =
I may be understanding this wrong...

This is slightly inconsistent with the above code.  In the above code, we assume that if one is binary, the other must be either binary or string.  Do we need to check explicitly here that the other type is string?  I understand why, but then maybe we should have a Preconditions check sayin git is either of type string or binary?

Not sure what is best here.


http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
File fe/src/main/java/org/apache/impala/analysis/LikePredicate.java:

http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java@116
PS14, Line 116:   private static boolean isLikeableType(Type type) {
No real comment here. But I feel the other types might get upset for not being liked.



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 14 Jul 2022 00:01:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 19:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/11131/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 19
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 10 Aug 2022 18:36:28 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#17).

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/hbase-inserts.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/struct-in-select-list.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
88 files changed, 1,164 insertions(+), 399 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/17
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 17
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 18:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py@69
PS18, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py@89
PS18, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py@100
PS18, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py@113
PS18, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py@123
PS18, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/18/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/18/tests/query_test/test_udfs.py@116
PS18, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 05 Aug 2022 10:44:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 15:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/11085/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Aug 2022 13:51:05 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#16).

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/hbase-inserts.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/struct-in-select-list.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
88 files changed, 1,163 insertions(+), 399 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/16
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#19).

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/hbase-inserts.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
89 files changed, 1,190 insertions(+), 389 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/19
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 19
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 22:

(1 comment)

Some tests were broken by recent changes related fixing comments, ps 21 fixes those (and ps 22 fixes some whitespace issues)

http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@357
PS14, Line 357:     Preconditions.checkState(!childType.isNull());
> Added a precondition to ensure that if one is binary, the other must be str
This change actually broke a number of tests, for example when casting NULL to binary or tests the rely on an exact error message about the invalid cast.

Changed the logic to do noop cast only in case of binary<->string cast, and let the subsequent logic catch the error that there is no casting functions to/from other types.



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 11 Aug 2022 09:47:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 25:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/11189/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Aug 2022 07:12:28 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#8).

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................

IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
- Ran exhaustive tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
74 files changed, 768 insertions(+), 348 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/8
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 8:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8073/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 02 May 2022 09:26:22 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6920/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 13 Aug 2020 19:25:18 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: WIP IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/16066/3/be/src/exec/hbase-table-writer.cc
File be/src/exec/hbase-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/3/be/src/exec/hbase-table-writer.cc@157
PS3, Line 157:             if (col_desc.auxType().string_subtype == AuxColumnType::StringSubtype::BINARY) {
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/3/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/3/be/src/exec/hdfs-text-table-writer.cc@103
PS3, Line 103:             const ColumnDescriptor& col_desc = table_desc_->col_descs()[num_partition_cols + j];
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/3/be/src/exec/hdfs-text-table-writer.cc@105
PS3, Line 105:             if (col_desc.auxType().string_subtype ==  AuxColumnType::StringSubtype::BINARY) {
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/3/be/src/exec/hdfs-text-table-writer.cc@107
PS3, Line 107:               Base64Encode(string_value->ptr , string_value->len, &rowbatch_stringstream_);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/3/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/16066/3/tests/query_test/test_scanners.py@1604
PS3, Line 1604: class TestBinaryType(ImpalaTestSuite):
flake8: E302 expected 2 blank lines, found 1


http://gerrit.cloudera.org:8080/#/c/16066/3/tests/query_test/test_scanners.py@1617
PS3, Line 1617: 
flake8: W292 no newline at end of file


http://gerrit.cloudera.org:8080/#/c/16066/3/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/3/tests/query_test/test_udfs.py@116
PS3, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 13 Aug 2020 17:24:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 14:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/hbase-table-writer.cc
File be/src/exec/hbase-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/hbase-table-writer.cc@161
PS14, Line 161:               data_len = encoded_value.length();
I just realized we support writing to HBase. It worths adding tests for this.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/text-converter.cc
File be/src/exec/text-converter.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/text-converter.cc@359
PS14, Line 359:       return auxType.string_subtype != AuxColumnType::StringSubtype::BINARY;
Add JIRA for codegen on reading BINARY from text files? I guess we just need the codegen of base64 decoding.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256
PS14, Line 256:     return col_descs_[slot_desc->col_path().back()];
Not sure if this is correct for nested types. Can we add tests on BINARY inside nested types?


http://gerrit.cloudera.org:8080/#/c/16066/14/testdata/workloads/functional-query/queries/QueryTest/binary-type.test
File testdata/workloads/functional-query/queries/QueryTest/binary-type.test:

http://gerrit.cloudera.org:8080/#/c/16066/14/testdata/workloads/functional-query/queries/QueryTest/binary-type.test@68
PS14, Line 68:   where binary_col_with_nulls = cast("01/02/09" as binary)
This is identical to the above query.



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 15 Jun 2022 03:17:24 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/11/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/11/tests/query_test/test_udfs.py@116
PS11, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Jun 2022 16:15:22 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 11: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/8171/


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Jun 2022 20:48:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 21:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/11135/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 21
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 11 Aug 2022 09:57:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#20).

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/hbase-inserts.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
89 files changed, 1,190 insertions(+), 389 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/20
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 20
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 26:

(6 comments)

Thanks for the comments!
PS 25 was a rebase that included https://gerrit.cloudera.org/#/c/18815/ , which is a major refactoring that moved several h/cc files.

PS 26 tries to fix the comments.

http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/exec/orc-metadata-utils.cc
File be/src/exec/orc-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/exec/orc-metadata-utils.cc@477
PS24, Line 477: 
> Could you add a comment about why we don't need TYPE_BINARY here?
added a comment - note that file was moved during the rebase


http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/runtime/descriptors.h@256
PS24, Line 256: 0]];
> nit: front() or [0] seems better (as line 248 uses). Same comment to line 2
Done


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
File java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java:

http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@40
PS24, Line 40: import org.apache.hadoop.io.Writable;
> duplicated line
Done


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@319
PS24, Line 319: t " + 
> BytesWritable
Done


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@321
PS24, Line 321: get
> nit: get() is deprecated by getBytes().
Done


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@326
PS24, Line 326: nary.s
> nit: handle null 'result' when 'inputs' is empty.
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Aug 2022 08:16:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 26:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8473/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Aug 2022 09:07:18 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 24:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/11184/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 24
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Aug 2022 14:53:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 25:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/25/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/25/tests/query_test/test_udfs.py@116
PS25, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Aug 2022 06:52:39 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 18:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256
PS14, Line 256:     return col_descs_[slot_desc->col_path().back()];
> Hmm, I realized that I have to think nested binaries through more carefully
Checked how Hive handles this, it seems to utf8 encode the nested binary with replace:
select named_struct("b", unhex("00112233445566778899AABBCCDDEEFF"));
result: {"b":"3DUfw��������}

select array(unhex("00112233445566778899AABBCCDDEEFF"));
result: ["3DUfw��������] 

A weird behavior around nested binary is that it is not quoted (unlike string):
select named_struct("s", "a", "b", cast("a" as binary));
{"s":"a","b":a}

I have also checked what happens when we read these back from Impala:
- if we return the whole struct, the non-valid utf8 binary is replaced with an empty string
- if we return the whole array, it is returned escaped:
"\000\021\"3DUfw��������"]
- if the member/item is returned directly, we print it the same way as Hive
- if we hex() the result back, it returns it correctly, so Hive does not write it in a lossy way

I am thinking about not allowing nested binary for now, and enable them once the struct/array printing functions are unified + I have verified whether not quoting them is intentional in Hive.



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 05 Aug 2022 11:22:57 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 18:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/11105/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 05 Aug 2022 11:04:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 16:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/11100/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 04 Aug 2022 19:28:02 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 17:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256
PS14, Line 256:     return col_descs_[slot_desc->col_path().back()];
> Added tests with binary in struct, binary in array was only tested manually
Hmm, I realized that I have to think nested binaries through more carefully and check how Hive deals with them - the issue is that complex types are expected to be returned as utf8 strings, and I don't know what to do with a nested BINARY with special characters - probably they are handled by escaping them.



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 17
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 04 Aug 2022 15:18:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#21).

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/hbase-inserts.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
89 files changed, 1,196 insertions(+), 389 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/21
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 21
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#23).

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/hbase-inserts.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-basic.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
90 files changed, 1,238 insertions(+), 389 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/23
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 23
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 23:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/bin/generate-schema-statements.py@222
PS22, Line 222: 'BINARY': 'bytes'
> nit: it'd be nice to add a trailing comma so future changes don't need to t
Done


http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/datasets/functional/functional_schema_template.sql@3532
PS22, Line 3532: binary_in_complex_types
> Can we add some data to this table and add some e2e tests? e.g.
Added minimal data and these simple tests.



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 23
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Aug 2022 12:20:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 8: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/8073/


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 02 May 2022 13:51:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10521/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 02 May 2022 15:51:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 6:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/6919/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 13 Aug 2020 18:21:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: WIP IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 3:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/6918/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 13 Aug 2020 17:45:28 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 7:

(1 comment)

I only checked the FE part of the patch. Will continue with the rest tomorrow.

http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@989
PS7, Line 989: // BINARY is only supported for count + min + max.
Lucky that those were the first 3 operators in the for-loop's body.
What if later someone adds another function above this line? Let's consider adding these functions for binary after the for loop similarly to ds_hll_union (or avg).



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 25 Aug 2020 15:10:02 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-9482 Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#3).

Change subject: WIP IMPALA-9482 Support for BINARY columns
......................................................................

WIP IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

The reason for the WIP status is that:
- I plan to add more tests, mainly in FE.
- No successful exhaustive test run yet.
- There are a few differences compared to Hive:
  - In INSERT ... VALUES () string literals need to be explicitly
    cast to BINARY, while this is not needed in Hive.
  - UDF/UDAFs that expect STRING argument accept BINARY too, while in
    Hive explicit cast is needed in this case.
  - Hive doesn't calculate NDV during COMPUTE STATISTICS for BINARY
    columns. Impala still calculates it but  throws away the results.

I think that review of most parts can start in this state, so it makes
sense to upload it without the missing pieces.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added a basic coverage in FE/EE tests, will continue adding new
  ones while the patch is in WIP state.
- Ran core tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M be/src/util/symbols-util.cc
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/impala_connection.py
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
56 files changed, 503 insertions(+), 309 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/3
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] WIP IMPALA-9482 Support for BINARY columns

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: WIP IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@18
PS2, Line 18:   - UDF/UDAFs that expect STRING argument accept BINARY too, while in
Are we planning to change this? I think implicit casts from BINARY to STRING are a can of worms that we should really avoid, cause it'll cause major headaches when we start treating STRING as UTF-8.

With CHAR, a similar thing of automatically promoting to string was attempted to get the builtin functions "for free" but it didn't really work out because the semantics aren't actually the same.


http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@52
PS2, Line 52: - Added a basic coverage in FE/EE tests, will continue adding new
I think we want a lot more coverage in analysis tests, to make sure that we're doing all the implicit casts correctly (e.g. when invoking functions with BINARY and a different type).


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc
File be/src/exec/hdfs-rcfile-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc@593
PS2, Line 593:         const ColumnDescriptor& col_desc =
             :             scan_node_->hdfs_table()->GetColumnDesc(slot_desc);
> Wouldn't it be better to have a BINARY type in the BE so we could save pass
Yeah I wondered about this too. I think we should consider this in the context of UTF-8 and what we'd want to do there.

I think the situation for UTF-8 is actually similar, in that most of the string operations are not changed, and most of the changes would be in the frontend, to call a UTF-8-aware version of a function instead of the current ASCII function.

UTF-8 is also orthogonal to CHAR/VARCHAR/STRING, so having the encoding auxiliary to the type does have some advantages.

It's similar in that the writers and readers need to know about it to add file metadata and validate the data they're reading/writing.

So I think the infrastructure for auxiliary column info that this patch adds is actually useful for future work.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc
File be/src/runtime/types.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@185
PS2, Line 185:   PrimitiveType primitiveType = ThriftToType(col_type);
nit: here and elsewhere, I'm seeing a bunch of camelCase variable names in C++ where it should be either CamelCase or separated_by_underscores.



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 16 Jun 2020 17:24:22 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.cc
File be/src/exec/text-converter.cc:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.cc@355
PS7, Line 355: switch (col_type.type) {
             :     case TYPE_CHAR:
             :       return false;
             :     case TYPE_STRING:
             :       return auxType.string_subtype != AuxColumnType::StringSubtype::BINARY;
             :     default:
             :       return true;
This could be simplified a bit in my opinion:
if (col_type.type == TYPE_STRING) return auxType... != BINARY;
return col_type.type != TYPE_CHAR;


http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exprs/utility-functions-ir.cc
File be/src/exprs/utility-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exprs/utility-functions-ir.cc@209
PS7, Line 209: const StringVal& /*input_val*/
Is this part needed?



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 26 Aug 2020 09:20:16 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 20:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py@69
PS20, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py@89
PS20, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py@100
PS20, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py@113
PS20, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py@123
PS20, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/20/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/20/tests/query_test/test_udfs.py@116
PS20, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 20
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 10 Aug 2022 18:10:21 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 20:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256
PS14, Line 256:     return col_descs_[slot_desc->col_path().back()];
> Checked how Hive handles this, it seems to utf8 encode the nested binary wi
Change the analyzer to disallow complex types in select list if they have binary members. Created follow up Jira IMPALA-11491

Also added a test tables + FE tests to check this behavior.



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 20
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 10 Aug 2022 18:12:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#24).

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/hbase-inserts.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-basic.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
90 files changed, 1,238 insertions(+), 389 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/24
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 24
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#11).

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................

IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
- Ran exhaustive tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
74 files changed, 829 insertions(+), 370 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/11
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#15).

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
86 files changed, 1,119 insertions(+), 375 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/15
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 12:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10721/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jun 2022 15:31:04 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/13/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/13/tests/query_test/test_udfs.py@116
PS13, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 10 Jun 2022 18:42:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 13:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10753/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 10 Jun 2022 19:00:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-9482 Support for BINARY columns

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: WIP IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc
File be/src/runtime/types.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@122
PS2, Line 122: ToThrift(PrimitiveType ptype)
Should this function work as the inverse of ThriftToType() ?
If so, shouldn't it take a AuxColumnType parameter as well?


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@222
PS2, Line 222: ColumnType::ToThrift(TColumnType* thrift_type)
Same as above, maybe it needs now an additional AuxColumnType parameter?


http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@126
PS2, Line 126:        // No built-in function needed for BINARY <-> STRING conversion.
             :         if (fromType.getPrimitiveType() == PrimitiveType.BINARY ||
             :             toType.getPrimitiveType() == PrimitiveType.BINARY){
             :           continue;
             :         }
BINARY<->STRING conversions are no-op conversins, so maybe this block should be moved after L191.


http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/catalog/Type.java
File fe/src/main/java/org/apache/impala/catalog/Type.java:

http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/catalog/Type.java@820
PS2, Line 820: // STRING <->  BINARY conversion is not lossy, but implicit cast is not allowed.
I'm probably misunderstanding something but the commit msg suggests that BINARY to STRING implicit conversion is supported.

"UDF/UDAFs that expect STRING argument accept BINARY too, while in
    Hive explicit cast is needed in this case."

Please clarify.


http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
File fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java:

http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java@154
PS2, Line 154: case BINARY: return Schema.create(Schema.Type.STRING);
I'm not sure about this, maybe binary should be converted to avro bytes. The avro documentation on primitive types states that:
bytes: sequence of 8-bit unsigned bytes
string: unicode character sequence


http://gerrit.cloudera.org:8080/#/c/16066/2/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/16066/2/testdata/bin/generate-schema-statements.py@213
PS2, Line 213: string
Again, avro has bytes type which might be a better fit for binary.



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Jun 2020 15:24:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/service/hs2-util.cc@273
PS7, Line 273: static void StringExprValuesToHS2TColumnHelper(ScalarExprEvaluator* expr_eval, 
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/16066/7/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/7/tests/query_test/test_udfs.py@116
PS7, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 13 Aug 2020 19:05:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: WIP IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc
File be/src/exec/hbase-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc@157
PS2, Line 157:             if (col_desc.auxType().string_subtype == AuxColumnType::StringSubtype::BINARY) {
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@103
PS2, Line 103:             const ColumnDescriptor& col_desc = table_desc_->col_descs()[num_partition_cols + j];
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@105
PS2, Line 105:             if (col_desc.auxType().string_subtype ==  AuxColumnType::StringSubtype::BINARY) {
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@107
PS2, Line 107:               Base64Encode(string_value->ptr , string_value->len, &rowbatch_stringstream_);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/2/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/16066/2/tests/query_test/test_scanners.py@1535
PS2, Line 1535: class TestBinaryType(ImpalaTestSuite):
flake8: E302 expected 2 blank lines, found 1


http://gerrit.cloudera.org:8080/#/c/16066/2/tests/query_test/test_scanners.py@1548
PS2, Line 1548: 
flake8: W292 no newline at end of file


http://gerrit.cloudera.org:8080/#/c/16066/2/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/2/tests/query_test/test_udfs.py@115
PS2, Line 115: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 11 Jun 2020 12:01:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#9).

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................

IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
- Ran exhaustive tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
74 files changed, 764 insertions(+), 348 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/9
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/9/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/9/tests/query_test/test_udfs.py@116
PS9, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 02 May 2022 15:33:02 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#18).

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/hbase-inserts.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/struct-in-select-list.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
88 files changed, 1,163 insertions(+), 399 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/18
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 21:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@368
PS21, Line 368: 	  noOp_ = true;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@369
PS21, Line 369: 	  return;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@370
PS21, Line 370: 	}
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java@4278
PS21, Line 4278: 		"cast(string_col as binary), bool_col, " +
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
File fe/src/test/java/org/apache/impala/catalog/CatalogTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@632
PS21, Line 632: 	// NDV is not calculated for BINARY columns
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@633
PS21, Line 633: 	assertFalse(binaryCol.getStats().hasNumDistinctValues());
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
File fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java@243
PS21, Line 243: 		stats.get(0).toString());
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py@69
PS21, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py@89
PS21, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py@100
PS21, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py@113
PS21, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py@123
PS21, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/21/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/21/tests/query_test/test_udfs.py@116
PS21, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 21
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 11 Aug 2022 09:36:04 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 15:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py@69
PS15, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py@89
PS15, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py@100
PS15, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py@113
PS15, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py@123
PS15, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/15/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/15/tests/query_test/test_udfs.py@116
PS15, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Aug 2022 13:30:46 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 7:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/16066/7//COMMIT_MSG
Commit Message:

PS7: 
I think you're missing some handling for the parquet_annotate_strings_utf8 option - it looks like we might set this for BINARY types, which is wrong.


http://gerrit.cloudera.org:8080/#/c/16066/7//COMMIT_MSG@29
PS7, Line 29: - HS2/Beeswax service
Do we need to update clients? Looks like impala-shell might have some logic for HS2 but not sure that it's tested.


http://gerrit.cloudera.org:8080/#/c/16066/7//COMMIT_MSG@55
PS7, Line 55: - Added FE/EE tests mainly based on the ones added to the DATE type
Can we add some more sanity tests for different parts of query processing. I think the fact that it's handled as string internally reduces the amount of testing required, but some sanity checks would be good. E.g. a join on BINARY, a group by on aggregation and a sort by BINARY?


http://gerrit.cloudera.org:8080/#/c/16066/7//COMMIT_MSG@56
PS7, Line 56: - Ran exhaustive tests.
Can we also add a shell test?


http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/hbase-scan-node.h
File be/src/exec/hbase-scan-node.h:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/hbase-scan-node.h@84
PS7, Line 84:   /// Descriptor of the HBase table.
Can you mention it's set in Prepare()


http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/hdfs-text-table-writer.cc@107
PS7, Line 107:               // TODO: try to find a more efficient imlementation
Maybe remove the TODO? Or file a JIRA? Not sure that it adds much value here.


http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.h
File be/src/exec/text-converter.h:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.h@65
PS7, Line 65: auxType
aux_type here and elsewhere


http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.cc
File be/src/exec/text-converter.cc:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.cc@353
PS7, Line 353: bool TextConverter::SupportsCodegenWriteSlot(const ColumnType& col_type,
I don't love adding more logic without corresponding codegen, we spent so much time filling in gaps in the past and it feels bad to add more back in.


http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/runtime/descriptors.h@209
PS7, Line 209: (
aux_type


http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@259
PS7, Line 259:         // NDV is not calculated for BINARY columns (similarly to Hive).
This seems kinda arbitrary - is there a real reason not to calculate it? I guess not a big deal.


http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
File fe/src/main/java/org/apache/impala/analysis/LikePredicate.java:

http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java@116
PS7, Line 116:   private static boolean isLikeableType(Type type) {
lol at the function name (no need to change).


http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
File fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java@192
PS7, Line 192: C
nit: Column


http://gerrit.cloudera.org:8080/#/c/16066/7/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/16066/7/tests/query_test/test_scanners.py@1611
PS7, Line 1611:   def add_test_dimensions(cls):
It'd be good to run this for HS2 as well.



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 04 Sep 2020 04:20:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#12).

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................

IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
- Ran exhaustive tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
83 files changed, 918 insertions(+), 373 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/12
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 26:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/26/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/26/tests/query_test/test_udfs.py@116
PS26, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Aug 2022 08:13:01 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 26: Code-Review+2

Thanks for the great work!


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Aug 2022 08:47:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 24:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256
PS14, Line 256:     return col_descs_[slot_desc->col_path().back()];
> Thanks for digging into this!
It turned out that while function would not work for nested columns, we do not use it at any place where this could be a problem. Changed the dcheck to ensure that the type is not nested (the path's size is 1).



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 24
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Aug 2022 14:36:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 6:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16066/6/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/6/be/src/service/hs2-util.cc@273
PS6, Line 273: static void StringExprValuesToHS2TColumnHelper(ScalarExprEvaluator* expr_eval, RowBatch* batch,
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
File fe/src/test/java/org/apache/impala/catalog/CatalogTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@624
PS6, Line 624:     HdfsTable binaryTable = (HdfsTable) catalog_.getOrLoadTable("functional", "binary_tbl",
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
File fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java@433
PS6, Line 433:         "WITH SERDEPROPERTIES ('hbase.columns.mapping'=':key,d:string_col,d:binary_col', " +
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/6/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/6/tests/query_test/test_udfs.py@116
PS6, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 13 Aug 2020 18:00:18 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 6:

(23 comments)

Thanks for the comments and sorry for the huge time before resolving them!

Patch 3 is just a rebase with some conflict resolution.

http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@18
PS2, Line 18: - No NDV is calculated during COMPUTE STATISTICS.
> Are we planning to change this? I think implicit casts from BINARY to STRIN
I have changed the casting rules to always need explicit casting between STRING and BINARY. Added more info in the commit message (line 42).


http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@47
PS2, Line 47: INSERT 
> Do you plan to do interop tests between Hive and Impala?
We generally don't seem to have such tests, so I didn't add it here. I think it would be great to add some simple tests that test all types and fileformats with writer as Hive and reader as Impala + vica versa, but I would do this in jira.

Note that for some file formats the Hive writer -> Impala reader path is tested as Hive writes the tables during dataload.


http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@52
PS2, Line 52:   to test scanning.
> I think we want a lot more coverage in analysis tests, to make sure that we
I have added much more FE tests (similar ones as Attila added for the Date type) + a few to specifically check casts.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-scan-node.cc
File be/src/exec/hbase-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-scan-node.cc@49
PS2, Line 49: null
> nit: nullptr? I see there's a lot of NULLs in the file, so maybe just updat
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc
File be/src/exec/hbase-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc@157
PS2, Line 157:             if (col_desc.auxType().IsBinaryStringSubtype()) {
> line too long (92 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc@159
PS2, Line 159: Base64Encode
> Why do we encode it for HBase?
This how Hive does it, BINARY columns are base64 encoded as in text files.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc
File be/src/exec/hdfs-rcfile-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc@593
PS2, Line 593:         const AuxColumnType& aux_type =
             :             scan_node_->hdfs_table()->GetColumnDesc(slot_desc).
> Yeah I wondered about this too. I think we should consider this in the cont
Yes, I was thinking similarly as what Tim wrote - it would be much nicer if string encoding could be handled mainly in the frontend and the backend would just need to think about byte arrays. Probably VARCHAR could be also handled this way, as AFAIK most code do not enforce the length limit, only scanners/writers/client handlers need to deal with it.

Another type that could probably profit from AuxType is timestamp, if we'll implement TIMESTAMP WITH (LOCAL) TIMEZONE. It should be probably the same data type underneath, but the frontend could pick different functions where the conversion matters + scanners would also need the info as different files may be written in different timezones.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@103
PS2, Line 103:             const ColumnDescriptor& col_desc =
> line too long (96 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@105
PS2, Line 105:             const StringValue* string_value = reinterpret_cast<const StringValue*>(value);
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@107
PS2, Line 107:               // TODO: try to find a more efficient imlementation
> line too long (91 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.h
File be/src/runtime/types.h:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.h@67
PS2, Line 67: Aux
> nit: 'Aux' is not too descriptive. Probably EncodedColumnType/StoredColumnT
I didn't change it in the last patch to avoid the additional noise, but I am not attached to the current name.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc
File be/src/runtime/types.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@122
PS2, Line 122: ToThrift(PrimitiveType ptype)
> Should this function work as the inverse of ThriftToType() ?
It makes sense to create a version which also uses AuxType, but this was not needed so far. Where the backend needs to send the full type info somewhere as thrift, it seems to have the original thrift structures so no reverse conversion is needed.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@180
PS2, Line 180:   DCHECK_EQ(1, type.types.size());
> nit: I'd move this comment right before the switch to L186.
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@185
PS2, Line 185:   AuxColumnType aux_type(type);
> nit: here and elsewhere, I'm seeing a bunch of camelCase variable names in 
done here and hopefully everywhere where it's needed


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@222
PS2, Line 222: ColumnType::ToThrift(TColumnType* thrift_type)
> Same as above, maybe it needs now an additional AuxColumnType parameter?
Same as above.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc@289
PS2, Line 289: HS2TColumn(ScalarExprEvaluator* expr_eval, RowBatch* batch,
             :     int start_id
> nit: refactor to a common code that takes the Thrift field as parameter?
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc@729
PS2, Line 729: se TYPE_BINARY:
             :     default:
> Shouldn't you change this comment to reflect that now BINARY can also trigg
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@126
PS2, Line 126:        // Disable casting from string to boolean
             :         if (fromType.isStringType() && toType.isBoolean()) continue;
             :         // Casting from date is only allowed when to-type is timestamp or string.
             :         if (fromType.isDate() && !toType.isTimestamp() && !toType.isStringType()) {
             :          
> BINARY<->STRING conversions are no-op conversins, so maybe this block shoul
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/catalog/Type.java
File fe/src/main/java/org/apache/impala/catalog/Type.java:

http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/catalog/Type.java@820
PS2, Line 820: // STRING <->  BINARY conversion is not lossy, but implicit cast is not allowed 
> I'm probably misunderstanding something but the commit msg suggests that BI
I added more info about the casts to the commit message. The last patch set compatibility of BINARY <-> STRING to INVALID_TYPE to avoid casting implicitly from BINARY to STRING in function calls.


http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
File fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java:

http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java@154
PS2, Line 154: case BINARY: return Schema.create(Schema.Type.BYTES);
> I'm not sure about this, maybe binary should be converted to avro bytes. Th
Thanks, I changed it to bytes. I tried it that way first, but there was some error during decoding, which doesn't occur now, so I must have made a mistake somewhere.


http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java@263
PS2, Line 263: 
> This comment is no longer valid, right?
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/16066/2/testdata/bin/generate-schema-statements.py@213
PS2, Line 213: PERTIE
> Again, avro has bytes type which might be a better fit for binary.
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/tests/common/impala_connection.py
File tests/common/impala_connection.py:

http://gerrit.cloudera.org:8080/#/c/16066/2/tests/common/impala_connection.py@426
PS2, Line 426: 
> nit: typo
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 13 Aug 2020 18:59:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 12:

PS 12 adds support for Hive UDFs.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jun 2022 15:13:02 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 14:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10761/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 13 Jun 2022 18:59:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#14).

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
83 files changed, 1,116 insertions(+), 373 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/14
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 24: Code-Review+1

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/exec/orc-metadata-utils.cc
File be/src/exec/orc-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/exec/orc-metadata-utils.cc@477
PS24, Line 477:           || type.type == TYPE_CHAR) {
Could you add a comment about why we don't need TYPE_BINARY here?


http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/runtime/descriptors.h@256
PS24, Line 256: back()
nit: front() or [0] seems better (as line 248 uses). Same comment to line 255.


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
File java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java:

http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@40
PS24, Line 40: import org.apache.hadoop.io.Text;
duplicated line


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@319
PS24, Line 319: String
BytesWritable


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@321
PS24, Line 321: get
nit: get() is deprecated by getBytes().


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@326
PS24, Line 326: result
nit: handle null 'result' when 'inputs' is empty.



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 24
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Aug 2022 02:30:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#25).

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase/hbase-scan-node.cc
M be/src/exec/hbase/hbase-scan-node.h
M be/src/exec/hbase/hbase-table-writer.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/rcfile/hdfs-rcfile-scanner.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exec/text/hdfs-text-scanner.cc
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/hbase-inserts.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-basic.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
90 files changed, 1,238 insertions(+), 389 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/25
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Reviewed-on: http://gerrit.cloudera.org:8080/16066
Reviewed-by: Quanlong Huang <hu...@gmail.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase/hbase-scan-node.cc
M be/src/exec/hbase/hbase-scan-node.h
M be/src/exec/hbase/hbase-table-writer.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/rcfile/hdfs-rcfile-scanner.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exec/text/hdfs-text-scanner.cc
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/hbase-inserts.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-basic.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
90 files changed, 1,242 insertions(+), 391 deletions(-)

Approvals:
  Quanlong Huang: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 27
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 18:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc@872
PS14, Line 872:   AuxColumnType aux_type(columnType);
> Done
Reverted this, as while gcc compiled it, clang had issue I could not solve. This is strange to me, as case TYPE_DECIMAL: also has some local variables, but I did not want to dive too deeply.



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 05 Aug 2022 10:46:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 19:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/16066/19/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/19/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java@297
PS19, Line 297:     // Returning complex types with BINARY in select list is not yet implemented (IMPALA-11491).
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py@69
PS19, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py@89
PS19, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py@100
PS19, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py@113
PS19, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py@123
PS19, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/19/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/19/tests/query_test/test_udfs.py@116
PS19, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 19
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 10 Aug 2022 18:07:31 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#22).

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/hbase-inserts.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
89 files changed, 1,196 insertions(+), 389 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/22
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 16:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/11087/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Aug 2022 17:55:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 17:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py@69
PS17, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py@89
PS17, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py@100
PS17, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py@113
PS17, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py@123
PS17, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/17/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/17/tests/query_test/test_udfs.py@116
PS17, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 17
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Aug 2022 18:16:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 11:

Ps 11 fixed length() behavior when utf8_mode=1 and added BE tests for functions and predicates on BINARY.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Jun 2022 16:23:38 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 7:

Let me know when I should take another look


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 01 Sep 2020 23:20:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 11:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10675/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Jun 2022 16:34:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 12: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jun 2022 19:36:52 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/12/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/12/tests/query_test/test_udfs.py@116
PS12, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jun 2022 15:13:20 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8194/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jun 2022 15:13:22 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 14:

> (4 comments)
 > 
 > Thanks for reviving this patch! I just went throught the FE changes
 > and some tests and they LGTM. I'm going to dig into more details.
 > 
 > Not sure if I missed it, but do we have tests on writing BINARY
 > columns? Assuming the data loading of functional_parquet.binary_tbl
 > is the test on writing BINARY to parquet, we still need test on
 > writing BINARY to text files.

At first I thought that with the new table writing to text is covered now, but then realized that LOAD always uses Hive. Will add some table writing tests in the next patch.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 13 Jun 2022 19:12:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 16:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py@69
PS16, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py@89
PS16, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py@100
PS16, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py@113
PS16, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py@123
PS16, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/16/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/16/tests/query_test/test_udfs.py@116
PS16, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Aug 2022 17:46:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 22:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/11136/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 11 Aug 2022 10:05:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 7:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h
File be/src/exec/text-converter.h:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h@63
PS2, Line 63: 
> Could you mention the new param in the comment above?
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h@88
PS2, Line 88: hat cont
> Could you mention the new param in the comment above?
Done


http://gerrit.cloudera.org:8080/#/c/16066/6/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/6/be/src/service/hs2-util.cc@273
PS6, Line 273: static void StringExprValuesToHS2TColumnHelper(ScalarExprEvaluator* expr_eval, 
> line too long (95 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
File fe/src/test/java/org/apache/impala/catalog/CatalogTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@624
PS6, Line 624:     HdfsTable binaryTable = (HdfsTable) catalog_.getOrLoadTable("functional",
> line too long (91 > 90)
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 13 Aug 2020 19:05:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#7).

Change subject: IMPALA-9482 Support for BINARY columns
......................................................................

IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
- Ran exhaustive tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
73 files changed, 752 insertions(+), 363 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/7
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] WIP IMPALA-9482 Support for BINARY columns

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: WIP IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 2:

(8 comments)

I did a go through on this patch and left some minor comments. However, I feel that a second look from someone more competent to introducing new types should take a look. I see that there are a high number of places you had to touch but I can't judge if there is anything else missing.

http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@47
PS2, Line 47: Testing
Do you plan to do interop tests between Hive and Impala?


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h
File be/src/exec/text-converter.h:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h@63
PS2, Line 63: col_desc
Could you mention the new param in the comment above?


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h@88
PS2, Line 88: col_desc
Could you mention the new param in the comment above?


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc
File be/src/runtime/types.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@180
PS2, Line 180:   // ODBC driver requires types in lower case
nit: I'd move this comment right before the switch to L186.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc@729
PS2, Line 729: // HiveServer2 does not have a type for invalid, date, datetime or
             :       // fixed_uda_intermediate.
Shouldn't you change this comment to reflect that now BINARY can also trigger this branch?


http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@341
PS2, Line 341:     boolean isStringBinaryCast =
This can go after the Preconditions check, L347.


http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java@263
PS2, Line 263: For tables with unsupported primitive types (e.g., binary)
This comment is no longer valid, right?


http://gerrit.cloudera.org:8080/#/c/16066/2/tests/common/impala_connection.py
File tests/common/impala_connection.py:

http://gerrit.cloudera.org:8080/#/c/16066/2/tests/common/impala_connection.py@426
PS2, Line 426: `
nit: typo



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 16 Jun 2020 13:39:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 26:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/11190/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Aug 2022 08:35:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#26).

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase/hbase-scan-node.cc
M be/src/exec/hbase/hbase-scan-node.h
M be/src/exec/hbase/hbase-table-writer.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/rcfile/hdfs-rcfile-scanner.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exec/text/hdfs-text-scanner.cc
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/JavaUdfDataType.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
M fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
M fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M fe/src/test/java/org/apache/impala/hive/executor/TestGenericUdf.java
M fe/src/test/java/org/apache/impala/service/JdbcTest.java
M java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
D testdata/UnsupportedTypes/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/binary_tbl/000000_0.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/binary-type.test
M testdata/workloads/functional-query/queries/QueryTest/generic-java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/hbase-inserts.test
M testdata/workloads/functional-query/queries/QueryTest/java-udf.test
M testdata/workloads/functional-query/queries/QueryTest/load-generic-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/load-java-udfs.test
M testdata/workloads/functional-query/queries/QueryTest/misc.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-basic.test
M testdata/workloads/functional-query/queries/QueryTest/udf.test
M tests/common/test_result_verifier.py
M tests/custom_cluster/test_permanent_udfs.py
M tests/hs2/test_fetch.py
M tests/hs2/test_hs2.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
90 files changed, 1,242 insertions(+), 391 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/26
-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 22:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256
PS14, Line 256:     return col_descs_[slot_desc->col_path().back()];
> Change the analyzer to disallow complex types in select list if they have b
Thanks for digging into this!

Sorry that my initial confusion is that "col_descs_" are the top-level columns of the table, but the last item in SchemaPath is not always the top-level column index. Usually the first item of SchemaPath is the top-level column index, and the next items are the index inside the nested type. E.g. the 6th column in table complextypestbl is

 nested_struct	struct<
   a: int,
   b: array<int>,
   ...
 >

If the query selects "nested_struct.a" in the SelectList, the corresponding SchemaPath is [5, 0]. Here [5] is the SchemaPath of "nested_struct". But we are using 0 (the last item) here as the index of col_descs_. So I hope we can add a test of selecting the binary column directly inside a struct<binary> top level column.

Maybe I've missed something. Just explaning my confusion.


http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
File fe/src/main/java/org/apache/impala/analysis/LikePredicate.java:

http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java@116
PS14, Line 116:   private static boolean isLikeableType(Type type) {
> :D poor other types - tbh I don't remember whether the naming was intention
haha


http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/bin/generate-schema-statements.py@222
PS22, Line 222: 'BINARY': 'bytes'
nit: it'd be nice to add a trailing comma so future changes don't need to touch this line.


http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/datasets/functional/functional_schema_template.sql@3532
PS22, Line 3532: binary_in_complex_types
Can we add some data to this table and add some e2e tests? e.g.

 select binary_member_col.b from binary_in_complex_types;
 select a.item from binary_in_complex_types t, t.binary_item_col;
 select m.key from binary_in_complex_types t, t.binary_key_col;
 select m.value from binary_in_complex_types t, t.binary_value_col;



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Aug 2022 09:43:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 24:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py@69
PS24, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py@89
PS24, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py@100
PS24, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py@113
PS24, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py@123
PS24, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/24/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/24/tests/query_test/test_udfs.py@116
PS24, Line 116: y
flake8: E501 line too long (92 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 24
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Aug 2022 14:33:23 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
......................................................................


Patch Set 23:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/11181/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 23
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Steve Carlin <sc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Aug 2022 12:35:30 +0000
Gerrit-HasComments: No