You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2019/02/03 13:10:32 UTC

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Hello Zoltan Borok-Nagy, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#11).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................

IMPALA-6503: Support reading complex types from ORC format files

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don't need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we'll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type. The ORC
column readers differ from the Parquet readers (used in
hdfs-parquet-scanner) which materializes Parquet column binaries into
tuple values directly. They just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
30 files changed, 1,828 insertions(+), 462 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/11
-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>