You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2019/01/07 14:54:42 UTC

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Quanlong Huang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12168


Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................

IMPALA-6503: Support reading complex types from ORC format files

We’ve supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don’t need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we’ll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala’s representation

To format the materialization, we implement several ORC column readers
used in hdfs-orc-scanner. Each kind of reader treats a column type.
Don’t like the Parquet readers (used in hdfs-parquet-scanner) which
materializes Parquet column binaries into tuple values directly, the ORC
readers (in hdfs-orc-scanner) just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
29 files changed, 1,656 insertions(+), 442 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/5
-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 12: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/3702/


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Feb 2019 00:55:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 24: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 24
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Mar 2019 00:24:34 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#18).

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................

IMPALA-6503: Support reading complex types from ORC

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we leverage the ORC lib to parse ORC binaries (data in
io buffer read from DiskIoMgr). The ORC lib can materialize ORC column
binaries into its representation (orc::ColumnVectorBatch). Then we
transform values in orc::ColumnVectorBatch into impala::Tuples in
hdfs-orc-scanner. We don't need to do anything about decoding/decompression
since they are handled by the ORC lib. Fortunately, the ORC lib already
supports complex types, we can still leverage it to support complex types.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need in the form required by the ORC
  lib (Get list of ORC type ids from tuple descriptors)
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation (Slots/Tuples/RowBatches)

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type and
transforms outputs of the ORC lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.
* Run exhaustive tests in DEBUG and RELEASE builds.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
29 files changed, 1,898 insertions(+), 462 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/18
-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 18
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 24:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3892/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 24
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Mar 2019 00:24:36 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 14:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3768/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 13 Feb 2019 01:00:38 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1813/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 18 Jan 2019 14:41:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 22: Code-Review+2

Thank you for your great work and patience!
It's very cool to have this feature!


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 22
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 07 Mar 2019 15:24:03 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 18:

(14 comments)

I've added some slides but still have more things to cover: https://docs.google.com/presentation/d/1uj8m7y69o47MhpqCc0SJ03GDTtPDrg4m04eAFVmq34A

Will add more slides later.

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@178
PS18, Line 178:   std::unordered_set<const SlotDescriptor*> missing_field_slots_;
> Please add comment to this member.
Done. Some members are relative (reader_, row_reader_, reader_options_, row_reader_options_) so I skip the blank lines between them.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@287
PS18, Line 287: continue
> nit: please put it into braces
Done


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@320
PS18, Line 320: RETURN_IF_ERROR
> nit: braces
Done


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@398
PS18, Line 398:       CreateChildForSlot(node, child_slot);
> nit: put into braces.
Done


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@411
PS18, Line 411: CreateChildForSlot
> nit: braces
Done


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@451
PS18, Line 451:     RETURN_IF_ERROR(child->ReadValue(array_start_ + array_idx_, tuple, pool));
> nit: put it into braces
Done


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@467
PS18, Line 467: RETURN_IF_ERROR
> nit: braces
Done


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@502
PS18, Line 502: (field == SchemaPathConstants::MAP_KEY) key_readers_.push_back(child);
              :   else value_readers_.push_back(chil
> nit: the whole 'if' statement doesn't fit into one line, so please put it i
Done


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@512
PS18, Line 512:       CreateChildForSlot(node, child_slot);
> nit: braces
Done


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@527
PS18, Line 527: y_selected) key_readers_.push_back(child);
              :     else value_readers_
> nit: braces
Done


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@539
PS18, Line 539: CreateChildForSlot
> nit: braces
Done


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@540
PS18, Line 540: CreateChildForSlot
> nit: braces
Done


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@579
PS18, Line 579:     RETURN_IF_ERROR(child->ReadValue(array_offset_, tuple, pool));
> nit: braces
Done


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@590
PS18, Line 590:     RETURN_IF_ERROR(child->ReadValue(offset + tuple_idx, tuple, pool));
> nit: braces
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 18
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Mar 2019 13:40:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 15:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2153/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 15
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 18 Feb 2019 14:48:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 25:

(15 comments)

> Patch Set 25:
> 
> (33 comments)
> 
> Ah, I missed that this was almost committed. I had started reviewing but hadn't finished so hadn't posted the comments yet. I'll post my comments here for now.

Todd, thanks for your helpful comments! I should notify you before I run GVO... What if I can get this several hours earlier!

> Quanlong, do you mind if I build a patch on this to take care of my suggested cleanups?

I feel like we still need to polish the codes (especially some comments) to improve readablity. Most of the reasons and examples are given in the Google Docs and slides (https://docs.google.com/presentation/d/1uj8m7y69o47MhpqCc0SJ03GDTtPDrg4m04eAFVmq34A). We may need to write them down in comments.

It'll be great if Cloudera folks can join the development of the ORC support. Todd, feel free to create a follow-up JIRA for this. I'll reply more comments in details later.

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@182
PS18, Line 182: tch batch updated
> I noticed that the ORC docs are not very consistent on terminology here. Th
Yes. "Type" and "node" in the ORC relative codes all means "column".


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@235
PS18, Line 235:   /// Advances 'stripe_idx_' to the next non-empty stripe and initializes
> can you update the doc to explain the 'column_reader' parameter here? The d
Sorry for my misleading comments.
The "orc_root_reader_" keeps reference of the "orc_root_batch_". Each child of "orc_root_reader_" keeps the reference of the corresponding sub batch of "orc_root_batch_" and so on recursively. 
What I want to explain is that the "column_reader" will read values of the orc batch it refers and then materialize tuples of "dst_batch".


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@243
PS18, Line 243: 
> 'selected_indices' doesn't seem to exist here. Should it be "selected nodes
Yes. Fixed in a later version of the patch.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@244
PS18, Line 244:   /// Materialize collection(list/map) tuples belong to the 'row_idx'-th row of
> I'm having trouble understanding this documentation. Maybe worth giving an 
Yes. There're some examples here: https://docs.google.com/presentation/d/1uj8m7y69o47MhpqCc0SJ03GDTtPDrg4m04eAFVmq34A


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@72
PS18, Line 72:  OrcColumnReader(const orc::Type* node, const SlotDescriptor* slot_desc,
             :       HdfsOrcScanner* scanner);
> instead of having both a public constructor and a public factory function a
Yes! We can make it protected.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@79
PS18, Line 79:   virtual bool MaterializeTuple() const { return true; }
> Can we rename this to sound more like a boolean? As is, it sounds like a fu
Sure. Actually, I struggled to think of a name. Maybe better to be "shouldMaterializeTuples". This function is mainly used in complex type readers.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@88
PS18, Line 88: orc::ColumnVectorBatch
> can this be const?
I tried to do this before but failed because orc::ColumnVectorBatch::notNull is a "orc::DataBuffer<char>". We need to use "notNull[row_idx]" but the operator[] of "orc::DataBuffer" is lake of a const version: https://github.com/apache/orc/blob/fa0a33e4a7331dd21485bf4ccc0a93edf4b3ae16/c++/include/orc/MemoryPool.hh#L72


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@122
PS18, Line 122: class OrcBoolColumnReader : public OrcColumnReader {
> do we need to define all the subtypes of readers in the .h file or could th
I think they sould be here since HdfsOrcScanner reference them as friend classes and hdfs-orc-scanner.h only includes this .h file.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@46
PS18, Line 46:     switch (slot_desc->type().type) {
> I think it's worth a comment explaining why we switched above on node->getK
Yes. There's a reason why we switch on Impala slot type here. Will sumarize it later.
We've verify the compatibility before we create the column readers. The verification is in OrcSchemaResolver::ResolveColumn.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@102
PS18, Line 102:         DCHECK(false) << slot_desc->type().DebugString();
> I think we should be more defensive here and actually return a bad Status i
Since we swith on Impala slot type, we won't hit UNION here. But we really need to consider DATE type here after we merge https://gerrit.cloudera.org/c/12481/


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@112
PS18, Line 112:   if (node->getKind() == orc::TypeKind::STRUCT) {
> I'm a bit confused. Isn't the top-level reader always going to be a STRUCT?
No. "top-level reader" is not the root reader (orc_root_reader_ in hdfs-orc-scanner). I struggled to name it too. Maybe you can find me a better name :)
The column reader that materializes the table-level (top-level) tuple is a "top-level reader". All the ancestors of this reader are also defined as "top-level readers".
Note that the table-level tuple is not neccessarily mapped to a row of the table. It depends on the TupleDescriptors (so depends on the query). My slides have some examples about this.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc
File be/src/exec/orc-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@25
PS18, Line 25: void OrcMetadataUtils::BuildSchemaPaths(const orc::Type& root, int num_partition_keys,
> it's nice that you factored out these utility functions. Would it be possib
Yes. Good point!


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@40
PS18, Line 40:   DCHECK_EQ(paths->size(), node.getColumnId());
> Is this guaranteed that the column IDs in an ORC file are in sequential ord
Yes. The ORC column IDs are the pre-order traversal ids of the schema tree.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@65
PS18, Line 65:   }
> do we need an 'else { return Status(...); } here? What if the ORC file has 
Ah, yes! We should take care of this!


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@184
PS18, Line 184:       "Type mismatch: table column $0 is map to column $1 in ORC file '$2'",
> would be good if you could include the string column paths here (both the i
Sure!



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 25
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Mar 2019 15:23:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 9:

(10 comments)

I looked through the headers and test files initially. I think this makes sense at a high level. I haven't looked at the .cc files in parallel - I think Zoltan will probably get to that before me.

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h@175
PS9, Line 175:   std::list<uint64_t> selected_type_ids_;
Mention that RowReaderOptions.includeTypes() expects a list (otherwise it's confusing why this is not a vector). Actually it's still confusing why it's not a vector but at least we can blame the ORC library for that decision :)


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@301
PS9, Line 301:     VLOG_QUERY << "Add ORC column " << node->getMaximumColumnId() << " for empty tuple "
This logging will be too verbose for most purposes - I'd suggest making it VLOG(3) or removing it.


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363
PS9, Line 363:   for (uint64_t id : selected_type_ids)
nit: braces for multi-line if


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@395
PS9, Line 395: selected_type_ids_.size()
It doesn't really matter here, but with gcc4.9.2 list::size() is actually O(n). We've been burned by that before.


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@680
PS9, Line 680:     for (int i = 0; i < total_tuples; ++i)
Use braces for multi-line if


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@19
PS9, Line 19: #ifndef IMPALA_EXEC_ORC_COLUMN_READERS_H
We started using "#pragma once" instead of the traditional include guards in many places - it would be nice to switch.


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@61
PS9, Line 61:   virtual void UpdateInputBatch(orc::ColumnVectorBatch* orc_batch) = 0;
Maybe mention the relationship with ReadValue() - do you need to call it before calling ReadValue()? Maybe it would help to document how callers should use this in a class comment for OrcColumnReader?


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc@330
PS9, Line 330:   for (OrcColumnReader* child : children_)
Need braces around multi-line for loop - here and elsewhere


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc@331
PS9, Line 331:     RETURN_IF_ERROR(child->ReadValue(row_idx_, tuple, pool));
This probably performs similarly to the previous code with the switch, but ultimately to make this performance we'd want to do batched calls that read multiple values in one try.

We changed the Parquet reader to do things that way a while back and it makes it faster automatically and unlocks more optimisation opportunities.

Nothing to fix here, just sharing


http://gerrit.cloudera.org:8080/#/c/12168/9/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/12168/9/tests/query_test/test_nested_types.py@128
PS9, Line 128:     # TODO(IMPALA-6505): support predicates push down on ORC stripe statistics
Maybe move this to a string argument to skip()



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Sat, 26 Jan 2019 01:19:46 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 23:

Thanks for all your review! Really glad that we reach another milestone!


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 23
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 07 Mar 2019 23:41:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 13:

(15 comments)

Thanks for your new feedbacks! Let's keep reviewing it so things could be simpler:)

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363
PS9, Line 363:     // We only deal with collection columns (ARRAY/MAP) and primitive columns here.
> Yep that's the rule - use braces if the whole thing doesn't fit on a single
Sure


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@198
PS12, Line 198: niqu
> nit:std:: not needed since we import it into the namespace with common/name
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@316
PS12, Line 316:     VLOG(3) << "Add ORC column " << node->getMaximumColumnId() << " for empty tuple "
> Can we remove these VLOG_FILE statements here and below? It could get quite
Yes, they're mainly used for debugging. VLOG(3) is equal to VLOG_ROW but this function will only be called once for an orc-scanner, not for each row or RowBatch. Isn't it a VLOG(2) level info?

However, I still change it to VLOG(3) to respect your opinions :)


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@380
PS12, Line 380:     const list<uint64_t>& selected_type_ids) {
> Can this argument be const?
Yes. Should be const.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@593
PS12, Line 593: 
> How do we know that we're finished processing the previous batch? What happ
If we come here, the previous batch should have been drained. The relative codes are in GetNextInternal:

  // Transfer remaining values in current orc batches tracked by readers.
  if (!orc_root_reader_->EndOfBatch()) {
    assemble_rows_timer_.Start();
    RETURN_IF_ERROR(TransferTuples(orc_root_reader_, row_batch));
    assemble_rows_timer_.Stop();
    if (row_batch->AtCapacity()) return Status::OK();
    DCHECK(orc_root_reader_->EndOfBatch());
  }

If we exited the while() loop because row_batch() was at capacity, the remaining rows will be processed in the above codes. Since the batch size equals to the capacity of RowBatch, the remaining rows should no more than the capacity of RowBatch. Thus calling TransferTuples once can drain the original batch.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638
PS12, Line 638: 
> typo: materialization
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638
PS12, Line 638: alizeTuple()) {
> grammar: "that are not materializing tuples"
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@688
PS12, Line 688: 
> Pass in complex_col_reader by pointer since it's mutable.
It can be const. We don't change any states of it in this function. I just add 'const' for it and add const for functions we called here.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@133
PS12, Line 133:   }
> VLOG_FILE is also probably too verbose here - VLOG(3) would be more appropr
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@219
PS12, Line 219: materiali
> same comment about logging
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc
File be/src/runtime/descriptors.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc@65
PS12, Line 65: const int SchemaPathConstants::ARRAY_ITEM;
> I think it's better to define the constant value in the header and just ins
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh@581
PS12, Line 581:   run-step "Loading nested parquet data" load-nested.log \
              :     ${IMPALA_HOME}/testdata/bin/load_nested.py \
              :     -t tpch_nested_parquet -f parquet/none ${LOAD_NESTED_ARGS:-}
              :   run-step "Loading nested orc data" load-nested.log \
              :     ${IMPALA_HOME}/testdata/bin/load_nested.py \
> I think it would make sense to specify the arguments for the call to run th
Sure. It's clearer.


http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py
File testdata/bin/load_nested.py:

http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py@286
PS12, Line 286:       hive.execute("""
              :         CREATE TABLE part
              :         STORED AS ORC
              :         TBLPROPERTIES('{compression_key}'='{compression_value}')
              :         AS SELECT * FROM {source_db}.part""".format(**sql_params))
              : 
> I would prefer this to be right next to the create statement for parquet, e
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py@340
PS12, Line 340:     impala.invalidate_metadata(table_name="customer")
              :     impala.invalidate_metadata(table_name="part")
              :     impala.invalidate_metadata(table_name="region")
              :     impala.invalidate_metadata(table_name="supplier")
              :     impala.compute_stats()
              : 
              :   LOG.info("Done loading neste
> Please throw an error if the file_format/compression_value is not one of th
My original intention is that only these two cases needs transformation (i.e. parquet/none means parquet/snappy, orc/def means using zlib). Actually, "snap" needs to be transformed to "SNAPPY" and some other codecs are not supported by Hive (e.g. bzip). Based on these, I refactor this part to be able to load more table formats (e.g. parquet/gzip, orc/snap, orc/none, etc.)


http://gerrit.cloudera.org:8080/#/c/12168/12/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/12168/12/tests/query_test/test_nested_types.py@611
PS12, Line 611:     if file_format == 'parquet':
              :       self.__create_parquet_tables(unique_database)
              :     elif file_format == 'orc':
              :       self.__create_orc_tables(unique_database)
              :     self.run_test_case('QueryTest/max-nesting-depth', vector, unique_database)
              : 
              :   def __create_parquet_tables(self, unique_data
> It might be cleaner to have __create_orc_table() call __create_parquet_tabl
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 Feb 2019 23:56:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#7).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................

IMPALA-6503: Support reading complex types from ORC format files

We’ve supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don’t need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we’ll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala’s representation

To format the materialization, we implement several ORC column readers
used in hdfs-orc-scanner. Each kind of reader treats a column type.
Don’t like the Parquet readers (used in hdfs-parquet-scanner) which
materializes Parquet column binaries into tuple values directly, the ORC
readers (in hdfs-orc-scanner) just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
29 files changed, 1,685 insertions(+), 444 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/7
-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 20:

(5 comments)

Thanks again for the docs! Might be worth to explain it in more detail how you create the row readers, it took me a while to understand it. Maybe extend the comments with a step-by-step walkthrough.

So e.g., if I understood it correctly:

Resolve columns => get a list of orc::Nodes => get column ids from the nodes => configure orc_row_reader_ with column ids (now they are called types) => create temporary orc::RowReader to get the selected subset of the schema => create impala::OrcColumnReaders

http://gerrit.cloudera.org:8080/#/c/12168/20/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/20/be/src/exec/hdfs-orc-scanner.h@247
PS20, Line 247: 'selected_indices'
nit: 'selected_nodes'


http://gerrit.cloudera.org:8080/#/c/12168/20/be/src/exec/orc-metadata-utils.h
File be/src/exec/orc-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/12168/20/be/src/exec/orc-metadata-utils.h@27
PS20, Line 27: class OrcMetadataUtils {
nit: please add some comment to the class and functions.


http://gerrit.cloudera.org:8080/#/c/12168/20/be/src/exec/orc-metadata-utils.h@36
PS20, Line 36: class OrcSchemaResolver {
nit: please add class comment


http://gerrit.cloudera.org:8080/#/c/12168/20/be/src/exec/orc-metadata-utils.h@42
PS20, Line 42: 'pos_field' is true
I think it would be clearer to say "'pos_field' is set to true" to emphasize that this is an output parameter.


http://gerrit.cloudera.org:8080/#/c/12168/20/be/src/exec/orc-metadata-utils.h@44
PS20, Line 44: 'missing_field' is true
"'missing_field' is set to true"



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 20
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Mar 2019 17:28:16 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 18:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/12168/17//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12168/17//COMMIT_MSG@30
PS17, Line 30: Tests:
> Did you run exhaustive tests? I think there are some nested types tests tha
Passed exhaustive tests in DEBUG and RELEASE build:
https://jenkins.impala.io/job/pre-review-test/318/
https://jenkins.impala.io/job/pre-review-test/319/


http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/hdfs-orc-scanner.cc@379
PS17, Line 379:     if
> Doubt inline provides any measurable benefit it. I'd remove it, but ok to i
Done


http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/hdfs-orc-scanner.cc@402
PS17, Line 402:     pos_slots.pop();
> I think the data flow would be clearer if pos_slots was a local and Resolve
Good point!


http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/orc-column-readers.cc@30
PS17, Line 30: string
> inline seems unnecessary
Done


http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/orc-metadata-utils.h
File be/src/exec/orc-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/orc-metadata-utils.h@48
PS17, Line 48:   const HdfsTableDescriptor& tbl_desc_;
> All the members are not assigned except for constructor, so you could add a
Done


http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/orc-metadata-utils.cc
File be/src/exec/orc-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/orc-metadata-utils.cc@130
PS17, Line 130:           || type.type == TYPE_INT || type.type == TYPE_BIGINT) {
> Can you put braces on the multi-line if statements here and below.
Done


http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/orc-metadata-utils.cc@175
PS17, Line 175: 
> "Decimal value" instead of "It"
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 18
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Mar 2019 05:24:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 16:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2154/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 16
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 19 Feb 2019 16:08:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#10).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................

IMPALA-6503: Support reading complex types from ORC format files

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don't need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we'll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type. The ORC
column readers differ from the Parquet readers (used in
hdfs-parquet-scanner) which materializes Parquet column binaries into
tuple values directly. They just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
30 files changed, 1,789 insertions(+), 454 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/10
-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 25:

(2 comments)

Todd, I just added some more replies. Do you have time to keep reviewing these codes? If so, I'd like to create a JIRA and a patch for refactoring the comments and addressing some minor changes (add DCHECKs, resolve "nit"s).

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@46
PS18, Line 46:     switch (slot_desc->type().type) {
> Yes. There's a reason why we switch on Impala slot type here. Will sumarize
About why we switch on impala slot type here but switch on ORC type above for complex types:
- An OrcColumnReader should be the same type corresponding to the ORC node in its constructor. So it's reasonable to switch on ORC type above.
- For primitive types, it's been verified in OrcSchemaResolver::ResolveColumn that the ORC type can be converted to the slot type.
- Some column readers are template classes (e.g. OrcIntColumnReader). The template variable depends on the slot type. If we switch on the ORC type, we should still check the slot type to create the reader. So codes will be very ugly and verbose like:


switch (node->getKind()) {
  ...
  case orc::TypeKind::BYTE:
    if (slot_desc->type().type == TYPE_TINYINT) {
      reader = new OrcIntColumnReader<int8_t>(node, slot_desc, scanner);
    } else if (slot_desc->type().type == TYPE_SMALLINT) {
      reader = new OrcIntColumnReader<int16_t>(node, slot_desc, scanner);
    } else if (slot_desc->type().type == TYPE_INT) {
      reader = new OrcIntColumnReader<int32_t>(node, slot_desc, scanner);
    } else if (slot_desc->type().type == TYPE_BIGINT) {
      reader = new OrcIntColumnReader<int64_t>(node, slot_desc, scanner);
    } else {
        DCHECK(false);
    }
    break;
  case orc::TypeKind::SHORT:
    if (slot_desc->type().type == TYPE_SMALLINT) {
      reader = new OrcIntColumnReader<int16_t>(node, slot_desc, scanner);
    } else if (slot_desc->type().type == TYPE_INT) {
      reader = new OrcIntColumnReader<int32_t>(node, slot_desc, scanner);
    } else if (slot_desc->type().type == TYPE_BIGINT) {
      reader = new OrcIntColumnReader<int64_t>(node, slot_desc, scanner);
    } else {
        DCHECK(false);
    }
    break;
  case orc::TypeKind::INT:
    if (slot_desc->type().type == TYPE_INT) {
      reader = new OrcIntColumnReader<int32_t>(node, slot_desc, scanner);
    } else if (slot_desc->type().type == TYPE_BIGINT) {
      reader = new OrcIntColumnReader<int64_t>(node, slot_desc, scanner);
    } else {
        DCHECK(false);
    }
    break;
  case orc::TypeKind::LONG:
    if (slot_desc->type().type == TYPE_BIGINT) {
      reader = new OrcIntColumnReader<int64_t>(node, slot_desc, scanner);
    } else {
        DCHECK(false);
    }
    break;
  case ...
}

Hence, we switch on impala slot type here.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc
File be/src/exec/orc-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@27
PS18, Line 27:   SchemaPath path;
             :   paths->push_back(path);
             :   DCHECK_EQ(root.getKind(), orc::TypeKind::STRUCT);
             :   int num_columns = root.getSubtypeCount();
             :   for (int i = 0; i < num_columns; ++i) {
             :     path.push_back(i + num_partition_keys);
             :     BuildSchemaPath(*root.getSubtype(i), &path, paths);
             :     path.pop_back();
             :   }
> it seems like this code is the same handling as the 'STRUCT' case below in 
No, they're different. Here we're dealing with table level columns. The column indices in this level should add on 'num_partition_keys'. For columns on other levels, the column index always starts at 0.

I'd like to not refactor them into the same code path, so readers can distinguish these two cases easily. Maybe I should add some more comments :)



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 25
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 11 Mar 2019 03:43:14 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363
PS9, Line 363:   for (uint64_t id : selected_type_ids)
> I'm quite confused about the definition of "multi-line". The if statement j
Tim will correct me if I'm wrong, but I guess he was also thinking about the braces of the for-loop.

I should've pasted the url of Impala's style guide since we diverged a little from the Google style guide: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65868536

My experience is that we don't like any control statements without braces, unless the whole statement fits into a single line.


http://gerrit.cloudera.org:8080/#/c/12168/10/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/12168/10/testdata/bin/create-load-data.sh@34
PS10, Line 34: set -x
I guess you left it here accidentally



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 28 Jan 2019 16:46:20 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#11).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................

IMPALA-6503: Support reading complex types from ORC format files

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don't need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we'll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type. The ORC
column readers differ from the Parquet readers (used in
hdfs-parquet-scanner) which materializes Parquet column binaries into
tuple values directly. They just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
30 files changed, 1,828 insertions(+), 462 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/11
-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 18:

(14 comments)

Found some style-violations. Besides those the reader and scanner code looks good to me.
On Monday I'll try to take a deeper look at the OrcSchemaResolver. Thanks for the google doc and slides that explain the details!

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@178
PS18, Line 178:   std::unordered_set<const SlotDescriptor*> missing_field_slots_;
Please add comment to this member.

Nit: please insert blank lines after member declarations.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@287
PS18, Line 287: continue
nit: please put it into braces


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@320
PS18, Line 320: RETURN_IF_ERROR
nit: braces


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@398
PS18, Line 398:       CreateChildForSlot(node, child_slot);
nit: put into braces.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@411
PS18, Line 411: CreateChildForSlot
nit: braces


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@451
PS18, Line 451:     RETURN_IF_ERROR(child->ReadValue(array_start_ + array_idx_, tuple, pool));
nit: put it into braces


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@467
PS18, Line 467: RETURN_IF_ERROR
nit: braces


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@502
PS18, Line 502: (field == SchemaPathConstants::MAP_KEY) key_readers_.push_back(child);
              :   else value_readers_.push_back(chil
nit: the whole 'if' statement doesn't fit into one line, so please put it into braces.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@512
PS18, Line 512:       CreateChildForSlot(node, child_slot);
nit: braces


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@527
PS18, Line 527: y_selected) key_readers_.push_back(child);
              :     else value_readers_
nit: braces


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@539
PS18, Line 539: CreateChildForSlot
nit: braces


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@540
PS18, Line 540: CreateChildForSlot
nit: braces


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@579
PS18, Line 579:     RETURN_IF_ERROR(child->ReadValue(array_offset_, tuple, pool));
nit: braces


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@590
PS18, Line 590:     RETURN_IF_ERROR(child->ReadValue(offset + tuple_idx, tuple, pool));
nit: braces



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 18
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Mar 2019 17:26:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 17:

(8 comments)

I'm pretty happy with this, I have some fairly minor comments but very happy with the code and the functional coverage from existing tests.

Do other reviewers plan to continue the review? Haven't seen activity for a bit.

http://gerrit.cloudera.org:8080/#/c/12168/17//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12168/17//COMMIT_MSG@30
PS17, Line 30: Tests:
Did you run exhaustive tests? I think there are some nested types tests that aren't run in core.  I can run it on our infrastructure if that would help.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@593
PS12, Line 593:   DCHECK(parse_status_.ok());
> If we come here, the previous batch should have been drained. The relative 
I should have remembered that, I think I reviewed the original version of the code in the Parquet scanner. Lots of state to keep in my head.


http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/hdfs-orc-scanner.cc@379
PS17, Line 379: inline
Doubt inline provides any measurable benefit it. I'd remove it, but ok to ignore this comment.


http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/hdfs-orc-scanner.cc@402
PS17, Line 402:   while (!pos_slots_.empty()) {
I think the data flow would be clearer if pos_slots was a local and ResolveColumns() took it as an output argument.


http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/orc-column-readers.cc@30
PS17, Line 30: inline
inline seems unnecessary


http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/orc-metadata-utils.h
File be/src/exec/orc-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/orc-metadata-utils.h@48
PS17, Line 48:   const HdfsTableDescriptor& tbl_desc_;
All the members are not assigned except for constructor, so you could add an additional const to the pointers, e.g. const orc::Type* const root_.


http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/orc-metadata-utils.cc
File be/src/exec/orc-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/orc-metadata-utils.cc@130
PS17, Line 130:           || type.type == TYPE_INT || type.type == TYPE_BIGINT)
Can you put braces on the multi-line if statements here and below.


http://gerrit.cloudera.org:8080/#/c/12168/17/be/src/exec/orc-metadata-utils.cc@175
PS17, Line 175: It
"Decimal value" instead of "It"



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 17
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Feb 2019 23:52:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 17:

Thank you for your review! I'll refactor the codes soon.

About exhaustive test, yesterday I ran it in DEBUG mode and it passed: https://jenkins.impala.io/job/pre-review-test/318/

I just triggered it again in RELEASE mode: https://jenkins.impala.io/job/pre-review-test/319/


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 17
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Feb 2019 23:39:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 20:

We're going to release Impala-3.2. Hopes that this patch can catch up the release. Thank you, friends!


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 20
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Mar 2019 15:05:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 17:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2169/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 17
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 20 Feb 2019 15:41:59 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 16: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 16
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 20 Feb 2019 09:04:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 23:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3889/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 23
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 07 Mar 2019 19:57:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#19).

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................

IMPALA-6503: Support reading complex types from ORC

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we leverage the ORC lib to parse ORC binaries (data in
io buffer read from DiskIoMgr). The ORC lib can materialize ORC column
binaries into its representation (orc::ColumnVectorBatch). Then we
transform values in orc::ColumnVectorBatch into impala::Tuples in
hdfs-orc-scanner. We don't need to do anything about decoding/decompression
since they are handled by the ORC lib. Fortunately, the ORC lib already
supports complex types, we can still leverage it to support complex types.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need in the form required by the ORC
  lib (Get list of ORC type ids from tuple descriptors)
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation (Slots/Tuples/RowBatches)

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type and
transforms outputs of the ORC lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.
* Run exhaustive tests in DEBUG and RELEASE builds.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
29 files changed, 1,921 insertions(+), 462 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/19
-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 19
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 23: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 23
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Mar 2019 00:20:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#15).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................

IMPALA-6503: Support reading complex types from ORC format files

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we leverage the ORC lib to parse ORC binaries (data in
io buffer read from DiskIoMgr). The ORC lib can materialize ORC column
binaries into its representation (orc::ColumnVectorBatch). Then we
transform values in orc::ColumnVectorBatch into impala::Tuples in
hdfs-orc-scanner. We don't need to do anything about decoding/decompression
since they are handled by the ORC lib. Fortunately, the ORC lib already
supports complex types, we can still leverage it to support complex types.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need in the form required by the ORC
  lib (Get list of ORC type ids from tuple descriptors)
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation (Slots/Tuples/RowBatches)

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type and
transforms outputs of the ORC lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
29 files changed, 1,864 insertions(+), 461 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/15
-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 15
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 22:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3882/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 22
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Mar 2019 05:51:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#17).

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................

IMPALA-6503: Support reading complex types from ORC

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we leverage the ORC lib to parse ORC binaries (data in
io buffer read from DiskIoMgr). The ORC lib can materialize ORC column
binaries into its representation (orc::ColumnVectorBatch). Then we
transform values in orc::ColumnVectorBatch into impala::Tuples in
hdfs-orc-scanner. We don't need to do anything about decoding/decompression
since they are handled by the ORC lib. Fortunately, the ORC lib already
supports complex types, we can still leverage it to support complex types.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need in the form required by the ORC
  lib (Get list of ORC type ids from tuple descriptors)
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation (Slots/Tuples/RowBatches)

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type and
transforms outputs of the ORC lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
29 files changed, 1,894 insertions(+), 462 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/17
-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 17
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 10:

(1 comment)

> Patch Set 9:
> 
> (2 comments)

http://gerrit.cloudera.org:8080/#/c/12168/10/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/12168/10/testdata/bin/create-load-data.sh@34
PS10, Line 34: set -x
> I guess you left it here accidentally
This is introduced by a recent commit: 236b9194d345bbfdfb177dd7ef4908170eaff259
I just rebase and it appears :)



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 28 Jan 2019 22:54:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 12:

(12 comments)

It's a big change so it's taking a while to get my head around!

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363
PS9, Line 363:     // We only deal with collection columns (ARRAY/MAP) and primitive columns here.
> Tim will correct me if I'm wrong, but I guess he was also thinking about th
Yep that's the rule - use braces if the whole thing doesn't fit on a single line.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@198
PS12, Line 198: td::
nit:std:: not needed since we import it into the namespace with common/names.h


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@316
PS12, Line 316:     VLOG_FILE << "Add ORC column " << node->getMaximumColumnId() << " for empty tuple "
Can we remove these VLOG_FILE statements here and below? It could get quite noisy. I think we could either remove them or make them VLOG(3) since they're mainly for debugging I think?


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@380
PS12, Line 380:     list<uint64_t>& selected_type_ids) {
Can this argument be const?


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@593
PS12, Line 593:   // We're going to free the previous batch. Clear the reference first.
How do we know that we're finished processing the previous batch? What happens if we exited the while() loop because row_batch() was at capacity? Don't we lose some rows?


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638
PS12, Line 638: hat not materializing tuples 
grammar: "that are not materializing tuples"


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638
PS12, Line 638: materializtion
typo: materialization


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@688
PS12, Line 688: OrcComplexColumnReader& complex_col_reade
Pass in complex_col_reader by pointer since it's mutable.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@133
PS12, Line 133:   VLOG_FILE << "Created reader for " << PrintNode(orc_type) << ": slot_desc_="
VLOG_FILE is also probably too verbose here - VLOG(3) would be more appropriate.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@170
PS12, Line 170: TryAllocate
This is minor, but TryAllocateUnaligned() is better for strings - it is marginally more efficient and avoids wasting space.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@219
PS12, Line 219: VLOG_FILE
same comment about logging


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc
File be/src/runtime/descriptors.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc@65
PS12, Line 65: const int SchemaPathConstants::ARRAY_ITEM = 0;
I think it's better to define the constant value in the header and just instantiate it here, i.e. revert the change in the header and make these like:

  const int SchemaPathConstants::ARRAY_ITEM;

That just means the constant value can be inlined in callers.



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 Feb 2019 00:08:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 12:

The jenkins job still checkout the cdh5-trunk branch of Impala-lzo, which cause the build failures.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Feb 2019 01:31:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 9:

(22 comments)

Thanks for your feedback! I've added more comments and refactor some codes as required.

http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG@25
PS8, Line 25: Don’t like
> nit: 'They are not like', or 'They differ from'
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h@175
PS9, Line 175:   std::list<uint64_t> selected_type_ids_;
> Mention that RowReaderOptions.includeTypes() expects a list (otherwise it's
Yeah, I have no ideas about this too.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h@160
PS8, Line 160:   orc::RowReaderOptions row_reader_options;
> nit: put '_' at the end of the variable name.
Oops! Missed this in the previous patch. Done.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@181
PS8, Line 181: base
> nit: based
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@329
PS8, Line 329:         *template_tuple =
             :             Tuple::Create(tuple_desc.byte_size(), template_tuple_pool_.get());
> nit: put curly braces around it
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@364
PS8, Line 364:     if (id >= node.getColumnId() && id <= node.getMaximumColumnId()) return true;
> nit: put curly braces around it
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@681
PS8, Line 681:       RETURN_IF_ERROR(AssembleCollection(*child_reader, child_batch_offset + i,
             :           coll_value_builder));
> nit: add curly braces around it
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@301
PS9, Line 301:     VLOG_QUERY << "Add ORC column " << node->getMaximumColumnId() << " for empty tuple "
> This logging will be too verbose for most purposes - I'd suggest making it 
This is quite helpful in debug. I think VLOG_FILE is more suiteable since this function will only be called once for each file.


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363
PS9, Line 363:   for (uint64_t id : selected_type_ids)
> nit: braces for multi-line if
I'm quite confused about the definition of "multi-line". The if statement just occupies one line. Isn't it a single-line statement? Zoltan also comments that the for loop needs a curly brace. Aren't braces optional for single-line statement loops?


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@395
PS9, Line 395: selected_type_ids_.size()
> It doesn't really matter here, but with gcc4.9.2 list::size() is actually O
Sure


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@680
PS9, Line 680:     for (int i = 0; i < total_tuples; ++i)
> Use braces for multi-line if
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@19
PS9, Line 19: #ifndef IMPALA_EXEC_ORC_COLUMN_READERS_H
> We started using "#pragma once" instead of the traditional include guards i
Sure. Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@61
PS9, Line 61:   virtual void UpdateInputBatch(orc::ColumnVectorBatch* orc_batch) = 0;
> Maybe mention the relationship with ReadValue() - do you need to call it be
Sure. Added class comments. Feel free to polish them since they're not written by a native speaker :)


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@101
PS8, Line 101:     batch_ = dynamic_cast<orc::LongVectorBatch*>(orc_batch);
> We avoid doing dynamic casts in release mode.
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@126
PS8, Line 126:     int64_t val = batch_->data.data()[row_idx];
> Maybe you could add a DCHECK that batch_ is not null.
Sure. Add DCHECK_NOTNULL at the first use of batch_. Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@190
PS8, Line 190:   *slot = TimestampValue::FromUnixTimeNanos(secs, nanos,
ORC timestamps are not timezone-aware. Timestamps read/write by the ORC lib are assumed in GMT timezone. We used to have a long debate about this in https://github.com/apache/orc/pull/233 when I encountered a bug of the writter (ORC-320, ORC-322). The short conclusion is updated in the ORC web site (https://orc.apache.org/docs/core-cpp.html):

> TimestampVectorBatch handles timestamp values. The data is represented as two buffers of int64_t for seconds and nanoseconds respectively. Note that we always assume data is in GMT timezone; therefore it is user’s responsibility to convert wall clock time from local timezone to GMT.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@326
PS8, Line 326:     children_[c]->UpdateInputBatch(batch_->fields[children_fields_[c]]);
> nit: add curly brackets around it since it is a multi-line for stmt.
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@331
PS8, Line 331:     RETURN_IF_ERROR(child->ReadValue(row_idx_, tuple, pool));
> nit: add curly brackets around it
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@395
PS8, Line 395:   } else CreateChildForSlot(node, slot_desc);
> nit: it's against the Google style guide to have braces on only one branch 
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc@330
PS9, Line 330:   for (OrcColumnReader* child : children_)
> Need braces around multi-line for loop - here and elsewhere
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc@331
PS9, Line 331:     RETURN_IF_ERROR(child->ReadValue(row_idx_, tuple, pool));
> This probably performs similarly to the previous code with the switch, but 
I think the ORC lib already materializes column values in batch, which corresponds to the generation of scratch_batch_ in the parquet scanner. Is it optimised enough?


http://gerrit.cloudera.org:8080/#/c/12168/9/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/12168/9/tests/query_test/test_nested_types.py@128
PS9, Line 128:     # TODO(IMPALA-6505): support predicates push down on ORC stripe statistics
> Maybe move this to a string argument to skip()
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 28 Jan 2019 02:22:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 12:

(4 comments)

A couple comments about the data loading and tests.

http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh@581
PS12, Line 581:   run-step "Loading nested parquet data" load-nested.log \
              :     ${IMPALA_HOME}/testdata/bin/load_nested.py ${LOAD_NESTED_ARGS:-}
              :   run-step "Loading nested orc data" load-nested.log \
              :     ${IMPALA_HOME}/testdata/bin/load_nested.py \
              :     -t tpch_nested_orc_def -f orc/def ${LOAD_NESTED_ARGS:-}
I think it would make sense to specify the arguments for the call to run the load_nested.py for parquet. i.e.
  run-step "Loading nested parquet data" load-nested.log \
    ${IMPALA_HOME}/testdata/bin/load_nested.py \
    -t tpch_nested_parquet -f parquet/none ${LOAD_NESTED_ARGS:-}


http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py
File testdata/bin/load_nested.py:

http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py@286
PS12, Line 286:     # For ORC format, we create the 'part' table by Hive
              :     if file_format == "orc":
              :       hive.execute("""
              :         CREATE TABLE part
              :         STORED AS ORC
              :         AS SELECT * FROM {source_db}.part""".format(**sql_params))
I would prefer this to be right next to the create statement for parquet, even if you need to open a separate hive cursor.


http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py@340
PS12, Line 340:   file_format, compression_value = args.table_format.split("/")
              :   if file_format == "parquet" and compression_value == "none":
              :     compression_key = "parquet.compression"
              :     compression_value = "SNAPPY"
              :   elif file_format == "orc" and compression_value == "def":
              :     compression_key = "orc.compress"
              :     compression_value = "ZLIB"
Please throw an error if the file_format/compression_value is not one of the supported combinations. i.e. add an else and raise an error.


http://gerrit.cloudera.org:8080/#/c/12168/12/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/12168/12/tests/query_test/test_nested_types.py@611
PS12, Line 611:     if file_format == 'parquet':
              :       self.__create_parquet_tables(unique_database, True)
              :     elif file_format == 'orc':
              :       # Creating ORC tables from ORC files (IMPALA-8046) has not been supported.
              :       # We create the Parquet tables first and then transform them into ORC tables.
              :       self.__create_parquet_tables(unique_database, False)
              :       self.__create_orc_tables(unique_database)
It might be cleaner to have __create_orc_table() call __create_parquet_tables(), so that this code becomes:

if file_format == 'parquet':
  self.__create_parquet_tables(unique_database)
elif file_format == 'orc':
  self.__create_orc_tables(unique_database)

The comment here about "Creating ORC tables from..." becomes part of __create_orc_tables(), and the as_target parameter is only needed from __create_orc_tables().



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Feb 2019 18:44:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 5:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/1723/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 07 Jan 2019 15:28:52 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 11:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/1974/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Sun, 03 Feb 2019 13:50:49 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 17:

It's a big enough change that more eyes on it would be useful, let's wait until you can complete the review.

Mainly I just want to make sure we're on the same page and avoid deadlocks :)


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 17
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Feb 2019 17:38:22 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 17:

I was planning to do another round, but got a little busy. Hopefully I can look at it later this week.

But I also don't want to hold it if you think it's OK.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 17
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Feb 2019 17:30:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 16:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3799/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 16
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 20 Feb 2019 04:26:09 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 8:

(12 comments)

Started to look at it. Currently mostly have comments about style.
I plan to look at it deeper.

http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG@25
PS8, Line 25: Don’t like
nit: 'They are not like', or 'They differ from'


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h@160
PS8, Line 160:   orc::RowReaderOptions row_reader_options;
nit: put '_' at the end of the variable name.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@181
PS8, Line 181: base
nit: based


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@329
PS8, Line 329:         *template_tuple =
             :             Tuple::Create(tuple_desc.byte_size(), template_tuple_pool_.get());
nit: put curly braces around it


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@364
PS8, Line 364:     if (id >= node.getColumnId() && id <= node.getMaximumColumnId()) return true;
nit: put curly braces around it


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@681
PS8, Line 681:       RETURN_IF_ERROR(AssembleCollection(*child_reader, child_batch_offset + i,
             :           coll_value_builder));
nit: add curly braces around it


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@101
PS8, Line 101:     batch_ = dynamic_cast<orc::LongVectorBatch*>(orc_batch);
We avoid doing dynamic casts in release mode.
You can do a dynamic_cast inside a DCHECK, then use static_cast here.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@126
PS8, Line 126:     int64_t val = batch_->data.data()[row_idx];
Maybe you could add a DCHECK that batch_ is not null.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@190
PS8, Line 190:   *slot = TimestampValue::FromUnixTimeNanos(secs, nanos,
Do we know that orc timestamps are timezone-aware or not?


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@326
PS8, Line 326:     children_[c]->UpdateInputBatch(batch_->fields[children_fields_[c]]);
nit: add curly brackets around it since it is a multi-line for stmt.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@331
PS8, Line 331:     RETURN_IF_ERROR(child->ReadValue(row_idx_, tuple, pool));
nit: add curly brackets around it


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@395
PS8, Line 395:   } else CreateChildForSlot(node, slot_desc);
nit: it's against the Google style guide to have braces on only one branch of an if-stmt:
https://google.github.io/styleguide/cppguide.html#Conditionals



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 25 Jan 2019 17:25:23 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 18:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2315/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 18
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Mar 2019 06:06:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1728/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 08 Jan 2019 00:43:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#9).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................

IMPALA-6503: Support reading complex types from ORC format files

We’ve supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don’t need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we’ll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala’s representation

To format the materialization, we implement several ORC column readers
used in hdfs-orc-scanner. Each kind of reader treats a column type.
Don’t like the Parquet readers (used in hdfs-parquet-scanner) which
materializes Parquet column binaries into tuple values directly, the ORC
readers (in hdfs-orc-scanner) just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
30 files changed, 1,698 insertions(+), 453 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/9
-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 22: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 22
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Mar 2019 10:14:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#16).

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................

IMPALA-6503: Support reading complex types from ORC

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we leverage the ORC lib to parse ORC binaries (data in
io buffer read from DiskIoMgr). The ORC lib can materialize ORC column
binaries into its representation (orc::ColumnVectorBatch). Then we
transform values in orc::ColumnVectorBatch into impala::Tuples in
hdfs-orc-scanner. We don't need to do anything about decoding/decompression
since they are handled by the ORC lib. Fortunately, the ORC lib already
supports complex types, we can still leverage it to support complex types.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need in the form required by the ORC
  lib (Get list of ORC type ids from tuple descriptors)
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation (Slots/Tuples/RowBatches)

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type and
transforms outputs of the ORC lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
29 files changed, 1,889 insertions(+), 462 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/16
-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 16
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................

IMPALA-6503: Support reading complex types from ORC

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we leverage the ORC lib to parse ORC binaries (data in
io buffer read from DiskIoMgr). The ORC lib can materialize ORC column
binaries into its representation (orc::ColumnVectorBatch). Then we
transform values in orc::ColumnVectorBatch into impala::Tuples in
hdfs-orc-scanner. We don't need to do anything about decoding/decompression
since they are handled by the ORC lib. Fortunately, the ORC lib already
supports complex types, we can still leverage it to support complex types.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need in the form required by the ORC
  lib (Get list of ORC type ids from tuple descriptors)
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation (Slots/Tuples/RowBatches)

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type and
transforms outputs of the ORC lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.
* Run exhaustive tests in DEBUG and RELEASE builds.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Reviewed-on: http://gerrit.cloudera.org:8080/12168
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
30 files changed, 1,943 insertions(+), 466 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 25
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1854/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 23 Jan 2019 16:15:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 21:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2355/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 21
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Mar 2019 16:38:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#13).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................

IMPALA-6503: Support reading complex types from ORC format files

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don't need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we'll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type. The ORC
column readers differ from the Parquet readers (used in
hdfs-parquet-scanner) which materializes Parquet column binaries into
tuple values directly. They just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
29 files changed, 1,864 insertions(+), 461 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/13
-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12168/5/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/5/be/src/exec/hdfs-orc-scanner.cc@291
PS5, Line 291:   RETURN_IF_ERROR(schema_resolver_->ResolveColumn(tuple_desc.tuple_path(), &node, &pos_field,
line too long (93 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 07 Jan 2019 14:55:52 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 14: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 13 Feb 2019 05:06:49 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3726/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Feb 2019 00:37:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 12:

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/3726/


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Feb 2019 04:35:25 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 20:

(5 comments)

> Patch Set 20:
> 
> (5 comments)
> 
> Thanks again for the docs! Might be worth to explain it in more detail how you create the row readers, it took me a while to understand it. Maybe extend the comments with a step-by-step walkthrough.
> 
> So e.g., if I understood it correctly:
> 
> Resolve columns => get a list of orc::Nodes => get column ids from the nodes => configure orc_row_reader_ with column ids (now they are called types) => create temporary orc::RowReader to get the selected subset of the schema => create impala::OrcColumnReaders

Yes, you are right! Sorry for taking you too much time to understand it. I just add such kind of comments before class HdfsOrcScanner. Feel free to correct my comments.

http://gerrit.cloudera.org:8080/#/c/12168/20/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/20/be/src/exec/hdfs-orc-scanner.h@247
PS20, Line 247: 'selected_indices'
> nit: 'selected_nodes'
Done


http://gerrit.cloudera.org:8080/#/c/12168/20/be/src/exec/orc-metadata-utils.h
File be/src/exec/orc-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/12168/20/be/src/exec/orc-metadata-utils.h@27
PS20, Line 27: class OrcMetadataUtils {
> nit: please add some comment to the class and functions.
Done


http://gerrit.cloudera.org:8080/#/c/12168/20/be/src/exec/orc-metadata-utils.h@36
PS20, Line 36: class OrcSchemaResolver {
> nit: please add class comment
Done


http://gerrit.cloudera.org:8080/#/c/12168/20/be/src/exec/orc-metadata-utils.h@42
PS20, Line 42: 'pos_field' is true
> I think it would be clearer to say "'pos_field' is set to true" to emphasiz
Good point! Done


http://gerrit.cloudera.org:8080/#/c/12168/20/be/src/exec/orc-metadata-utils.h@44
PS20, Line 44: 'missing_field' is true
> "'missing_field' is set to true"
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 20
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Mar 2019 15:54:06 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#8).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................

IMPALA-6503: Support reading complex types from ORC format files

We’ve supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don’t need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we’ll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala’s representation

To format the materialization, we implement several ORC column readers
used in hdfs-orc-scanner. Each kind of reader treats a column type.
Don’t like the Parquet readers (used in hdfs-parquet-scanner) which
materializes Parquet column binaries into tuple values directly, the ORC
readers (in hdfs-orc-scanner) just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
29 files changed, 1,695 insertions(+), 449 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/8
-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 25:

(33 comments)

Ah, I missed that this was almost committed. I had started reviewing but hadn't finished so hadn't posted the comments yet. I'll post my comments here for now. Quanlong, do you mind if I build a patch on this to take care of my suggested cleanups?

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@29
PS18, Line 29: #include "exec/orc-metadata-utils.h"
nit: do you need to include this here or could you get by with forward declarations?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@155
PS18, Line 155:   int64_t stripe_rows_read_ = 0;
nit: nullptr is the default initializer already for unique_ptr


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@174
PS18, Line 174: 
nit: "owned by"


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@182
PS18, Line 182: tch batch updated
I noticed that the ORC docs are not very consistent on terminology here. These are equivalent to "column" IDs as referenced above, right?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@235
PS18, Line 235:   /// Advances 'stripe_idx_' to the next non-empty stripe and initializes
can you update the doc to explain the 'column_reader' parameter here? The docs say this transfers tuples from orc_root_batch_, but in fact maybe it is transferring from the provided column reader?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@243
PS18, Line 243: 
'selected_indices' doesn't seem to exist here. Should it be "selected nodes"?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@244
PS18, Line 244:   /// Materialize collection(list/map) tuples belong to the 'row_idx'-th row of
I'm having trouble understanding this documentation. Maybe worth giving an example here?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@249
PS18, Line 249: 
row_reader_options_


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@33
PS18, Line 33: types
nit: typo: "type"


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@41
PS18, Line 41: ///     while ( /* has more rows to read */ ) {
Performance-wise, would it be possible to actually materialize a column at a time? Or do you think doing it a tuple at a time will be fast once we later introduce codegen?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@57
PS18, Line 57: guaranteed
typo: guarantee


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@72
PS18, Line 72:  OrcColumnReader(const orc::Type* node, const SlotDescriptor* slot_desc,
             :       HdfsOrcScanner* scanner);
instead of having both a public constructor and a public factory function above, could this be another factory function, with an appropriate name? Or could this be made private if it's only meant to be used by the factory function itself?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@79
PS18, Line 79:   virtual bool MaterializeTuple() const { return true; }
Can we rename this to sound more like a boolean? As is, it sounds like a function which materializes a tuple.

I'm also a little confused on the description -- I would think that as described, this function returns whether or not the reader is responsible for materializing a specific slot?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@88
PS18, Line 88: orc::ColumnVectorBatch
can this be const?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@122
PS18, Line 122: class OrcBoolColumnReader : public OrcColumnReader {
do we need to define all the subtypes of readers in the .h file or could they be moved to the .cc file?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@46
PS18, Line 46:     switch (slot_desc->type().type) {
I think it's worth a comment explaining why we switched above on node->getKind() but from here downward we're switching on the Impala slot type. Have we already verified the compatibility elsewhere? Or will we verify compatibility inside the constructor?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@102
PS18, Line 102:         DCHECK(false) << slot_desc->type().DebugString();
I think we should be more defensive here and actually return a bad Status in the case of an unknown type. It seems we're missing UNION and DATE at least here. We don't want to crash in the case that someone passes an ORC file with such a bad type.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@112
PS18, Line 112:   if (node->getKind() == orc::TypeKind::STRUCT) {
I'm a bit confused. Isn't the top-level reader always going to be a STRUCT?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.h
File be/src/exec/orc-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.h@29
PS18, Line 29: class OrcMetadataUtils {
docs


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.h@39
PS18, Line 39: class OrcSchemaResolver {
can you document the lifetime requirements of all these parameters? It seems all of the pointers passed in must be kept alive while the OrcSchemaResolver is in use.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.h@44
PS18, Line 44:   /// Resolve SchemaPath into orc::Type (ORC column representation)
in this case, would 'node' still be returned?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.h@48
PS18, Line 48:   Status ResolveColumn(const SchemaPath& col_path, const orc::Type** node,
reference-typed members are against the style guide. I think this should be a const pointer instead


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc
File be/src/exec/orc-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@25
PS18, Line 25: void OrcMetadataUtils::BuildSchemaPaths(const orc::Type& root, int num_partition_keys,
it's nice that you factored out these utility functions. Would it be possible to write a unit test for them?  It seems like 'orc::Type' is pretty easy to construct for a test using orc::Type::buildTypeFromString(...): https://github.com/apache/orc/blob/d81131b9d608779f48d7de6a3c11ece03a3bc0a0/c%2B%2B/test/TestType.cc#L280


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@26
PS18, Line 26:     vector<SchemaPath>* paths) {
worth a DCHECK(paths->empty());


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@27
PS18, Line 27:   SchemaPath path;
             :   paths->push_back(path);
             :   DCHECK_EQ(root.getKind(), orc::TypeKind::STRUCT);
             :   int num_columns = root.getSubtypeCount();
             :   for (int i = 0; i < num_columns; ++i) {
             :     path.push_back(i + num_partition_keys);
             :     BuildSchemaPath(*root.getSubtype(i), &path, paths);
             :     path.pop_back();
             :   }
it seems like this code is the same handling as the 'STRUCT' case below in BuildSchemaPath. Could you just implement this as:

SchemaPath path;
BuildSchemaPath(root, &path, paths);


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@40
PS18, Line 40:   DCHECK_EQ(paths->size(), node.getColumnId());
Is this guaranteed that the column IDs in an ORC file are in sequential order with no gaps?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@65
PS18, Line 65:   }
do we need an 'else { return Status(...); } here? What if the ORC file has some corruption, we don't want to crash. Also seems this is missing UNION, so even if we can't support UNION initially we should make sure to return a good error back to the user


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@89
PS18, Line 89:       table_col_type = &table_col_type->children[table_idx];
can you add a DCHECK_LT(table_idx, table_col_type->children.size()) here so that in debug builds if we got this wrong we'd get an easier crash?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@100
PS18, Line 100:         return Status(Substitute("File '$0' has an incompatible ORC schema for column "
would be good in unit tests to make sure these cases are covered, too


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@184
PS18, Line 184:       "Type mismatch: table column $0 is map to column $1 in ORC file '$2'",
would be good if you could include the string column paths here (both the impala-side and orc-side). I was trying to use the existing ORC support last night and got very confused for a while by "table column bigint is mapped to column int" since I couldn't figure out which column I'd messed up the table definition for.


http://gerrit.cloudera.org:8080/#/c/12168/18/testdata/ComplexTypesTbl/README
File testdata/ComplexTypesTbl/README:

http://gerrit.cloudera.org:8080/#/c/12168/18/testdata/ComplexTypesTbl/README@10
PS18, Line 10: The ORC files can be regenerated by running the following commands in current directory:
maybe it's worth providing a simple python script to regenerate the nested orc data, including performing the above transformation of the JSON files?


http://gerrit.cloudera.org:8080/#/c/12168/18/testdata/bin/load_nested.py
File testdata/bin/load_nested.py:

http://gerrit.cloudera.org:8080/#/c/12168/18/testdata/bin/load_nested.py@358
PS18, Line 358:   parser.add_argument("-f", "--table-format", default="parquet/none")  # can be "orc/def"
instead of the 'can be orc/def' comment here, maybe include that as a help=... kwarg, so that users of the script can see it?


http://gerrit.cloudera.org:8080/#/c/12168/18/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/12168/18/tests/query_test/test_nested_types.py@613
PS18, Line 613:     elif file_format == 'orc':
worth an 'else' raise exception



-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 25
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Mar 2019 05:56:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 19:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2335/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 19
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Mar 2019 14:27:52 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 22: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 22
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Mar 2019 23:56:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 10:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1904/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 28 Jan 2019 06:03:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 20:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3864/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 20
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Mar 2019 14:59:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 23: Code-Review+2

I'll kick off a dry run to make sure rebased commit is tested. Hopefully Quanlong can hit submit when he wakes up.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 23
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 07 Mar 2019 19:57:09 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 13:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2091/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 13 Feb 2019 00:33:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 22:

Agreed, thanks a lot for the contribution!


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 22
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 07 Mar 2019 16:41:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3702/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Sun, 03 Feb 2019 23:50:49 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#21).

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................

IMPALA-6503: Support reading complex types from ORC

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we leverage the ORC lib to parse ORC binaries (data in
io buffer read from DiskIoMgr). The ORC lib can materialize ORC column
binaries into its representation (orc::ColumnVectorBatch). Then we
transform values in orc::ColumnVectorBatch into impala::Tuples in
hdfs-orc-scanner. We don't need to do anything about decoding/decompression
since they are handled by the ORC lib. Fortunately, the ORC lib already
supports complex types, we can still leverage it to support complex types.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need in the form required by the ORC
  lib (Get list of ORC type ids from tuple descriptors)
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation (Slots/Tuples/RowBatches)

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type and
transforms outputs of the ORC lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.
* Run exhaustive tests in DEBUG and RELEASE builds.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
30 files changed, 1,943 insertions(+), 466 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/21
-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 21
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 20: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/3864/


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 20
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Mar 2019 19:23:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 24: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 24
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Mar 2019 04:39:06 +0000
Gerrit-HasComments: No