You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Tim Armstrong (Code Review)" <ge...@cloudera.org> on 2018/04/12 01:13:06 UTC

[Impala-ASF-CR](2.x) IMPALA-5717: Support for reading ORC data files

Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/9988

to review the following change.


Change subject: IMPALA-5717: Support for reading ORC data files
......................................................................

IMPALA-5717: Support for reading ORC data files

This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Reviewed-on: http://gerrit.cloudera.org:8080/9134
Reviewed-by: Quanlong Huang <hu...@gmail.com>
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/util/backend-gflag-util.cc
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindOrc.cmake
M common/thrift/BackendGflags.thrift
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/LineItemMultiBlock/README.dox
A testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc
A testdata/LineItemMultiBlock/lineitem_sixblocks.orc
A testdata/LineItemMultiBlock/lineitem_threeblocks.orc
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/run-hive-server.sh
M testdata/cluster/node_templates/common/etc/hadoop/conf/hdfs-site.xml.tmpl
A testdata/data/chars-formats.orc
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_dimensions.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/impala_test_suite.py
M tests/common/test_dimensions.py
M tests/common/test_vector.py
M tests/comparison/cli_options.py
M tests/query_test/test_chars.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
62 files changed, 1,745 insertions(+), 307 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/88/9988/1
-- 
To view, visit http://gerrit.cloudera.org:8080/9988
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9988
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR](2.x) IMPALA-5717: Support for reading ORC data files

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9988 )

Change subject: IMPALA-5717: Support for reading ORC data files
......................................................................

IMPALA-5717: Support for reading ORC data files

This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Reviewed-on: http://gerrit.cloudera.org:8080/9134
Reviewed-by: Quanlong Huang <hu...@gmail.com>
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-on: http://gerrit.cloudera.org:8080/9988
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/util/backend-gflag-util.cc
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindOrc.cmake
M common/thrift/BackendGflags.thrift
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/LineItemMultiBlock/README.dox
A testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc
A testdata/LineItemMultiBlock/lineitem_sixblocks.orc
A testdata/LineItemMultiBlock/lineitem_threeblocks.orc
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/run-hive-server.sh
M testdata/cluster/node_templates/common/etc/hadoop/conf/hdfs-site.xml.tmpl
A testdata/data/chars-formats.orc
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_dimensions.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/impala_test_suite.py
M tests/common/test_dimensions.py
M tests/common/test_vector.py
M tests/comparison/cli_options.py
M tests/query_test/test_chars.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
62 files changed, 1,745 insertions(+), 307 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Tim Armstrong: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/9988
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: merged
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9988
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR](2.x) IMPALA-5717: Support for reading ORC data files

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/9988 )

Change subject: IMPALA-5717: Support for reading ORC data files
......................................................................


Patch Set 1:

Tianyi's change is in. Submitting.


-- 
To view, visit http://gerrit.cloudera.org:8080/9988
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9988
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 13 Apr 2018 03:25:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR](2.x) IMPALA-5717: Support for reading ORC data files

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9988 )

Change subject: IMPALA-5717: Support for reading ORC data files
......................................................................


Patch Set 1:

Here's the ORC patch backported to 2.x with a minor conflict resolved. Joe, I think you were looking at the backed up commits on 2.x, so I'll leave you to decide when to submit this.


-- 
To view, visit http://gerrit.cloudera.org:8080/9988
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9988
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Apr 2018 01:14:25 +0000
Gerrit-HasComments: No

[Impala-ASF-CR](2.x) IMPALA-5717: Support for reading ORC data files

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9988 )

Change subject: IMPALA-5717: Support for reading ORC data files
......................................................................


Patch Set 1: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/9988
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9988
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Apr 2018 01:13:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR](2.x) IMPALA-5717: Support for reading ORC data files

Posted by "Philip Zeyliger (Code Review)" <ge...@cloudera.org>.
Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/9988 )

Change subject: IMPALA-5717: Support for reading ORC data files
......................................................................


Patch Set 1:

Thanks for pro-actively doing the cherry-pick. I unblocked most of the cherrypick job and now the next failed cherrypick is Tianyi's change and then this one.

818cd8fa2721cd8205b304563b728952bffc8b2f IMPALA-5717: Support for reading ORC data files (Wed Apr 11 05:13:02 2018 +0000) - stiga-huang
d28b39afae5b8f1d621dcd2e1884c0353f1c058f IMPALA-6820: Remove _impala_builtins from catalogd (Wed Apr 11 03:55:07 2018 +0000) - Tianyi Wang


-- 
To view, visit http://gerrit.cloudera.org:8080/9988
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9988
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Apr 2018 03:02:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR](2.x) IMPALA-5717: Support for reading ORC data files

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/9988 )

Change subject: IMPALA-5717: Support for reading ORC data files
......................................................................


Patch Set 2:

I just find that I ignore this... Thank you, Tim, Joe, Philip, and Tianyi for doing this for me!


-- 
To view, visit http://gerrit.cloudera.org:8080/9988
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9988
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 13 Apr 2018 04:55:20 +0000
Gerrit-HasComments: No