You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org> on 2019/11/15 18:58:26 UTC

[Impala-ASF-CR] [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Yanjia Gary Li has uploaded this change for review. ( http://gerrit.cloudera.org:8080/14711


Change subject: [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

[WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the correct table, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilte.

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
---
M fe/pom.xml
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
5 files changed, 81 insertions(+), 3 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14711/2
-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 2
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 21:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/14711/21/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/14711/21/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1561
PS21, Line 1561: (format == HdfsFileFormat.PARQUET) || (format == HdfsFileFormat.HUDI_PARQUET)
> can be replaced by isParquetBased(format)?
Done


http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/data/README@487
PS21, Line 487: file footer.
> file name?
Done


http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/datasets/functional/functional_schema_template.sql@2763
PS21, Line 2763: S
> You need to place a ';' at the end of all your statements, otherwise the ge
Done


http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/datasets/functional/functional_schema_template.sql@2776
PS21, Line 2776: LOCATION '/test-warehouse/hudi_parquet'
> Add ';'
Done


http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/datasets/functional/functional_schema_template.sql@2786
PS21, Line 2786: LOCATION '/test-warehouse/hudi_parquet'
> Add ';'
Done


http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/workloads/functional-query/queries/QueryTest/hudi-parquet.test
File testdata/workloads/functional-query/queries/QueryTest/hudi-parquet.test:

http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/workloads/functional-query/queries/QueryTest/hudi-parquet.test@71
PS21, Line 71: USE functional_parquet;
> nit: either put it at the beginning of the file and you don't need to spell
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 21
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 07 Feb 2020 04:35:06 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 25:

Congrats, Yanjia! Your first commit to Apache Impala has just been merged. Great job!


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 25
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 11 Feb 2020 15:10:44 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has uploaded a new patch set (#14). ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the correct table, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilter.

Tests
 - Unit test for Hudi in FileMetadataLoader
 - Query tests in create-table.test
 - Query tests in hudiparquet.test

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node.cc
M bin/impala-config.sh
M bin/rat_exclude_files.txt
M common/thrift/CatalogObjects.thrift
M fe/pom.xml
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
M fe/src/main/jflex/sql-scanner.flex
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M impala-parent/pom.xml
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/hudicow/.hoodie/20200112194517.clean
A testdata/data/hudicow/.hoodie/20200112194517.clean.inflight
A testdata/data/hudicow/.hoodie/20200112194517.clean.requested
A testdata/data/hudicow/.hoodie/20200112194517.commit
A testdata/data/hudicow/.hoodie/20200112194517.commit.requested
A testdata/data/hudicow/.hoodie/20200112194517.inflight
A testdata/data/hudicow/.hoodie/20200112194529.clean
A testdata/data/hudicow/.hoodie/20200112194529.clean.inflight
A testdata/data/hudicow/.hoodie/20200112194529.clean.requested
A testdata/data/hudicow/.hoodie/20200112194529.commit
A testdata/data/hudicow/.hoodie/20200112194529.commit.requested
A testdata/data/hudicow/.hoodie/20200112194529.inflight
A testdata/data/hudicow/.hoodie/hoodie.properties
A testdata/data/hudicow/year=2015/month=03/day=16/.hoodie_partition_metadata
A testdata/data/hudicow/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-38-282_20200112194529.parquet
A testdata/data/hudicow/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet
A testdata/data/hudicow/year=2015/month=03/day=17/.hoodie_partition_metadata
A testdata/data/hudicow/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-38-281_20200112194529.parquet
A testdata/data/hudicow/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-5-9_20200112194517.parquet
A testdata/data/hudicow/year=2016/month=03/day=15/.hoodie_partition_metadata
A testdata/data/hudicow/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-38-283_20200112194529.parquet
A testdata/data/hudicow/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-5-11_20200112194517.parquet
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/functional-query/queries/QueryTest/create-table.test
A testdata/workloads/functional-query/queries/QueryTest/hudiparquet.test
M tests/common/test_dimensions.py
M tests/query_test/test_scanners.py
50 files changed, 585 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14711/14
-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 14
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 20:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5583/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 20
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 03 Feb 2020 00:01:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has uploaded a new patch set (#19). ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the table correctly, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilter.

Tests
 - Unit test for Hudi in FileMetadataLoader
 - Query tests in create-table.test
 - Query tests in hudiparquet.test

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node.cc
M bin/impala-config.sh
M bin/rat_exclude_files.txt
M common/thrift/CatalogObjects.thrift
M fe/pom.xml
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
M fe/src/main/jflex/sql-scanner.flex
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M impala-parent/pom.xml
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/hudi_parquet/.hoodie/20200112194517.clean
A testdata/data/hudi_parquet/.hoodie/20200112194517.clean.inflight
A testdata/data/hudi_parquet/.hoodie/20200112194517.clean.requested
A testdata/data/hudi_parquet/.hoodie/20200112194517.commit
A testdata/data/hudi_parquet/.hoodie/20200112194517.commit.requested
A testdata/data/hudi_parquet/.hoodie/20200112194517.inflight
A testdata/data/hudi_parquet/.hoodie/20200112194529.clean
A testdata/data/hudi_parquet/.hoodie/20200112194529.clean.inflight
A testdata/data/hudi_parquet/.hoodie/20200112194529.clean.requested
A testdata/data/hudi_parquet/.hoodie/20200112194529.commit
A testdata/data/hudi_parquet/.hoodie/20200112194529.commit.requested
A testdata/data/hudi_parquet/.hoodie/20200112194529.inflight
A testdata/data/hudi_parquet/.hoodie/hoodie.properties
A testdata/data/hudi_parquet/year=2015/month=03/day=16/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-38-282_20200112194529.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=17/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-38-281_20200112194529.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-5-9_20200112194517.parquet
A testdata/data/hudi_parquet/year=2016/month=03/day=15/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-38-283_20200112194529.parquet
A testdata/data/hudi_parquet/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-5-11_20200112194517.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/hudi-parquet.test
M tests/common/test_dimensions.py
M tests/query_test/test_scanners.py
45 files changed, 631 insertions(+), 43 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14711/19
-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 19
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 22:

(1 comment)

Sorry for joining late and thanks for the great work!

I have some questions / concerns about the example parquet files.

http://gerrit.cloudera.org:8080/#/c/14711/22/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/14711/22/testdata/data/README@489
PS22, Line 489: `ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet`
              : `ca51fa17-681b-4497-85b7-4f68e7a63ee7-0` is the bloom index hash of this file
              : `20200112194517` is the timestamp of this version
I would prefer not to add such large (~0.5MB) files to the .git repo if possible. There are already some > 1MB files there, and there are some tests that really need large files, but  it would be good to think about possibilities to avoid it in this case.

We looked at the files with Zoltan - it only has a few rows, and to whole file is compressible to ~7K, so we are were not sure about the reason behind the large size. We have noticed a large Bloom filter in the metadata - maybe the uncompressed bloom filter is the reason behind the large size?

It would be also nice to add some more information here about the way these files for created - Parquet bloom filters are not yet finished according to https://issues.apache.org/jira/browse/PARQUET-41 Did you use a released parquet-mr to create the files, or Hoodie has its own fork of parquet-mt or its own writer?



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 22
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 07 Feb 2020 15:44:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 7:

> Patch Set 7:
> 
> (3 comments)
> 
> Hi! Thanks for working on this, left some comments there.

Thanks for reviewing. Now I realized this PR should be far more complicated than the current commit, but the good thing is I can get some insights from https://jira.apache.org/jira/browse/IMPALA-5717
I will let you guys know when this is ready to review.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 7
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Sat, 11 Jan 2020 06:04:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 16:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14711/16/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/14711/16/be/src/exec/hdfs-scan-node-base.cc@379
PS16, Line 379: HUDI_PARQUET
> I see, but in the backend you just create "low-level" operators, such as sc
Agree with Zoltan. It would be nice if there were no necessary backend changes here. While this might change in the future (e.g. when Hudi near real-time format is added), I think we can worry about that in a separate patch.

I'm actually surprised this doesn't already work with your current patch. Reading the code Zoltan linked (https://gerrit.cloudera.org/#/c/14711/7/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java line 180) that serializes all THdfsFileFormat#HUDI_PARQUET as THdfsFileFormat#PARQUET.

That being said, I can't point to the place in the planner where you would need to make this change (maybe someone more familiar with the planner code would know). So this isn't a blocker for me.


http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@354
PS16, Line 354: isParquet
> maybe 'isParquetBased'?
+1



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 16
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 05 Feb 2020 16:35:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 14:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java@68
PS14, Line 68: THdfsFileFormat.HUDI_PARQUET
> Thanks for reviewing! I put HUDI_PARQUET here because the comment above say
This is just about the CREATE TABLE AS SELECT statement, which currently cannot be supported by Impala to create Hudi tables. This statement creates a new table based on the results in the SELECT statement. Therefore it would require Impala to write data in Hudi format.

You will still be able to define a Hudi table in a plain CREATE TABLE statement with partitions as well.


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java@74
PS14, Line 74: HUDI_PARQUET
> Make sense. statement like "CREATE TABLE LIKE PARQUET xxx STORED AS HUDIPAR
Right.


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/jflex/sql-scanner.flex
File fe/src/main/jflex/sql-scanner.flex:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/jflex/sql-scanner.flex@149
PS14, Line 149: hudiparquet
> I am following the naming convention of SEQUENCEFILE. 
I see. I don't have a strong opinion about it, my personal preference is for HUDI_PARQUET, but I don't mind if you choose the other for consistency. Maybe other reviewers will also have an opinion about it.


http://gerrit.cloudera.org:8080/#/c/14711/14/tests/common/test_dimensions.py
File tests/common/test_dimensions.py:

http://gerrit.cloudera.org:8080/#/c/14711/14/tests/common/test_dimensions.py@33
PS14, Line 33: hudiparquet
> I am not quite sure if I understand the .test execution sequence correctly.
OK, so Impala's end-to-end tests are organized around workloads. A workload is basically the following:

* a set of tests
* test dimensions (contains the parameter space of the tests, e.g. file formats, compression, etc.) See testdata/workloads/functional-query/functional-query_dimensions.csv
* dataset, e.g. for the 'function-query' workload the 'functional' dataset belongs. This dataset needs to be loaded in multiple fileformats using different compressions. Not all combinations are need to be loaded, it is restricted by 'schema_constraints.csv'

Before you run any tests you need to load data into your cluster. There are several ways of doing this, one way is to invoke buildall.sh -testdata. Among other things it will load the 'functional' dataset required for the 'functional-query' workload. You can look around in Impala shell and see what databases are created using 'show databases'.

Of course there are some exceptions, e.g. some tests load their own data, create their own tables, especially when we want to test CREATE or INSERT statements. These typically run in a temporary 'unique database'.

Now you can run the test cases. You run a test with an 'exploration strategy' which means it will run the test multiple times with different combinations of test parameters.

If you run a test with 'exhaustive' exploration strategy, it means it will run all the allowed combinations of the test parameters (as mentioned above, it is restricted by 'schema_constraints.csv'). E.g. the file 'testdata/workloads/functional-query/functional-query_exhaustive.csv' contains these auto-generated parameter lists. A manually selected subset of it is in 'functional-query_core.csv'. These are the parameter combinations that we think are the most important, we refer to it as the 'core' exploration strategy or just 'core tests'.

If you'd add HUDI_PARQUET to the test matrix of functional-query it'd mean that all tests belonging to it would be executed with file_format=HUDI_PARQUET as well. But since most of them would fail simply because we cannot load the 'functional' dataset in Hudi format. So you'd need to manually exclude those tests from the test matrix with the help of 'cls.ImpalaTestMatrix.add_constraint()'. E.g. tests in test_scanners.py::TesteParquet run only with file_format=='parquet'.

Other option is to define your own separate Hudi workload. Maybe this would be the cleanest, but it'd require the most boilerplate code probably.

You could also cheat and add your python test functions to the 'functional-query' workload (without touching the test dimensions) and hardcode the file format in the query statements. At this point it might be OK to choose this option. I let you choose the option works best for you then we can re-iterate on it later.



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 14
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 16 Jan 2020 19:23:55 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 11:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5424/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 11
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 14 Jan 2020 08:11:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 14:

(16 comments)

Thanks for applying the changes, I think it's getting into shape.

To answer your question about the test data loading: 'create-table.test' is independent from the other tests. It creates the tables in a temporary database (a so called "unique database") that is created in the beginning of the test and gets dropped after it.

So you could create the Hudi tables during data loading, or in the beginning of hudiparquet.test

Some additional observations:
* I think we shouldn't allow Hudi + ACID at the same time (see 'transactional_properties'), but maybe we can handle it in a follow-up Jira
* Don't forget to update the testdata/data/README file

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java@68
PS14, Line 68: THdfsFileFormat.HUDI_PARQUET
I think Impala should not support inserting into HUDI_PARQUET because currently it doesn't understand the underlying file structure.


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java@74
PS14, Line 74: HUDI_PARQUET
We can only create a table from a Parquet file, it doesn't matter if the file is inside Hudi's directory structure.


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
File fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java@85
PS14, Line 85: udi
nit: HUDI


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@372
PS14, Line 372: fileFormats_.contains(HdfsFileFormat.PARQUET)
              :         || fileFormats_.contains(HdfsFileFormat.HUDI_PARQUET)
Maybe you could introduce an IsParquetBased() method for simplicity.


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/util/HudiUtil.java
File fe/src/main/java/org/apache/impala/util/HudiUtil.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/util/HudiUtil.java@27
PS14, Line 27: class
nit: method, you could also describe what is the effect of this function.


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/jflex/sql-scanner.flex
File fe/src/main/jflex/sql-scanner.flex:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/jflex/sql-scanner.flex@149
PS14, Line 149: hudiparquet
nit: I'm in favor of "hudi_parquet" because it's more readable.


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/bin/create-load-data.sh@216
PS14, Line 216:   # Hudi Table
              :   mkdir ${TMP_DIR}/hudicow
              :   cp -r ${IMPALA_HOME}/testdata/data/hudicow/. ${TMP_DIR}/hudicow
Maybe load-custom-data() is a better place for it.

You could also CREATE the Hudi tables from this file. Maybe create a new .sql file that contains the CREATE TABLE statements and execute this SQL file from here.


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/bin/generate-schema-statements.py@151
PS14, Line 151: hudiparquet
We don't need Hudi related changes in this file because we don't want to (and cannot) create Hudi workloads during the data load.

In other words, we can only copy some already written Hudi tables to HDFS and issue some queries on them. But we cannot write tables in Hudi format with Impala and Hive yet.


http://gerrit.cloudera.org:8080/#/c/14711/11/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/14711/11/testdata/data/README@458
PS11, Line 458: 
Please extend this file with the Hudi table


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_core.csv
File testdata/workloads/functional-query/functional-query_core.csv:

http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_core.csv@7
PS14, Line 7: file_format:hudiparquet, dataset: functional, compression_codec: none, compression_type: none
We cannot write tables in Hudi format yet.


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_dimensions.csv
File testdata/workloads/functional-query/functional-query_dimensions.csv:

http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_dimensions.csv@1
PS14, Line 1: hudiparquet
We cannot write tables in Hudi format.


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_exhaustive.csv
File testdata/workloads/functional-query/functional-query_exhaustive.csv:

http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_exhaustive.csv@26
PS14, Line 26: hudiparquet
We cannot write tables in Hudi format


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_pairwise.csv
File testdata/workloads/functional-query/functional-query_pairwise.csv:

http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_pairwise.csv@7
PS14, Line 7: hudiparquet
We cannot write tables in Hudi format


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/queries/QueryTest/create-table.test
File testdata/workloads/functional-query/queries/QueryTest/create-table.test:

http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/queries/QueryTest/create-table.test@303
PS14, Line 303: hudiparquet '/test-warehouse/hudicow/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-38-282_20200112194529.parquet'
It's really just a Parquet file. We could already use it in a CREATE TABLE LIKE PARQUET statement.


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/queries/QueryTest/create-table.test@303
PS14, Line 303: partitioned 
nit: too long line, you could break it here and below


http://gerrit.cloudera.org:8080/#/c/14711/14/tests/common/test_dimensions.py
File tests/common/test_dimensions.py:

http://gerrit.cloudera.org:8080/#/c/14711/14/tests/common/test_dimensions.py@33
PS14, Line 33: hudiparquet
We cannot run these tests on Hudi tables because we cannot load the workloads in Hudi format.



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 14
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 15 Jan 2020 17:01:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5385/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 7
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 09 Jan 2020 02:59:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has uploaded a new patch set (#13). ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the correct table, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilter.

Tests
 - Unit test for Hudi in FileMetadataLoader
 - Query tests in create-table.test
 - Query tests in hudiparquet.test

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node.cc
M bin/impala-config.sh
M bin/rat_exclude_files.txt
M common/thrift/CatalogObjects.thrift
M fe/pom.xml
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
M fe/src/main/jflex/sql-scanner.flex
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M impala-parent/pom.xml
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/hudicow/.hoodie/20200112194517.clean
A testdata/data/hudicow/.hoodie/20200112194517.clean.inflight
A testdata/data/hudicow/.hoodie/20200112194517.clean.requested
A testdata/data/hudicow/.hoodie/20200112194517.commit
A testdata/data/hudicow/.hoodie/20200112194517.commit.requested
A testdata/data/hudicow/.hoodie/20200112194517.inflight
A testdata/data/hudicow/.hoodie/20200112194529.clean
A testdata/data/hudicow/.hoodie/20200112194529.clean.inflight
A testdata/data/hudicow/.hoodie/20200112194529.clean.requested
A testdata/data/hudicow/.hoodie/20200112194529.commit
A testdata/data/hudicow/.hoodie/20200112194529.commit.requested
A testdata/data/hudicow/.hoodie/20200112194529.inflight
A testdata/data/hudicow/.hoodie/hoodie.properties
A testdata/data/hudicow/year=2015/month=03/day=16/.hoodie_partition_metadata
A testdata/data/hudicow/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-38-282_20200112194529.parquet
A testdata/data/hudicow/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet
A testdata/data/hudicow/year=2015/month=03/day=17/.hoodie_partition_metadata
A testdata/data/hudicow/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-38-281_20200112194529.parquet
A testdata/data/hudicow/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-5-9_20200112194517.parquet
A testdata/data/hudicow/year=2016/month=03/day=15/.hoodie_partition_metadata
A testdata/data/hudicow/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-38-283_20200112194529.parquet
A testdata/data/hudicow/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-5-11_20200112194517.parquet
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/functional-query/queries/QueryTest/create-table.test
A testdata/workloads/functional-query/queries/QueryTest/hudiparquet.test
M tests/common/test_dimensions.py
M tests/query_test/test_scanners.py
50 files changed, 582 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14711/13
-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 13
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 16:

(12 comments)

does this allow writing to Hudi Parquet tables? if not, would be good to have a test that validates that any writes from Impala to a Hudi Parquet table fails

http://gerrit.cloudera.org:8080/#/c/14711/16/fe/pom.xml
File fe/pom.xml:

http://gerrit.cloudera.org:8080/#/c/14711/16/fe/pom.xml@140
PS16, Line 140: org.apache.hudi
is the dependency on Avro jar (org.apache.avro) needed?


http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@892
PS16, Line 892:    * only,
              :    * false otherwise.
nit: collapse to a single line


http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
File fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java:

http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java@197
PS16, Line 197: fileFormat_
since it is possible fileFormat_ can be null, should there be a null check here?


http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java:

http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java@73
PS16, Line 73:       "org.apache.hadoop.hive.kudu.KuduSerDe", false, false, false)
             :   ,
nit: combine these two lines


http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@354
PS16, Line 354: isParquet
similar comment to Zoltan's above.


http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1566
PS16, Line 1566:         // file.
               :         // I.e. excluding partition columns and columns that are populated from file
               :         // metadata.
nit: lines can be collapsed


http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/util/HudiUtil.java
File fe/src/main/java/org/apache/impala/util/HudiUtil.java:

http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/util/HudiUtil.java@26
PS16, Line 26: /**
would be good to clarify that this method does not modify the given list of FileStatuses, in fact it might be best to declare the `stats` object as final.


http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/util/HudiUtil.java@37
PS16, Line 37: ;
nit: is this needed?


http://gerrit.cloudera.org:8080/#/c/14711/16/testdata/bin/create-hudi.sql
File testdata/bin/create-hudi.sql:

http://gerrit.cloudera.org:8080/#/c/14711/16/testdata/bin/create-hudi.sql@22
PS16, Line 22:  
nit: white space


http://gerrit.cloudera.org:8080/#/c/14711/16/testdata/data/hudicow/.hoodie/hoodie.properties
File testdata/data/hudicow/.hoodie/hoodie.properties:

http://gerrit.cloudera.org:8080/#/c/14711/16/testdata/data/hudicow/.hoodie/hoodie.properties@1
PS16, Line 1: Properties
it would be good to have some more documentation on the format of this file, as well as the `.hoodie` folder in general. or at least a link to some documentation explaining the format of this folder as well as what this file means

testdata/data/README should be updated as well to document the hudicow folder

also what does hudicow stand for?


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/queries/QueryTest/hudiparquet.test
File testdata/workloads/functional-query/queries/QueryTest/hudiparquet.test:

http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/queries/QueryTest/hudiparquet.test@1
PS14, Line 1: ====
would be good to have some additional test cases:
* some tests that don't just do a count(*), IIRC count(*) queries on Parquet files follow a special path that avoids reading the entire file (it just reads the footer stats to get the count)
* some queries that read from Hudi Parquet tables and regular Parquet tables, and perhaps join them together


http://gerrit.cloudera.org:8080/#/c/14711/16/testdata/workloads/functional-query/queries/QueryTest/hudiparquet.test
File testdata/workloads/functional-query/queries/QueryTest/hudiparquet.test:

http://gerrit.cloudera.org:8080/#/c/14711/16/testdata/workloads/functional-query/queries/QueryTest/hudiparquet.test@1
PS16, Line 1: ====
nit: rename file to 'hudi-parquet.test' to be consistent with other file names



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 16
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 21 Jan 2020 17:30:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 25:

Thanks everyone for reviewing this PR and guiding me through this! Looking forward to having more contributions in the future!


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 25
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 11 Feb 2020 18:43:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14711/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/14711/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@609
PS2, Line 609:           oldFds, hostIndex_, validTxnList, writeIds, e.getValue().get(0).getFileFormat());
line too long (91 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 2
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 15 Nov 2019 18:59:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has uploaded a new patch set (#21). ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the table correctly, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilter.

Tests
 - Unit test for Hudi in FileMetadataLoader
 - Create table tests in functional_schema_template.sql
 - Query tests in hudi-parquet.test

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
---
M bin/impala-config.sh
M bin/rat_exclude_files.txt
M common/thrift/CatalogObjects.thrift
M fe/pom.xml
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
M fe/src/main/jflex/sql-scanner.flex
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M impala-parent/pom.xml
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/hudi_parquet/.hoodie/20200112194517.clean
A testdata/data/hudi_parquet/.hoodie/20200112194517.clean.inflight
A testdata/data/hudi_parquet/.hoodie/20200112194517.clean.requested
A testdata/data/hudi_parquet/.hoodie/20200112194517.commit
A testdata/data/hudi_parquet/.hoodie/20200112194517.commit.requested
A testdata/data/hudi_parquet/.hoodie/20200112194517.inflight
A testdata/data/hudi_parquet/.hoodie/20200112194529.clean
A testdata/data/hudi_parquet/.hoodie/20200112194529.clean.inflight
A testdata/data/hudi_parquet/.hoodie/20200112194529.clean.requested
A testdata/data/hudi_parquet/.hoodie/20200112194529.commit
A testdata/data/hudi_parquet/.hoodie/20200112194529.commit.requested
A testdata/data/hudi_parquet/.hoodie/20200112194529.inflight
A testdata/data/hudi_parquet/.hoodie/hoodie.properties
A testdata/data/hudi_parquet/year=2015/month=03/day=16/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-38-282_20200112194529.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=17/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-38-281_20200112194529.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-5-9_20200112194517.parquet
A testdata/data/hudi_parquet/year=2016/month=03/day=15/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-38-283_20200112194529.parquet
A testdata/data/hudi_parquet/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-5-11_20200112194517.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/hudi-parquet.test
M tests/query_test/test_scanners.py
42 files changed, 625 insertions(+), 39 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14711/21
-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 21
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 3:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/5029/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 3
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 15 Nov 2019 21:33:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 14:

Hello, I am not able to load all the tests in my local VM but I used impala-shell to manually run all the tests in .test files I edited. I assumed the create-table.test will be executed before test_scanners.py, please let me know if otherwise. That would be great if I can run all the tests using impala's infrastructure.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 14
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 15 Jan 2020 06:26:09 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 20:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14711/20/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/14711/20/tests/query_test/test_scanners.py@320
PS20, Line 320:     self.run_test_case('QueryTest/hudi-parquet', vector)
Thank you all for reviewing. I am able to use
./bin/load-data.py -w functional-query -f --table_formats=hudiparquet/none/none --table_names=hudi_partitioned
to create the hudi tables. But not able to execute the testing query here. 
Would you guys point to the direction that I need to go? All I need is just to run the query in QueryTest/hudi-parquet.test without breaking other tests.
Thanks a lot.



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 20
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Sun, 02 Feb 2020 23:19:57 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 24:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5317/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 24
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 11 Feb 2020 10:15:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/14711 )

Change subject: [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

[WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the correct table, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilte.

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
---
M bin/rat_exclude_files.txt
M fe/pom.xml
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M testdata/bin/create-load-data.sh
A testdata/data/hudicow/.hoodie/20191218160747.clean
A testdata/data/hudicow/.hoodie/20191218160747.clean.inflight
A testdata/data/hudicow/.hoodie/20191218160747.clean.requested
A testdata/data/hudicow/.hoodie/20191218160747.commit
A testdata/data/hudicow/.hoodie/20191218160747.commit.requested
A testdata/data/hudicow/.hoodie/20191218160747.inflight
A testdata/data/hudicow/.hoodie/20191218160804.clean
A testdata/data/hudicow/.hoodie/20191218160804.clean.inflight
A testdata/data/hudicow/.hoodie/20191218160804.clean.requested
A testdata/data/hudicow/.hoodie/20191218160804.commit
A testdata/data/hudicow/.hoodie/20191218160804.commit.requested
A testdata/data/hudicow/.hoodie/20191218160804.inflight
A testdata/data/hudicow/.hoodie/hoodie.properties
A testdata/data/hudicow/2015/03/16/.hoodie_partition_metadata
A testdata/data/hudicow/2015/03/16/75ca3e88-0a2c-4592-b1c5-4d4812c80388-0_0-38-281_20191218160804.parquet
A testdata/data/hudicow/2015/03/16/75ca3e88-0a2c-4592-b1c5-4d4812c80388-0_1-5-10_20191218160747.parquet
A testdata/data/hudicow/2015/03/17/.hoodie_partition_metadata
A testdata/data/hudicow/2015/03/17/ea972399-ebd0-4447-a01b-6356fd76e43f-0_2-38-283_20191218160804.parquet
A testdata/data/hudicow/2015/03/17/ea972399-ebd0-4447-a01b-6356fd76e43f-0_2-5-11_20191218160747.parquet
A testdata/data/hudicow/2016/03/15/.hoodie_partition_metadata
A testdata/data/hudicow/2016/03/15/e867a156-5330-4a05-a188-ec39f131b1d0-0_0-5-9_20191218160747.parquet
A testdata/data/hudicow/2016/03/15/e867a156-5330-4a05-a188-ec39f131b1d0-0_1-38-282_20191218160804.parquet
30 files changed, 424 insertions(+), 20 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14711/7
-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 7
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 21: Code-Review+1

(1 comment)

One minor comment, otherwise LGTM

http://gerrit.cloudera.org:8080/#/c/14711/21/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/14711/21/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1561
PS21, Line 1561: (format == HdfsFileFormat.PARQUET) || (format == HdfsFileFormat.HUDI_PARQUET)
can be replaced by isParquetBased(format)?



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 21
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 06 Feb 2020 20:14:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 2:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/5028/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 2
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 15 Nov 2019 19:40:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 23: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 23
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 11 Feb 2020 10:09:28 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 24: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 24
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 11 Feb 2020 10:15:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 16:

(2 comments)

Zoltan asked me to look at the data loading piece, and I have some opinions here. Hopefully it is not too difficult to move the data loading into our data loading framework.

http://gerrit.cloudera.org:8080/#/c/14711/16/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/14711/16/testdata/bin/create-load-data.sh@530
PS16, Line 530:   hadoop fs -rm -r /test-warehouse/hudicow
Loading data here is a bit of an anti-pattern because it prevents us from loading in parallel and also doesn't allow developers to load individual tables. E.g. I can load functional_parquet.customer_multiblock like this:

  ./bin/load-data.py -w functional-query -f --table_formats=parquet/none --table_names=customer_multiblock

It would be preferable to do this for this table.

I think it's similar to customer_multiblock: https://github.com/apache/impala/blob/master/testdata/datasets/functional/functional_schema_template.sql#L2455
https://github.com/apache/impala/blob/master/testdata/datasets/functional/schema_constraints.csv#L58

Except you might need to specify a custom create statement like: https://github.com/apache/impala/blob/master/testdata/datasets/functional/functional_schema_template.sql#L2555


http://gerrit.cloudera.org:8080/#/c/14711/16/testdata/data/hudicow/.hoodie/hoodie.properties
File testdata/data/hudicow/.hoodie/hoodie.properties:

http://gerrit.cloudera.org:8080/#/c/14711/16/testdata/data/hudicow/.hoodie/hoodie.properties@1
PS16, Line 1: Properties
> it would be good to have some more documentation on the format of this file
+1



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 16
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 21 Jan 2020 22:37:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 16:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14711/16/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/14711/16/be/src/exec/hdfs-scan-node-base.cc@379
PS16, Line 379: HUDI_PARQUET
> Agree with Zoltan. It would be nice if there were no necessary backend chan
The code I linked is from PS7 and I think it worked just fine. The current code sets the file format to HUDI_PARQUET in the thrift sturcture that's why it needs backend changes.

So I think we just need to restore that one line to reflect the state of PS7 and then we don't need changes in the be.



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 16
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 05 Feb 2020 17:15:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 21: Code-Review+1

(5 comments)

Found some minor issues, but other than that looks good do me.

Once the issues are fixed, and the change looks good to Sahil or Tim as well, I'm ready to give it a +2.

http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/data/README@487
PS21, Line 487: file footer.
file name?


http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/datasets/functional/functional_schema_template.sql@2763
PS21, Line 2763: S
You need to place a ';' at the end of all your statements, otherwise the generated SQL file will be syntactically incorrect when you want to load more tables, or the whole workload, e.g.:

 ./bin/load-data.py -w functional-query  --table_formats=parquet/none/none


http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/datasets/functional/functional_schema_template.sql@2776
PS21, Line 2776: LOCATION '/test-warehouse/hudi_parquet'
Add ';'


http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/datasets/functional/functional_schema_template.sql@2786
PS21, Line 2786: LOCATION '/test-warehouse/hudi_parquet'
Add ';'


http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/workloads/functional-query/queries/QueryTest/hudi-parquet.test
File testdata/workloads/functional-query/queries/QueryTest/hudi-parquet.test:

http://gerrit.cloudera.org:8080/#/c/14711/21/testdata/workloads/functional-query/queries/QueryTest/hudi-parquet.test@71
PS21, Line 71: USE functional_parquet;
nit: either put it at the beginning of the file and you don't need to spell 'functional_parquet.' in each query, or use the fully qualified table names in this query as well



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 21
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 06 Feb 2020 15:19:23 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 22: Code-Review+2

LGTM! Thanks for your great work and patience, Yanjia! Supporting Hudi is a really cool addition to Impala.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 22
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 07 Feb 2020 11:23:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 16:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/14711/16/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/14711/16/be/src/exec/hdfs-scan-node-base.cc@379
PS16, Line 379: HUDI_PARQUET
> My logic was:
I see, but in the backend you just create "low-level" operators, such as scan nodes that need to process some input splits in file format X. So they don't need to be too smart, the planner will tell them what to do.

That said, I don't have a too strong opinion about it.


http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
File fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java:

http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java@197
PS16, Line 197: fileFormat_
> Done
If fileformat_ is null, then the equality check will just return false which is the expected behavior.


http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@354
PS16, Line 354: isParquet
> any suggestion? I couldn't come out a better name here
maybe 'isParquetBased'?


http://gerrit.cloudera.org:8080/#/c/14711/20/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/14711/20/testdata/data/README@482
PS20, Line 482: 
nit: if possible, please keep the 90 chars line length limit


http://gerrit.cloudera.org:8080/#/c/14711/20/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/14711/20/testdata/datasets/functional/functional_schema_template.sql@2762
PS20, Line 2762: 
               : 
               : 
               : 
               : 
Since you are using a custom CREATE statement you'll need to define the partitions in the CREATE TABLE stmt.


http://gerrit.cloudera.org:8080/#/c/14711/20/testdata/datasets/functional/schema_constraints.csv
File testdata/datasets/functional/schema_constraints.csv:

http://gerrit.cloudera.org:8080/#/c/14711/20/testdata/datasets/functional/schema_constraints.csv@59
PS20, Line 59: 
hudiparquet is not part of the test dimensions of the functional workload. Since most of the tests would fail with hudiparquet we can cheat here and create the hudi table in the functional_parquet database, i.e. switch to parquet here.


http://gerrit.cloudera.org:8080/#/c/14711/20/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/14711/20/tests/query_test/test_scanners.py@313
PS20, Line 313: un_test_cas
TestHudiParquet


http://gerrit.cloudera.org:8080/#/c/14711/20/tests/query_test/test_scanners.py@320
PS20, Line 320: 
> Thank you all for reviewing. I am able to use
If in 'schema_constraints.csv' you switch to parquet then you can load the hudi tables with --table_formats=parquet/none/none. It's necessary because hudiparquet is not part of the test dimensions, so we'll just put the table in the functional_parquet database. You already only run this test when file_format == parquet.



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 16
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 03 Feb 2020 15:39:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 24: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 24
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 11 Feb 2020 15:08:38 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 22: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/5301/


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 22
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 07 Feb 2020 16:15:33 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 13:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/14711/13/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
File fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java:

http://gerrit.cloudera.org:8080/#/c/14711/13/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@94
PS13, Line 94:     assertEquals("year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-38-282"
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/14711/13/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@97
PS13, Line 97:     assertEquals("year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-38-281"
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/14711/13/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@100
PS13, Line 100:     assertEquals("year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-38-283"
line too long (92 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 13
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 15 Jan 2020 06:15:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 11:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14711/11/tests/common/test_dimensions.py
File tests/common/test_dimensions.py:

http://gerrit.cloudera.org:8080/#/c/14711/11/tests/common/test_dimensions.py@32
PS11, Line 32:  
flake8: W291 trailing whitespace


http://gerrit.cloudera.org:8080/#/c/14711/11/tests/common/test_dimensions.py@32
PS11, Line 32:   KNOWN_FILE_FORMATS = ['text', 'seq', 'rc', 'parquet', 
line has trailing whitespace



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 11
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 14 Jan 2020 07:41:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 10:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5423/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 10
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 14 Jan 2020 07:55:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 10:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/14711/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14711/7//COMMIT_MSG@12
PS7, Line 12: Filte
> nit: Filter
Done


http://gerrit.cloudera.org:8080/#/c/14711/7//COMMIT_MSG@13
PS7, Line 13: 
> Could you please list the tests you did here?
Done


http://gerrit.cloudera.org:8080/#/c/14711/7/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
File fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java:

http://gerrit.cloudera.org:8080/#/c/14711/7/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java@88
PS7, Line 88: leDescript
> Could you update the comments above?
Done


http://gerrit.cloudera.org:8080/#/c/14711/7/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java:

http://gerrit.cloudera.org:8080/#/c/14711/7/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java@165
PS7, Line 165:       case HUDI_PARQUET: return HdfsFileFormat.HUDI_PARQUET;
> Maybe it would be better to add an empty HUDI_PARQUET case over the default
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 10
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 14 Jan 2020 07:27:54 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has uploaded a new patch set (#23). ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the table correctly, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilter.

Tests
 - Unit test for Hudi in FileMetadataLoader
 - Create table tests in functional_schema_template.sql
 - Query tests in hudi-parquet.test

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
---
M be/src/service/query-options-test.cc
M bin/impala-config.sh
M bin/rat_exclude_files.txt
M common/thrift/CatalogObjects.thrift
M fe/pom.xml
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
M fe/src/main/jflex/sql-scanner.flex
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M impala-parent/pom.xml
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/hudi_parquet/.hoodie/20200210090610.clean
A testdata/data/hudi_parquet/.hoodie/20200210090610.clean.inflight
A testdata/data/hudi_parquet/.hoodie/20200210090610.clean.requested
A testdata/data/hudi_parquet/.hoodie/20200210090610.commit
A testdata/data/hudi_parquet/.hoodie/20200210090610.commit.requested
A testdata/data/hudi_parquet/.hoodie/20200210090610.inflight
A testdata/data/hudi_parquet/.hoodie/20200210090618.clean
A testdata/data/hudi_parquet/.hoodie/20200210090618.clean.inflight
A testdata/data/hudi_parquet/.hoodie/20200210090618.clean.requested
A testdata/data/hudi_parquet/.hoodie/20200210090618.commit
A testdata/data/hudi_parquet/.hoodie/20200210090618.commit.requested
A testdata/data/hudi_parquet/.hoodie/20200210090618.inflight
A testdata/data/hudi_parquet/.hoodie/hoodie.properties
A testdata/data/hudi_parquet/year=2015/month=03/day=16/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2015/month=03/day=16/5f541af5-ca07-4329-ad8c-40fa9b353f35-0_1-70-118_20200210090610.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=16/5f541af5-ca07-4329-ad8c-40fa9b353f35-0_2-103-391_20200210090618.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=17/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2015/month=03/day=17/675e035d-c146-4658-9404-fe590e296d80-0_0-103-389_20200210090618.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=17/675e035d-c146-4658-9404-fe590e296d80-0_0-70-117_20200210090610.parquet
A testdata/data/hudi_parquet/year=2016/month=03/day=15/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2016/month=03/day=15/940359ee-cc79-4974-8a2a-5d133a81a3fd-0_1-103-390_20200210090618.parquet
A testdata/data/hudi_parquet/year=2016/month=03/day=15/940359ee-cc79-4974-8a2a-5d133a81a3fd-0_2-70-119_20200210090610.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/hudi-parquet.test
M testdata/workloads/functional-query/queries/QueryTest/set.test
M tests/query_test/test_scanners.py
44 files changed, 626 insertions(+), 41 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14711/23
-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 23
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 22:

(3 comments)

The verify job failed. The upside is that we have the opportunity to lower the file sizes as Csaba suggested. I also added comments of the test failures.

http://gerrit.cloudera.org:8080/#/c/14711/22/be/src/service/query-options-test.cc
File be/src/service/query-options-test.cc:

http://gerrit.cloudera.org:8080/#/c/14711/22/be/src/service/query-options-test.cc@223
PS22, Line 223: )
you need to add HUDI_PARQUET here


http://gerrit.cloudera.org:8080/#/c/14711/22/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/14711/22/testdata/data/README@489
PS22, Line 489: `ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet`
              : `ca51fa17-681b-4497-85b7-4f68e7a63ee7-0` is the bloom index hash of this file
              : `20200112194517` is the timestamp of this version
> I would prefer not to add such large (~0.5MB) files to the .git repo if pos
I agree with Csaba, and it seems we can easily make the file sizes smaller. I'm not sure if we can disable bloom filter writing based on Hudi's source, but the following options should definitely lower the size of the bloom filter:

From https://github.com/apache/incubator-hudi/blob/c1516df8ac55757ebd07d8aa459a0ceedeccab7b/hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java#L41-L44

 hoodie.index.bloom.num_entries
 hoodie.index.bloom.fpp


http://gerrit.cloudera.org:8080/#/c/14711/22/testdata/workloads/functional-query/queries/QueryTest/set.test
File testdata/workloads/functional-query/queries/QueryTest/set.test:

http://gerrit.cloudera.org:8080/#/c/14711/22/testdata/workloads/functional-query/queries/QueryTest/set.test@145
PS22, Line 145: .
Need to add HUDI_PARQUET(7)



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 22
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 07 Feb 2020 16:35:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has uploaded a new patch set (#11). ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the correct table, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilter.

Tests
 - Unit test for Hudi in FileMetadataLoader
 - Query tests in create-table.test
 - Query tests in hudiparquet.test

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node.cc
M bin/impala-config.sh
M bin/rat_exclude_files.txt
M common/thrift/CatalogObjects.thrift
M fe/pom.xml
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
M fe/src/main/jflex/sql-scanner.flex
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M impala-parent/pom.xml
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/hudicow/.hoodie/20200112194517.clean
A testdata/data/hudicow/.hoodie/20200112194517.clean.inflight
A testdata/data/hudicow/.hoodie/20200112194517.clean.requested
A testdata/data/hudicow/.hoodie/20200112194517.commit
A testdata/data/hudicow/.hoodie/20200112194517.commit.requested
A testdata/data/hudicow/.hoodie/20200112194517.inflight
A testdata/data/hudicow/.hoodie/20200112194529.clean
A testdata/data/hudicow/.hoodie/20200112194529.clean.inflight
A testdata/data/hudicow/.hoodie/20200112194529.clean.requested
A testdata/data/hudicow/.hoodie/20200112194529.commit
A testdata/data/hudicow/.hoodie/20200112194529.commit.requested
A testdata/data/hudicow/.hoodie/20200112194529.inflight
A testdata/data/hudicow/.hoodie/hoodie.properties
A testdata/data/hudicow/year=2015/month=03/day=16/.hoodie_partition_metadata
A testdata/data/hudicow/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-38-282_20200112194529.parquet
A testdata/data/hudicow/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet
A testdata/data/hudicow/year=2015/month=03/day=17/.hoodie_partition_metadata
A testdata/data/hudicow/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-38-281_20200112194529.parquet
A testdata/data/hudicow/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-5-9_20200112194517.parquet
A testdata/data/hudicow/year=2016/month=03/day=15/.hoodie_partition_metadata
A testdata/data/hudicow/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-38-283_20200112194529.parquet
A testdata/data/hudicow/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-5-11_20200112194517.parquet
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/functional-query/queries/QueryTest/create-table.test
A testdata/workloads/functional-query/queries/QueryTest/hudiparquet.test
M tests/common/test_dimensions.py
M tests/query_test/test_scanners.py
50 files changed, 582 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14711/11
-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 11
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 7:

(1 comment)

Thanks for applying the changes. You could also add some scanner tests that create a Hudi table and issue a few queries against it.

To add those tests please look at the 'test_scanners.py' file and the .test files in the QueryTest directory.

http://gerrit.cloudera.org:8080/#/c/14711/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14711/7//COMMIT_MSG@12
PS7, Line 12: Filte
nit: Filter



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 7
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 10 Jan 2020 15:17:56 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 7:

(3 comments)

Hi! Thanks for working on this, left some comments there.

http://gerrit.cloudera.org:8080/#/c/14711/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14711/7//COMMIT_MSG@13
PS7, Line 13: 
Could you please list the tests you did here?


http://gerrit.cloudera.org:8080/#/c/14711/7/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
File fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java:

http://gerrit.cloudera.org:8080/#/c/14711/7/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java@88
PS7, Line 88: fileFormat
Could you update the comments above?


http://gerrit.cloudera.org:8080/#/c/14711/7/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java:

http://gerrit.cloudera.org:8080/#/c/14711/7/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java@165
PS7, Line 165:       // HUDI_PARQUET doesn't support converting from thrift
Maybe it would be better to add an empty HUDI_PARQUET case over the default case, just to have it listed here.



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 7
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 09 Jan 2020 17:18:32 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 5:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/5322/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 5
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 20 Dec 2019 02:42:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 16:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14711/16/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/14711/16/be/src/exec/hdfs-scan-node-base.cc@379
PS16, Line 379: HUDI_PARQUET
it would be nice (although maybe as a follow up JIRA), if none of the changes to the be/ code were necessary. it should be possible for all the logic to live only in the fe/ code, since, as far as the be/ code is concerned, it is just reading a regular Parquet file. the fe/ code does the proper filtering of the hudi directory to determine which files to read, and then the be/ is just responsible for reading them. the be/ doesn't really need to be aware that Hudi exists.

one way to do this might be to change the THdfsFileFormat passed to the be/ code from HUDI_PARQUET to just PARQUET. this can probably be done in the fe/ after resolving which Parquet files need to be read.

I'm not 100% sure how this would work, but it would definitely be nice to have and would IMO make the code cleaner. won't hold up merging the patch on this change though.



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 16
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Jan 2020 17:29:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has uploaded a new patch set (#22). ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the table correctly, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilter.

Tests
 - Unit test for Hudi in FileMetadataLoader
 - Create table tests in functional_schema_template.sql
 - Query tests in hudi-parquet.test

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
---
M bin/impala-config.sh
M bin/rat_exclude_files.txt
M common/thrift/CatalogObjects.thrift
M fe/pom.xml
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
M fe/src/main/jflex/sql-scanner.flex
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M impala-parent/pom.xml
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/hudi_parquet/.hoodie/20200112194517.clean
A testdata/data/hudi_parquet/.hoodie/20200112194517.clean.inflight
A testdata/data/hudi_parquet/.hoodie/20200112194517.clean.requested
A testdata/data/hudi_parquet/.hoodie/20200112194517.commit
A testdata/data/hudi_parquet/.hoodie/20200112194517.commit.requested
A testdata/data/hudi_parquet/.hoodie/20200112194517.inflight
A testdata/data/hudi_parquet/.hoodie/20200112194529.clean
A testdata/data/hudi_parquet/.hoodie/20200112194529.clean.inflight
A testdata/data/hudi_parquet/.hoodie/20200112194529.clean.requested
A testdata/data/hudi_parquet/.hoodie/20200112194529.commit
A testdata/data/hudi_parquet/.hoodie/20200112194529.commit.requested
A testdata/data/hudi_parquet/.hoodie/20200112194529.inflight
A testdata/data/hudi_parquet/.hoodie/hoodie.properties
A testdata/data/hudi_parquet/year=2015/month=03/day=16/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-38-282_20200112194529.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=17/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-38-281_20200112194529.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-5-9_20200112194517.parquet
A testdata/data/hudi_parquet/year=2016/month=03/day=15/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-38-283_20200112194529.parquet
A testdata/data/hudi_parquet/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-5-11_20200112194517.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/hudi-parquet.test
M tests/query_test/test_scanners.py
42 files changed, 624 insertions(+), 39 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14711/22
-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 22
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 16:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5477/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 16
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 21 Jan 2020 08:33:36 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has uploaded a new patch set (#20). ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the table correctly, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilter.

Tests
 - Unit test for Hudi in FileMetadataLoader
 - Query tests in create-table.test
 - Query tests in hudiparquet.test

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node.cc
M bin/impala-config.sh
M bin/rat_exclude_files.txt
M common/thrift/CatalogObjects.thrift
M fe/pom.xml
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
M fe/src/main/jflex/sql-scanner.flex
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M impala-parent/pom.xml
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/hudi_parquet/.hoodie/20200112194517.clean
A testdata/data/hudi_parquet/.hoodie/20200112194517.clean.inflight
A testdata/data/hudi_parquet/.hoodie/20200112194517.clean.requested
A testdata/data/hudi_parquet/.hoodie/20200112194517.commit
A testdata/data/hudi_parquet/.hoodie/20200112194517.commit.requested
A testdata/data/hudi_parquet/.hoodie/20200112194517.inflight
A testdata/data/hudi_parquet/.hoodie/20200112194529.clean
A testdata/data/hudi_parquet/.hoodie/20200112194529.clean.inflight
A testdata/data/hudi_parquet/.hoodie/20200112194529.clean.requested
A testdata/data/hudi_parquet/.hoodie/20200112194529.commit
A testdata/data/hudi_parquet/.hoodie/20200112194529.commit.requested
A testdata/data/hudi_parquet/.hoodie/20200112194529.inflight
A testdata/data/hudi_parquet/.hoodie/hoodie.properties
A testdata/data/hudi_parquet/year=2015/month=03/day=16/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-38-282_20200112194529.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=17/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-38-281_20200112194529.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-5-9_20200112194517.parquet
A testdata/data/hudi_parquet/year=2016/month=03/day=15/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-38-283_20200112194529.parquet
A testdata/data/hudi_parquet/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-5-11_20200112194517.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/hudi-parquet.test
M tests/common/test_dimensions.py
M tests/query_test/test_scanners.py
45 files changed, 632 insertions(+), 43 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14711/20
-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 20
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 20:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14711/20/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/14711/20/testdata/data/README@485
PS20, Line 485:  
nit: whitespace



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 20
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 05 Feb 2020 16:38:32 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 14:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5434/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 14
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 15 Jan 2020 06:50:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 10:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/14711/10/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/14711/10/be/src/exec/hdfs-scan-node-base.cc@379
PS10, Line 379:   for (int format = THdfsFileFormat::TEXT; format <= THdfsFileFormat::HUDI_PARQUET; ++format) {
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/14711/10/be/src/exec/hdfs-scan-node-base.cc@1010
PS10, Line 1010:           if (file_format == THdfsFileFormat::PARQUET || file_format == THdfsFileFormat::HUDI_PARQUET) {
line too long (104 > 90)


http://gerrit.cloudera.org:8080/#/c/14711/10/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/14711/10/be/src/exec/hdfs-scan-node.cc@525
PS10, Line 525:       if (partition->file_format() != THdfsFileFormat::PARQUET && partition->file_format() != THdfsFileFormat::HUDI_PARQUET) {
line too long (126 > 90)


http://gerrit.cloudera.org:8080/#/c/14711/10/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/14711/10/testdata/bin/generate-schema-statements.py@303
PS10, Line 303: w
flake8: E501 line too long (98 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/14711/10/tests/common/test_dimensions.py
File tests/common/test_dimensions.py:

http://gerrit.cloudera.org:8080/#/c/14711/10/tests/common/test_dimensions.py@32
PS10, Line 32: s
flake8: E501 line too long (94 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/14711/10/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/14711/10/tests/query_test/test_scanners.py@304
PS10, Line 304: class TestHudiParquet(ImpalaTestSuite):
flake8: E302 expected 2 blank lines, found 1



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 10
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 14 Jan 2020 07:26:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has uploaded a new patch set (#10). ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the correct table, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilter.

Tests
 - Unit test for Hudi in FileMetadataLoader
 - Query tests in create-table.test
 - Query tests in hudiparquet.test

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node.cc
M bin/impala-config.sh
M bin/rat_exclude_files.txt
M common/thrift/CatalogObjects.thrift
M fe/pom.xml
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
M fe/src/main/jflex/sql-scanner.flex
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M impala-parent/pom.xml
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
A testdata/data/hudicow/.hoodie/20200112194517.clean
A testdata/data/hudicow/.hoodie/20200112194517.clean.inflight
A testdata/data/hudicow/.hoodie/20200112194517.clean.requested
A testdata/data/hudicow/.hoodie/20200112194517.commit
A testdata/data/hudicow/.hoodie/20200112194517.commit.requested
A testdata/data/hudicow/.hoodie/20200112194517.inflight
A testdata/data/hudicow/.hoodie/20200112194529.clean
A testdata/data/hudicow/.hoodie/20200112194529.clean.inflight
A testdata/data/hudicow/.hoodie/20200112194529.clean.requested
A testdata/data/hudicow/.hoodie/20200112194529.commit
A testdata/data/hudicow/.hoodie/20200112194529.commit.requested
A testdata/data/hudicow/.hoodie/20200112194529.inflight
A testdata/data/hudicow/.hoodie/hoodie.properties
A testdata/data/hudicow/year=2015/month=03/day=16/.hoodie_partition_metadata
A testdata/data/hudicow/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-38-282_20200112194529.parquet
A testdata/data/hudicow/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet
A testdata/data/hudicow/year=2015/month=03/day=17/.hoodie_partition_metadata
A testdata/data/hudicow/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-38-281_20200112194529.parquet
A testdata/data/hudicow/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-5-9_20200112194517.parquet
A testdata/data/hudicow/year=2016/month=03/day=15/.hoodie_partition_metadata
A testdata/data/hudicow/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-38-283_20200112194529.parquet
A testdata/data/hudicow/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-5-11_20200112194517.parquet
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/functional-query/queries/QueryTest/create-table.test
A testdata/workloads/functional-query/queries/QueryTest/hudiparquet.test
M tests/common/test_dimensions.py
M tests/query_test/test_scanners.py
50 files changed, 576 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14711/10
-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 10
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 22:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14711/22/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/14711/22/testdata/data/README@489
PS22, Line 489: `ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet`
              : `ca51fa17-681b-4497-85b7-4f68e7a63ee7-0` is the bloom index hash of this file
              : `20200112194517` is the timestamp of this version
> I agree with Csaba, and it seems we can easily make the file sizes smaller.
Thanks for pointing this out.
I definitely agree here. Those parquet files are generated by a test in Hudi and the bloom.num_entries was set as default 60000. I am not familiar with the indexing part of Hudi's code so I am not sure if this is using any built-in bloom filter feature of PARQUET. But reducing this number to 100 will makes each parquet file to ~10KB. If this size is acceptable then I will update those files.



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 22
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 07 Feb 2020 21:48:56 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 16:

(5 comments)

Thanks for applying the changes!

http://gerrit.cloudera.org:8080/#/c/14711/16//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14711/16//COMMIT_MSG@10
PS16, Line 10: load the correct table
nit: load the correct files / load the table correctly


http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/14711/16/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@346
PS16, Line 346: isParquetBased
nit: maybe 'hasParquet()' is a better name


http://gerrit.cloudera.org:8080/#/c/14711/16/testdata/bin/create-hudi.sql
File testdata/bin/create-hudi.sql:

http://gerrit.cloudera.org:8080/#/c/14711/16/testdata/bin/create-hudi.sql@17
PS16, Line 17: DROP TABLE IF EXISTS hudicow_partitioned;
Please add 'USE functional' to load it into the functional database.


http://gerrit.cloudera.org:8080/#/c/14711/16/testdata/workloads/functional-query/queries/QueryTest/hudiparquet.test
File testdata/workloads/functional-query/queries/QueryTest/hudiparquet.test:

http://gerrit.cloudera.org:8080/#/c/14711/16/testdata/workloads/functional-query/queries/QueryTest/hudiparquet.test@4
PS16, Line 4: default
Please switch to 'functional'.


http://gerrit.cloudera.org:8080/#/c/14711/16/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/14711/16/tests/query_test/test_scanners.py@306
PS16, Line 306: class TestHudiParquet(ImpalaTestSuite):
Currently these tests will be invoked with all the file formats, but it will only use hudi parquet. Hudi Parquet is not part of the 'file_format' test dimension, but you could add a filtering for an arbitrary file format so the tests won't run multiple times. E.g. you could add stg like this:

  @classmethod
  def add_test_dimensions(cls):
    super(TestParquet, cls).add_test_dimensions()
    cls.ImpalaTestMatrix.add_constraint(
      lambda v: v.get_value('table_format').file_format == 'parquet')



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 16
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 21 Jan 2020 16:29:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 23:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5189/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 23
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 10 Feb 2020 19:06:58 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 21:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5136/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 21
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 06 Feb 2020 02:35:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 16:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14711/16/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/14711/16/be/src/exec/hdfs-scan-node-base.cc@379
PS16, Line 379: HUDI_PARQUET
> I definitely agree with you that not changing anything on the backend would
Unfortunately in Impala there are some misnomers. By "frontend" we mean the parser and planner that are written in java. By "backend" we mean the code responsible for the actual query execution (scanning, joining, aggregating, etc.), these parts are written in C++.

After the Hudi filtering is done we can tell the backend that it just need to scan Parquet files. You already did this here https://gerrit.cloudera.org/#/c/14711/7/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java in line 180. I think you only need to restore that one line and the "backend" will work just fine thinking it's scanning Parquet.



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 16
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 27 Jan 2020 16:31:19 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 22:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14711/22/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/14711/22/testdata/data/README@489
PS22, Line 489: `ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet`
              : `ca51fa17-681b-4497-85b7-4f68e7a63ee7-0` is the bloom index hash of this file
              : `20200112194517` is the timestamp of this version
> Thanks for pointing this out.
10KB seems fine to me, there are lot of files in the repo around that size.



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 22
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 10 Feb 2020 12:34:23 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 3:

Thanks, Yanija. I think that makes sense, so I recommend you to continue with the implementation and testing.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 3
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 19 Nov 2019 16:04:42 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 4:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/5321/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 4
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 20 Dec 2019 02:25:32 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 3:

(1 comment)

Thanks for working on this. So currently you are creating Hudi tables from Hive and only read it from Impala?
The code makes sense to me, but of course, needs testing.

http://gerrit.cloudera.org:8080/#/c/14711/2/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
File fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java:

http://gerrit.cloudera.org:8080/#/c/14711/2/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java@106
PS2, Line 106:       ListMap<TNetworkAddress> hostIndex, @Nullable ValidTxnList validTxnList,
             :       @Nullable ValidWriteIdList writeIds, @Nullable HdfsFileFormat fileFormat) {
             :     Preconditions.checkState((validTxnList == null && writeIds == null)
             :         || (validTxnList != null && writeIds != null));
             :     partDir_ = Preconditions.checkNotNull(partDir);
             :     recursive_ = recursive;
             :     hostIndex_ = Preconditions.checkNotNull(hostIndex);
             :     oldFdsByRelPath_ = Maps.uniqueIndex(oldFds, FileDescriptor::getRelativePath);
             :     writeIds_ = writeIds;
             :     validTxnList_ = validTxnList;
             :     fileFormat_ = fileFormat;
             : 
nit: instead of code duplication, you could make one of your constructor to call the other



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 3
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 15 Nov 2019 22:39:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 14:

(4 comments)

Thanks for reviewing! I still have some questions left. Once everything is clear I will address all the comments.

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java@68
PS14, Line 68: THdfsFileFormat.HUDI_PARQUET
> I think Impala should not support inserting into HUDI_PARQUET because curre
Thanks for reviewing! I put HUDI_PARQUET here because the comment above says "The statement supports an optional PARTITIONED BY clause." So my guess was if we don't support HUDI_PARQUET here, does that mean we are not able to partition HUDI_PARQUET table?


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java@74
PS14, Line 74: HUDI_PARQUET
> We can only create a table from a Parquet file, it doesn't matter if the fi
Make sense. statement like "CREATE TABLE LIKE PARQUET xxx STORED AS HUDIPARQUET LOCATION xxx" would work


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/jflex/sql-scanner.flex
File fe/src/main/jflex/sql-scanner.flex:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/jflex/sql-scanner.flex@149
PS14, Line 149: hudiparquet
> nit: I'm in favor of "hudi_parquet" because it's more readable.
I am following the naming convention of SEQUENCEFILE. 
In HdfsFileFormat it was SEQUENCE_FILE but in Sql it was SEQUENCEFILE. Same for RCFILE. Does it make sense to keep HUDIPARQUET here?


http://gerrit.cloudera.org:8080/#/c/14711/14/tests/common/test_dimensions.py
File tests/common/test_dimensions.py:

http://gerrit.cloudera.org:8080/#/c/14711/14/tests/common/test_dimensions.py@33
PS14, Line 33: hudiparquet
> We cannot run these tests on Hudi tables because we cannot load the workloa
I am not quite sure if I understand the .test execution sequence correctly.
I first looked into test_scanner.py file and see TestParquet is using 'functional-query' workload, so I tried to do the same for HUDIPARQUET.
Then I created a class TestHudiParquet in order to execute the queries under QueryTest/hudiparquet by following the same way how QueryTest/parquet gets executed. 
If my understanding was wrong here, would you point me to the right direction?



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 14
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 15 Jan 2020 20:49:53 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/14711 )

Change subject: [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

[WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the correct table, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilte.

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
---
M fe/pom.xml
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
5 files changed, 92 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14711/3
-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 3
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 4:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/14711/4/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
File fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java:

http://gerrit.cloudera.org:8080/#/c/14711/4/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@95
PS4, Line 95:         "2015/03/16/75ca3e88-0a2c-4592-b1c5-4d4812c80388-0_0-38-281_20191218160804.parquet",
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/14711/4/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@98
PS4, Line 98:         "2015/03/17/ea972399-ebd0-4447-a01b-6356fd76e43f-0_2-38-283_20191218160804.parquet",
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/14711/4/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@101
PS4, Line 101:         "2016/03/15/e867a156-5330-4a05-a188-ec39f131b1d0-0_1-38-282_20191218160804.parquet",
line too long (92 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 4
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 20 Dec 2019 01:55:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has uploaded a new patch set (#16). ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the correct table, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilter.

Tests
 - Unit test for Hudi in FileMetadataLoader
 - Query tests in create-table.test
 - Query tests in hudiparquet.test

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node.cc
M bin/impala-config.sh
M bin/rat_exclude_files.txt
M common/thrift/CatalogObjects.thrift
M fe/pom.xml
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
M fe/src/main/jflex/sql-scanner.flex
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M impala-parent/pom.xml
A testdata/bin/create-hudi.sql
M testdata/bin/create-load-data.sh
A testdata/data/hudicow/.hoodie/20200112194517.clean
A testdata/data/hudicow/.hoodie/20200112194517.clean.inflight
A testdata/data/hudicow/.hoodie/20200112194517.clean.requested
A testdata/data/hudicow/.hoodie/20200112194517.commit
A testdata/data/hudicow/.hoodie/20200112194517.commit.requested
A testdata/data/hudicow/.hoodie/20200112194517.inflight
A testdata/data/hudicow/.hoodie/20200112194529.clean
A testdata/data/hudicow/.hoodie/20200112194529.clean.inflight
A testdata/data/hudicow/.hoodie/20200112194529.clean.requested
A testdata/data/hudicow/.hoodie/20200112194529.commit
A testdata/data/hudicow/.hoodie/20200112194529.commit.requested
A testdata/data/hudicow/.hoodie/20200112194529.inflight
A testdata/data/hudicow/.hoodie/hoodie.properties
A testdata/data/hudicow/year=2015/month=03/day=16/.hoodie_partition_metadata
A testdata/data/hudicow/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-38-282_20200112194529.parquet
A testdata/data/hudicow/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet
A testdata/data/hudicow/year=2015/month=03/day=17/.hoodie_partition_metadata
A testdata/data/hudicow/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-38-281_20200112194529.parquet
A testdata/data/hudicow/year=2015/month=03/day=17/45c9fa97-e514-41e8-91d2-6098e5995cdb-0_0-5-9_20200112194517.parquet
A testdata/data/hudicow/year=2016/month=03/day=15/.hoodie_partition_metadata
A testdata/data/hudicow/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-38-283_20200112194529.parquet
A testdata/data/hudicow/year=2016/month=03/day=15/17dda230-e48a-4110-8c29-c613a3ac0b70-0_2-5-11_20200112194517.parquet
M testdata/workloads/functional-query/queries/QueryTest/create-table.test
A testdata/workloads/functional-query/queries/QueryTest/hudiparquet.test
M tests/query_test/test_scanners.py
43 files changed, 604 insertions(+), 41 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14711/16
-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 16
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 22:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5151/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 22
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 07 Feb 2020 05:27:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 20:

(6 comments)

Finally about to run the tests!
To load table:
./bin/load-data.py -w functional-query -f --table_formats=parquet/none/none --table_names=hudi_partitioned
(then to load hudi_as_parquet and hudi_non_partitioned, these three tables are using the same hdfs files. I loaded the files in hudi_partitioned only, so I assume these three table will be created in sequence based on the csv file)
./tests/run-tests.py query_test/test_scanners.py --table_formats=parquet/none/none -k "hudi"
8 test cases were passed

http://gerrit.cloudera.org:8080/#/c/14711/20/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/14711/20/testdata/data/README@482
PS20, Line 482: .ap
> nit: if possible, please keep the 90 chars line length limit
Done


http://gerrit.cloudera.org:8080/#/c/14711/20/testdata/data/README@485
PS20, Line 485:  
> nit: whitespace
Done


http://gerrit.cloudera.org:8080/#/c/14711/20/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/14711/20/testdata/datasets/functional/functional_schema_template.sql@2762
PS20, Line 2762: ---- PARTITION_COLUMNS
               : year int
               : month int
               : day int
               : hour int
> Since you are using a custom CREATE statement you'll need to define the par
Done


http://gerrit.cloudera.org:8080/#/c/14711/20/testdata/datasets/functional/schema_constraints.csv
File testdata/datasets/functional/schema_constraints.csv:

http://gerrit.cloudera.org:8080/#/c/14711/20/testdata/datasets/functional/schema_constraints.csv@59
PS20, Line 59: hudiparquet
> hudiparquet is not part of the test dimensions of the functional workload. 
nice trick!


http://gerrit.cloudera.org:8080/#/c/14711/20/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/14711/20/tests/query_test/test_scanners.py@313
PS20, Line 313: TestParquet
> TestHudiParquet
Done


http://gerrit.cloudera.org:8080/#/c/14711/20/tests/query_test/test_scanners.py@320
PS20, Line 320:     self.run_test_case('QueryTest/hudi-parquet', vector)
> If in 'schema_constraints.csv' you switch to parquet then you can load the 
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 20
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 06 Feb 2020 02:07:53 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 16:

(1 comment)

Thanks for all the comments. I will update this PR in the next patch.

http://gerrit.cloudera.org:8080/#/c/14711/16/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/14711/16/be/src/exec/hdfs-scan-node-base.cc@379
PS16, Line 379: HUDI_PARQUET
> it would be nice (although maybe as a follow up JIRA), if none of the chang
I definitely agree with you that not changing anything on the backend would be the best option, but I am not sure if there is a way to do so. Since we need to let impala know the file format is HUDI_PARQUET instead of PARQUET to trigger the filter in the sql statement. I couldn't find a way to do this without editing the thrift code.



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 16
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Jan 2020 23:45:02 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 20:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14711/16/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/14711/16/be/src/exec/hdfs-scan-node-base.cc@379
PS16, Line 379: HUDI_PARQUET
> The code I linked is from PS7 and I think it worked just fine. The current 
Ok I see. So what we gonna do is:
Keep HUDIPARQUET as a SQL keyword
Keep HUDI_PARQUET in the thrift structure. 
In the frontend, when we switch THdfsFileFormat to HdfsFileFormat, we match HUDI_PARQUET as what it is, but when we switch HdfsFileFormat back to THdfsFileFormat, we match HUDI_PARQUET as PARQUET. Then the backend will treat it as normal PARQUET. Did I understand correctly?



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 20
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 05 Feb 2020 21:36:02 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the correct table, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilte.

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
---
M fe/pom.xml
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
A testdata/data/hudicow/.hoodie/20191218160747.clean
A testdata/data/hudicow/.hoodie/20191218160747.clean.inflight
A testdata/data/hudicow/.hoodie/20191218160747.clean.requested
A testdata/data/hudicow/.hoodie/20191218160747.commit
A testdata/data/hudicow/.hoodie/20191218160747.commit.requested
A testdata/data/hudicow/.hoodie/20191218160747.inflight
A testdata/data/hudicow/.hoodie/20191218160804.clean
A testdata/data/hudicow/.hoodie/20191218160804.clean.inflight
A testdata/data/hudicow/.hoodie/20191218160804.clean.requested
A testdata/data/hudicow/.hoodie/20191218160804.commit
A testdata/data/hudicow/.hoodie/20191218160804.commit.requested
A testdata/data/hudicow/.hoodie/20191218160804.inflight
A testdata/data/hudicow/.hoodie/hoodie.properties
A testdata/data/hudicow/2015/03/16/.hoodie_partition_metadata
A testdata/data/hudicow/2015/03/16/75ca3e88-0a2c-4592-b1c5-4d4812c80388-0_0-38-281_20191218160804.parquet
A testdata/data/hudicow/2015/03/16/75ca3e88-0a2c-4592-b1c5-4d4812c80388-0_1-5-10_20191218160747.parquet
A testdata/data/hudicow/2015/03/17/.hoodie_partition_metadata
A testdata/data/hudicow/2015/03/17/ea972399-ebd0-4447-a01b-6356fd76e43f-0_2-38-283_20191218160804.parquet
A testdata/data/hudicow/2015/03/17/ea972399-ebd0-4447-a01b-6356fd76e43f-0_2-5-11_20191218160747.parquet
A testdata/data/hudicow/2016/03/15/.hoodie_partition_metadata
A testdata/data/hudicow/2016/03/15/e867a156-5330-4a05-a188-ec39f131b1d0-0_0-5-9_20191218160747.parquet
A testdata/data/hudicow/2016/03/15/e867a156-5330-4a05-a188-ec39f131b1d0-0_1-38-282_20191218160804.parquet
A testdata/data/hudimor/.hoodie/20191218161117.clean
A testdata/data/hudimor/.hoodie/20191218161117.clean.inflight
A testdata/data/hudimor/.hoodie/20191218161117.clean.requested
A testdata/data/hudimor/.hoodie/20191218161117.deltacommit
A testdata/data/hudimor/.hoodie/20191218161117.deltacommit.inflight
A testdata/data/hudimor/.hoodie/20191218161117.deltacommit.requested
A testdata/data/hudimor/.hoodie/hoodie.properties
A testdata/data/hudimor/2015/03/16/.hoodie_partition_metadata
A testdata/data/hudimor/2015/03/16/8ffb177f-3c02-461d-925e-032664bfde7d-0_1-5-10_20191218161117.parquet
A testdata/data/hudimor/2015/03/17/.hoodie_partition_metadata
A testdata/data/hudimor/2015/03/17/6edbfe56-d673-443f-870c-c4383c58ecea-0_2-5-11_20191218161117.parquet
A testdata/data/hudimor/2016/03/15/.hoodie_partition_metadata
A testdata/data/hudimor/2016/03/15/893da3b0-996c-483b-ad40-14c54e0e65ab-0_0-5-9_20191218161117.parquet
41 files changed, 542 insertions(+), 20 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14711/5
-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 5
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the correct table, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilte.

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
---
M fe/pom.xml
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/TransactionKeepalive.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
A testdata/data/hudicow/.hoodie/20191218160747.clean
A testdata/data/hudicow/.hoodie/20191218160747.clean.inflight
A testdata/data/hudicow/.hoodie/20191218160747.clean.requested
A testdata/data/hudicow/.hoodie/20191218160747.commit
A testdata/data/hudicow/.hoodie/20191218160747.commit.requested
A testdata/data/hudicow/.hoodie/20191218160747.inflight
A testdata/data/hudicow/.hoodie/20191218160804.clean
A testdata/data/hudicow/.hoodie/20191218160804.clean.inflight
A testdata/data/hudicow/.hoodie/20191218160804.clean.requested
A testdata/data/hudicow/.hoodie/20191218160804.commit
A testdata/data/hudicow/.hoodie/20191218160804.commit.requested
A testdata/data/hudicow/.hoodie/20191218160804.inflight
A testdata/data/hudicow/.hoodie/hoodie.properties
A testdata/data/hudicow/2015/03/16/.hoodie_partition_metadata
A testdata/data/hudicow/2015/03/16/75ca3e88-0a2c-4592-b1c5-4d4812c80388-0_0-38-281_20191218160804.parquet
A testdata/data/hudicow/2015/03/16/75ca3e88-0a2c-4592-b1c5-4d4812c80388-0_1-5-10_20191218160747.parquet
A testdata/data/hudicow/2015/03/17/.hoodie_partition_metadata
A testdata/data/hudicow/2015/03/17/ea972399-ebd0-4447-a01b-6356fd76e43f-0_2-38-283_20191218160804.parquet
A testdata/data/hudicow/2015/03/17/ea972399-ebd0-4447-a01b-6356fd76e43f-0_2-5-11_20191218160747.parquet
A testdata/data/hudicow/2016/03/15/.hoodie_partition_metadata
A testdata/data/hudicow/2016/03/15/e867a156-5330-4a05-a188-ec39f131b1d0-0_0-5-9_20191218160747.parquet
A testdata/data/hudicow/2016/03/15/e867a156-5330-4a05-a188-ec39f131b1d0-0_1-38-282_20191218160804.parquet
A testdata/data/hudimor/.hoodie/20191218161117.clean
A testdata/data/hudimor/.hoodie/20191218161117.clean.inflight
A testdata/data/hudimor/.hoodie/20191218161117.clean.requested
A testdata/data/hudimor/.hoodie/20191218161117.deltacommit
A testdata/data/hudimor/.hoodie/20191218161117.deltacommit.inflight
A testdata/data/hudimor/.hoodie/20191218161117.deltacommit.requested
A testdata/data/hudimor/.hoodie/hoodie.properties
A testdata/data/hudimor/2015/03/16/.hoodie_partition_metadata
A testdata/data/hudimor/2015/03/16/8ffb177f-3c02-461d-925e-032664bfde7d-0_1-5-10_20191218161117.parquet
A testdata/data/hudimor/2015/03/17/.hoodie_partition_metadata
A testdata/data/hudimor/2015/03/17/6edbfe56-d673-443f-870c-c4383c58ecea-0_2-5-11_20191218161117.parquet
A testdata/data/hudimor/2016/03/15/.hoodie_partition_metadata
A testdata/data/hudimor/2016/03/15/893da3b0-996c-483b-ad40-14c54e0e65ab-0_0-5-9_20191218161117.parquet
42 files changed, 551 insertions(+), 21 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14711/4
-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 4
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 22:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5301/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 22
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 07 Feb 2020 11:23:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 19:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14711/19/tests/common/test_dimensions.py
File tests/common/test_dimensions.py:

http://gerrit.cloudera.org:8080/#/c/14711/19/tests/common/test_dimensions.py@32
PS19, Line 32: s
flake8: E501 line too long (94 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 19
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Sun, 02 Feb 2020 23:08:44 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 19:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5582/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 19
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Sun, 02 Feb 2020 23:54:03 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: [WIP]IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 3:

> (1 comment)
 > 
 > Thanks for working on this. So currently you are creating Hudi
 > tables from Hive and only read it from Impala?
 > The code makes sense to me, but of course, needs testing.

Thanks for reviewing. This PR is trying to recognize Hudi format parquet directly from impala and not using Hive. My idea will be sending impala query CREATE EXTERNAL TABLE xxx LIKE PARQUET `path` PARTITION BY (xxx) STORED AS HUDI LOCATION `path`; RECOVER PARTITION; REFRESH TABLE xxx
Then make a sync tool on the Hudi side to send queries to the impala server to update the changed partitions.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 3
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 15 Nov 2019 23:04:26 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................

IMPALA-8778: Support Apache Hudi Read Optimized Table

Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the table correctly, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilter.

Tests
 - Unit test for Hudi in FileMetadataLoader
 - Create table tests in functional_schema_template.sql
 - Query tests in hudi-parquet.test

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Reviewed-on: http://gerrit.cloudera.org:8080/14711
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/service/query-options-test.cc
M bin/impala-config.sh
M bin/rat_exclude_files.txt
M common/thrift/CatalogObjects.thrift
M fe/pom.xml
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
A fe/src/main/java/org/apache/impala/util/HudiUtil.java
M fe/src/main/jflex/sql-scanner.flex
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M impala-parent/pom.xml
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/hudi_parquet/.hoodie/20200210090610.clean
A testdata/data/hudi_parquet/.hoodie/20200210090610.clean.inflight
A testdata/data/hudi_parquet/.hoodie/20200210090610.clean.requested
A testdata/data/hudi_parquet/.hoodie/20200210090610.commit
A testdata/data/hudi_parquet/.hoodie/20200210090610.commit.requested
A testdata/data/hudi_parquet/.hoodie/20200210090610.inflight
A testdata/data/hudi_parquet/.hoodie/20200210090618.clean
A testdata/data/hudi_parquet/.hoodie/20200210090618.clean.inflight
A testdata/data/hudi_parquet/.hoodie/20200210090618.clean.requested
A testdata/data/hudi_parquet/.hoodie/20200210090618.commit
A testdata/data/hudi_parquet/.hoodie/20200210090618.commit.requested
A testdata/data/hudi_parquet/.hoodie/20200210090618.inflight
A testdata/data/hudi_parquet/.hoodie/hoodie.properties
A testdata/data/hudi_parquet/year=2015/month=03/day=16/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2015/month=03/day=16/5f541af5-ca07-4329-ad8c-40fa9b353f35-0_1-70-118_20200210090610.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=16/5f541af5-ca07-4329-ad8c-40fa9b353f35-0_2-103-391_20200210090618.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=17/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2015/month=03/day=17/675e035d-c146-4658-9404-fe590e296d80-0_0-103-389_20200210090618.parquet
A testdata/data/hudi_parquet/year=2015/month=03/day=17/675e035d-c146-4658-9404-fe590e296d80-0_0-70-117_20200210090610.parquet
A testdata/data/hudi_parquet/year=2016/month=03/day=15/.hoodie_partition_metadata
A testdata/data/hudi_parquet/year=2016/month=03/day=15/940359ee-cc79-4974-8a2a-5d133a81a3fd-0_1-103-390_20200210090618.parquet
A testdata/data/hudi_parquet/year=2016/month=03/day=15/940359ee-cc79-4974-8a2a-5d133a81a3fd-0_2-70-119_20200210090610.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/hudi-parquet.test
M testdata/workloads/functional-query/queries/QueryTest/set.test
M tests/query_test/test_scanners.py
44 files changed, 626 insertions(+), 41 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 25
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Yanjia Gary Li (Code Review)" <ge...@cloudera.org>.
Yanjia Gary Li has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 14:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java@68
PS14, Line 68: THdfsFileFormat.HUDI_PARQUET
> This is just about the CREATE TABLE AS SELECT statement, which currently ca
Done


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java@74
PS14, Line 74: HUDI_PARQUET
> We can only create a table from a Parquet file, it doesn't matter if the fi
Done


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
File fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java@85
PS14, Line 85: udi
> nit: HUDI
Done


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@372
PS14, Line 372: fileFormats_.contains(HdfsFileFormat.PARQUET)
              :         || fileFormats_.contains(HdfsFileFormat.HUDI_PARQUET)
> Maybe you could introduce an IsParquetBased() method for simplicity.
Good point!


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/util/HudiUtil.java
File fe/src/main/java/org/apache/impala/util/HudiUtil.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/util/HudiUtil.java@27
PS14, Line 27: class
> nit: method, you could also describe what is the effect of this function.
Done


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/jflex/sql-scanner.flex
File fe/src/main/jflex/sql-scanner.flex:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/jflex/sql-scanner.flex@149
PS14, Line 149: hudiparquet
> I see. I don't have a strong opinion about it, my personal preference is fo
Happy to get more inputs here.


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/bin/create-load-data.sh@216
PS14, Line 216:   # Hudi Table
              :   mkdir ${TMP_DIR}/hudicow
              :   cp -r ${IMPALA_HOME}/testdata/data/hudicow/. ${TMP_DIR}/hudicow
> Maybe load-custom-data() is a better place for it.
Done


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/bin/generate-schema-statements.py@151
PS14, Line 151: hudiparquet
> We don't need Hudi related changes in this file because we don't want to (a
Done


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_core.csv
File testdata/workloads/functional-query/functional-query_core.csv:

http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_core.csv@7
PS14, Line 7: file_format:hudiparquet, dataset: functional, compression_codec: none, compression_type: none
> We cannot write tables in Hudi format yet.
Done


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_dimensions.csv
File testdata/workloads/functional-query/functional-query_dimensions.csv:

http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_dimensions.csv@1
PS14, Line 1: hudiparquet
> We cannot write tables in Hudi format.
Done


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_exhaustive.csv
File testdata/workloads/functional-query/functional-query_exhaustive.csv:

http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_exhaustive.csv@26
PS14, Line 26: hudiparquet
> We cannot write tables in Hudi format
Done


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_pairwise.csv
File testdata/workloads/functional-query/functional-query_pairwise.csv:

http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/functional-query_pairwise.csv@7
PS14, Line 7: hudiparquet
> We cannot write tables in Hudi format
Done


http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/queries/QueryTest/create-table.test
File testdata/workloads/functional-query/queries/QueryTest/create-table.test:

http://gerrit.cloudera.org:8080/#/c/14711/14/testdata/workloads/functional-query/queries/QueryTest/create-table.test@303
PS14, Line 303: hudiparquet '/test-warehouse/hudicow/year=2015/month=03/day=16/ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-38-282_20200112194529.parquet'
> It's really just a Parquet file. We could already use it in a CREATE TABLE 
Done


http://gerrit.cloudera.org:8080/#/c/14711/14/tests/common/test_dimensions.py
File tests/common/test_dimensions.py:

http://gerrit.cloudera.org:8080/#/c/14711/14/tests/common/test_dimensions.py@33
PS14, Line 33: hudiparquet
> OK, so Impala's end-to-end tests are organized around workloads. A workload
Thanks for all the details! I can add this to README in the next iteration.



-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 14
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 21 Jan 2020 07:41:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )

Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 13:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5433/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 13
Gerrit-Owner: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <ya...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 15 Jan 2020 06:44:33 +0000
Gerrit-HasComments: No