You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org> on 2020/06/15 16:47:48 UTC

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Zoltan Borok-Nagy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16082


Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................

IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Hive ACID supports row-level DELETE and UPDATE operatations on a table.
It achieves it via assigning a unique row-id for each row, and
maintinaining two sets of files in a table. The first set is in the
base/delta directories, they contain the INSERTed rows. The second set
of files are in the delete-delta directories, they contain the DELETEd
rows.

(UPDATE operations are implemented via DELETE+INSERT.)

In the filesystem it looks like e.g.:
 * full_acid/delta_0000001_0000001_0000/0000_0
 * full_acid/delta_0000002_0000002_0000/0000_0
 * full_acid/delete_delta_0000003_0000003_0000/0000_0

During scanning we need to return INSERTed rows minus DELETEd rows.
This patch implements it by creating an ANTI JOIN between the INSERT and
DELETE sets. It is a planner-only modification. Every HDFS SCAN
that scans full ACID tables (that also have deleted rows) are converted
to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE
deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is
created above them.

Later we can add support for other distribution modes if the performance
requires it. E.g. if we have too many deleted rows then probably we are
better off with PARTITIONED distribution mode. We could estimate the
number of deleted rows by sampling the delete delta files.

The current patch only works for primitive types. I.e. we cannot select
nested data if the table has deleted rows.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
---
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/util/AcidUtils.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test
A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
M tests/query_test/test_acid.py
13 files changed, 708 insertions(+), 88 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/1
-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Aman Sinha, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16082

to look at the new patch set (#5).

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................

IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Hive ACID supports row-level DELETE and UPDATE operations on a table.
It achieves it via assigning a unique row-id for each row, and
maintaining two sets of files in a table. The first set is in the
base/delta directories, they contain the INSERTed rows. The second set
of files are in the delete-delta directories, they contain the DELETEd
rows.

(UPDATE operations are implemented via DELETE+INSERT.)

In the filesystem it looks like e.g.:
 * full_acid/delta_0000001_0000001_0000/0000_0
 * full_acid/delta_0000002_0000002_0000/0000_0
 * full_acid/delete_delta_0000003_0000003_0000/0000_0

During scanning we need to return INSERTed rows minus DELETEd rows.
This patch implements it by creating an ANTI JOIN between the INSERT and
DELETE sets. It is a planner-only modification. Every HDFS SCAN
that scans full ACID tables (that also have deleted rows) are converted
to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE
deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is
created above them.

Later we can add support for other distribution modes if the performance
requires it. E.g. if we have too many deleted rows then probably we are
better off with PARTITIONED distribution mode. We could estimate the
number of deleted rows by sampling the delete delta files.

The current patch only works for primitive types. I.e. we cannot select
nested data if the table has deleted rows.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
---
M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/util/AcidUtils.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test
A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
M tests/query_test/test_acid.py
15 files changed, 1,147 insertions(+), 89 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/5
-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16082/11/tests/custom_cluster/test_local_catalog.py
File tests/custom_cluster/test_local_catalog.py:

http://gerrit.cloudera.org:8080/#/c/16082/11/tests/custom_cluster/test_local_catalog.py@30
PS11, Line 30: from tests.common.skip import (SkipIfHive2, SkipIfCatalogV2, SkipIfS3, SkipIfABFS,
flake8: F401 'tests.common.skip.SkipIfCatalogV2' imported but unused



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 11
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jul 2020 15:01:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 11:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java
File fe/src/main/java/org/apache/impala/catalog/FeFsTable.java:

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java@376
PS10, Line 376: 
> Could you add a TODO in this method for IMPALA-9883?
Done


http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@2010
PS10, Line 2010: getNumFileDescriptors();
> Please change this to getNumFileDescriptors() after we revise its implement
Done


http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java
File fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java:

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java@56
PS10, Line 56: import org.apache.thrift.TException;
> nit: unused import
Thanks for pointing these out! Done.


http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java@255
PS10, Line 255:       }
> nit: 4 spaces indent
Done


http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java@409
PS10, Line 409:     public ImmutableList<FileDescriptor> getFileDescriptors() {
              :       if (insertFds_.isEmpty()) return fds_;
              :       List<FileDescriptor> ret = Lists.newArrayListWithCapacity(
              :           insertFds_.size() + deleteFds_.size());
              :       ret.addAll(insertFds_);
              :       ret.addAll(deleteFds_);
              :       return ImmutableList.copyOf(ret);
              :     }
              : 
              :     @Override
              :     public ImmutableList<FileDescriptor> getInsertFileDescriptors() {
              :       return insertFds_;
              :     }
              : 
> Need to update these
Done


http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
File fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java:

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java@133
PS10, Line 133: !fileDescriptors_.isEmpty() ||
> nit: It's inefficient to construct the list for this. Maybe change to 
Done


http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java@138
PS10, Line 138: 
> nit: It's inefficient to construct the list for this. Maybe change to
Instead of the ternary operator, I rather just sum up the sizes to keep it more readable.


http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@45
PS9, Line 45: import org.apache.impala.analysis.NullLiteral;
> Looks like IntelliJ is more intelligent in this :D
Yeah :D It's a shame vscode fails on such a trivial task. But apart from this it's a great IDE.


http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1518
PS9, Line 1518:       feTable.getMetaStoreTable().getParameters()));
> PS10 looks good to me. I think the insert files and delete files are used s
I'm OK with this if you think it's good.

Yeah, I was also thinking about merging fileDescriptors_ and insertFileDescriptors_. My only concern is that it would be a bit weird for genDeleteDeltaPartition(), because it would put the delete delta descriptors into the regular descriptors.


http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1536
PS10, Line 1536:         hdfsTblRef.getDesc(), conjuncts, insertDeltaPartitions, hdfsTblRef,
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
File fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java:

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java@497
PS10, Line 497: () {
> not *Fail* anymore
Done


http://gerrit.cloudera.org:8080/#/c/16082/10/tests/custom_cluster/test_local_catalog.py
File tests/custom_cluster/test_local_catalog.py:

http://gerrit.cloudera.org:8080/#/c/16082/10/tests/custom_cluster/test_local_catalog.py@467
PS10, Line 467:   @SkipIfHive2.acid
> Could you add a test here for reading functional_orc_def.alltypes_deleted_r
Copied the existing full acid scans test here.



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 11
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jul 2020 15:03:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 10:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6499/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Jul 2020 16:05:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 14: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 14
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 14 Jul 2020 12:53:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 11:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@1368
PS11, Line 1368: 
               :   @Override
               :   public long getWriteId() {
               :     return writeId_;
               :   }
               : 
               :   @Override
               :   public HdfsPartition genInsertDeltaPartition() {
               :     ImmutableList<byte[]> fileDescriptors = !encodedInsertFileDescriptors_.isEmpty() ?
               :         encodedInsertFileDescriptors_ : encodedFileDescriptors_;
               :     return new HdfsPartition.Builder(this)
               :         .setId(id_)
               :         .setFileDescriptors(fileDescriptors)
               :         .build();
               :   }
               : 
               :   @Override
               :   public HdfsPartition genDeleteDeltaPartition() {
               :     if (encodedDeleteFileDescriptors_.isEmpty()) return null;
               :     return new HdfsPartition.Builder(this)
               :         .setId(id_)
               :         .setFileDescriptors(encodedDeleteFileDescriptors_)
               :         .build();
               :   }
nit: Could you move these non-static methods to somewhere above the HdfsPartition.Builder? Other codes here are static subclass, fields or methods.


http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1453
PS11, Line 1453:     final String NOT_SUPPORTED_YET = "This query is not supported on full ACID tables " +
               :         "that have deleted rows. As a workaround you can run a major compaction.";
Maybe mention that the query references complex types?

Not sure how complex it is to add a test for this. It may worths some test coverage on it. Or maybe we'll support reading full-ACID complex types soon?


http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1468
PS11, Line 1468: List<? extends FeFsPartition> partitions
unused arguement


http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1471
PS11, Line 1471: hdfsTblRef.getUniqueAlias()
Looks like if the tableRef doesn't have an explict alias, this is the full qualified table name. Could you add some tests to make sure the resolution works as expected? I.e. tests that the full-acid table is used more than once and all occurances don't have explicit alias, e.g.

 select count(*) from (
   select * from functional_orc_def.alltypes_deleted_rows where id % 2 = 0
   union all
   select * from functional_orc_def.alltypes_deleted_rows where id % 2 != 0
 ) t;

 select id from functional_orc_def.alltypes_deleted_rows where id % 2 = 0 limit 10
 union all
 select max(id) from functional_orc_def.alltypes_deleted_rows;


http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1537
PS11, Line 1537: isPartitionKeyScan
Could you add some tests for isPartitionKeyScan=true? IIUC, we only scan one row per partition when isPartitionKeyScan=true. If the first row from the insert delta doesn't match the first row from the delete delta, we will consider this partition as non-empty, no matter whether there are other delete rows that could match the first row from the insert delta.

I think it's worth to add a test like this:

 create table my_acid (id int) partitioned by (p int) stored as orc
   tblproperties('transactional'='true');
 insert into my_acid partition(p=0) values (0), (1), (2);
 // (p=1, id=0) may be the first row of insert delta
 insert into my_acid partition(p=1) values (0), (1), (2);
 // (p=1, id=2) may be the first row of delete delta
 delete from my_acid where p = 1 and id = 2;
 delete from my_acid where p = 1 and id = 1;
 delete from my_acid where p = 1 and id = 0;
 // Run this in Impala. The result should be 0 since partition(p=1) is empty.
 select max(p) from my_acid;


http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1541
PS11, Line 1541: isPartitionKeyScan
Should we always set this to true to scan all delete rows?


http://gerrit.cloudera.org:8080/#/c/16082/11/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/11/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@714
PS11, Line 714: ====
Could you add some tests to make sure that the query hints are respected? Also make sure the "using" syntax is correctly handled. E.g.

 select straight_join a.id from functional.alltypes a
   join /* +BROADCAST */ functional_orc_def.alltypes_deleted_rows b
   using (id);

  select straight_join a.id from functional.alltypes a
   join /* +SHUFFLE */ functional_orc_def.alltypes_deleted_rows b
   using (id);


http://gerrit.cloudera.org:8080/#/c/16082/11/testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
File testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/11/testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test@21
PS11, Line 21: insert into acid values (5), (5), (5);
Could you add a test for insert overwrite?


http://gerrit.cloudera.org:8080/#/c/16082/11/testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test@52
PS11, Line 52: select count(*) from functional_orc_def.alltypes_deleted_rows;
Could you add a test for show partitions on this table? Will the deleted partitions still show up?



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 11
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 10 Jul 2020 02:59:55 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 4:

(11 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@673
PS2, Line 673:     fileFormatDescriptor_ = other.fileFormatDescriptor_;
> Looking at the changes in https://issues.apache.org/jira/browse/IMPALA-9778
Resolved it during the rebase (PS3).


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
File fe/src/main/java/org/apache/impala/planner/HashJoinNode.java:

http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java@178
PS2, Line 178:       if (!isAcidJoin_ || detailLevel.ordinal() >= TExplainLevel.EXTENDED.ordinal()) {
> For the anti join should we not show the join conditions in the Impala quer
Done


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java@83
PS2, Line 83:   // True if this join is used to do the join between insert and delete delta files.
> nit: spelling 'deleta'
Done


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java@88
PS2, Line 88:     displayName_ = "DELETE TABLE " + displayName_;
> The  'ACID' prefix for a HashJoin could be misleading to end users since a 
Yeah, I just wanted to somehow mark this JOIN to make it obvious what's going on, but probably 'ACID' is not the best term to use.

I think 'DELETE TABLE HASH JOIN' makes sense.


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1453
PS2, Line 1453:       for (HdfsPartition.FileDescriptor fd : partition.getFileDescriptors()) {
> The casting  is redundant since 'partition' is already of type FeFsPartitio
Done


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1454
PS2, Line 1454:         if (fd.getRelativePath().startsWith("delete_delta_")) {
> Is there any indicator in the HMS's partition object that indicates whether
Unfortunately there's no such metadata AFAICT. It could be added during data loading, though it'd complicate the code a little, especially when local catalog is being used since it reuses loaded partitions while trying to filter out file descriptors based on an ACID write id list, see IMPALA-9791 for more details.

Also, I did some measurements. Created a table with 1000 partitions and 10 files per each partition. Then I've nested this for-loop into another for-loop with 100 iterations to simulate a table with 100K partitions and 1 million files. This whole check was evaluated between 100-200 milliseconds.

So for now I'm leaning towards keeping it simple unless you feel strongly against it.


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1480
PS2, Line 1480:     rawPath.add("row__id");
> The design doc suggests the anti-join is on 4 fields..is the partition id n
It's needed and it's added by the above for loop at L1474-L1478.


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1551
PS2, Line 1551:     HdfsScanNode deltaScanNode = new HdfsScanNode(ctx_.getNextNodeId(),
> The  'aggInfo' is meant for the simple 'select count(*) from table'  type o
Yeah, this makes sense.
It didn't cause a bug because the backend only optimizes count(*) if it's a "zero-slot table scan", and we've added ACID slots to materialize.


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1591
PS2, Line 1591:             insertSlotDesc.getMaterializedPath())) {
> Call analyze for the BinaryPredicate.  Also, could you add a simple test wi
Done. Added planner tests for explain_level 2 and 3.


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1605
PS2, Line 1605:   /**
> nit: spelling 'partitoin'
Done


http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@58
PS2, Line 58: |  row-size=100B cardinality=143
> The cardinality estimate for this hash join would need some adjustment in t
Maybe it's not that wrong. If I use compute stats then the actual cardinalities are stored, i.e. the number of values that are not deleted.

So the left side of the JOIN will have the cardinality that is expected after the JOIN.

The cardinality of the delete table SCAN can be quite off currently because we don't pass the 'conjuncts' to it since there'se no data in the deleta delta files. So it's good that we don't take its cardinality into account. I think the only way to get reasonable cardinality number for the delete SCAN node is to do some sampling on the delete files.



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Jun 2020 16:40:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Aman Sinha (Code Review)" <ge...@cloudera.org>.
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@58
PS2, Line 58: |  row-size=100B cardinality=143
> Yeah I don't think we should pass the conjuncts to the delete deltas. Also 
>>  It's because COMPUTE STATS is query-based, so it runs count(*) and count(distinct) queries on a table and stores the results of it.
Makes sense..so Compute Stats is also doing the left anti join.  Do you know if the sampling clause of Compute Stats will be supported for the ACID table ? i.e a 10% sample on the delta table would need to be anti-joined with the delete_delta.  I suppose that should work. In any case, that's a separate investigation.

Regarding the naming, yes, 'EVENTS'  is more accurate than 'TABLE' since it is not really a table.


http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@286
PS4, Line 286: |  |  hash predicates: functional_orc_def.alltypes_deleted_rows.month = functional_orc_def.alltypes_deleted_rows-delete-delta.month, functional_orc_def.alltypes_deleted_rows.row__id.bucket = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.bucket, functional_orc_def.alltypes_deleted_rows.row__id.originaltransaction = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.originaltransaction, functional_orc_def.alltypes_deleted_rows.row__id.rowid = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.rowid, functional_orc_def.alltypes_deleted_rows.year = functional_orc_def.alltypes_deleted_rows-delete-delta.year
> It only adds the hidden ACID columns + the partitioning columns (year, mont
I see..I missed that the other columns were partitioning cols. I would think that not all the columns need to be in the 'hash predicate' .. as long as the ones with high NDV are considered for hashing, the rest could be part of the 'other join predicates' to avoid hashing on too many columns.   But this could be a future enhancement.



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Jun 2020 17:54:44 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 10: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Jul 2020 21:16:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Aman Sinha (Code Review)" <ge...@cloudera.org>.
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 7: Code-Review+1

Couple of minor comments. Overall, this LGTM.


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Jun 2020 17:55:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 12:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6576/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 12
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 13 Jul 2020 18:58:00 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 6:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/6389/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Jun 2020 16:41:44 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Aman Sinha, Quanlong Huang, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16082

to look at the new patch set (#12).

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................

IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Hive ACID supports row-level DELETE and UPDATE operations on a table.
It achieves it via assigning a unique row-id for each row, and
maintaining two sets of files in a table. The first set is in the
base/delta directories, they contain the INSERTed rows. The second set
of files are in the delete-delta directories, they contain the DELETEd
rows.

(UPDATE operations are implemented via DELETE+INSERT.)

In the filesystem it looks like e.g.:
 * full_acid/delta_0000001_0000001_0000/0000_0
 * full_acid/delta_0000002_0000002_0000/0000_0
 * full_acid/delete_delta_0000003_0000003_0000/0000_0

During scanning we need to return INSERTed rows minus DELETEd rows.
This patch implements it by creating an ANTI JOIN between the INSERT and
DELETE sets. It is a planner-only modification. Every HDFS SCAN
that scans full ACID tables (that also have deleted rows) are converted
to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE
deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is
created above them.

Later we can add support for other distribution modes if the performance
requires it. E.g. if we have too many deleted rows then probably we are
better off with PARTITIONED distribution mode. We could estimate the
number of deleted rows by sampling the delete delta files.

The current patch only works for primitive types. I.e. we cannot select
nested data if the table has deleted rows.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
---
M common/thrift/CatalogObjects.thrift
M common/thrift/CatalogService.thrift
M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java
M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/FeFsTable.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java
M fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsTable.java
M fe/src/main/java/org/apache/impala/catalog/local/MetaProvider.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/util/AcidUtils.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test
A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
M tests/custom_cluster/test_local_catalog.py
M tests/query_test/test_acid.py
27 files changed, 1,710 insertions(+), 151 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/12
-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 12
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 9:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@45
PS9, Line 45: import org.apache.impala.analysis.Path.PathType;
nit: keep the import list sorted in groups (usually the IDE will do this for you).


http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1418
PS9, Line 1418:       if (addAcidSlotsIfNeeded(analyzer, hdfsTblRef, partitions)) {
nit: what about merging this if-statement with its outer scope so they are

 if (isPartitionKeyScan && queryOpts.optimize_partition_key_scans) {
   ...
 } else if (addAcidSlotsIfNeeded(analyzer, hdfsTblRef, partitions)) {
   ...
 } else {
   ...
 }


http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1518
PS9, Line 1518:     // Let's separate insert delta File Descriptors from delete delta FDs.
I think we should separate the file descriptors in catalogd after loading them from HDFS instead of doing it here for each query. We can introduce two fileds in HdfsPartition: encodedInsertDeltaFileDescriptors_ and encodedDeleteDeltaFileDescriptors_ (and related fields in THdfsPartition and TPartialPartitionInfo). If a partition contains delete deltas, we separate them by setting these two fields and leaving encodedFileDescriptors_ null. We can also introduce two methods for FeFsPartition: genInsertDeltaPartition() and genDeleteDeltaPartition() using HdfsPartition.Builder in this way:

 public HdfsPartition genInsertDeltaPartition() {
    return new HdfsPartition.Builder(this)
        .setFileDescriptors(InsertDeltaFileDescriptors_)
        .build();
 }

With this we don't need to remove the "final" marker of encodedFileDescriptors_ and add back the setFileDescriptor() method, which violates our goal to make HdfsPartition immutable. The setFileDescriptor() method may encourage future developers to modify HdfsPartitions in-place in catalogd, which will break IMPALA-7533.



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Jul 2020 06:57:52 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6390/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Jun 2020 16:48:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6399/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 23 Jun 2020 16:28:22 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Aman Sinha, Quanlong Huang, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16082

to look at the new patch set (#10).

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................

IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Hive ACID supports row-level DELETE and UPDATE operations on a table.
It achieves it via assigning a unique row-id for each row, and
maintaining two sets of files in a table. The first set is in the
base/delta directories, they contain the INSERTed rows. The second set
of files are in the delete-delta directories, they contain the DELETEd
rows.

(UPDATE operations are implemented via DELETE+INSERT.)

In the filesystem it looks like e.g.:
 * full_acid/delta_0000001_0000001_0000/0000_0
 * full_acid/delta_0000002_0000002_0000/0000_0
 * full_acid/delete_delta_0000003_0000003_0000/0000_0

During scanning we need to return INSERTed rows minus DELETEd rows.
This patch implements it by creating an ANTI JOIN between the INSERT and
DELETE sets. It is a planner-only modification. Every HDFS SCAN
that scans full ACID tables (that also have deleted rows) are converted
to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE
deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is
created above them.

Later we can add support for other distribution modes if the performance
requires it. E.g. if we have too many deleted rows then probably we are
better off with PARTITIONED distribution mode. We could estimate the
number of deleted rows by sampling the delete delta files.

The current patch only works for primitive types. I.e. we cannot select
nested data if the table has deleted rows.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
---
M common/thrift/CatalogObjects.thrift
M common/thrift/CatalogService.thrift
M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java
M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java
M fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsTable.java
M fe/src/main/java/org/apache/impala/catalog/local/MetaProvider.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/util/AcidUtils.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test
A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
M tests/query_test/test_acid.py
25 files changed, 1,429 insertions(+), 142 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/10
-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................

IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Hive ACID supports row-level DELETE and UPDATE operations on a table.
It achieves it via assigning a unique row-id for each row, and
maintaining two sets of files in a table. The first set is in the
base/delta directories, they contain the INSERTed rows. The second set
of files are in the delete-delta directories, they contain the DELETEd
rows.

(UPDATE operations are implemented via DELETE+INSERT.)

In the filesystem it looks like e.g.:
 * full_acid/delta_0000001_0000001_0000/0000_0
 * full_acid/delta_0000002_0000002_0000/0000_0
 * full_acid/delete_delta_0000003_0000003_0000/0000_0

During scanning we need to return INSERTed rows minus DELETEd rows.
This patch implements it by creating an ANTI JOIN between the INSERT and
DELETE sets. It is a planner-only modification. Every HDFS SCAN
that scans full ACID tables (that also have deleted rows) are converted
to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE
deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is
created above them.

Later we can add support for other distribution modes if the performance
requires it. E.g. if we have too many deleted rows then probably we are
better off with PARTITIONED distribution mode. We could estimate the
number of deleted rows by sampling the delete delta files.

The current patch only works for primitive types. I.e. we cannot select
nested data if the table has deleted rows.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Reviewed-on: http://gerrit.cloudera.org:8080/16082
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M common/thrift/CatalogObjects.thrift
M common/thrift/CatalogService.thrift
M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java
M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/FeFsTable.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java
M fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsTable.java
M fe/src/main/java/org/apache/impala/catalog/local/MetaProvider.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/util/AcidUtils.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test
A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
M tests/custom_cluster/test_local_catalog.py
M tests/query_test/test_acid.py
27 files changed, 1,710 insertions(+), 151 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 15
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 14: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 14
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 14 Jul 2020 07:49:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1536
PS10, Line 1536:         hdfsTblRef.getDesc(), conjuncts, insertDeltaPartitions, hdfsTblRef, /*aggInfo=*/null,
line too long (93 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Jul 2020 15:42:35 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 10:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6094/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Jul 2020 16:09:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16082/12/tests/custom_cluster/test_local_catalog.py
File tests/custom_cluster/test_local_catalog.py:

http://gerrit.cloudera.org:8080/#/c/16082/12/tests/custom_cluster/test_local_catalog.py@30
PS12, Line 30: from tests.common.skip import (SkipIfHive2, SkipIfCatalogV2, SkipIfS3, SkipIfABFS,
flake8: F401 'tests.common.skip.SkipIfCatalogV2' imported but unused



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 12
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 13 Jul 2020 18:29:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 10:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java
File fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java:

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java@409
PS10, Line 409:     @Override
              :     public ImmutableList<FileDescriptor> getFileDescriptors() {
              :       return fds_;
              :     }
              : 
              :     @Override
              :     public ImmutableList<FileDescriptor> getInsertFileDescriptors() {
              :       return fds_;
              :     }
              : 
              :     @Override
              :     public ImmutableList<FileDescriptor> getDeleteFileDescriptors() {
              :       return fds_;
              :     }
Need to update these


http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
File fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java:

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java@497
PS10, Line 497: Fail
not *Fail* anymore



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Jul 2020 17:07:44 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6365/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Jun 2020 12:50:19 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Aman Sinha, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16082

to look at the new patch set (#4).

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................

IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Hive ACID supports row-level DELETE and UPDATE operatations on a table.
It achieves it via assigning a unique row-id for each row, and
maintinaining two sets of files in a table. The first set is in the
base/delta directories, they contain the INSERTed rows. The second set
of files are in the delete-delta directories, they contain the DELETEd
rows.

(UPDATE operations are implemented via DELETE+INSERT.)

In the filesystem it looks like e.g.:
 * full_acid/delta_0000001_0000001_0000/0000_0
 * full_acid/delta_0000002_0000002_0000/0000_0
 * full_acid/delete_delta_0000003_0000003_0000/0000_0

During scanning we need to return INSERTed rows minus DELETEd rows.
This patch implements it by creating an ANTI JOIN between the INSERT and
DELETE sets. It is a planner-only modification. Every HDFS SCAN
that scans full ACID tables (that also have deleted rows) are converted
to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE
deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is
created above them.

Later we can add support for other distribution modes if the performance
requires it. E.g. if we have too many deleted rows then probably we are
better off with PARTITIONED distribution mode. We could estimate the
number of deleted rows by sampling the delete delta files.

The current patch only works for primitive types. I.e. we cannot select
nested data if the table has deleted rows.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
---
M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/util/AcidUtils.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test
A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
M tests/query_test/test_acid.py
15 files changed, 1,147 insertions(+), 89 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/4
-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6464/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Jun 2020 09:28:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 3:

PS3 is only a rebase


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Jun 2020 12:28:44 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6329/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 15 Jun 2020 17:36:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Aman Sinha, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16082

to look at the new patch set (#3).

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................

IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Hive ACID supports row-level DELETE and UPDATE operatations on a table.
It achieves it via assigning a unique row-id for each row, and
maintinaining two sets of files in a table. The first set is in the
base/delta directories, they contain the INSERTed rows. The second set
of files are in the delete-delta directories, they contain the DELETEd
rows.

(UPDATE operations are implemented via DELETE+INSERT.)

In the filesystem it looks like e.g.:
 * full_acid/delta_0000001_0000001_0000/0000_0
 * full_acid/delta_0000002_0000002_0000/0000_0
 * full_acid/delete_delta_0000003_0000003_0000/0000_0

During scanning we need to return INSERTed rows minus DELETEd rows.
This patch implements it by creating an ANTI JOIN between the INSERT and
DELETE sets. It is a planner-only modification. Every HDFS SCAN
that scans full ACID tables (that also have deleted rows) are converted
to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE
deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is
created above them.

Later we can add support for other distribution modes if the performance
requires it. E.g. if we have too many deleted rows then probably we are
better off with PARTITIONED distribution mode. We could estimate the
number of deleted rows by sampling the delete delta files.

The current patch only works for primitive types. I.e. we cannot select
nested data if the table has deleted rows.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
---
M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/util/AcidUtils.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test
A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
M tests/query_test/test_acid.py
15 files changed, 745 insertions(+), 89 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/3
-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Aman Sinha, Quanlong Huang, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16082

to look at the new patch set (#11).

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................

IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Hive ACID supports row-level DELETE and UPDATE operations on a table.
It achieves it via assigning a unique row-id for each row, and
maintaining two sets of files in a table. The first set is in the
base/delta directories, they contain the INSERTed rows. The second set
of files are in the delete-delta directories, they contain the DELETEd
rows.

(UPDATE operations are implemented via DELETE+INSERT.)

In the filesystem it looks like e.g.:
 * full_acid/delta_0000001_0000001_0000/0000_0
 * full_acid/delta_0000002_0000002_0000/0000_0
 * full_acid/delete_delta_0000003_0000003_0000/0000_0

During scanning we need to return INSERTed rows minus DELETEd rows.
This patch implements it by creating an ANTI JOIN between the INSERT and
DELETE sets. It is a planner-only modification. Every HDFS SCAN
that scans full ACID tables (that also have deleted rows) are converted
to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE
deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is
created above them.

Later we can add support for other distribution modes if the performance
requires it. E.g. if we have too many deleted rows then probably we are
better off with PARTITIONED distribution mode. We could estimate the
number of deleted rows by sampling the delete delta files.

The current patch only works for primitive types. I.e. we cannot select
nested data if the table has deleted rows.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
---
M common/thrift/CatalogObjects.thrift
M common/thrift/CatalogService.thrift
M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java
M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/FeFsTable.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java
M fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsTable.java
M fe/src/main/java/org/apache/impala/catalog/local/MetaProvider.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/util/AcidUtils.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test
A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
M tests/custom_cluster/test_local_catalog.py
M tests/query_test/test_acid.py
27 files changed, 1,460 insertions(+), 146 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/11
-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 11
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 9: Code-Review+1

PS9 is a rebase. Carrying +1.


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Jun 2020 09:02:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 12:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@1368
PS11, Line 1368:         // Validate all the partition key/values to ensure you can convert them toThrift()
               :         Expr.treesToThrift(getPartitionValues());
               :       } catch (Exception e) {
               :         throw new CatalogException("Partition (" + getPartitionName() +
               :             ") has invalid partition column values: ", e);
               :       }
               :     }
               :   }
               : 
               :   /**
               :    * Comparator to allow ordering of partitions by their partition-key values.
               :    */
               :   public static final KeyValueComparator KV_COMPARATOR = new KeyValueComparator();
               :   public static class KeyValueComparator implements Comparator<FeFsPartition> {
               :     @Override
               :     public int compare(FeFsPartition o1, FeFsPartition o2) {
               :       return comparePartitionKeyValues(o1.getPartitionValues(), o2.getPartitionValues());
               :     }
               :   }
               : 
               :   @VisibleForTesting
               :   public static int comparePartitionKeyValues(List<LiteralExpr> lhs,
               :       List<LiteralExpr> rhs) {
               :    
> nit: Could you move these non-static methods to somewhere above the HdfsPar
Done


http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1453
PS11, Line 1453:         "that have deleted rows and complex types. As a workaround you can run a " +
               :         "major compaction.";
> Maybe mention that the query references complex types?
Done. Full-ACID complex types are coming soon, but it was low effort to add some tests for it so I added.


http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1468
PS11, Line 1468: throws AnalysisException {
> unused arguement
Done


http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1471
PS11, Line 1471: hdfsTblRef.getUniqueAlias()
> Looks like if the tableRef doesn't have an explict alias, this is the full 
It works well, added some tests.


http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1537
PS11, Line 1537: /*isPartitionKeySc
> Could you add some tests for isPartitionKeyScan=true? IIUC, we only scan on
Thanks for pointing this out. The code really had a bug, though I needed a bit different statements to hit it.

When 'isPartitionKeyScan=true' HdfsScanNode creates only one scan range for each file descriptor. Then the backend scanners only return a single row from each scan range.

So to hit the bug I had to issue the following statements:

 create table my_acid (id int) partitioned by (p int) stored as orc
   tblproperties('transactional'='true');
 insert into my_acid partition(p=0) values (0), (1), (2);
 insert into my_acid partition(p=1) values (0), (1), (2);
 delete from my_acid where p = 1 and id = 0;
 // Run this in Impala. The result should be 1 since partition(p=1) is NOT empty.
 select max(p) from my_acid;

In p=1 we have only one insert delta file which contains (0), (1), and (2), all of them fit into one scan range. So the scanner only returns the first element (0) that we have deleted.

Added this example as e2e test.


http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1541
PS11, Line 1541: /*isPartitionKeySc
> Should we always set this to true to scan all delete rows?
I think you meant 'false', and you are right. And we also need to set it to false for the insert delta files as well.


http://gerrit.cloudera.org:8080/#/c/16082/11/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/11/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@714
PS11, Line 714: ====
> Could you add some tests to make sure that the query hints are respected? A
Done


http://gerrit.cloudera.org:8080/#/c/16082/11/testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
File testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/11/testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test@21
PS11, Line 21: insert into acid values (5), (5), (5);
> Could you add a test for insert overwrite?
Thanks, it actually revealed a bug related to the splitted file descriptors.

If the table had delete deltas and I issued a REFRESH, then the old deltas could shadow the files coming from a newer base directory because I didn't clear the old file descriptors in ParallelFileMetadataLoader.loadAndSet().


http://gerrit.cloudera.org:8080/#/c/16082/11/testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test@52
PS11, Line 52: use $DATABASE;
> Could you add a test for show partitions on this table? Will the deleted pa
alltypes_deleted_rows doesn't have deleted partitions, but I added a test for a temporary table.

SHOW PARTITIONS shows the partitions with no data, similar to Hive.



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 12
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 13 Jul 2020 18:30:23 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Aman Sinha, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16082

to look at the new patch set (#8).

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................

IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Hive ACID supports row-level DELETE and UPDATE operations on a table.
It achieves it via assigning a unique row-id for each row, and
maintaining two sets of files in a table. The first set is in the
base/delta directories, they contain the INSERTed rows. The second set
of files are in the delete-delta directories, they contain the DELETEd
rows.

(UPDATE operations are implemented via DELETE+INSERT.)

In the filesystem it looks like e.g.:
 * full_acid/delta_0000001_0000001_0000/0000_0
 * full_acid/delta_0000002_0000002_0000/0000_0
 * full_acid/delete_delta_0000003_0000003_0000/0000_0

During scanning we need to return INSERTed rows minus DELETEd rows.
This patch implements it by creating an ANTI JOIN between the INSERT and
DELETE sets. It is a planner-only modification. Every HDFS SCAN
that scans full ACID tables (that also have deleted rows) are converted
to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE
deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is
created above them.

Later we can add support for other distribution modes if the performance
requires it. E.g. if we have too many deleted rows then probably we are
better off with PARTITIONED distribution mode. We could estimate the
number of deleted rows by sampling the delete delta files.

The current patch only works for primitive types. I.e. we cannot select
nested data if the table has deleted rows.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
---
M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/util/AcidUtils.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test
A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
M tests/query_test/test_acid.py
15 files changed, 1,148 insertions(+), 89 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/8
-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 14:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6127/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 14
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 14 Jul 2020 07:49:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 9:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6078/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Jun 2020 09:02:53 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Aman Sinha, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16082

to look at the new patch set (#7).

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................

IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Hive ACID supports row-level DELETE and UPDATE operations on a table.
It achieves it via assigning a unique row-id for each row, and
maintaining two sets of files in a table. The first set is in the
base/delta directories, they contain the INSERTed rows. The second set
of files are in the delete-delta directories, they contain the DELETEd
rows.

(UPDATE operations are implemented via DELETE+INSERT.)

In the filesystem it looks like e.g.:
 * full_acid/delta_0000001_0000001_0000/0000_0
 * full_acid/delta_0000002_0000002_0000/0000_0
 * full_acid/delete_delta_0000003_0000003_0000/0000_0

During scanning we need to return INSERTed rows minus DELETEd rows.
This patch implements it by creating an ANTI JOIN between the INSERT and
DELETE sets. It is a planner-only modification. Every HDFS SCAN
that scans full ACID tables (that also have deleted rows) are converted
to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE
deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is
created above them.

Later we can add support for other distribution modes if the performance
requires it. E.g. if we have too many deleted rows then probably we are
better off with PARTITIONED distribution mode. We could estimate the
number of deleted rows by sampling the delete delta files.

The current patch only works for primitive types. I.e. we cannot select
nested data if the table has deleted rows.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
---
M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/util/AcidUtils.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test
A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
M tests/query_test/test_acid.py
15 files changed, 1,143 insertions(+), 89 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/7
-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 11:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6513/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 11
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jul 2020 15:29:09 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 13: Code-Review+2

(1 comment)

Thanks for adding bunch of tests! LGTM. Let's ship it!

http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1541
PS11, Line 1541: /*isPartitionKeySc
> I think you meant 'false', and you are right. And we also need to set it to
Oops, yes, I mean 'false'... Thanks for addressing this.



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 13
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 14 Jul 2020 00:20:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 9:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@45
PS9, Line 45: import org.apache.impala.analysis.Path.PathType;
> nit: keep the import list sorted in groups (usually the IDE will do this fo
Yeah, I use my IDE to do that for me. Interestingly, even if I delete this line, and ask VSCode to import PathType for me, it puts the import to this line again...

Same with HdfsPartition.FileDescriptor. Seems like if an import path has more components it can confuse vscode.

Anyway, fixed it manually :D


http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1418
PS9, Line 1418:       if (addAcidSlotsIfNeeded(analyzer, hdfsTblRef, partitions)) {
> nit: what about merging this if-statement with its outer scope so they are
Done


http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1518
PS9, Line 1518:     // Let's separate insert delta File Descriptors from delete delta FDs.
> I think we should separate the file descriptors in catalogd after loading t
I implemented it in PS10, but I'm afraid that the change became a bit too invasive. An alternative approach is just to add the genInsertDeltaPartition() and genDeleteDeltaPartition() to the partition classes and those would filter out the unneeded file descs and return a new Partition object. This way the change would be smaller, but we'd still have to generate these insert/delta partitions for each query. What do you think?



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Jul 2020 15:46:36 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 7:

PS 6&7 are only rebases.


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Jun 2020 16:21:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 8: Code-Review+1

(2 comments)

Carrying Aman's +1

http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@58
PS2, Line 58: |  row-size=100B cardinality=143
> >>  It's because COMPUTE STATS is query-based, so it runs count(*) and coun
Thanks for mentioning stats extrapolation. I created IMPALA-9883 for that.

Renamed DELETA TABLE to DELETE EVENTS.


http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@286
PS4, Line 286: |  |  hash predicates: functional_orc_def.alltypes_deleted_rows.month = functional_orc_def.alltypes_deleted_rows-delete-delta.month, functional_orc_def.alltypes_deleted_rows.row__id.bucket = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.bucket, functional_orc_def.alltypes_deleted_rows.row__id.originaltransaction = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.originaltransaction, functional_orc_def.alltypes_deleted_rows.row__id.rowid = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.rowid, functional_orc_def.alltypes_deleted_rows.year = functional_orc_def.alltypes_deleted_rows-delete-delta.year
> I see..I missed that the other columns were partitioning cols. I would thin
Added a TODO comment about it in SingleNodePlanner.



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 23 Jun 2020 16:06:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6371/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Jun 2020 17:05:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 10:

(9 comments)

Thanks for making the change! I'm still looking into the join related logic and tests.

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java
File fe/src/main/java/org/apache/impala/catalog/FeFsTable.java:

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java@376
PS10, Line 376: getFilesSample
Could you add a TODO in this method for IMPALA-9883?


http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@2010
PS10, Line 2010: getFileDescriptors().size()
Please change this to getNumFileDescriptors() after we revise its implementation.


http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java
File fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java:

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java@56
PS10, Line 56: import org.apache.orc.FileMetadata;
nit: unused import


http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java@255
PS10, Line 255:         
nit: 4 spaces indent


http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
File fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java:

http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java@133
PS10, Line 133: !getFileDescriptors().isEmpty()
nit: It's inefficient to construct the list for this. Maybe change to 

 return !fileDescriptors_.isEmpty() || !insertFileDescriptors_.isEmpty() || !deleteFileDescriptors_.isEmpty().


http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java@138
PS10, Line 138: getFileDescriptors().size()
nit: It's inefficient to construct the list for this. Maybe change to

 return fileDescriptors_.isEmpty()? insertFileDescriptors_.size() + deleteFileDescriptors_.size() : fileDescriptors_.size();


http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@45
PS9, Line 45: import org.apache.impala.analysis.NullLiteral;
> Yeah, I use my IDE to do that for me. Interestingly, even if I delete this 
Looks like IntelliJ is more intelligent in this :D


http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1518
PS9, Line 1518:       feTable.getMetaStoreTable().getParameters()));
> I implemented it in PS10, but I'm afraid that the change became a bit too i
PS10 looks good to me. I think the insert files and delete files are used separately so it makes sense to store them in different fileds. However, It'd be good to ask more feedbacks on this decision.

Maybe we can merge fileDescriptors_ and insertFileDescriptors_ to one field (since only one of them is valid at a time) to reduce the complexity.


http://gerrit.cloudera.org:8080/#/c/16082/10/tests/custom_cluster/test_local_catalog.py
File tests/custom_cluster/test_local_catalog.py:

http://gerrit.cloudera.org:8080/#/c/16082/10/tests/custom_cluster/test_local_catalog.py@467
PS10, Line 467:   def test_full_acid_support(self):
Could you add a test here for reading functional_orc_def.alltypes_deleted_rows in LocalCatalog mode?



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jul 2020 09:59:38 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Aman Sinha, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16082

to look at the new patch set (#9).

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................

IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Hive ACID supports row-level DELETE and UPDATE operations on a table.
It achieves it via assigning a unique row-id for each row, and
maintaining two sets of files in a table. The first set is in the
base/delta directories, they contain the INSERTed rows. The second set
of files are in the delete-delta directories, they contain the DELETEd
rows.

(UPDATE operations are implemented via DELETE+INSERT.)

In the filesystem it looks like e.g.:
 * full_acid/delta_0000001_0000001_0000/0000_0
 * full_acid/delta_0000002_0000002_0000/0000_0
 * full_acid/delete_delta_0000003_0000003_0000/0000_0

During scanning we need to return INSERTed rows minus DELETEd rows.
This patch implements it by creating an ANTI JOIN between the INSERT and
DELETE sets. It is a planner-only modification. Every HDFS SCAN
that scans full ACID tables (that also have deleted rows) are converted
to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE
deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is
created above them.

Later we can add support for other distribution modes if the performance
requires it. E.g. if we have too many deleted rows then probably we are
better off with PARTITIONED distribution mode. We could estimate the
number of deleted rows by sampling the delete delta files.

The current patch only works for primitive types. I.e. we cannot select
nested data if the table has deleted rows.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
---
M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/util/AcidUtils.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test
A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
M tests/query_test/test_acid.py
15 files changed, 1,148 insertions(+), 89 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/9
-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6350/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 Jun 2020 16:16:25 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16082/12/tests/custom_cluster/test_local_catalog.py
File tests/custom_cluster/test_local_catalog.py:

http://gerrit.cloudera.org:8080/#/c/16082/12/tests/custom_cluster/test_local_catalog.py@30
PS12, Line 30: from tests.common.skip import (SkipIfHive2, SkipIfCatalogV2, SkipIfS3, SkipIfABFS,
> flake8: F401 'tests.common.skip.SkipIfCatalogV2' imported but unused
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 12
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 13 Jul 2020 18:31:36 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Aman Sinha (Code Review)" <ge...@cloudera.org>.
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 4:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java@88
PS2, Line 88:     displayName_ = "DELETE TABLE " + displayName_;
> Yeah, I just wanted to somehow mark this JOIN to make it obvious what's goi
Sounds good.


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1454
PS2, Line 1454:         if (fd.getRelativePath().startsWith("delete_delta_")) {
> Unfortunately there's no such metadata AFAICT. It could be added during dat
I think for this patch it is fine to keep it this way.  I saw in the doc that there's a performance analysis phase ..perhaps it could be investigated/improved in that phase.  Since this extra cost is added during planning time which typically is in hundreds of ms, It would be interesting to see for typical workloads what is the percentage change in planning time.


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1480
PS2, Line 1480:     rawPath.add("row__id");
> It's needed and it's added by the above for loop at L1474-L1478.
ah, i see..thanks.


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1551
PS2, Line 1551:     HdfsScanNode deltaScanNode = new HdfsScanNode(ctx_.getNextNodeId(),
> Yeah, this makes sense.
Actually, the planner does this optimization as part of SanNode.applyCountStarOptimization() which is triggered for Parquet file formats currently..so I suppose in the ORC case it doesn't even matter (although I could foresee doing such optimization for ORC as well).


http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@58
PS2, Line 58: |  row-size=100B cardinality=143
> Maybe it's not that wrong. If I use compute stats then the actual cardinali
Regarding the conjuncts, wouldn't it be incorrect to  even pass the conjuncts to the delete delta scan ?  we don't want to prune out deleted records too early. 
For the cardinality, do you mean if the original cardinality was 100 and 10 rows were deleted,  does COMPUTE STATS show  90 ?  Even if no compaction was done ?


http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@286
PS4, Line 286: |  |  hash predicates: functional_orc_def.alltypes_deleted_rows.month = functional_orc_def.alltypes_deleted_rows-delete-delta.month, functional_orc_def.alltypes_deleted_rows.row__id.bucket = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.bucket, functional_orc_def.alltypes_deleted_rows.row__id.originaltransaction = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.originaltransaction, functional_orc_def.alltypes_deleted_rows.row__id.rowid = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.rowid, functional_orc_def.alltypes_deleted_rows.year = functional_orc_def.alltypes_deleted_rows-delete-delta.year
This looks like it is adding all the other Slots to hash predicates, apart from the hidden slots.  Isn't that wrong ? My understanding was the anti join should only join on the ACID specific columns.



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Jun 2020 01:27:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Aman Sinha (Code Review)" <ge...@cloudera.org>.
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 2:

(11 comments)

Did a first pass..some comments below.

http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@673
PS2, Line 673:   protected HdfsPartition(HdfsPartition other) {
Looking at the changes in https://issues.apache.org/jira/browse/IMPALA-9778,  the HdfsPartition class  has been restructured quite a bit.. although it seems your changes should be ok since you are only creating a clone.


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
File fe/src/main/java/org/apache/impala/planner/HashJoinNode.java:

http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java@178
PS2, Line 178:       if (!isAcidJoin_ || detailLevel.ordinal() == TExplainLevel.VERBOSE.ordinal()) {
For the anti join should we not show the join conditions in the Impala query profile ? (query profile output is in EXTENDED mode, not VERBOSE).


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java@83
PS2, Line 83:   // True if this join is used to do the join between insert and deleta delta files.
nit: spelling 'deleta'


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java@88
PS2, Line 88:     displayName_ = "ACID " + displayName_;
The  'ACID' prefix for a HashJoin could be misleading to end users since a plan operator itself is not ACID in DBMS terms.  Could we use something like. 'DELTA TABLE HASH JOIN'  (open to other suggestions).


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1453
PS2, Line 1453:       FeFsPartition fsPartition = (FeFsPartition)partition;
The casting  is redundant since 'partition' is already of type FeFsPartition.


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1454
PS2, Line 1454:       for (HdfsPartition.FileDescriptor fd : fsPartition.getFileDescriptors()) {
Is there any indicator in the HMS's partition object that indicates whether any delete was done ?  It does seem expensive to  check all the file descriptors within a partition since in vast majority of cases nothing may have been deleted or the 'delete_delta' is found only towards the end. However,  if such metadata does not exist,  then I suppose this is the only alternative.


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1480
PS2, Line 1480:     String[] acidFields = {"originaltransaction", "bucket", "rowid"};
The design doc suggests the anti-join is on 4 fields..is the partition id not needed in this join ?


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1551
PS2, Line 1551:         hdfsTblRef.getDesc(), conjuncts, insertPartitions, hdfsTblRef, aggInfo,
The  'aggInfo' is meant for the simple 'select count(*) from table'  type of query where the aggregation value can be found from the row count in the metadata.  But in this case, since  we want to compute this value only after the anti-join has been done,  I think it should be null.


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1591
PS2, Line 1591:           ret.add(new BinaryPredicate(
Call analyze for the BinaryPredicate.  Also, could you add a simple test with explain_level=3 to check for the join condition ? I don't think I saw one.


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1605
PS2, Line 1605:    * @return the cloned partitoin
nit: spelling 'partitoin'


http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@58
PS2, Line 58: |  row-size=100B cardinality=143
The cardinality estimate for this hash join would need some adjustment in the future..right now it is showing the default behavior of a Left Outer Join, which is not accurate for the join between the delta and delete_deltas.



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Jun 2020 00:27:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 5:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG@9
PS4, Line 9: Hive ACID supports row-level DELETE and UPDATE operations on a table.
> nit: spelling 'operatations'
Done


http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG@11
PS4, Line 11: maintaining two sets of files in a table. The first set is in the
> nit: spelling 'maintinaining'
Done


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1454
PS2, Line 1454:         if (fd.getRelativePath().startsWith("delete_delta_")) {
> I think for this patch it is fine to keep it this way.  I saw in the doc th
Yeah I agree. Will do that during perf analysis.


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1551
PS2, Line 1551:     HdfsScanNode deltaScanNode = new HdfsScanNode(ctx_.getNextNodeId(),
> Actually, the planner does this optimization as part of SanNode.applyCountS
You are right, I was thinking about a different optimization in the ORC scanner which is also for count(*):

https://github.com/apache/impala/blob/17fd15c6e4981499932c02d541c76757a5fdf87d/be/src/exec/hdfs-orc-scanner.cc#L531-L548

Anyway, now we have tests for count(*) + ACID so if someone implements the Parquet-kind count(*) optimization for ORC in the future, they'll know if they break stg.


http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@58
PS2, Line 58: |  row-size=100B cardinality=143
> Regarding the conjuncts, wouldn't it be incorrect to  even pass the conjunc
Yeah I don't think we should pass the conjuncts to the delete deltas. Also it wouldn't have any effect since the delete delta files contain no real data, only the ACID fields (originaltransaction, rowid, etc.) which serve as a tombstone.

To answer the question about cardinality, yes, COMPUTE STATS would show 90. It's because COMPUTE STATS is query-based, so it runs count(*) and count(distinct) queries on a table and stores the results of it. And the "DELETE TABLE" is not a separate table. The delete files are stored under the table's directory in so called "delete delta" directories next to the "delta" directories, e.g.:

  /test-warehouse/managed/full_acid_part/p=1/delete_delta_0000003_0000003_0000/bucket_00000
  /test-warehouse/managed/full_acid_part/p=1/delta_0000001_0000001_0000/bucket_00000_0

Given that do you think we should rename "DELETE TABLE HASH JOIN" to "DELETE EVENTS HASH JOIN"? (The Hive docs usually refer to the rows in delete delta files as "delete events")


http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@286
PS4, Line 286: |  |  hash predicates: functional_orc_def.alltypes_deleted_rows.month = functional_orc_def.alltypes_deleted_rows-delete-delta.month, functional_orc_def.alltypes_deleted_rows.row__id.bucket = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.bucket, functional_orc_def.alltypes_deleted_rows.row__id.originaltransaction = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.originaltransaction, functional_orc_def.alltypes_deleted_rows.row__id.rowid = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.rowid, functional_orc_def.alltypes_deleted_rows.year = functional_orc_def.alltypes_deleted_rows-delete-delta.year
> This looks like it is adding all the other Slots to hash predicates, apart 
It only adds the hidden ACID columns + the partitioning columns (year, month).



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Jun 2020 16:13:53 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16082

to look at the new patch set (#2).

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................

IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Hive ACID supports row-level DELETE and UPDATE operatations on a table.
It achieves it via assigning a unique row-id for each row, and
maintinaining two sets of files in a table. The first set is in the
base/delta directories, they contain the INSERTed rows. The second set
of files are in the delete-delta directories, they contain the DELETEd
rows.

(UPDATE operations are implemented via DELETE+INSERT.)

In the filesystem it looks like e.g.:
 * full_acid/delta_0000001_0000001_0000/0000_0
 * full_acid/delta_0000002_0000002_0000/0000_0
 * full_acid/delete_delta_0000003_0000003_0000/0000_0

During scanning we need to return INSERTed rows minus DELETEd rows.
This patch implements it by creating an ANTI JOIN between the INSERT and
DELETE sets. It is a planner-only modification. Every HDFS SCAN
that scans full ACID tables (that also have deleted rows) are converted
to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE
deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is
created above them.

Later we can add support for other distribution modes if the performance
requires it. E.g. if we have too many deleted rows then probably we are
better off with PARTITIONED distribution mode. We could estimate the
number of deleted rows by sampling the delete delta files.

The current patch only works for primitive types. I.e. we cannot select
nested data if the table has deleted rows.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
---
M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/util/AcidUtils.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test
A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
M tests/query_test/test_acid.py
15 files changed, 733 insertions(+), 88 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/2
-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Aman Sinha, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16082

to look at the new patch set (#6).

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................

IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Hive ACID supports row-level DELETE and UPDATE operations on a table.
It achieves it via assigning a unique row-id for each row, and
maintaining two sets of files in a table. The first set is in the
base/delta directories, they contain the INSERTed rows. The second set
of files are in the delete-delta directories, they contain the DELETEd
rows.

(UPDATE operations are implemented via DELETE+INSERT.)

In the filesystem it looks like e.g.:
 * full_acid/delta_0000001_0000001_0000/0000_0
 * full_acid/delta_0000002_0000002_0000/0000_0
 * full_acid/delete_delta_0000003_0000003_0000/0000_0

During scanning we need to return INSERTed rows minus DELETEd rows.
This patch implements it by creating an ANTI JOIN between the INSERT and
DELETE sets. It is a planner-only modification. Every HDFS SCAN
that scans full ACID tables (that also have deleted rows) are converted
to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE
deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is
created above them.

Later we can add support for other distribution modes if the performance
requires it. E.g. if we have too many deleted rows then probably we are
better off with PARTITIONED distribution mode. We could estimate the
number of deleted rows by sampling the delete delta files.

The current patch only works for primitive types. I.e. we cannot select
nested data if the table has deleted rows.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
---
M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/util/AcidUtils.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test
A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
M tests/query_test/test_acid.py
15 files changed, 1,142 insertions(+), 89 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/6
-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 9: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Jun 2020 14:02:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 13:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6577/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 13
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 13 Jul 2020 18:57:44 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Aman Sinha (Code Review)" <ge...@cloudera.org>.
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG@9
PS4, Line 9: Hive ACID supports row-level DELETE and UPDATE operatations on a table.
nit: spelling 'operatations'


http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG@11
PS4, Line 11: maintinaining two sets of files in a table. The first set is in the
nit: spelling 'maintinaining'



-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Jun 2020 19:12:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Aman Sinha, Quanlong Huang, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16082

to look at the new patch set (#13).

Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
......................................................................

IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Hive ACID supports row-level DELETE and UPDATE operations on a table.
It achieves it via assigning a unique row-id for each row, and
maintaining two sets of files in a table. The first set is in the
base/delta directories, they contain the INSERTed rows. The second set
of files are in the delete-delta directories, they contain the DELETEd
rows.

(UPDATE operations are implemented via DELETE+INSERT.)

In the filesystem it looks like e.g.:
 * full_acid/delta_0000001_0000001_0000/0000_0
 * full_acid/delta_0000002_0000002_0000/0000_0
 * full_acid/delete_delta_0000003_0000003_0000/0000_0

During scanning we need to return INSERTed rows minus DELETEd rows.
This patch implements it by creating an ANTI JOIN between the INSERT and
DELETE sets. It is a planner-only modification. Every HDFS SCAN
that scans full ACID tables (that also have deleted rows) are converted
to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE
deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is
created above them.

Later we can add support for other distribution modes if the performance
requires it. E.g. if we have too many deleted rows then probably we are
better off with PARTITIONED distribution mode. We could estimate the
number of deleted rows by sampling the delete delta files.

The current patch only works for primitive types. I.e. we cannot select
nested data if the table has deleted rows.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
---
M common/thrift/CatalogObjects.thrift
M common/thrift/CatalogService.thrift
M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java
M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/FeFsTable.java
M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java
M fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java
M fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsTable.java
M fe/src/main/java/org/apache/impala/catalog/local/MetaProvider.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/util/AcidUtils.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test
A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test
M tests/custom_cluster/test_local_catalog.py
M tests/query_test/test_acid.py
27 files changed, 1,710 insertions(+), 151 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/13
-- 
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 13
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>