You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Alex Behm (Code Review)" <ge...@cloudera.org> on 2017/09/20 15:28:28 UTC

[Impala-ASF-CR] IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.

Alex Behm has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/8110

Change subject: IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.
......................................................................

IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.

Today, Impala populates the 'rawDataSize' property
during COMPUTE STATS for the purpose of extrapolating
row counts based on file sizes.

Intended meaning/use of tblproperties:
- rawDataSize' is the estimated in-memory size of a table
  (without encoding and compression)
- 'totalSize' represents the on-disk size

Using the fields correctly is important for compatibility
with other users of the HMS such as Hive and SparkSQL.
For example, SparkSQL relies on the 'totalSize' for
join ordering.

Testing:
- core/hdfs run passed

Change-Id: If7c2c4e1e99b297c849f9f0d18b2bef34ad811c6
---
M fe/src/main/java/org/apache/impala/catalog/Table.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/test/java/org/apache/impala/planner/StatsExtrapolationTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
4 files changed, 23 insertions(+), 23 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/10/8110/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8110
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: If7c2c4e1e99b297c849f9f0d18b2bef34ad811c6
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>

[Impala-ASF-CR] IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8110 )

Change subject: IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.
......................................................................


Patch Set 3:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1257/


-- 
To view, visit http://gerrit.cloudera.org:8080/8110
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If7c2c4e1e99b297c849f9f0d18b2bef34ad811c6
Gerrit-Change-Number: 8110
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Comment-Date: Thu, 21 Sep 2017 22:02:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8110 )

Change subject: IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8110/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8110/1//COMMIT_MSG@9
PS1, Line 9: Today, Impala populates the 'rawDataSize' property
           : during COMPUTE STATS for the purpose of extrapolating
           : row counts based on file sizes.
           : 
           : Intended meaning/use of tblproperties:
           : - rawDataSize' is the estimated in-memory size of a table
           :   (without encoding and compression)
           : - 'totalSize' represents the on-disk size
           : 
           : Using the fields correctly is important for compatibility
           : with other users of the HMS such as Hive and SparkSQL.
           : For example, SparkSQL relies on the 'totalSize' for
           : join ordering.
> Although this is very informative, I don't think I understand what this com
The title says "Use totalSize tblproperty instead of rawDataSize".

Added extra info in commit msg body.


http://gerrit.cloudera.org:8080/#/c/8110/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java:

http://gerrit.cloudera.org:8080/#/c/8110/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1178
PS1, Line 1178: droppedRawDataSize
> rename to droppedTotalSize to be consistent with the tblproperty being upda
Good catch. I missed that one.



-- 
To view, visit http://gerrit.cloudera.org:8080/8110
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If7c2c4e1e99b297c849f9f0d18b2bef34ad811c6
Gerrit-Change-Number: 8110
Gerrit-PatchSet: 1
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Sep 2017 20:53:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change. ( http://gerrit.cloudera.org:8080/8110 )

Change subject: IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8110/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8110/1//COMMIT_MSG@9
PS1, Line 9: Today, Impala populates the 'rawDataSize' property
           : during COMPUTE STATS for the purpose of extrapolating
           : row counts based on file sizes.
           : 
           : Intended meaning/use of tblproperties:
           : - rawDataSize' is the estimated in-memory size of a table
           :   (without encoding and compression)
           : - 'totalSize' represents the on-disk size
           : 
           : Using the fields correctly is important for compatibility
           : with other users of the HMS such as Hive and SparkSQL.
           : For example, SparkSQL relies on the 'totalSize' for
           : join ordering.
Although this is very informative, I don't think I understand what this commit changes. Will we be populating both rawDataSize and totalSize, replace one with the other, or something else?


http://gerrit.cloudera.org:8080/#/c/8110/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java:

http://gerrit.cloudera.org:8080/#/c/8110/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1178
PS1, Line 1178: droppedRawDataSize
rename to droppedTotalSize to be consistent with the tblproperty being updated?



-- 
To view, visit http://gerrit.cloudera.org:8080/8110
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If7c2c4e1e99b297c849f9f0d18b2bef34ad811c6
Gerrit-Change-Number: 8110
Gerrit-PatchSet: 1
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Sep 2017 18:21:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada, Dimitris Tsirogiannis, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8110

to look at the new patch set (#2).

Change subject: IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.
......................................................................

IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.

Today, Impala populates the 'rawDataSize' property
during COMPUTE STATS for the purpose of extrapolating
row counts based on file sizes.

After this patch Impala will populate 'totalSize' instead of
'rawDataSize'. The 'rawDataSize' is not populated or used.

Intended meaning/use of tblproperties:
- rawDataSize' is the estimated in-memory size of a table
  (without encoding and compression)
- 'totalSize' represents the on-disk size

Using the fields correctly is important for compatibility
with other users of the HMS such as Hive and SparkSQL.
For example, SparkSQL relies on the 'totalSize' for
join ordering.

Testing:
- core/hdfs run passed

Change-Id: If7c2c4e1e99b297c849f9f0d18b2bef34ad811c6
---
M fe/src/main/java/org/apache/impala/catalog/Table.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/test/java/org/apache/impala/planner/StatsExtrapolationTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
4 files changed, 25 insertions(+), 25 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/10/8110/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8110
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If7c2c4e1e99b297c849f9f0d18b2bef34ad811c6
Gerrit-Change-Number: 8110
Gerrit-PatchSet: 2
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>

[Impala-ASF-CR] IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8110 )

Change subject: IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.
......................................................................


Patch Set 3: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/8110
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If7c2c4e1e99b297c849f9f0d18b2bef34ad811c6
Gerrit-Change-Number: 8110
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Comment-Date: Fri, 22 Sep 2017 02:07:33 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change. ( http://gerrit.cloudera.org:8080/8110 )

Change subject: IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.
......................................................................


Patch Set 2: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8110
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If7c2c4e1e99b297c849f9f0d18b2bef34ad811c6
Gerrit-Change-Number: 8110
Gerrit-PatchSet: 2
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Sep 2017 21:22:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8110 )

Change subject: IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.
......................................................................

IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.

Today, Impala populates the 'rawDataSize' property
during COMPUTE STATS for the purpose of extrapolating
row counts based on file sizes.

After this patch Impala will populate 'totalSize' instead of
'rawDataSize'. The 'rawDataSize' is not populated or used.

Intended meaning/use of tblproperties:
- rawDataSize' is the estimated in-memory size of a table
  (without encoding and compression)
- 'totalSize' represents the on-disk size

Using the fields correctly is important for compatibility
with other users of the HMS such as Hive and SparkSQL.
For example, SparkSQL relies on the 'totalSize' for
join ordering.

Testing:
- core/hdfs run passed

Change-Id: If7c2c4e1e99b297c849f9f0d18b2bef34ad811c6
Reviewed-on: http://gerrit.cloudera.org:8080/8110
Tested-by: Impala Public Jenkins
Reviewed-by: Alex Behm <al...@cloudera.com>
---
M fe/src/main/java/org/apache/impala/catalog/Table.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/test/java/org/apache/impala/planner/StatsExtrapolationTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
4 files changed, 25 insertions(+), 25 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Alex Behm: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/8110
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: If7c2c4e1e99b297c849f9f0d18b2bef34ad811c6
Gerrit-Change-Number: 8110
Gerrit-PatchSet: 4
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins

[Impala-ASF-CR] IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.
......................................................................


Patch Set 1: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/8110
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If7c2c4e1e99b297c849f9f0d18b2bef34ad811c6
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8110 )

Change subject: IMPALA-5955: Use totalSize tblproperty instead of rawDataSize.
......................................................................


Patch Set 3: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8110
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If7c2c4e1e99b297c849f9f0d18b2bef34ad811c6
Gerrit-Change-Number: 8110
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Comment-Date: Fri, 22 Sep 2017 03:38:46 +0000
Gerrit-HasComments: No