You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2018/07/13 00:21:50 UTC

[Impala-ASF-CR] IMPALA-7294. TABLESAMPLE should not allocate array based on total table file count

Hello Tianyi Wang, Vuk Ercegovac,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/10936

to review the following change.


Change subject: IMPALA-7294. TABLESAMPLE should not allocate array based on total table file count
......................................................................

IMPALA-7294. TABLESAMPLE should not allocate array based on total table file count

This changes HdfsTable.getFilesSample() to allocate its intermediate
sampling array based on the number of files in the selected
(post-pruning) partitions, rather than the total number of files in the
table. While the former behavior was correct (the total file count is of
course an upper bound on the pruned file count), it was an unnecessarily
large allocation, which has some downsides around garbage collection.

In addition, this is important for the LocalCatalog implementation of
table sampling, since we do not want to have to load all partition file
lists in order to compute a sample over a pruned subset of partitions.

The original code indicated that this was an optimization to avoid
looping over the partition list an extra time. However, typical
partition lists are relatively small even in the worst case (order of
100k) and looping over 100k in-memory Java objects is not likely to be
the bottleneck in planning any query. This is especially true
considering that we loop over that same list later in the function
anyway, so we probably aren't saving page faults or LLC cache misses
either.

In testing this change I noticed that the existing test for TABLESAMPLE
didn't test TABLESAMPLE when applied in conjunction with a predicate.
I added a new dimension to the test which employs a predicate which
prunes some partitions to ensure that the code works in that case.
I also added coverage of the "100%" sampling parameter as a sanity check
that it returns the same results as a non-sampled query.

Change-Id: I0248d89bcd9dd4ff8b4b85fef282c19e3fe9bdd5
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M tests/query_test/test_tablesample.py
2 files changed, 29 insertions(+), 13 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/36/10936/1
-- 
To view, visit http://gerrit.cloudera.org:8080/10936
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I0248d89bcd9dd4ff8b4b85fef282c19e3fe9bdd5
Gerrit-Change-Number: 10936
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-7294. TABLESAMPLE should not allocate array based on total table file count

Posted by "Philip Zeyliger (Code Review)" <ge...@cloudera.org>.
Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/10936 )

Change subject: IMPALA-7294. TABLESAMPLE should not allocate array based on total table file count
......................................................................


Patch Set 1: Code-Review+2

Yep, this seems fine. Thanks for the detailed commit message.


-- 
To view, visit http://gerrit.cloudera.org:8080/10936
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0248d89bcd9dd4ff8b4b85fef282c19e3fe9bdd5
Gerrit-Change-Number: 10936
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Tue, 17 Jul 2018 18:47:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7294. TABLESAMPLE should not allocate array based on total table file count

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/10936 )

Change subject: IMPALA-7294. TABLESAMPLE should not allocate array based on total table file count
......................................................................


Patch Set 1: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/10936
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0248d89bcd9dd4ff8b4b85fef282c19e3fe9bdd5
Gerrit-Change-Number: 10936
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Tue, 17 Jul 2018 19:03:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7294. TABLESAMPLE should not allocate array based on total table file count

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/10936 )

Change subject: IMPALA-7294. TABLESAMPLE should not allocate array based on total table file count
......................................................................

IMPALA-7294. TABLESAMPLE should not allocate array based on total table file count

This changes HdfsTable.getFilesSample() to allocate its intermediate
sampling array based on the number of files in the selected
(post-pruning) partitions, rather than the total number of files in the
table. While the former behavior was correct (the total file count is of
course an upper bound on the pruned file count), it was an unnecessarily
large allocation, which has some downsides around garbage collection.

In addition, this is important for the LocalCatalog implementation of
table sampling, since we do not want to have to load all partition file
lists in order to compute a sample over a pruned subset of partitions.

The original code indicated that this was an optimization to avoid
looping over the partition list an extra time. However, typical
partition lists are relatively small even in the worst case (order of
100k) and looping over 100k in-memory Java objects is not likely to be
the bottleneck in planning any query. This is especially true
considering that we loop over that same list later in the function
anyway, so we probably aren't saving page faults or LLC cache misses
either.

In testing this change I noticed that the existing test for TABLESAMPLE
didn't test TABLESAMPLE when applied in conjunction with a predicate.
I added a new dimension to the test which employs a predicate which
prunes some partitions to ensure that the code works in that case.
I also added coverage of the "100%" sampling parameter as a sanity check
that it returns the same results as a non-sampled query.

Change-Id: I0248d89bcd9dd4ff8b4b85fef282c19e3fe9bdd5
Reviewed-on: http://gerrit.cloudera.org:8080/10936
Reviewed-by: Philip Zeyliger <ph...@cloudera.com>
Reviewed-by: Vuk Ercegovac <ve...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M tests/query_test/test_tablesample.py
2 files changed, 29 insertions(+), 13 deletions(-)

Approvals:
  Philip Zeyliger: Looks good to me, approved
  Vuk Ercegovac: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/10936
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I0248d89bcd9dd4ff8b4b85fef282c19e3fe9bdd5
Gerrit-Change-Number: 10936
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-7294. TABLESAMPLE should not allocate array based on total table file count

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10936 )

Change subject: IMPALA-7294. TABLESAMPLE should not allocate array based on total table file count
......................................................................


Patch Set 1:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2827/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/10936
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0248d89bcd9dd4ff8b4b85fef282c19e3fe9bdd5
Gerrit-Change-Number: 10936
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Tue, 17 Jul 2018 18:54:04 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7294. TABLESAMPLE should not allocate array based on total table file count

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10936 )

Change subject: IMPALA-7294. TABLESAMPLE should not allocate array based on total table file count
......................................................................


Patch Set 1: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/10936
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0248d89bcd9dd4ff8b4b85fef282c19e3fe9bdd5
Gerrit-Change-Number: 10936
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Tue, 17 Jul 2018 22:17:30 +0000
Gerrit-HasComments: No