You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org> on 2018/06/12 01:58:57 UTC

[Impala-ASF-CR](2.x) IMPALA-5931: Generates scan ranges in planner for s3/adls

Hello Impala Public Jenkins,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/10692

to review the following change.


Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
......................................................................

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- added backend scheduler tests
- mixed-filesystems test covers tables/queries with multiple fs's.
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
    - TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Reviewed-on: http://gerrit.cloudera.org:8080/8523
Reviewed-by: Vuk Ercegovac <ve...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M CMakeLists.txt
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/util/CMakeLists.txt
A be/src/util/flat_buffer.cc
A be/src/util/flat_buffer.h
M common/thrift/Frontend.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Planner.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/ScanNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java
22 files changed, 677 insertions(+), 245 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/92/10692/1
-- 
To view, visit http://gerrit.cloudera.org:8080/10692
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: newchange
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 10692
Gerrit-PatchSet: 1
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR](2.x) IMPALA-5931: Generates scan ranges in planner for s3/adls

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/10692 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
......................................................................

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- added backend scheduler tests
- mixed-filesystems test covers tables/queries with multiple fs's.
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
    - TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Reviewed-on: http://gerrit.cloudera.org:8080/8523
Reviewed-by: Vuk Ercegovac <ve...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-on: http://gerrit.cloudera.org:8080/10692
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
---
M CMakeLists.txt
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/util/CMakeLists.txt
A be/src/util/flat_buffer.cc
A be/src/util/flat_buffer.h
M common/thrift/Frontend.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Planner.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/ScanNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java
22 files changed, 677 insertions(+), 245 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/10692
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: merged
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 10692
Gerrit-PatchSet: 3
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>

[Impala-ASF-CR](2.x) IMPALA-5931: Generates scan ranges in planner for s3/adls

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10692 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
......................................................................


Patch Set 2:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2647/


-- 
To view, visit http://gerrit.cloudera.org:8080/10692
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 10692
Gerrit-PatchSet: 2
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 Jun 2018 22:53:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR](2.x) IMPALA-5931: Generates scan ranges in planner for s3/adls

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10692 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
......................................................................


Patch Set 2: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/10692
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 10692
Gerrit-PatchSet: 2
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Comment-Date: Wed, 13 Jun 2018 02:30:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR](2.x) IMPALA-5931: Generates scan ranges in planner for s3/adls

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10692 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
......................................................................


Patch Set 2: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/10692
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 10692
Gerrit-PatchSet: 2
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 Jun 2018 22:53:22 +0000
Gerrit-HasComments: No

[Impala-ASF-CR](2.x) IMPALA-5931: Generates scan ranges in planner for s3/adls

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/10692 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
......................................................................


Patch Set 1: Code-Review+2

I looked at the diff between this patch and the original patch in master and convinced myself that the merge conflicts have been addressed properly.


-- 
To view, visit http://gerrit.cloudera.org:8080/10692
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: 2.x
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 10692
Gerrit-PatchSet: 1
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 Jun 2018 21:28:22 +0000
Gerrit-HasComments: No