You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org> on 2021/03/19 18:25:51 UTC

[Impala-ASF-CR] IMPALA-10594: Handle failed coordinators in admissiond

Thomas Tauber-Marshall has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17209


Change subject: IMPALA-10594: Handle failed coordinators in admissiond
......................................................................

IMPALA-10594: Handle failed coordinators in admissiond

This patch adds a statestore callback for the admissiond that monitors
for coordinators that have been removed from the cluster membership
and releases all of the resources for queries running on those
coordinators.

Testing:
- Added a custom cluster test that kills a coordinator and verifies
  that resources for queries running on it are eventually released.

Change-Id: I883f323bb765680ef24b3c3f51fb209dea15f0b0
---
M be/src/scheduling/admission-control-service.cc
M be/src/scheduling/admission-control-service.h
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/scheduling/admissiond-env.cc
M tests/custom_cluster/test_admission_controller.py
6 files changed, 103 insertions(+), 3 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/09/17209/1
-- 
To view, visit http://gerrit.cloudera.org:8080/17209
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I883f323bb765680ef24b3c3f51fb209dea15f0b0
Gerrit-Change-Number: 17209
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-10594: Handle failed coordinators in admissiond

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17209 )

Change subject: IMPALA-10594: Handle failed coordinators in admissiond
......................................................................

IMPALA-10594: Handle failed coordinators in admissiond

This patch adds a statestore callback for the admissiond that monitors
for coordinators that have been removed from the cluster membership
and releases all of the resources for queries running on those
coordinators.

Testing:
- Added a custom cluster test that kills a coordinator and verifies
  that resources for queries running on it are eventually released.

Change-Id: I883f323bb765680ef24b3c3f51fb209dea15f0b0
Reviewed-on: http://gerrit.cloudera.org:8080/17209
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/scheduling/admission-control-service.cc
M be/src/scheduling/admission-control-service.h
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/scheduling/admissiond-env.cc
M tests/custom_cluster/test_admission_controller.py
6 files changed, 125 insertions(+), 9 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/17209
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I883f323bb765680ef24b3c3f51fb209dea15f0b0
Gerrit-Change-Number: 17209
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>

[Impala-ASF-CR] IMPALA-10594: Handle failed coordinators in admissiond

Posted by "Wenzhe Zhou (Code Review)" <ge...@cloudera.org>.
Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/17209 )

Change subject: IMPALA-10594: Handle failed coordinators in admissiond
......................................................................


Patch Set 2: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17209
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I883f323bb765680ef24b3c3f51fb209dea15f0b0
Gerrit-Change-Number: 17209
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Mar 2021 20:29:43 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10594: Handle failed coordinators in admissiond

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/17209 )

Change subject: IMPALA-10594: Handle failed coordinators in admissiond
......................................................................


Patch Set 2: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17209/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17209/2//COMMIT_MSG@13
PS2, Line 13: 
> There won't be any overadmission because the executors also monitor the sta
Got it. Thanks for the explanation.



-- 
To view, visit http://gerrit.cloudera.org:8080/17209
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I883f323bb765680ef24b3c3f51fb209dea15f0b0
Gerrit-Change-Number: 17209
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Mar 2021 20:08:54 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10594: Handle failed coordinators in admissiond

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17209 )

Change subject: IMPALA-10594: Handle failed coordinators in admissiond
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8409/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17209
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I883f323bb765680ef24b3c3f51fb209dea15f0b0
Gerrit-Change-Number: 17209
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Mar 2021 19:19:36 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10594: Handle failed coordinators in admissiond

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17209

to look at the new patch set (#2).

Change subject: IMPALA-10594: Handle failed coordinators in admissiond
......................................................................

IMPALA-10594: Handle failed coordinators in admissiond

This patch adds a statestore callback for the admissiond that monitors
for coordinators that have been removed from the cluster membership
and releases all of the resources for queries running on those
coordinators.

Testing:
- Added a custom cluster test that kills a coordinator and verifies
  that resources for queries running on it are eventually released.

Change-Id: I883f323bb765680ef24b3c3f51fb209dea15f0b0
---
M be/src/scheduling/admission-control-service.cc
M be/src/scheduling/admission-control-service.h
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/scheduling/admissiond-env.cc
M tests/custom_cluster/test_admission_controller.py
6 files changed, 125 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/09/17209/2
-- 
To view, visit http://gerrit.cloudera.org:8080/17209
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I883f323bb765680ef24b3c3f51fb209dea15f0b0
Gerrit-Change-Number: 17209
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-10594: Handle failed coordinators in admissiond

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17209 )

Change subject: IMPALA-10594: Handle failed coordinators in admissiond
......................................................................


Patch Set 3: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17209
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I883f323bb765680ef24b3c3f51fb209dea15f0b0
Gerrit-Change-Number: 17209
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Thu, 01 Apr 2021 22:19:24 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10594: Handle failed coordinators in admissiond

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17209 )

Change subject: IMPALA-10594: Handle failed coordinators in admissiond
......................................................................


Patch Set 3:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7039/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17209
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I883f323bb765680ef24b3c3f51fb209dea15f0b0
Gerrit-Change-Number: 17209
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Thu, 01 Apr 2021 16:27:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10594: Handle failed coordinators in admissiond

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17209 )

Change subject: IMPALA-10594: Handle failed coordinators in admissiond
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8407/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17209
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I883f323bb765680ef24b3c3f51fb209dea15f0b0
Gerrit-Change-Number: 17209
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Mar 2021 18:45:53 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10594: Handle failed coordinators in admissiond

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/17209 )

Change subject: IMPALA-10594: Handle failed coordinators in admissiond
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17209/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17209/2//COMMIT_MSG@13
PS2, Line 13: 
> its also possible that the coordinator probably never went down but due to 
There won't be any overadmission because the executors also monitor the statestore for coordinator failures and fail any fragments: https://github.com/apache/impala/blob/master/be/src/runtime/query-exec-mgr.cc#L225

So, if a coordinator is temporarily unreachable, when it comes back up it'll find that all of its fragments are cancelled and it will just fail all of its queries.

The statestore is pretty conservative about when it concludes an impalad is down, so it should be very rare that it decides an impalad is down when its actually not.



-- 
To view, visit http://gerrit.cloudera.org:8080/17209
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I883f323bb765680ef24b3c3f51fb209dea15f0b0
Gerrit-Change-Number: 17209
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Sat, 20 Mar 2021 00:01:24 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10594: Handle failed coordinators in admissiond

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/17209 )

Change subject: IMPALA-10594: Handle failed coordinators in admissiond
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17209/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17209/2//COMMIT_MSG@13
PS2, Line 13: 
its also possible that the coordinator probably never went down but due to network issues the statestore marks it as down. In this case there can be fragments running on the other coordinators/executors and removing those queries can result in overadmission on the ones that are running those fragments. This should ideally be prevented by the host level memory accounting in admission controller, but I need to re-check if it works here.

Also once the coord comes back up on the statestore, what will happen to the already running queries?? do they get accounted for again and potentially cause overadmission?



-- 
To view, visit http://gerrit.cloudera.org:8080/17209
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I883f323bb765680ef24b3c3f51fb209dea15f0b0
Gerrit-Change-Number: 17209
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Mar 2021 21:00:18 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10594: Handle failed coordinators in admissiond

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17209 )

Change subject: IMPALA-10594: Handle failed coordinators in admissiond
......................................................................


Patch Set 3: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17209
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I883f323bb765680ef24b3c3f51fb209dea15f0b0
Gerrit-Change-Number: 17209
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Thu, 01 Apr 2021 16:27:38 +0000
Gerrit-HasComments: No