You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org> on 2019/07/16 04:30:32 UTC

[Impala-ASF-CR] [WIP] IMPALA-8339: Add local node blacklist to coordinators

Thomas Tauber-Marshall has uploaded this change for review. ( http://gerrit.cloudera.org:8080/13868


Change subject: [WIP] IMPALA-8339: Add local node blacklist to coordinators
......................................................................

[WIP] IMPALA-8339: Add local node blacklist to coordinators

This patch adds the concept of a blacklist of executors to the
coordinator, which removes nodes from consideration for query
scheduling. Blacklisting decisions are local to a given coordinator
and are not included in statestore updates.

The intention is to allow coordinators to be more aggressive about
deciding that a node is unhealthy or unavailable, to minimize failed
queries in environments where cluster membership may be more variable,
rather than having to wait on the statestore heartbeat mechanism to
decide that the node is down.

For the first patch, nodes will only be blacklisted if the KRPC status
for Exec() is an error. Followup work will add blacklisting of nodes
in more complex scenarios, eg. if a node appears to be a straggler.

TODO:
- Metrics/logs around blacklisting decisions
- Mechanism for manual override of blacklisting decisions
- Automated testing
- Manual testing at scale/stress

Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/scheduling/CMakeLists.txt
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/scheduling/cluster-membership-mgr.h
A be/src/scheduling/node-blacklist.cc
A be/src/scheduling/node-blacklist.h
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler.cc
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M tests/common/impala_cluster.py
13 files changed, 321 insertions(+), 17 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/13868/1
-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4072/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 9
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Jul 2019 04:47:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-8339: Add local node blacklist to coordinators

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: [WIP] IMPALA-8339: Add local node blacklist to coordinators
......................................................................


Patch Set 1:

(16 comments)

Some comments on the first pass. May be helpful to document the interaction between blacklisted node and cluster membership change.

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/runtime/coordinator-backend-state.h@284
PS1, Line 284: Exec()
ExecQueryFInstance()


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/runtime/coordinator.cc@394
PS1, Line 394:       // The Exec() rpc failed, so blacklist the node.
             :       const TBackendDescriptor& be_desc = backend_state->exec_params()->be_desc;
             :       ExecEnv::GetInstance()->cluster_membership_mgr()->Blacklist(be_desc);
One interesting case is what if the coordinator cannot RPC to itself. I suppose the assumption we make here is that blacklisting certain nodes will prevent scheduler from scheduling scans and thus the rest of the fragments on them. The best practice is to have dedicated coordinators and executors so a coordinator usually doesn't take the role of an executor. However, for certain queries, there may always be a root fragment which will run on the coordinator.

This may get better as we move the root fragments off the coordinator.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.h
File be/src/scheduling/cluster-membership-mgr.h:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.h@96
PS1, Line 96: Ohly
typo


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.h@206
PS1, Line 206:   boost::mutex update_membership_lock_;
Does this also need the mutable modifier like the other two locks ?

Also, membership update seems to touch quite a number of fields in this class. May be it's better to clearly document the thread safety of each field.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.h@206
PS1, Line 206: update_membership_lock_;
Is there any chance this mutex may be held at the same time as various other mutex in this class ? If so, we had better document the lock ordering.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc
File be/src/scheduling/cluster-membership-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc@119
PS1, Line 119:   // Get any previously blacklisted nodes that should be unblacklisted.
             :   std::list<TBackendDescriptor> unblacklisted = node_blacklist_.Maintenance();
I suppose this relies on StateStore being alive and keeps sending periodic cluster membership update. It's a fair assumption that StateStore should recover within a reasonable amount of time.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc@168
PS1, Line 168:     // Deleted item
How would this work if the deleted item is in the list unblacklisted ?


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc@175
PS1, Line 175:  !blacklisted) {
How would it work for a node which is already removed from the executor group due to blacklisting ? Should we remove it altogether from the blacklist if the cluster memebership update indicates this is already deleted ? Otherwise, won't it eventually get into the probation state and get added back with the stale BE descriptor ?

In theory, a blacklisted node can get removed from the cluster and restarts with the same or different IP address.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc@265
PS1, Line 265: for (const TBackendDescriptor& be_desc : unblacklisted) {
How would this work if the cluster membership in the past indicates that this node is removed from the cluster already ? This probably warrants some more detailed documentation.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h
File be/src/scheduling/node-blacklist.h:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h@30
PS1, Line 30: Class to maintain a local blacklist of backends.
Can you please describe the thread safety of this class ? Is there any assumption about the locks held by caller when calling into functions of this class ?


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h@38
PS1, Line 38: come
typo


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h@48
PS1, Line 48: class NodeBlacklist {
Is this class called only from the ClusterMembershipMgr class ? If so, may be worth moving this entire file and implementation to cluster-membership-mgr.cc ?


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h@78
PS1, Line 78:     /// If false, this node is on probation, meaning that it was previously on the
            :     /// blacklist but was removed due to the timeout. We still keep the entry for awhile
            :     /// in case it gets blacklisted again.
            :     bool blacklisted;
IMHO, this may be clearer if we use a enum instead to represent the state instead of just relying on a single boolean. It seems a bit error prone to interpret the return value of certain function (e.g. Remove() removing false can mean it's on probation but not blacklisted or it's not in the blacklist to begin with).

 enum {
    NOT_BLACKLISTED,
    BLACKLISTED,
    ON_PROBATION,
  };


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.cc
File be/src/scheduling/node-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.cc@25
PS1, Line 25: blacklist_timeout_padding
What's the reasoning behind having this timeout padding ? May be worth documenting it.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.cc@42
PS1, Line 42:     // If this node exists in the list, it must be on probation. Re-blacklisk it now.
            :     DCHECK(!it->blacklisted);
Is it possible for multiple queries to attempt to blacklist the same backend at the same time so it's not necessarily on probation, right ?


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.cc@83
PS1, Line 83: 10
May be helpful to define this as a constant with meaningful name.



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Jul 2019 00:49:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc
File be/src/scheduling/executor-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@29
PS3, Line 29: 0 disables blacklisting
> My problem with setting it to 120 is that it makes it possible for users to
I see. May be it's simpler to not have this padding flag at all and use a constant padding percentage instead (e.g. 20 ~ 30%). The consequence of having a padding value too small is that we may end up removing a bad executor from a node too quickly. However, this is not catastrophic once we implement the transparent query retry in the follow-up patch, right ?

I am okay with a a flag which disables blacklisting in case something goes tremendously wrong.



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Fri, 26 Jul 2019 02:30:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Lars Volker, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13868

to look at the new patch set (#9).

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................

IMPALA-8339: Add local executor blacklist to coordinators

This patch adds the concept of a blacklist of executors to the
coordinator, which removes executors from consideration for query
scheduling. Blacklisting decisions are local to a given coordinator
and are not included in statestore updates.

The intention is to allow coordinators to be more aggressive about
deciding that an exeutor is unhealthy or unavailable, to minimize
failed queries in environments where cluster membership may be more
variable, rather than having to wait on the statestore heartbeat
mechanism to decide that the executor is down.

For the first patch, executors will only be blacklisted if the KRPC
status for Exec() is an error. Followup work will add blacklisting of
executors in more complex scenarios, eg. if an executor appears to be
a straggler.

When a query is scheduled and there is currently some blacklisted
executors, a new line 'Blacklisted Executors:' is added to the profile
listing the hostnames of all such executors.

Testing:
- Added a case to the cluster mgr BE unit test that uses blacklisting.
- Added e2e test cases for killing and restarting an impalad.
- Manual randomized testing locally with iptables.
TODO
- Add an e2e test case where an impalad becomes briefly unreachable.
- Manual/stress tests on a real cluster.

Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/scheduling/CMakeLists.txt
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/cluster-membership-mgr-test.cc
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/scheduling/cluster-membership-mgr.h
A be/src/scheduling/executor-blacklist.cc
A be/src/scheduling/executor-blacklist.h
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler.cc
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
A tests/custom_cluster/test_blacklist.py
15 files changed, 735 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/13868/9
-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 9
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] [WIP] IMPALA-8339: Add local node blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: [WIP] IMPALA-8339: Add local node blacklist to coordinators
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc
File be/src/scheduling/cluster-membership-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc@186
PS1, Line 186:               (*new_executor_groups)[group].LookUpBackendDesc(be_desc.address) == nullptr);
line too long (91 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 16 Jul 2019 04:31:57 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 2:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/3955/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 23 Jul 2019 05:56:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 6: Code-Review+1

(3 comments)

Not sure if Lars wants to take one more pass ?

http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.h
File be/src/scheduling/executor-blacklist.h:

http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.h@43
PS6, Line 43:  WHen
When


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.cc
File be/src/scheduling/executor-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.cc@191
PS6, Line 191: 1.2 
This can be a variable for clarity.


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/statestore/statestore.h
File be/src/statestore/statestore.h:

http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/statestore/statestore.h@183
PS6, Line 183: time 
Please specify the time unit.



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Sat, 27 Jul 2019 00:21:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 3:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/3956/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 23 Jul 2019 06:02:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 10: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 10
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Jul 2019 04:06:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 7:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/13868/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13868/6//COMMIT_MSG@23
PS6, Line 23: a straggler.
> Mention the change to the profile in the commit message?
Done


http://gerrit.cloudera.org:8080/#/c/13868/6//COMMIT_MSG@28
PS6, Line 28: 
> Are you planning to do these in this change?
I was intending do this as follow up work


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/cluster-membership-mgr-test.cc
File be/src/scheduling/cluster-membership-mgr-test.cc:

http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/cluster-membership-mgr-test.cc@350
PS6, Line 350:   const int BLACKLIST_TIMEOUT_SLEEP_US = 100000;
> const int, uppercase? I think that the frequencies could also be closer to 
Done


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/cluster-membership-mgr-test.cc@354
PS6, Line 354:   EXPECT_EQ(NUM_BACKENDS, backends_.size());
> nit: single line
Done


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/cluster-membership-mgr-test.cc@374
PS6, Line 374: NU
> nit: BE, or backend. I tripped over "a be to" :)
Done


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/cluster-membership-mgr-test.cc@415
PS6, Line 415: 
> nit: extra word?
Done


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.h
File be/src/scheduling/executor-blacklist.h:

http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.h@39
PS6, Line 39: probation
> I think "greylisting" is another term that we could use which sounds less l
I think 'probation' is clearer since its a more common term, but willing to change it if others want


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.h@43
PS6, Line 43:  When
> When
Done


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.cc
File be/src/scheduling/executor-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.cc@86
PS6, Line 86:   }
> Does this not need to check the timeouts? Or do we assume that the caller c
ClusterMembershipMgr uses this to determine if the removed BE would be present in the snapshot's 'executor_groups', i.e. if it was previously blacklisted it won't be in 'executor_groups'.

We only transition things to probation during Maitenance() because we need to do the 'unblacklisting' step at the same time to keep things consistent.


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.cc@191
PS6, Line 191: g re
> This can be a variable for clarity.
Done


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/statestore/statestore.h
File be/src/statestore/statestore.h:

http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/statestore/statestore.h@183
PS6, Line 183: time 
> Please specify the time unit.
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Mon, 29 Jul 2019 18:13:24 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 6:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/13868/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13868/6//COMMIT_MSG@23
PS6, Line 23: a straggler.
Mention the change to the profile in the commit message?


http://gerrit.cloudera.org:8080/#/c/13868/6//COMMIT_MSG@28
PS6, Line 28: TODO
Are you planning to do these in this change?


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/cluster-membership-mgr-test.cc
File be/src/scheduling/cluster-membership-mgr-test.cc:

http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/cluster-membership-mgr-test.cc@350
PS6, Line 350:   int blacklist_timeout_sleep_us = 5000;
const int, uppercase? I think that the frequencies could also be closer to the real world ones, given we only sleep 3 times during this test.


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/cluster-membership-mgr-test.cc@354
PS6, Line 354:     CreateBackend();
nit: single line


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/cluster-membership-mgr-test.cc@374
PS6, Line 374: be
nit: BE, or backend. I tripped over "a be to" :)


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/cluster-membership-mgr-test.cc@415
PS6, Line 415: it
nit: extra word?


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.h
File be/src/scheduling/executor-blacklist.h:

http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.h@39
PS6, Line 39: probation
I think "greylisting" is another term that we could use which sounds less like law enforcement, but I don't feel strongly about it.


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.cc
File be/src/scheduling/executor-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.cc@86
PS6, Line 86:   State removed_state = remove_it->state;
Does this not need to check the timeouts? Or do we assume that the caller called Maintenance() before calling this method?



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Sat, 27 Jul 2019 03:40:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/13868/7/be/src/scheduling/executor-blacklist.h
File be/src/scheduling/executor-blacklist.h:

http://gerrit.cloudera.org:8080/#/c/13868/7/be/src/scheduling/executor-blacklist.h@139
PS7, Line 139: Percent
nit: this is really a multiplier (e.g. multiply by 1.2, not add 1.2%)


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.cc
File be/src/scheduling/executor-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.cc@86
PS6, Line 86:   }
> ClusterMembershipMgr uses this to determine if the removed BE would be pres
This means there is a window for false positives here, right? If so, I think it would be good to mention it in the header comment.



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Mon, 29 Jul 2019 20:23:28 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 10:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4688/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 10
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Jul 2019 04:06:02 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 9: Code-Review+2

(1 comment)

carrying forward

http://gerrit.cloudera.org:8080/#/c/13868/8/be/src/scheduling/executor-blacklist.h
File be/src/scheduling/executor-blacklist.h:

http://gerrit.cloudera.org:8080/#/c/13868/8/be/src/scheduling/executor-blacklist.h@41
PS8, Line 41: Maintenan
> nit: typo
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 9
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Jul 2019 04:05:44 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc
File be/src/scheduling/executor-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@29
PS3, Line 29: 
> I see. May be it's simpler to not have this padding flag at all and use a c
Works for me



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Fri, 26 Jul 2019 17:51:19 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Lars Volker, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13868

to look at the new patch set (#3).

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................

IMPALA-8339: Add local executor blacklist to coordinators

This patch adds the concept of a blacklist of executors to the
coordinator, which removes executors from consideration for query
scheduling. Blacklisting decisions are local to a given coordinator
and are not included in statestore updates.

The intention is to allow coordinators to be more aggressive about
deciding that an exeutor is unhealthy or unavailable, to minimize
failed queries in environments where cluster membership may be more
variable, rather than having to wait on the statestore heartbeat
mechanism to decide that the executor is down.

For the first patch, executors will only be blacklisted if the KRPC
status for Exec() is an error. Followup work will add blacklisting of
executors in more complex scenarios, eg. if an executor appears to be
a straggler.

Testing:
- Added a case to the cluster mgr BE unit test that uses blacklisting.
- Added an e2e test case that kills an impalad.
TODO
- Add an e2e test case where an impalad becomes briefly unreachable.
- Manual/stress tests on a real cluster.

Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/scheduling/CMakeLists.txt
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/cluster-membership-mgr-test.cc
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/scheduling/cluster-membership-mgr.h
A be/src/scheduling/executor-blacklist.cc
A be/src/scheduling/executor-blacklist.h
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler.cc
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
A tests/custom_cluster/test_blacklist.py
15 files changed, 656 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/13868/3
-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] [WIP] IMPALA-8339: Add local node blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: [WIP] IMPALA-8339: Add local node blacklist to coordinators
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3886/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 16 Jul 2019 05:10:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 8: Code-Review+2

(1 comment)

Spotted a typo but otherwise LGTM. Thanks for adding the comment!

http://gerrit.cloudera.org:8080/#/c/13868/8/be/src/scheduling/executor-blacklist.h
File be/src/scheduling/executor-blacklist.h:

http://gerrit.cloudera.org:8080/#/c/13868/8/be/src/scheduling/executor-blacklist.h@41
PS8, Line 41: Maintence
nit: typo



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Jul 2019 03:29:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Lars Volker, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13868

to look at the new patch set (#4).

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................

IMPALA-8339: Add local executor blacklist to coordinators

This patch adds the concept of a blacklist of executors to the
coordinator, which removes executors from consideration for query
scheduling. Blacklisting decisions are local to a given coordinator
and are not included in statestore updates.

The intention is to allow coordinators to be more aggressive about
deciding that an exeutor is unhealthy or unavailable, to minimize
failed queries in environments where cluster membership may be more
variable, rather than having to wait on the statestore heartbeat
mechanism to decide that the executor is down.

For the first patch, executors will only be blacklisted if the KRPC
status for Exec() is an error. Followup work will add blacklisting of
executors in more complex scenarios, eg. if an executor appears to be
a straggler.

Testing:
- Added a case to the cluster mgr BE unit test that uses blacklisting.
- Added e2e test cases for killing and restarting an impalad.
TODO
- Add an e2e test case where an impalad becomes briefly unreachable.
- Manual/stress tests on a real cluster.

Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/scheduling/CMakeLists.txt
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/cluster-membership-mgr-test.cc
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/scheduling/cluster-membership-mgr.h
A be/src/scheduling/executor-blacklist.cc
A be/src/scheduling/executor-blacklist.h
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler.cc
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
A tests/custom_cluster/test_blacklist.py
15 files changed, 728 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/13868/4
-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Lars Volker, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13868

to look at the new patch set (#8).

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................

IMPALA-8339: Add local executor blacklist to coordinators

This patch adds the concept of a blacklist of executors to the
coordinator, which removes executors from consideration for query
scheduling. Blacklisting decisions are local to a given coordinator
and are not included in statestore updates.

The intention is to allow coordinators to be more aggressive about
deciding that an exeutor is unhealthy or unavailable, to minimize
failed queries in environments where cluster membership may be more
variable, rather than having to wait on the statestore heartbeat
mechanism to decide that the executor is down.

For the first patch, executors will only be blacklisted if the KRPC
status for Exec() is an error. Followup work will add blacklisting of
executors in more complex scenarios, eg. if an executor appears to be
a straggler.

When a query is scheduled and there is currently some blacklisted
executors, a new line 'Blacklisted Executors:' is added to the profile
listing the hostnames of all such executors.

Testing:
- Added a case to the cluster mgr BE unit test that uses blacklisting.
- Added e2e test cases for killing and restarting an impalad.
- Manual randomized testing locally with iptables.
TODO
- Add an e2e test case where an impalad becomes briefly unreachable.
- Manual/stress tests on a real cluster.

Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/scheduling/CMakeLists.txt
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/cluster-membership-mgr-test.cc
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/scheduling/cluster-membership-mgr.h
A be/src/scheduling/executor-blacklist.cc
A be/src/scheduling/executor-blacklist.h
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler.cc
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
A tests/custom_cluster/test_blacklist.py
15 files changed, 735 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/13868/8
-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 3:

(19 comments)

Looking good.

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/admission-controller.cc
File be/src/scheduling/admission-controller.cc:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/admission-controller.cc@737
PS3, Line 737: std::s
nit: std:: not needed


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc
File be/src/scheduling/cluster-membership-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@272
PS3, Line 272: already be on the blacklist or probation.
This may be a case worth double checking. If a node is restarted and re-register with the Statestore, is it guaranteed that the "delete" update will happen before the "create" update ? Can updates arrive out of order ?

If it's somehow possible that the node being added is still on blacklist. It may be safer to call FindAndRemove() unconditionally and log it if we happen to find it on the blacklist in the "create" path.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@273
PS3, Line 273: DCHECK
DCHECK_EQ


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@350
PS3, Line 350: return;
Given the for-loop above, can an executor belong to more than one executor groups ?  If so, why is it okay to return early if it's not found in only one of the groups ?

Also, may help to comment on the intention for doing this check. I suspect it's to skip the rather heavy weight copying of the membership state below.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@355
PS3, Line 355: recovering_membership_.get() != nullptr
This is repeated at multiple places in this function. May make sense to factor it out to a local variable for clarity.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h
File be/src/scheduling/executor-blacklist.h:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@32
PS3, Line 32: Remove()
FindAndRemove()


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@38
PS3, Line 38: When a blacklisted executor has passed this timeout
            : /// and Maintenance() is called, the executor is put on 'probation'
May help to also document what if an executor on blacklist is removed during cluster membership updates.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@42
PS3, Line 42: THere
nit: typo


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@79
PS3, Line 79: unblacklisted
May be clearer to call it "probation_list" or something so it can relates to the state in which the backends on this list are in.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@124
PS3, Line 124: pzrobation
typo


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc
File be/src/scheduling/executor-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@29
PS3, Line 29: 0 disables blacklisting
This seems a bit weird. Setting it to 0 means the blacklist timeout == statestore_max_missed_heartbeats * statestore_heartbeat_frequency_ms, right ?

Would it make more sense to set this to 120 or something by default so it means (statestore_max_missed_heartbeats * statestore_heartbeat_frequency_ms) * (this flag / 100.0) ? So setting this flag to 0 means the timeout is 0 which makes sense to mean blacklisting is disabled.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@39
PS3, Line 39:  ExecutorBlacklist::Blacklist(const TBackendDescriptor& be_desc) {
Should this check if blacklisting is disabled at the beginning ?


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@54
PS3, Line 54: UnixMillis()
Why not MonotonicMillis() ? This field shouldn't need to be shared across multiple hosts, right ? MonotonicMillis() makes sure the clock will not go backward in case the TSC on different cores are slightly out of sync.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@83
PS3, Line 83:   auto eq = [&be_desc](const Entry& existing) {
            :     // The IP addresses must already match, so it is sufficient to check the port.
            :     DCHECK_EQ(existing.be_desc.ip_address, be_desc.ip_address);
            :     return existing.be_desc.address.port == be_desc.address.port;
            :   };
This seems to be used at more than one functions. May be worth factoring it out as a separate function.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@99
PS3, Line 99:  return remove_it->state;
Is remove_it still valid after calling erase above ?


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@110
PS3, Line 110: unblacklisted
put on probation


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@138
PS3, Line 138: take it off probation.
remove it from the blacklist.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@184
PS3, Line 184: unblacklisted
"put on probation"  ?


http://gerrit.cloudera.org:8080/#/c/13868/3/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/13868/3/tests/custom_cluster/test_blacklist.py@62
PS3, Line 62:     assert re.search("Blacklisted Executors: (.*)", profile) is None
Does it make sense to also add a test case when the node is re-added back to the cluster ?

Also, does it make sense to test for transient failure (e.g. with debug action) so we can randomly fail the ExecFInstanceRPC() and exercise the path of probation and eventual removal from blacklist.



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Wed, 24 Jul 2019 23:33:22 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4063/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Mon, 29 Jul 2019 18:53:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................

IMPALA-8339: Add local executor blacklist to coordinators

This patch adds the concept of a blacklist of executors to the
coordinator, which removes executors from consideration for query
scheduling. Blacklisting decisions are local to a given coordinator
and are not included in statestore updates.

The intention is to allow coordinators to be more aggressive about
deciding that an exeutor is unhealthy or unavailable, to minimize
failed queries in environments where cluster membership may be more
variable, rather than having to wait on the statestore heartbeat
mechanism to decide that the executor is down.

For the first patch, executors will only be blacklisted if the KRPC
status for Exec() is an error. Followup work will add blacklisting of
executors in more complex scenarios, eg. if an executor appears to be
a straggler.

When a query is scheduled and there is currently some blacklisted
executors, a new line 'Blacklisted Executors:' is added to the profile
listing the hostnames of all such executors.

Testing:
- Added a case to the cluster mgr BE unit test that uses blacklisting.
- Added e2e test cases for killing and restarting an impalad.
- Manual randomized testing locally with iptables.
TODO
- Add an e2e test case where an impalad becomes briefly unreachable.
- Manual/stress tests on a real cluster.

Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Reviewed-on: http://gerrit.cloudera.org:8080/13868
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/scheduling/CMakeLists.txt
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/cluster-membership-mgr-test.cc
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/scheduling/cluster-membership-mgr.h
A be/src/scheduling/executor-blacklist.cc
A be/src/scheduling/executor-blacklist.h
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler.cc
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
A tests/custom_cluster/test_blacklist.py
15 files changed, 735 insertions(+), 35 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 11
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4071/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Jul 2019 04:00:18 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 10: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 10
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Jul 2019 10:38:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Lars Volker, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13868

to look at the new patch set (#7).

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................

IMPALA-8339: Add local executor blacklist to coordinators

This patch adds the concept of a blacklist of executors to the
coordinator, which removes executors from consideration for query
scheduling. Blacklisting decisions are local to a given coordinator
and are not included in statestore updates.

The intention is to allow coordinators to be more aggressive about
deciding that an exeutor is unhealthy or unavailable, to minimize
failed queries in environments where cluster membership may be more
variable, rather than having to wait on the statestore heartbeat
mechanism to decide that the executor is down.

For the first patch, executors will only be blacklisted if the KRPC
status for Exec() is an error. Followup work will add blacklisting of
executors in more complex scenarios, eg. if an executor appears to be
a straggler.

When a query is scheduled and there is currently some blacklisted
executors, a new line 'Blacklisted Executors:' is added to the profile
listing the hostnames of all such executors.

Testing:
- Added a case to the cluster mgr BE unit test that uses blacklisting.
- Added e2e test cases for killing and restarting an impalad.
- Manual randomized testing locally with iptables.
TODO
- Add an e2e test case where an impalad becomes briefly unreachable.
- Manual/stress tests on a real cluster.

Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/scheduling/CMakeLists.txt
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/cluster-membership-mgr-test.cc
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/scheduling/cluster-membership-mgr.h
A be/src/scheduling/executor-blacklist.cc
A be/src/scheduling/executor-blacklist.h
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler.cc
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
A tests/custom_cluster/test_blacklist.py
15 files changed, 733 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/13868/7
-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4007/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Jul 2019 23:16:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/13868/7/be/src/scheduling/executor-blacklist.h
File be/src/scheduling/executor-blacklist.h:

http://gerrit.cloudera.org:8080/#/c/13868/7/be/src/scheduling/executor-blacklist.h@139
PS7, Line 139: Percent
> nit: this is really a multiplier (e.g. multiply by 1.2, not add 1.2%)
Done


http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.cc
File be/src/scheduling/executor-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/6/be/src/scheduling/executor-blacklist.cc@86
PS6, Line 86:   }
> This means there is a window for false positives here, right? If so, I thin
Added some more comments



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Jul 2019 03:19:04 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Lars Volker, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13868

to look at the new patch set (#2).

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................

IMPALA-8339: Add local executor blacklist to coordinators

This patch adds the concept of a blacklist of executors to the
coordinator, which removes executors from consideration for query
scheduling. Blacklisting decisions are local to a given coordinator
and are not included in statestore updates.

The intention is to allow coordinators to be more aggressive about
deciding that an exeutor is unhealthy or unavailable, to minimize
failed queries in environments where cluster membership may be more
variable, rather than having to wait on the statestore heartbeat
mechanism to decide that the executor is down.

For the first patch, executors will only be blacklisted if the KRPC
status for Exec() is an error. Followup work will add blacklisting of
executors in more complex scenarios, eg. if an executor appears to be
a straggler.

Testing:
- Added a case to the cluster mgr BE unit test that uses blacklisting.
- Added an e2e test case that kills an impalad.
TODO
- Add an e2e test case where an impalad becomes briefly unreachable.
- Manual/stress tests on a real cluster.

Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/scheduling/CMakeLists.txt
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/cluster-membership-mgr-test.cc
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/scheduling/cluster-membership-mgr.h
A be/src/scheduling/executor-blacklist.cc
A be/src/scheduling/executor-blacklist.h
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler.cc
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
A tests/custom_cluster/test_blacklist.py
15 files changed, 661 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/13868/2
-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Lars Volker, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13868

to look at the new patch set (#6).

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................

IMPALA-8339: Add local executor blacklist to coordinators

This patch adds the concept of a blacklist of executors to the
coordinator, which removes executors from consideration for query
scheduling. Blacklisting decisions are local to a given coordinator
and are not included in statestore updates.

The intention is to allow coordinators to be more aggressive about
deciding that an exeutor is unhealthy or unavailable, to minimize
failed queries in environments where cluster membership may be more
variable, rather than having to wait on the statestore heartbeat
mechanism to decide that the executor is down.

For the first patch, executors will only be blacklisted if the KRPC
status for Exec() is an error. Followup work will add blacklisting of
executors in more complex scenarios, eg. if an executor appears to be
a straggler.

Testing:
- Added a case to the cluster mgr BE unit test that uses blacklisting.
- Added e2e test cases for killing and restarting an impalad.
TODO
- Add an e2e test case where an impalad becomes briefly unreachable.
- Manual/stress tests on a real cluster.

Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/scheduling/CMakeLists.txt
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/cluster-membership-mgr-test.cc
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/scheduling/cluster-membership-mgr.h
A be/src/scheduling/executor-blacklist.cc
A be/src/scheduling/executor-blacklist.h
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler.cc
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
A tests/custom_cluster/test_blacklist.py
15 files changed, 731 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/13868/6
-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Lars Volker, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13868

to look at the new patch set (#5).

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................

IMPALA-8339: Add local executor blacklist to coordinators

This patch adds the concept of a blacklist of executors to the
coordinator, which removes executors from consideration for query
scheduling. Blacklisting decisions are local to a given coordinator
and are not included in statestore updates.

The intention is to allow coordinators to be more aggressive about
deciding that an exeutor is unhealthy or unavailable, to minimize
failed queries in environments where cluster membership may be more
variable, rather than having to wait on the statestore heartbeat
mechanism to decide that the executor is down.

For the first patch, executors will only be blacklisted if the KRPC
status for Exec() is an error. Followup work will add blacklisting of
executors in more complex scenarios, eg. if an executor appears to be
a straggler.

Testing:
- Added a case to the cluster mgr BE unit test that uses blacklisting.
- Added e2e test cases for killing and restarting an impalad.
TODO
- Add an e2e test case where an impalad becomes briefly unreachable.
- Manual/stress tests on a real cluster.

Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/scheduling/CMakeLists.txt
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/cluster-membership-mgr-test.cc
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/scheduling/cluster-membership-mgr.h
A be/src/scheduling/executor-blacklist.cc
A be/src/scheduling/executor-blacklist.h
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler.cc
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
A tests/custom_cluster/test_blacklist.py
15 files changed, 733 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/13868/5
-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4024/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Fri, 26 Jul 2019 18:31:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 2:

(31 comments)

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/runtime/coordinator-backend-state.h@284
PS1, Line 284: ExecQu
> ExecQueryFInstance()
Done


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/runtime/coordinator.cc@394
PS1, Line 394:       // The Exec() rpc failed, so blacklist the executor.
             :       LOG(INFO) << "Blacklisting "
             :                 << TNetworkAddressToString(backend_state->impalad_address()
> One interesting case is what if the coordinator cannot RPC to itself. I sup
Good point. Probably makes sense to have a special case for not blacklisting the local backend.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.h
File be/src/scheduling/cluster-membership-mgr.h:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.h@96
PS1, Line 96: Only
> typo
Done


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.h@168
PS1, Line 168:   void UpdateMembership(const StatestoreSubscriber::TopicDeltaMap& incoming_topic_deltas,
> nit: Adds, Updates.
Done


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.h@170
PS1, Line 170: 
> Should this method accept an IP address instead?

Discussed on another comment.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.h@206
PS1, Line 206: cutorGroups& executor_gr
> Is there any chance this mutex may be held at the same time as various othe
Done


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.h@206
PS1, Line 206:       const ExecutorGroups& executor_groups, const ExecutorBlacklist& executor_blacklist);
> Does this also need the mutable modifier like the other two locks ?

No, 'mutable' is used to allow a field to be modified in a function that's marked 'const'. update_membership_lock_ isn't taken in any const functions.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc
File be/src/scheduling/cluster-membership-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc@119
PS1, Line 119: 
             :   // Check if the local backend is up and needs updating.
> I suppose this relies on StateStore being alive and keeps sending periodic 
That's a good point. This approach (maintaining the blacklist on the statestore update thread) is what we had put in the design doc, but since I've gone with a design that requires the lock 'update_membership_lock_' anyways, it might be nice to just put this in its own thread.

That would also make it easier to change the frequency that we check the blacklist for updates - we receive statestore updates every 100ms by default (most of which of course are no-ops) which might be more frequent than we really need to do this check (by default, the shortest time a node is blacklisted for is 12s anyways, probably not a big deal if we only do this update once per second or so)

It could remove having to reason about how to handle nodes that were deleted/updated/added in the same update as we decide to unblacklist them, which you had questions about below.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc@168
PS1, Line 168:     if (recovering_
> How would this work if the deleted item is in the list unblacklisted ?
I think this and your other similar concerns below are addressed by taking Lars's suggestion of only constructing 'unblacklisted' after the statestore update has been fully processed, since we call 'Remove' on the blacklist here.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc@175
PS1, Line 175: rship_);
> Otherwise, won't it eventually get into the
 > probation state and get added back with the stale BE descriptor ?

NodeBlacklist::Remove() completely removes executors from the blacklist, it doesn't put them on probation.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc@186
PS1, Line 186:   for (const TTopicItem& item : update.topic_entries) {
> line too long (91 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc@265
PS1, Line 265:     new_backend_map->insert(make_pair(item.key, be_desc))
> How would this work if the cluster membership in the past indicates that th
Done


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc@271
PS1, Line 271:       }
> I think the overall control flow could be simpler if we apply all SS update
Done


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc@330
PS1, Line 330: ClusterMembershipMgr::Bl
> Maybe spell out explicitly why it doesn't apply?
Done


http://gerrit.cloudera.org:8080/#/c/13868/2/be/src/scheduling/cluster-membership-mgr.cc
File be/src/scheduling/cluster-membership-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/13868/2/be/src/scheduling/cluster-membership-mgr.cc@468
PS2, Line 468:       const ExecutorGroups& executor_groups, const ExecutorBlacklist& executor_blacklist) {
> line too long (91 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h
File be/src/scheduling/node-blacklist.h:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h@30
PS1, Line 30: 
> Can you please describe the thread safety of this class ? Is there any assu
Done


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h@38
PS1, Line 38: 
> typo
Done


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h@48
PS1, Line 48: 
> On the other hand the CMM is already pretty large
Yeah, I tend to agree with Lars that CMM would be a really big file, then. Don't feel strongly, though.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h@57
PS1, Line 57: 
> Maybe FindAndRemove would make the usage more clear when calling this?
Done


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h@78
PS1, Line 78: 
            : 
            : 
            : 
> IMHO, this may be clearer if we use a enum instead to represent the state i
Done


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h@92
PS1, Line 92: 
> Have you considered blacklisting whole hosts?
Of course, in a typical deployment its functionally equivalent since there's a 1 to 1 mapping, but it would be a little more efficient.

You could imagine users who want to have a setup with multiple impalads on each node who would want blacklisting to be on a per-impalad basis, so that if one impalad on a node crashes the others aren't also blacklisted. That's probably not an important or supported use case though.

One big reason to do it this way is it makes testing of blacklisting on the minicluster much easier.

I renamed it ExecutorBlacklist to make it clearer (though I'm willing to explore ways to do testing with per-node blacklisting, eg. with the dockerized minicluster, if people think its better)

One thing we might consider is adding a concept of a backend_id, eg, a TUniqueId, that gets generated on impalad startup. That gives us a couple things:
- something nice to key off of here and elsewhere, eg. ExecutorGroup
- A way to distinguish the case when an impalad goes down and a new one is started at the same host:port


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.cc
File be/src/scheduling/node-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.cc@25
PS1, Line 25: 
> What's the reasoning behind having this timeout padding ? May be worth docu
Done


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.cc@42
PS1, Line 42: 
            : 
> Is it possible for multiple queries to attempt to blacklist the same backen
Done


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.cc@83
PS1, Line 83: 
> May be helpful to define this as a constant with meaningful name.
Done


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.cc@112
PS1, Line 112: 
> std::move() or return values through pointer
Done


http://gerrit.cloudera.org:8080/#/c/13868/2/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/13868/2/tests/custom_cluster/test_blacklist.py@25
PS2, Line 25: class TestBlacklist(CustomClusterTestSuite):
> flake8: E302 expected 2 blank lines, found 1
Done


http://gerrit.cloudera.org:8080/#/c/13868/2/tests/custom_cluster/test_blacklist.py@36
PS2, Line 36:  
> flake8: E203 whitespace before ','
Done


http://gerrit.cloudera.org:8080/#/c/13868/2/tests/custom_cluster/test_blacklist.py@36
PS2, Line 36: =
> flake8: E711 comparison to None should be 'if cond is None:'
Done


http://gerrit.cloudera.org:8080/#/c/13868/2/tests/custom_cluster/test_blacklist.py@52
PS2, Line 52:  
> flake8: E203 whitespace before ','
Done


http://gerrit.cloudera.org:8080/#/c/13868/2/tests/custom_cluster/test_blacklist.py@61
PS2, Line 61:  
> flake8: E203 whitespace before ','
Done


http://gerrit.cloudera.org:8080/#/c/13868/2/tests/custom_cluster/test_blacklist.py@61
PS2, Line 61: =
> flake8: E711 comparison to None should be 'if cond is None:'
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 23 Jul 2019 05:20:10 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4009/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Jul 2019 23:10:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 4:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/admission-controller.cc
File be/src/scheduling/admission-controller.cc:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/admission-controller.cc@737
PS3, Line 737: string
> nit: std:: not needed
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc
File be/src/scheduling/cluster-membership-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@272
PS3, Line 272: already be on the blacklist or probation.
> This may be a case worth double checking. If a node is restarted and re-reg
Yes, this DCHECK is correct - if a node is restarted, when it re-registers with the statestore the old subscriber is always unregistered first, see: https://github.com/apache/impala/blob/master/be/src/statestore/statestore.cc#L623

This is also required by the loop above, which will fail if AddExecutor() is called with an existing executor.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@273
PS3, Line 273: DCHECK
> DCHECK_EQ
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@350
PS3, Line 350:  exists
> Given the for-loop above, can an executor belong to more than one executor 
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@355
PS3, Line 355: exists = true;
> This is repeated at multiple places in this function. May make sense to fac
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h
File be/src/scheduling/executor-blacklist.h:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@32
PS3, Line 32: 
> FindAndRemove()
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@38
PS3, Line 38: When a blacklisted executor has passed this timeout
            : /// and Maintenance() is called, the executor is put on 'probation'
> May help to also document what if an executor on blacklist is removed durin
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@42
PS3, Line 42: There
> nit: typo
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@79
PS3, Line 79: 
> May be clearer to call it "probation_list" or something so it can relates t
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@124
PS3, Line 124: 
> typo
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc
File be/src/scheduling/executor-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@29
PS3, Line 29: 0 disables blacklisting
> This seems a bit weird. Setting it to 0 means the blacklist timeout == stat
My problem with setting it to 120 is that it makes it possible for users to set it to < 100, which doesn't seem like it would ever be the right thing to do (eg. for executors that go down, it would make it very likely they would be incorrectly unblacklisted).

Maybe it would be better to just add another flag, blacklisting_enabled or similar. I was trying to minimize the number of flags, but I guess it doesn't really matter. I don't think we'll be formally documenting these or expecting customers to commonly set them, they're just intended as emergency safety valves.

Or we could do something like disable blacklisting if this is negative.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@39
PS3, Line 39:  ExecutorBlacklist::Blacklist(const TBackendDescriptor& be_desc) {
> Should this check if blacklisting is disabled at the beginning ?
I put the check in ClusterMembershipMgr::Blacklist to avoid copying the Snapshot if blacklisting is disabled.

I'll at least add a DCHECK here.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@54
PS3, Line 54: nicMillis() 
> Why not MonotonicMillis() ? This field shouldn't need to be shared across m
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@83
PS3, Line 83:   if (remove_it == be_descs.end()) {
            :     // Executor wasn't on the blacklist.
            :     return NOT_BLACKLISTED;
            :   }
            :   St
> This seems to be used at more than one functions. May be worth factoring it
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@99
PS3, Line 99:    for (auto entry_it : e
> Is remove_it still valid after calling erase above ?
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@110
PS3, Line 110: 
> put on probation
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@138
PS3, Line 138: 
> remove it from the blacklist.
I think this comment is correct? This branch is handling things that were on probation


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@184
PS3, Line 184: 
> "put on probation"  ?
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/13868/3/tests/custom_cluster/test_blacklist.py@62
PS3, Line 62:         (killed_impalad.hostname, killed_impalad.service.be_port), result.runtime_profile
> Does it make sense to also add a test case when the node is re-added back t
Added a test for restarting an impalad.

Agreed that we should have a test for transient failure. Its trickier since such a debug action doesn't exist yet. I was planning on doing that in a follow up patch, but I can include it in this patch if you prefer



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Jul 2019 22:35:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 5:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/13868/4/be/src/scheduling/executor-blacklist.cc
File be/src/scheduling/executor-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/4/be/src/scheduling/executor-blacklist.cc@196
PS4, Line 196: bool ExecutorBlacklist::eqBePort(
> line too long (92 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/13868/4/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/13868/4/tests/custom_cluster/test_blacklist.py@45
PS4, Line 45: 
> flake8: E501 line too long (107 > 90 characters)
Done


http://gerrit.cloudera.org:8080/#/c/13868/4/tests/custom_cluster/test_blacklist.py@70
PS4, Line 70: 
> flake8: E501 line too long (107 > 90 characters)
Done


http://gerrit.cloudera.org:8080/#/c/13868/4/tests/custom_cluster/test_blacklist.py@79
PS4, Line 79: 
> flake8: E501 line too long (107 > 90 characters)
Done


http://gerrit.cloudera.org:8080/#/c/13868/4/tests/custom_cluster/test_blacklist.py@110
PS4, Line 110: 
> flake8: E501 line too long (107 > 90 characters)
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Jul 2019 22:44:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/13868/2/be/src/scheduling/cluster-membership-mgr.cc
File be/src/scheduling/cluster-membership-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/13868/2/be/src/scheduling/cluster-membership-mgr.cc@468
PS2, Line 468:       const ExecutorGroups& executor_groups, const ExecutorBlacklist& executor_blacklist) {
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/13868/2/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/13868/2/tests/custom_cluster/test_blacklist.py@25
PS2, Line 25: class TestBlacklist(CustomClusterTestSuite):
flake8: E302 expected 2 blank lines, found 1


http://gerrit.cloudera.org:8080/#/c/13868/2/tests/custom_cluster/test_blacklist.py@36
PS2, Line 36:  
flake8: E203 whitespace before ','


http://gerrit.cloudera.org:8080/#/c/13868/2/tests/custom_cluster/test_blacklist.py@36
PS2, Line 36: =
flake8: E711 comparison to None should be 'if cond is None:'


http://gerrit.cloudera.org:8080/#/c/13868/2/tests/custom_cluster/test_blacklist.py@52
PS2, Line 52:  
flake8: E203 whitespace before ','


http://gerrit.cloudera.org:8080/#/c/13868/2/tests/custom_cluster/test_blacklist.py@61
PS2, Line 61:  
flake8: E203 whitespace before ','


http://gerrit.cloudera.org:8080/#/c/13868/2/tests/custom_cluster/test_blacklist.py@61
PS2, Line 61: =
flake8: E711 comparison to None should be 'if cond is None:'



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Comment-Date: Tue, 23 Jul 2019 05:16:14 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP] IMPALA-8339: Add local node blacklist to coordinators

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: [WIP] IMPALA-8339: Add local node blacklist to coordinators
......................................................................


Patch Set 1:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.h
File be/src/scheduling/cluster-membership-mgr.h:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.h@168
PS1, Line 168:   /// Add the given backend to the local blacklist. Update 'current_membership_' to remove
nit: Adds, Updates.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.h@170
PS1, Line 170:   void Blacklist(const TBackendDescriptor& be_desc);
Maybe rename to BlacklistExecutor? Should this method accept an IP address instead?


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc
File be/src/scheduling/cluster-membership-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc@271
PS1, Line 271: 
I think the overall control flow could be simpler if we apply all SS updates first, irrespective of blacklisting, and then apply all blacklisting information in a subsequent step here. In other words, we first assemble the full SS view, then we modify it according to the local view.


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/cluster-membership-mgr.cc@330
PS1, Line 330: which doesn't apply here
Maybe spell out explicitly why it doesn't apply?


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h
File be/src/scheduling/node-blacklist.h:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h@48
PS1, Line 48: class NodeBlacklist {
> Is this class called only from the ClusterMembershipMgr class ? If so, may 
On the other hand the CMM is already pretty large


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h@57
PS1, Line 57: Remove
Maybe FindAndRemove would make the usage more clear when calling this?


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h@81
PS1, Line 81:     bool blacklisted;
Could also be an enum?


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.h@92
PS1, Line 92:   std::unordered_map<IpAddr, std::vector<Entry>> node_list_;
Have you considered blacklisting whole hosts?


http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.cc
File be/src/scheduling/node-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/1/be/src/scheduling/node-blacklist.cc@112
PS1, Line 112: unblacklisted
std::move() or return values through pointer



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Jul 2019 20:12:54 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8339: Add local executor blacklist to coordinators

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/13868/4/be/src/scheduling/executor-blacklist.cc
File be/src/scheduling/executor-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/4/be/src/scheduling/executor-blacklist.cc@196
PS4, Line 196: bool ExecutorBlacklist::eqBePort(const TBackendDescriptor& be_desc, const Entry& existing) {
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/13868/4/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/13868/4/tests/custom_cluster/test_blacklist.py@45
PS4, Line 45: t
flake8: E501 line too long (107 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/13868/4/tests/custom_cluster/test_blacklist.py@70
PS4, Line 70: t
flake8: E501 line too long (107 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/13868/4/tests/custom_cluster/test_blacklist.py@79
PS4, Line 79: t
flake8: E501 line too long (107 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/13868/4/tests/custom_cluster/test_blacklist.py@110
PS4, Line 110: t
flake8: E501 line too long (107 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Jul 2019 22:35:58 +0000
Gerrit-HasComments: Yes