You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Sahil Takiar (Code Review)" <ge...@cloudera.org> on 2019/11/09 01:02:39 UTC

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPC as retryable, add dst node to blacklist

Sahil Takiar has uploaded this change for review. ( http://gerrit.cloudera.org:8080/14677


Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPC as retryable, add dst node to blacklist
......................................................................

IMPALA-9137, IMPALA-9138: Mark failed RPC as retryable, add dst node to blacklist

Introduces two optional fields to TStatus: TErrorType and
TRPCErrorMessage. TErrorType introduces a notion of "types" to TStatus
objects. For now there are only two types, GENERAL and RETRYABLE.
TRPCErrorMessage is set if the TStatus was created as the result of a
failed RPC call. It contains the TNetworkAddress of the destination node
of the failed RPC. When a Status object is created, SetIsRetryable() can
be used to mark the Status as retryable and SetRPCErrorMsg(RPCErrorMsg)
can be used to add a TRPCErrorMessage to a Status object.

When the Coordinator updates the states of
(Coordinator::UpdateBackendExecStatus), if it receives a Status where
Status::HasRPCErrorMsg() is true, it takes the RPCErrorMsg destination
node, and adds it to the blacklist.

Currently, if a Status is marked as retryable (Status::IsRetryable() ==
true), nothing happens. The change is simply meant to lay the groundwork
for future changes.

Only RPC failures in KrpcDataStreamSender are marked as retryable and
have a RPCErrorMsg set.

Re-factored the Thrift files a bit and added a Common.thrift file for
all commonly used Thrift structures.

Testing:
* Ran core tests
* Planning to add more tests after IMPALA-8138 is merged

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
---
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/common/status.cc
M be/src/common/status.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/util/container-util.h
M be/src/util/error-util.h
M common/protobuf/common.proto
M common/thrift/CMakeLists.txt
M common/thrift/CatalogObjects.thrift
M common/thrift/CatalogService.thrift
A common/thrift/Common.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/StatestoreService.thrift
M common/thrift/Status.thrift
M common/thrift/Types.thrift
18 files changed, 206 insertions(+), 28 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/14677/1
-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5245/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 19 Nov 2019 03:30:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPC as retryable, add dst node to blacklist

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14677

to look at the new patch set (#2).

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPC as retryable, add dst node to blacklist
......................................................................

IMPALA-9137, IMPALA-9138: Mark failed RPC as retryable, add dst node to blacklist

Introduces two optional fields to TStatus: TErrorType and
TRPCErrorMessage. TErrorType introduces a notion of "types" to TStatus
objects. For now there are only two types, GENERAL and RETRYABLE.
TRPCErrorMessage is set if the TStatus was created as the result of a
failed RPC call. It contains the TNetworkAddress of the destination node
of the failed RPC. When a Status object is created, SetIsRetryable() can
be used to mark the Status as retryable and SetRPCErrorMsg(RPCErrorMsg)
can be used to add a TRPCErrorMessage to a Status object.

When the Coordinator updates the states of
(Coordinator::UpdateBackendExecStatus), if it receives a Status where
Status::HasRPCErrorMsg() is true, it takes the RPCErrorMsg destination
node, and adds it to the blacklist.

Currently, if a Status is marked as retryable (Status::IsRetryable() ==
true), nothing happens. The change is simply meant to lay the groundwork
for future changes.

Only RPC failures in KrpcDataStreamSender are marked as retryable and
have a RPCErrorMsg set.

Re-factored the Thrift files a bit and added a Common.thrift file for
all commonly used Thrift structures.

Testing:
* Ran core tests
* Planning to add more tests after IMPALA-8138 is merged

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
---
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/common/status.cc
M be/src/common/status.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/util/container-util.h
M be/src/util/error-util.h
M common/protobuf/common.proto
M common/thrift/CMakeLists.txt
M common/thrift/CatalogObjects.thrift
M common/thrift/CatalogService.thrift
A common/thrift/Common.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/StatestoreService.thrift
M common/thrift/Status.thrift
M common/thrift/Types.thrift
M tests/custom_cluster/test_custom_statestore.py
M tests/statestore/test_statestore.py
20 files changed, 208 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/14677/2
-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 14: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 14
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Fri, 20 Dec 2019 02:50:44 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................


Patch Set 5: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 19 Nov 2019 20:00:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 11:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/coordinator.cc@826
PS11, Line 826:             LOG(WARNING) << "Query failed due to a failed RPC to an unknown target address "
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/coordinator.cc@831
PS11, Line 831:           const BackendExecParams* be_exec_params = dest_node_and_be_state->second->exec_params();
line too long (98 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 11
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Dec 2019 02:58:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Thomas Tauber-Marshall, Lars Volker, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14677

to look at the new patch set (#13).

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................

IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Introduces a new optional field to FragmentInstanceExecStatusPB:
AuxErrorInfoPB. AuxErrorInfoPB contains optional metadata associated
with a failed fragment instance. Currently, AuxErrorInfoPB only contains
one field: RPCErrorInfoPB, which is only set if the fragment failed
because a RPC to another impalad failed. The RPCErrorInfoPB contains
the destination node of the failed RPC and the posix error code of the
failed RPC.

Coordinator::UpdateBackendExecStatus(ReportExecStatusRequestPB, ...)
uses the information in RPCErrorInfoPB (if one is set) to blacklist
the target node. While RPCErrorInfoPB::dest_node can be set to the address
of the Coordinator, the Coordinator will not blacklist itself. The
Coordinator only blacklists the node if the RPC failed with a specific
error code (currently either ENOTCONN, ECONNREFUSED, ESHUTDOWN).

Testing:
* Ran core tests
* Added new test to test_blacklist.py

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
---
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/util/network-util.cc
M be/src/util/network-util.h
M common/protobuf/common.proto
M common/protobuf/control_service.proto
M tests/custom_cluster/test_blacklist.py
11 files changed, 230 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/14677/13
-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 13
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 11:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5262/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 11
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Dec 2019 03:28:18 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h
File be/src/common/status.h:

http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h@254
PS3, Line 254:   bool IsRecoverableError() const {
Might want to add some documentation about the difference between a 'recoverable' error and a 'retryable' error.


http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h@271
PS3, Line 271:   void SetIsRetryable() {
Might be nice to have this and SetRPCErrorMsg return '*this' to allow for patterns like:

return Status("error").SetIsRetryable();


http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift
File common/thrift/Common.thrift:

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift@1
PS3, Line 1: // Licensed to the Apache Software Foundation (ASF) under one
Why was it necessary to add this file, rather than just including Types.thrift in Status.thrift?


http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift
File common/thrift/Status.thrift:

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift@33
PS3, Line 33: enum TErrorType {
Do you have other error types in mind for the future, i.e. why not just do a 'bool retryable' in TStatus instead?


http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift@48
PS3, Line 48: 4: optional TRPCErrorMessage rpc_msg
This feels a little too specific to this patch/not extensible. We probably don't want to add a bunch of extra fields to TStatus for various other retriable failure scenarios. Some other scenarios we might need to address in the future:
- Ask the coordinator to blacklist a node for a reason other than an rpc failure, eg. because it has a bad disk
- Ask the coordinator to perform an action other than blacklisting a node before retrying the query, eg. if the failure is something where the query needs to be re-planned and with or without reloading metadata

Some ideas for how to address these issues:
- Make this something more generic, like a string->string map, then for this patch have a "BLACKLIST"-><hostname> entry in the map
- Instead of putting this stuff in the TStatus, put it somewhere like in the ReportExecStatusRequestPB, eg. add a list<TNetworkAddress> nodes_to_blacklist. That's kind of nice because its not clear to me that it makes sense to think of this stuff as a property of the status anyways. The downside is that these error statuses can come from lots of different places and it may be hard to bubble up the relevant info if its not returned in the status, but maybe we could add functions for it to RuntimeState or FragmentInstanceState or whatever. That could also make this more flexible - you can have a query that succeeds but still tells the coordinator to blacklist a node because it was detected to be slow or something.
- Maybe just rename it to RetryInfo or something, then if it ends up with 5-10 fields for different scenarios at least its all clearly contained together

We can also of course just punt on this decision until we do the follow up work that would actually need it, eg. maybe we won't ever use this sort of mechanism for blacklisting nodes for stuff like bad disks because that will just be covered by the extended health check mechanism. Just stuff to think about



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Fri, 15 Nov 2019 23:41:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................


Patch Set 5: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/5245/


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 19 Nov 2019 13:30:49 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................


Patch Set 4:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h
File be/src/common/status.h:

http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h@254
PS3, Line 254:   /// Mark the error as recoverable. Clients can recover from recoverable errors and a
> Might want to add some documentation about the difference between a 'recove
Done


http://gerrit.cloudera.org:8080/#/c/14677/3/be/src/common/status.h@271
PS3, Line 271:         && msg_->error() == TErrorCode::DISK_IO_ERROR;
> Might be nice to have this and SetRPCErrorMsg return '*this' to allow for p
Done


http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/common/status.cc
File be/src/common/status.cc:

http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/common/status.cc@315
PS4, Line 315:         msg_->SetRPCErrorMessage(RPCErrorMessage(status.retry_info.rpc_error_msg.dest_node));
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/service/client-request-state.cc
File be/src/service/client-request-state.cc:

http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/service/client-request-state.cc@906
PS4, Line 906:         "Restarting of fetch requires enabling of query result caching.")).SetIsRecoverable();
> line too long (94 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift
File common/thrift/Common.thrift:

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift@1
PS3, Line 1: // Licensed to the Apache Software Foundation (ASF) under one
> Why was it necessary to add this file, rather than just including Types.thr
Types.thrift contains a bunch of thrift structs that aren't relevant to Status.thrift and I didn't want to populate the file with unnecessary imports (I couldn't find a way to import a specific struct from another file). I think the re-factoring is cleaner as well.


http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift
File common/thrift/Status.thrift:

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift@33
PS3, Line 33: struct TStatusProperties {
> Do you have other error types in mind for the future, i.e. why not just do 
Yeah, that is a good point. I was originally thinking this would be used for IMPALA-8714, but after thinking about it some more, I don't think it makes that much sense to make RETRYABLE a type.

I changed this and introduced a TStatusProperties which is just a generic struct that is used to describe properties of a TStatus object. For now, I put in 'is_retryable' and 'is_recoverable' as the only two entries.


http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift@48
PS3, Line 48: 
> This feels a little too specific to this patch/not extensible. We probably 
Agree those use cases are important. I think the current mechanism will work well for retrying queries that would otherwise fail - e.g. due to an RPC failure, or a bad disk. The relevant retry information can simply be propagated via the Status object, which is created on failure anyway.

I think one issue with a string->string map is that it is too free-form, all users of the map will need to be aware of special keywords for encoding information in the map. I think making things more strongly typed via Thrift objects would be cleaner, and make the code easier to understand.

I think using the information for ReportExecStatusRequestPB will be necessary when we start looking into "limping nodes" - e.g. detecting nodes that are slower than the others, and then blacklisting them. I think we would need to use ReportExecStatusRequestPB in combination with the changes to TStatus. I think it is okay to have the information come from multiple places: e.g. from exec status reports and from Status objects. They represent two distinct things anyway: blacklisting due to due actual query errors vs. blacklisting nodes because they are slow.

I like the idea of using RetryInfo to encapsulate everything so that the TStatus struct doesn't get polluted with a bunch of optional structs. So I did that.



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 19 Nov 2019 03:19:40 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 12:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.h
File be/src/runtime/coordinator.h:

http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.h@152
PS12, Line 152:   void UpdateBlacklistWithAuxErrorInfo(const ReportExecStatusRequestPB& request);
> This can be private
Done


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.h@263
PS12, Line 263:   boost::unordered_map<TNetworkAddress, BackendState*> addr_to_backend_state_;
> I think we might already have this mapping in QuerySchedule::per_backend_ex
Yeah, but that mapping uses Thrift address rather than KRPC addresses. Updated the docs to make this more clear.


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@862
PS12, Line 862:         LOG(WARNING) << "Query failed due to a failed RPC to an unknown target address "
> Doesn't seem like this case should be possible, since 'addr_to_backend_stat
I added a DCHECK(false) and changed the log level to ERROR.


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@879
PS12, Line 879:       static const set<int32_t> blacklistable_rpc_error_codes = {
> Do you examples of posix codes a NetworkError could generate that we wouldn
Discussed offline. There are a bunch of error codes (https://www-numi.fnal.gov/offline_software/srt_public_context/WebDocs/Errors/unix_system_errors.html) that don't seem useful to blacklist. I'm not sure what set of error codes KRPC is even possible of setting for network errors. Regardless, its probably safer to selectively add error codes to this list as we see fit.


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@882
PS12, Line 882: ECONNREFUSED
> I think you can specify these directly instead of having to use the actual 
Done


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@892
PS12, Line 892: break;
> Was it your intention to not continue through the loop if the first RPCErro
Done


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/runtime-state.h
File be/src/runtime/runtime-state.h:

http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/runtime-state.h@439
PS12, Line 439:   struct AuxErrorInfo {
> Why not just use AuxErrorInfoPB directly?
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 12
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 19 Dec 2019 03:51:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 12:

> Depends on the particular RPC. When we blacklist for a failed
 > Exec() rpc, it will never have been retried. For TransmitData(), it
 > will only be retried if the error that came back is "server too
 > busy"

How about connection / connectivity errors? If the connection is broken, do we make any attempt to retry connection establishment or do we just fail?

 > If auto-scaling takes into account blacklisting, I would probably
 > say that's the wrong approach and we should think about changing it
 > to only consider the statestore view of cluster membership.

Well if a node is blacklisted, no queries get scheduled on that node right? So functionally, the node isn't doing any useful work, so it might make sense for the auto-scaler to think it is "dead" and no longer part of the executor group. Perhaps a better approach would be to only consider a node as "dead" if it has been blacklisted multiple times. I think this is a larger discussion though, we haven't spent enough time thinking through node blacklisting + auto-scaler integration. I think we should, but probably as a separate item.

 > Definitely agree that we need to put some work into realistic fault
 > testing, and not just rely on our intuition about blacklisting/retry
 > policy.

Filed IMPALA-9253 for this.

 > I'm definitely interested in approaches to retries that don't
 > require marking statuses retriable (obviously I've had a lot of
 > concerns with that approach). I could see some problems with this
 > approach too, such as that we may have situations in the future
 > where we want to retry queries without doing any blacklisting (eg.
 > if we think there was stale metadata and replanning the query will
 > let it succeed)

Yeah, its possible in the future that we might want to retry queries even if a node hasn't been blacklisted, but I don't see a strong use case for that type of pattern right now. In the interest of not over-engineering the initial solution, I'm fine with just making retries blacklist / cluster membership driven.

 > or probably worse - that you could have a query
 > that fails due to a failed rpc and reports that the node should be
 > blacklisted, and then also hits another error that isn't retry-able
 > but we retry it anyways. Not sure how common a situation like that
 > would be, though.

Definitely possible, but a bit of an edge case. A query would have to fail for two separate reasons at approximately the same time: an error that triggers a blacklist, and an error that doesn't. Whenever the Coordinator gets an error, it cancels the rest of the query, so the non-blacklist error has to occur before the cancellation is processed. Its something we should fix, but I think we can defer out of the initial implementation. Filed IMPALA-9124 as a follow up.

The fix would probably be to ensure that all failed fragments trigger a blacklist or failed because they were cancelled.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 12
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Mon, 16 Dec 2019 21:30:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Thomas Tauber-Marshall, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14677

to look at the new patch set (#10).

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................

IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Introduces a new optional field to FragmentInstanceExecStatusPB:
AuxErrorInfoPB. AuxErrorInfoPB contains optional metadata associated
with a failed fragment instance. Currently, AuxErrorInfoPB only contains
one field: RPCErrorInfoPB, which is only set if the fragment failed
because a RPC to another impalad failed. The RPCErrorInfoPB contains
the destination node of the failed RPC.

Coordinator::UpdateBackendExecStatus(ReportExecStatusRequestPB, ...)
uses the information in RPCErrorInfoPB (if one is set) to blacklist
the target node.

Testing:
* Ran core tests
* Added new test to test_blacklist.py

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
---
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/util/network-util.cc
M be/src/util/network-util.h
M common/protobuf/common.proto
M common/protobuf/control_service.proto
M tests/custom_cluster/test_blacklist.py
11 files changed, 154 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/14677/10
-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 10
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14677/9/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/14677/9/tests/custom_cluster/test_blacklist.py@139
PS9, Line 139: e
flake8: E501 line too long (99 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Dec 2019 02:57:36 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Thomas Tauber-Marshall, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14677

to look at the new patch set (#6).

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................

IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Introduces a new optional field to ReportExecStatusRequestPB:
StatusAuxInfoPB. StatusAuxInfoPB contains optional metadata associated
with ReportExecStatusRequestPB::overall_status. Currently,
StatusAuxInfoPB only contains one field: RPCErrorMessagePB, which is only
set if the ReportExecStatusRequestPB::fragment_instance_id failed
because a RPC to another impalad failed. The RPCErrorMessagePB contains
the destination node of the failed RPC.

Coordinator::UpdateBackendExecStatus(ReportExecStatusRequestPB, ...)
uses the information in RPCErrorMessagePB (if one is set) to blacklist
the target node.

Testing:
* Ran core tests
* Planning to add more tests after IMPALA-8138 is merged

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
---
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/runtime/query-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/status-aux-info.h
M be/src/util/network-util.cc
M be/src/util/network-util.h
M common/protobuf/common.proto
M common/protobuf/control_service.proto
10 files changed, 168 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/14677/6
-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 3:

Basically re-wrote the patch.

* The RPCErrorMessage information is now encapsulated in the StatusAuxInfoPB struct
* The StatusAuxInfoPB is tracked in the RuntimeState of each fragment instance
* The StatusAuxInfoPB is propagated to the Coordinator via ReportExecStatusRequestPB
* Removed all the Status re-factoring / changes


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Nov 2019 20:04:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 9:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/14677/9/be/src/runtime/krpc-data-stream-sender.cc
File be/src/runtime/krpc-data-stream-sender.cc:

http://gerrit.cloudera.org:8080/#/c/14677/9/be/src/runtime/krpc-data-stream-sender.cc@406
PS9, Line 406: unique_ptr
> Not sure that it's necessary to use a unique_ptr and move() here - the Stat
Removed the unique_ptr usage. Removed the StatusAuxInfo class entirely.


http://gerrit.cloudera.org:8080/#/c/14677/9/be/src/runtime/runtime-state.h
File be/src/runtime/runtime-state.h:

http://gerrit.cloudera.org:8080/#/c/14677/9/be/src/runtime/runtime-state.h@306
PS9, Line 306:   StatusAuxInfo& status_aux_info() {
> I guess you're relying on the lock in StatusAuxInfo to make this thread-saf
Done. Removed the StatusAuxInfo class entirely.


http://gerrit.cloudera.org:8080/#/c/14677/9/be/src/runtime/status-aux-info.h
File be/src/runtime/status-aux-info.h:

http://gerrit.cloudera.org:8080/#/c/14677/9/be/src/runtime/status-aux-info.h@61
PS9, Line 61: StatusAuxInfoPB* CreateStatusAuxInfoPB()
> Seems sort of weird to create a 'new StatusAuxInfoPB()" and then use set_al
No longer need this since RuntimeState just uses AuxErrorInfoPB directly.


http://gerrit.cloudera.org:8080/#/c/14677/9/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/14677/9/common/protobuf/control_service.proto@155
PS9, Line 155: // Metadata that can be associated with a StatusPB object.
> Might be useful to explain this a little more, eg. something like "Used to 
Done


http://gerrit.cloudera.org:8080/#/c/14677/9/common/protobuf/control_service.proto@184
PS9, Line 184: optional StatusAuxInfoPB status_aux_info = 7;
> Might make more sense to put this into FragmentInstanceExecStatusPB, since 
Done. Changed the name to AuxErrorInfoPB as well since FragmentInstanceExecStatusPB doesn't have a StatusPB object.


http://gerrit.cloudera.org:8080/#/c/14677/9/common/protobuf/control_service.proto@289
PS9, Line 289: }
> ?
Done


http://gerrit.cloudera.org:8080/#/c/14677/9/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/14677/9/tests/custom_cluster/test_blacklist.py@139
PS9, Line 139: e
> flake8: E501 line too long (99 > 90 characters)
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Mon, 09 Dec 2019 19:54:31 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 9:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/14677/9/be/src/runtime/krpc-data-stream-sender.cc
File be/src/runtime/krpc-data-stream-sender.cc:

http://gerrit.cloudera.org:8080/#/c/14677/9/be/src/runtime/krpc-data-stream-sender.cc@406
PS9, Line 406: unique_ptr
Not sure that it's necessary to use a unique_ptr and move() here - the StatusAuxInfo object is pretty small, so its not a big deal if RuntimeState just copies it.


http://gerrit.cloudera.org:8080/#/c/14677/9/be/src/runtime/runtime-state.h
File be/src/runtime/runtime-state.h:

http://gerrit.cloudera.org:8080/#/c/14677/9/be/src/runtime/runtime-state.h@306
PS9, Line 306:   StatusAuxInfo& status_aux_info() {
I guess you're relying on the lock in StatusAuxInfo to make this thread-safe? I think we usually try to avoid exposing class members to be modified in this way, it's better encapsulation to use something like SetStatusAuxInfo(). Then you could add a 'status_aux_info_lock_', and it would be consistent with how we handle the error log, eg. with 'error_log_lock_'.

You might even then get rid of StatusAuxInfo entirely and just directly use the StatusAuxInfoPB objects, since StatusAuxInfo isn't really doing much.


http://gerrit.cloudera.org:8080/#/c/14677/9/be/src/runtime/status-aux-info.h
File be/src/runtime/status-aux-info.h:

http://gerrit.cloudera.org:8080/#/c/14677/9/be/src/runtime/status-aux-info.h@61
PS9, Line 61: StatusAuxInfoPB* CreateStatusAuxInfoPB()
Seems sort of weird to create a 'new StatusAuxInfoPB()" and then use set_allocated_status_aux_info().

I think the way we usually do this is to make this something like ToProtobuf(StatusAuxInfoPB* status_info), and then pass mutable_status_aux_info() in.


http://gerrit.cloudera.org:8080/#/c/14677/9/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/14677/9/common/protobuf/control_service.proto@155
PS9, Line 155: // Metadata that can be associated with a StatusPB object.
Might be useful to explain this a little more, eg. something like "Used to store extra info about errors encountered during fragment execution that the coordinator may need to handle, such as indicating that a node may be bad and should be blacklisted"


http://gerrit.cloudera.org:8080/#/c/14677/9/common/protobuf/control_service.proto@184
PS9, Line 184: optional StatusAuxInfoPB status_aux_info = 7;
Might make more sense to put this into FragmentInstanceExecStatusPB, since at least in this patch the errors that we're handling are specific to particular fragment instances.

This also makes it easier to handle the case where multiple fragments may have failed. As is, if multiple fragments fail for different reasons, you'll just drop any StatusAuxInfoPBs that don't correspond to failed_finstance_id_.


http://gerrit.cloudera.org:8080/#/c/14677/9/common/protobuf/control_service.proto@289
PS9, Line 289: }
?



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Fri, 06 Dec 2019 23:31:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5247/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 19 Nov 2019 15:26:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 12:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5272/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 12
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Dec 2019 20:12:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14677/7/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/14677/7/tests/custom_cluster/test_blacklist.py@36
PS7, Line 36: #
flake8: E265 block comment should start with '# '


http://gerrit.cloudera.org:8080/#/c/14677/7/tests/custom_cluster/test_blacklist.py@139
PS7, Line 139: e
flake8: E501 line too long (99 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Dec 2019 00:24:44 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto@137
PS10, Line 137: RPC to another node failed
> Technically, RPCErrorInfoPB could contain the address of the Coordinator. I
Fwiw ClusterMembershipMgr::BlacklistExecutor already takes care of making sure not to let nodes blacklist themselves, though its reasonable to do the check that you've added too.



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 11
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Dec 2019 18:16:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift
File common/thrift/Status.thrift:

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift@48
PS3, Line 48: 4: optional TRPCErrorMessage rpc_msg
> yeah, the point about constructing new Status objects from existing ones is
Discussed with Sahil offline about it. Summary below:

- StatusAuxInfo may be a slightly better fit for future extensibility
- Most control services are initiated from the coordinator so not much need for elaborate propagation of the aux info
- For the most part, the propagation of aux info only needs to happen for fragment instance's execution. In which case. we can consider dumping the aux info in RuntimeState or some per-fragment instance state and collect them when reporting status.



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Nov 2019 21:29:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 12:

Responding to comments from @Thomas.

 > Then there's blacklisting, which only takes a single failed rpc to
 > get a node blacklisted

Right, but that RPC has already been retried multiple times, right? I should probably validate that for these specific error codes, the RPCs are actually retried. 

 > So, I don't think that we need to be overly afraid about
 > blacklisting things aggressively, since if the node actually isn't
 > unhealthy blacklisting it once isn't going to have that big of an
 > impact on the cluster.

One caveat here is with executor groups. IIUC there is a concept of "executor group health". If enough nodes in an executor group are either blacklisted / removed via the statestore, Impala will mark that executor group as unhealthy, and then stop scheduling any queries on it.

Stolen from the commit message for IMPALA-8484:

    Executors can be configured with an executor group through the newly
    added '--executor_groups' flag. [...] Each executor group
    specification can optionally contain a minimum size, separated by a
    ':', e.g. --executor_groups default-pool-1:3. Only when the cluster
    membership contains at least that number of executors for the groups
    will it be considered for admission.

So depending on an external auto-scaling policy, its possible that blacklisting a node (even for 12 seconds) makes the executor group unhealthy, and causes the auto-scaler to create a new executor group (a potentially expensive operation).

 > That's particularly the case since this work is aimed at
 > facilitating transparent query retry. We definitely want to avoid
 > scenarios where a TransmitData rpc fails, we don't blacklist the
 > node because the failure doesn't match our whitelist of posix codes
 > but we go ahead and retry the query anyways and then it fails again
 > because we schedule it on the same bad node again.
 
Agree. I think the first milestone for transparent query retries is to just handle scenario where nodes crash while running a query. Based on some experiments I did, these were the only error codes that came up. I'm not confident enough in my understanding of Linux error codes to include any more than the current three, but am open to suggestions.

In general, I think we can be pretty test driven here. The best way to figure out what additional errors codes to add would be to run some network fault injection tests and see what error codes Impala fails with. I'll open a follow up JIRA for that.

 > Of course, for this patch and the followup patch, we can just only
 > mark the status as retryable if we do in fact decide to blacklist
 > the node by doing the check for posix codes in KrpcDataStreamSender

I generally wanted to avoid doing this. I think it would be better for only the Coordinator to make the ultimate decision about whether a node is blacklisted. KrpcDataStreamSender just provides enough information for the Coordinator to make the blacklisting decision. Otherwise the logic for determining if a node should be blacklisted will be spread across multiple classes, making it hard to reason about blacklisting decisions.

However, I think you do bring up a good. We don't want a failed RPC to trigger a query retry unless we are actually blacklisting the target node. After thinking through it some more, maybe marking queries as "retryable" isn't necessary at all. Instead, whenever a query fails + blacklists a node, it is retried. So query retries are blacklisting driven. This would mean abandoning the patch for 
IMPALA-9138 (Classify certain errors as retryable), which I'm fine with doing. Let me know if you feel differently.

 > but there's still the problem that this patch will only blacklist
 > one node per query.
 
Yeah, maybe this is too conservative. Assuming transparent query retries support "x" number of retries, Impala should be able to tolerate "x" faults across the duration of the query (and its retries). Maybe that is okay, maybe not. Don't have strong feelings here.

 > To be clear - I think for the first iteration of all of this its
 > probably fine to be fairly conservative about what we blacklist,
 > and if sometimes queries fail when we could have made them succeed
 > by being more aggressive that's okay (so long as they would've
 > failed even without the blacklisting work, such that its not a
 > regression).

Agree. It feels safer to add functionality to blacklisting incrementally.

 > Just wanted to make the points that 1) blacklisting
 > was designed in a way where its not that big of a deal and we don't
 > need to worry too much about screwing up a cluster by blacklisting
 > a node unnecessarily and 2) we probably really want to avoid the
 > situation where we retry a query and it fails again for the same
 > reason

Agree with both points.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 12
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Fri, 13 Dec 2019 21:02:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 10:

(11 comments)

Addressed all comments except for the one on the commit message. That will require a few more changes I am still working through.

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.h
File be/src/runtime/coordinator.h:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.h@252
PS10, Line 252:   /// TNetworkAddress.
> mention ownership, nullptr (see above)
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@209
PS10, Line 209:     if (UNLIKELY(!addr_to_backend_state_.insert(addr_to_backend_state).second)) {
> you can use emplace() to construct the element in place.
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@815
PS10, Line 815:       for (auto status_iter = request.instance_exec_status().begin();
> Can you use a range-based loop?
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@826
PS10, Line 826:               addr_to_backend_state_.at(NetworkAddressPBToTNetworkAddress(dest_node));
> This will throw an exception if the element does not exist (which is almost
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@829
PS10, Line 829:           blacklisted_node = true;
> Could just break; the loop here. I can't see where blacklisted_node is used
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h
File be/src/runtime/runtime-state.h:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h@322
PS10, Line 322:   AuxErrorInfoPB* aux_error_info() {
> Another option to improve encapsulation would be to do what we do with the 
Done.

Followed Thomas' suggestion. Changed to GetAuxErrorInfo(AuxErrorInfoPB).


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h@444
PS10, Line 444:   /// query_status_ != Status::OK().
> Mention who owns this and who cleans it up? We should not rely on the rest 
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.cc
File be/src/runtime/runtime-state.cc:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.cc@317
PS10, Line 317:   NetworkAddressPB* network_addr = new NetworkAddressPB();
> Yeah, strong preference for 'mutable_*' over the other options.
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/util/network-util.h
File be/src/util/network-util.h:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/util/network-util.h@72
PS10, Line 72: TNetworkAddress NetworkAddressPBToTNetworkAddress(const NetworkAddressPB& address);
> I think FromNetworkAddressPB() would be descriptive enough since the callin
Done


http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto@137
PS10, Line 137: RPC to another node failed
> I think here it would be good to mention that this is between two executors
Technically, RPCErrorInfoPB could contain the address of the Coordinator. IIUC KRPC is used between the Coordinator fragment <-- Executor fragment RPCs.

I updated the code in coordinator.cc so that it doesn't blacklist itself even if the RPCErrorInfoPB::dest_node is set to the Coordinator address. I updated the commit message to document this as well.


http://gerrit.cloudera.org:8080/#/c/14677/10/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/14677/10/tests/custom_cluster/test_blacklist.py@121
PS10, Line 121:     """Verifies that an RPC failure causes the target node to be blacklisted. The
> Can you add a test that checks that a failure between the two non-coordinat
Changed this test to use a single dedicated coordinator.



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 10
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Dec 2019 03:02:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift
File common/thrift/Common.thrift:

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift@1
PS3, Line 1: // Licensed to the Apache Software Foundation (ASF) under one
> Not really sure I see any benefit - I guess it very slightly reduces recomp
My understanding of Types.thrift (and I could be wrong here) is that it it is meant to store Impala types (e.g. TPrimitiveType, TScalarType, etc.). However, it looks like a lot of unrelated stuff crept in over time, and I can't really figure out what Types.thrift is suppose to encapsulate.

Common.thrift is suppose to represent just commonly useful Thrift structs (e.g. TNetworkAddress) that are used across Thrift files. TUniqueId could probably live in here as well.

I guess the re-factoring is an attempt to stop polluting Types.thrift and start cleaning it up.

This isn't really part of the core patch, I just thought it was nice to have. So I don't have strong opinions about keeping this in, but just wanted to explain my thought process.


http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift
File common/thrift/Status.thrift:

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift@48
PS3, Line 48: 4: optional TRPCErrorMessage rpc_msg
> Agreed that a string->string map was a bad suggestion, just throwing stuff 
yeah, the point about constructing new Status objects from existing ones is a valid concern.

I'll try to think through this some more.

Curious if @Michael Ho has any thoughts.



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Nov 2019 01:14:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5162/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Nov 2019 20:46:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 12:

(7 comments)

I think there a question here of how aggressive we want to be about blacklisting.

Impala has two mechanisms for updating the cluster membership: the statestore, which is fairly conservative (by default it takes 10 failed heartbeats in a row for a node to be removed by the statestore), is considered the ground truth, is communicated to all coordinators in the cluster. Once a node is removed by the statestore, it has to explicitly re-register itself to get re-included in the membership.

Then there's blacklisting, which only takes a single failed rpc to get a node blacklisted, but blacklisting is a local decision by a coordinator and never communicated to other machines, and a node that is blacklisted a single time will only stay on the blacklist for 12s by default before getting put back into the membership (nodes that are repeatedly blacklisted will stay on for longer)

So, I don't think that we need to be overly afraid about blacklisting things aggressively, since if the node actually isn't unhealthy blacklisting it once isn't going to have that big of an impact on the cluster.

That's particularly the case since this work is aimed at facilitating transparent query retry. We definitely want to avoid scenarios where a TransmitData rpc fails, we don't blacklist the node because the failure doesn't match our whitelist of posix codes but we go ahead and retry the query anyways and then it fails again because we schedule it on the same bad node again. 

Of course, for this patch and the followup patch, we can just only mark the status as retryable if we do in fact decide to blacklist the node by doing the check for posix codes in KrpcDataStreamSender, but there's still the problem that this patch will only blacklist one node per query.

To be clear - I think for the first iteration of all of this its probably fine to be fairly conservative about what we blacklist, and if sometimes queries fail when we could have made them succeed by being more aggressive that's okay (so long as they would've failed even without the blacklisting work, such that its not a regression). Just wanted to make the points that 1) blacklisting was designed in a way where its not that big of a deal and we don't need to worry too much about screwing up a cluster by blacklisting a node unnecessarily and 2) we probably really want to avoid the situation where we retry a query and it fails again for the same reason

http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.h
File be/src/runtime/coordinator.h:

http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.h@152
PS12, Line 152:   void UpdateBlacklistWithAuxErrorInfo(const ReportExecStatusRequestPB& request);
This can be private


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.h@263
PS12, Line 263:   boost::unordered_map<TNetworkAddress, BackendState*> addr_to_backend_state_;
I think we might already have this mapping in QuerySchedule::per_backend_exec_params()?


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@862
PS12, Line 862:         LOG(WARNING) << "Query failed due to a failed RPC to an unknown target address "
Doesn't seem like this case should be possible, since 'addr_to_backend_state_' was built using the schedule for this query, and fragments can't possibly try to send rpcs to backends that the query wasn't scheduled on


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@879
PS12, Line 879:       static const set<int32_t> blacklistable_rpc_error_codes = {
Do you examples of posix codes a NetworkError could generate that we wouldn't want to blacklist?

We also may want to do this check on the other side and only send the AuxErrorInfoPB if the posix code matches (see my other comments)


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@882
PS12, Line 882: ECONNREFUSED
I think you can specify these directly instead of having to use the actual number value if you "#include <cerrno>"


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@892
PS12, Line 892: break;
Was it your intention to not continue through the loop if the first RPCErrorInfoPB that gets to this point doesn't have a matching posix_code? Maybe move this break into the 'if' above?


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/runtime-state.h
File be/src/runtime/runtime-state.h:

http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/runtime-state.h@439
PS12, Line 439:   struct AuxErrorInfo {
Why not just use AuxErrorInfoPB directly?



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 12
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Fri, 13 Dec 2019 18:56:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 10:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/14677/10//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14677/10//COMMIT_MSG@18
PS10, Line 18: the target node.
> Does this open up the possibility that a failure in S will
 > blacklist other healthy nodes? Are there errors in the S->T RPC
 > that we should ignore for the purpose of blacklisting?

That's a good point. I would say, for the purposes of this initial patch, we should differentiate between errors from the rpc layer, i.e. returned by the RpcController, vs. errors with running the rpc itself, i.e. returned in the TransmitDataResponsePB. As a first pass, probably we only want to blacklist stuff when there are errors from the rpc layer, we can think about other errors in later patches (fwiw this is also how the existing blacklisting for failed Exec() rpcs works)


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h
File be/src/runtime/runtime-state.h:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h@322
PS10, Line 322:   AuxErrorInfoPB* aux_error_info() {
> Does this ever need to be non-const or can we return a const pointer?
Another option to improve encapsulation would be to do what we do with the error map, i.e. make a SetAuxErrorInfo(AuxErrorInfoPB*)


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.cc
File be/src/runtime/runtime-state.cc:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.cc@317
PS10, Line 317:   NetworkAddressPB* network_addr = new NetworkAddressPB();
> Can you remove the unguarded allocations? You can either replace them with 
Yeah, strong preference for 'mutable_*' over the other options.



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 10
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Wed, 11 Dec 2019 18:57:18 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 14: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 14
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 19 Dec 2019 22:23:25 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 8:

Added a new test and fixed a bug.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 8
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Dec 2019 00:25:05 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPC as retryable, add dst node to blacklist

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPC as retryable, add dst node to blacklist
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4996/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Sat, 09 Nov 2019 22:57:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Thomas Tauber-Marshall, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14677

to look at the new patch set (#8).

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................

IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Introduces a new optional field to ReportExecStatusRequestPB:
StatusAuxInfoPB. StatusAuxInfoPB contains optional metadata associated
with ReportExecStatusRequestPB::overall_status. Currently,
StatusAuxInfoPB only contains one field: RPCErrorMessagePB, which is only
set if the ReportExecStatusRequestPB::fragment_instance_id failed
because a RPC to another impalad failed. The RPCErrorMessagePB contains
the destination node of the failed RPC.

Coordinator::UpdateBackendExecStatus(ReportExecStatusRequestPB, ...)
uses the information in RPCErrorMessagePB (if one is set) to blacklist
the target node.

Testing:
* Ran core tests
* Added new test to test_blacklist.py

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
---
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/runtime/query-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/status-aux-info.h
M be/src/util/network-util.cc
M be/src/util/network-util.h
M common/protobuf/common.proto
M common/protobuf/control_service.proto
M tests/custom_cluster/test_blacklist.py
11 files changed, 189 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/14677/8
-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 8
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Thomas Tauber-Marshall, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14677

to look at the new patch set (#4).

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................

IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Introduces two optional fields to TStatus: TStatusProperties and
TRetryInfo. TStatusProperties is a struct of properties used to describe
the TStatus object. TRetryInfo is a struct containing all error
information related to query retries. It currently has only one field:
TRPCErrorMessage, which is set if the TStatus was created as the result
of a failed RPC call. It contains the TNetworkAddress of the destination
node of the failed RPC. When a Status object is created, SetIsRetryable()
can be used to mark the Status as retryable and
SetRPCErrorMessage(RPCErrorMessage) can be used to add a TRPCErrorMessage
to a Status object.

When the Coordinator updates the state of an Impala backend
(Coordinator::UpdateBackendExecStatus), if it receives a Status where
Status::HasRPCErrorMessage() is true, it takes the RPCErrorMessage
destination node, and adds it to the blacklist.

Currently, if a Status is marked as retryable (Status::SetIsRetryable()),
nothing happens. The change is simply meant to lay the groundwork
for future changes in IMPALA-9124.

Only RPC failures in KrpcDataStreamSender are marked as retryable and
have a RPCErrorMessage set.

Re-factored the Thrift files a bit and added a Common.thrift file for
all commonly used Thrift structures. Started doing some re-factoring of
Status to break up all the information encapsulated in TErrorCode.

Testing:
* Ran core tests
* Planning to add more tests after IMPALA-8138 is merged

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
---
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/common/status.cc
M be/src/common/status.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/service/client-request-state.cc
M be/src/util/container-util.h
M be/src/util/error-util.h
M common/protobuf/common.proto
M common/thrift/CMakeLists.txt
M common/thrift/CatalogObjects.thrift
M common/thrift/CatalogService.thrift
A common/thrift/Common.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/StatestoreService.thrift
M common/thrift/Status.thrift
M common/thrift/Types.thrift
M common/thrift/generate_error_codes.py
M tests/custom_cluster/test_custom_statestore.py
M tests/statestore/test_statestore.py
22 files changed, 318 insertions(+), 36 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/14677/4
-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Thomas Tauber-Marshall, Lars Volker, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14677

to look at the new patch set (#12).

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................

IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Introduces a new optional field to FragmentInstanceExecStatusPB:
AuxErrorInfoPB. AuxErrorInfoPB contains optional metadata associated
with a failed fragment instance. Currently, AuxErrorInfoPB only contains
one field: RPCErrorInfoPB, which is only set if the fragment failed
because a RPC to another impalad failed. The RPCErrorInfoPB contains
the destination node of the failed RPC and the posix error code of the
failed RPC.

Coordinator::UpdateBackendExecStatus(ReportExecStatusRequestPB, ...)
uses the information in RPCErrorInfoPB (if one is set) to blacklist
the target node. While RPCErrorInfoPB::dest_node can be set to the address
of the Coordinator, the Coordinator will not blacklist itself. The
Coordinator only blacklists the node if the RPC failed with a specific
error code (currently either 107, 108, 111).

Testing:
* Ran core tests
* Added new test to test_blacklist.py

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
---
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/util/network-util.cc
M be/src/util/network-util.h
M common/protobuf/common.proto
M common/protobuf/control_service.proto
M tests/custom_cluster/test_blacklist.py
11 files changed, 237 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/14677/12
-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 12
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 12:

(4 comments)

Updated patch. Now Impala only blacklists a target node of a RPC if that node failed with a specific posix error code: 107, 108, or 111. I decided on these values purely based on what happens to a running query when an Impalad executor crashes in the middle of execution. We can consider adding more to the list later.

I had to re-write the test though. It looks like the debug action in IMPALA-8138 causes KRPC calls to fail with a "RemoteError", but in practice KRPC will actually fail with a "NetworkError" (the error types seem to be a Kudu Status feature). If you actually kill an Impala executor while a query is running, KRPC throws a "NetworkError" since the TCP connection probably got killed. I made blacklisting specific to "NetworkError"s. I think this makes more sense since KRPC actually returns a "NetworkError" when the connection dies, and you probably don't want to blacklist a node due to a "RemoteError" (at least I don't know of a failure scenario where you would want to).

http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/coordinator.cc@826
PS11, Line 826:     backend_resource_state_->MarkBackendFinished(backend_state, &releasable_backends);
> line too long (92 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/coordinator.cc@831
PS11, Line 831:         backend_released_barrier_.Notify();
> line too long (98 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/krpc-data-stream-sender.cc
File be/src/runtime/krpc-data-stream-sender.cc:

http://gerrit.cloudera.org:8080/#/c/14677/11/be/src/runtime/krpc-data-stream-sender.cc@406
PS11, Line 406:     return rpc_status_;
note: this turned out to be redundant so removed it


http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto@137
PS10, Line 137: led fragment instance. Use
> Fwiw ClusterMembershipMgr::BlacklistExecutor already takes care of making s
Sounds good.



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 12
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Dec 2019 19:58:28 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5190/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Dec 2019 03:26:22 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Thomas Tauber-Marshall, Lars Volker, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14677

to look at the new patch set (#11).

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................

IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Introduces a new optional field to FragmentInstanceExecStatusPB:
AuxErrorInfoPB. AuxErrorInfoPB contains optional metadata associated
with a failed fragment instance. Currently, AuxErrorInfoPB only contains
one field: RPCErrorInfoPB, which is only set if the fragment failed
because a RPC to another impalad failed. The RPCErrorInfoPB contains
the destination node of the failed RPC.

Coordinator::UpdateBackendExecStatus(ReportExecStatusRequestPB, ...)
uses the information in RPCErrorInfoPB (if one is set) to blacklist
the target node. While RPCErrorInfoPB::dest_node can be set to the address
of the Coordinator, the Coordinator will not blacklist itself.

Testing:
* Ran core tests
* Added new test to test_blacklist.py

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
---
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/util/network-util.cc
M be/src/util/network-util.h
M common/protobuf/common.proto
M common/protobuf/control_service.proto
M tests/custom_cluster/test_blacklist.py
11 files changed, 181 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/14677/11
-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 11
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5182/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 8
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Dec 2019 00:53:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................

IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Introduces a new optional field to FragmentInstanceExecStatusPB:
AuxErrorInfoPB. AuxErrorInfoPB contains optional metadata associated
with a failed fragment instance. Currently, AuxErrorInfoPB only contains
one field: RPCErrorInfoPB, which is only set if the fragment failed
because a RPC to another impalad failed. The RPCErrorInfoPB contains
the destination node of the failed RPC and the posix error code of the
failed RPC.

Coordinator::UpdateBackendExecStatus(ReportExecStatusRequestPB, ...)
uses the information in RPCErrorInfoPB (if one is set) to blacklist
the target node. While RPCErrorInfoPB::dest_node can be set to the address
of the Coordinator, the Coordinator will not blacklist itself. The
Coordinator only blacklists the node if the RPC failed with a specific
error code (currently either ENOTCONN, ECONNREFUSED, ESHUTDOWN).

Testing:
* Ran core tests
* Added new test to test_blacklist.py

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Reviewed-on: http://gerrit.cloudera.org:8080/14677
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/util/network-util.cc
M be/src/util/network-util.h
M common/protobuf/common.proto
M common/protobuf/control_service.proto
M tests/custom_cluster/test_blacklist.py
11 files changed, 230 insertions(+), 0 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 15
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14677

to look at the new patch set (#3).

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................

IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Introduces two optional fields to TStatus: TErrorType and
TRPCErrorMessage. TErrorType introduces a notion of "types" to TStatus
objects. For now there are only two types, GENERAL and RETRYABLE.
TRPCErrorMessage is set if the TStatus was created as the result of a
failed RPC call. It contains the TNetworkAddress of the destination node
of the failed RPC. When a Status object is created, SetIsRetryable() can
be used to mark the Status as retryable and SetRPCErrorMsg(RPCErrorMsg)
can be used to add a TRPCErrorMessage to a Status object.

When the Coordinator updates the state of an Impala backend
(Coordinator::UpdateBackendExecStatus), if it receives a Status where
Status::HasRPCErrorMsg() is true, it takes the RPCErrorMsg destination
node, and adds it to the blacklist.

Currently, if a Status is marked as retryable (Status::IsRetryable() ==
true), nothing happens. The change is simply meant to lay the groundwork
for future changes.

Only RPC failures in KrpcDataStreamSender are marked as retryable and
have a RPCErrorMsg set.

Re-factored the Thrift files a bit and added a Common.thrift file for
all commonly used Thrift structures.

Testing:
* Ran core tests
* Planning to add more tests after IMPALA-8138 is merged

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
---
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/common/status.cc
M be/src/common/status.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/util/container-util.h
M be/src/util/error-util.h
M common/protobuf/common.proto
M common/thrift/CMakeLists.txt
M common/thrift/CatalogObjects.thrift
M common/thrift/CatalogService.thrift
A common/thrift/Common.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/StatestoreService.thrift
M common/thrift/Status.thrift
M common/thrift/Types.thrift
M tests/custom_cluster/test_custom_statestore.py
M tests/statestore/test_statestore.py
20 files changed, 208 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/14677/3
-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 14:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5363/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 14
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 19 Dec 2019 22:23:26 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 13: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 13
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 19 Dec 2019 19:36:02 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5059/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 19 Nov 2019 03:54:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Thomas Tauber-Marshall, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14677

to look at the new patch set (#5).

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................

IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Introduces two optional fields to TStatus: TStatusProperties and
TRetryInfo. TStatusProperties is a struct of properties used to describe
the TStatus object. TRetryInfo is a struct containing all error
information related to query retries. It currently has only one field:
TRPCErrorMessage, which is set if the TStatus was created as the result
of a failed RPC call. It contains the TNetworkAddress of the destination
node of the failed RPC. When a Status object is created, SetIsRetryable()
can be used to mark the Status as retryable and
SetRPCErrorMessage(RPCErrorMessage) can be used to add a TRPCErrorMessage
to a Status object.

When the Coordinator updates the state of an Impala backend
(Coordinator::UpdateBackendExecStatus), if it receives a Status where
Status::HasRPCErrorMessage() is true, it takes the RPCErrorMessage
destination node, and adds it to the blacklist.

Currently, if a Status is marked as retryable (Status::SetIsRetryable()),
nothing happens. The change is simply meant to lay the groundwork
for future changes in IMPALA-9124.

Only RPC failures in KrpcDataStreamSender are marked as retryable and
have a RPCErrorMessage set.

Re-factored the Thrift files a bit and added a Common.thrift file for
all commonly used Thrift structures. Started doing some re-factoring of
Status to break up all the information encapsulated in TErrorCode.

Testing:
* Ran core tests
* Planning to add more tests after IMPALA-8138 is merged

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
---
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/common/status.cc
M be/src/common/status.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/service/client-request-state.cc
M be/src/util/container-util.h
M be/src/util/error-util.h
M common/protobuf/common.proto
M common/thrift/CMakeLists.txt
M common/thrift/CatalogObjects.thrift
M common/thrift/CatalogService.thrift
A common/thrift/Common.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/StatestoreService.thrift
M common/thrift/Status.thrift
M common/thrift/Types.thrift
M common/thrift/generate_error_codes.py
M tests/custom_cluster/test_custom_statestore.py
M tests/statestore/test_statestore.py
22 files changed, 320 insertions(+), 36 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/14677/5
-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5018/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 14 Nov 2019 01:30:28 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14677/10//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14677/10//COMMIT_MSG@18
PS10, Line 18: the target node.
> For coordinator C and executors S and T, this means that if S fails to send
S won't blacklist T since blacklists are only for Coordinators.
Yeah, I think there is a concern of where to assign "blame" when an RPC fails. If an RPC from S to T fails, this patch currently blacklists T, but it is possible that the actual issue is with S. This is potentially problematic, if an RPC from S to T fails and the issue is with S, then T will be blacklisted for no reason, and the unhealthy node S will still be part of the cluster.
I've done a few experiments where I kill T and errors from S are:

 ERROR: TransmitData() to 10.65.30.141:27000 failed: Network error: recv got EOF from 10.65.30.141:27000 (error 108)
 ERROR: TransmitData() to 10.65.29.251:27000 failed: Network error: recv error from 0.0.0.0:0: Transport endpoint is not connected (error 107)
 ERROR: TransmitData() to 10.65.26.254:27000 failed: Network error: Client connection negotiation failed: client connection to 10.65.26.254:27000: connect: Connection refused (error 111)
 ERROR: EndDataStream() to 127.0.0.1:27002 failed: Network error: recv error from 0.0.0.0:0: Transport endpoint is not connected (error 107)

Maybe there is a way to make the blacklisting more specific to these exact errors. Let me see what I can find.



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 10
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Wed, 11 Dec 2019 19:01:53 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift
File common/thrift/Common.thrift:

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Common.thrift@1
PS3, Line 1: // Licensed to the Apache Software Foundation (ASF) under one
> Types.thrift contains a bunch of thrift structs that aren't relevant to Sta
Not really sure I see any benefit - I guess it very slightly reduces recompilation time when people make changes to Types.thrift that don't affect Status.thrift, though how frequent is that really, and its not really any cleaner imo since now we have to answer the question of what belongs in Common vs. Types, which isn't clear - and its nice to avoid unnecessary code churn, but obviously not that big of a deal.

Imo, if anything the refactoring that we need is to make the thrift and protobuf definitions of this stuff more consistent, i.e. add a status.proto instead of defining it in common.proto, but again not necessarily worth the code churn.


http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift
File common/thrift/Status.thrift:

http://gerrit.cloudera.org:8080/#/c/14677/3/common/thrift/Status.thrift@48
PS3, Line 48: 
> Agree those use cases are important. I think the current mechanism will wor
Agreed that a string->string map was a bad suggestion, just throwing stuff out there.

Idk about this RetryInfo approach, though. I'm concerned that it doesn't really map to the existing semantics of what a Status is (a small object that indicates success/failure with a message about the cause of failure), and that as a result going this direction is awkward and potentially error-prone.

For example, you seem to be assuming that at any random point in the code we can construct a Status object with a RetryInfo as part of it, and that will be bubbled up to become the ultimate query status that is returned to the coordinator. 

However, its currently a common pattern for code to call some function, get an error status back, and then instead of actually returning that Status object it constructs a new Status object, usually with an error message that contains the error message of the original error Status. In a case like this, if the original Status has a RetryInfo it was probably lost when the new Status was created.

Of course, we can go through and make sure that all of the code paths where we generate RetryInfos do in fact bubble that Status up to become the overall query status, but again that seems error prone and means that the semantics of Status objects are sort of different in different places, i.e. sometimes a Status is just a potential indicator of some error that can be handled and discarded by the calling code, sometimes its an <Important_Query_Status> that must be preserved and returned to the coordinator.



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Nov 2019 23:09:37 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Thomas Tauber-Marshall, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14677

to look at the new patch set (#7).

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................

IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Introduces a new optional field to ReportExecStatusRequestPB:
StatusAuxInfoPB. StatusAuxInfoPB contains optional metadata associated
with ReportExecStatusRequestPB::overall_status. Currently,
StatusAuxInfoPB only contains one field: RPCErrorMessagePB, which is only
set if the ReportExecStatusRequestPB::fragment_instance_id failed
because a RPC to another impalad failed. The RPCErrorMessagePB contains
the destination node of the failed RPC.

Coordinator::UpdateBackendExecStatus(ReportExecStatusRequestPB, ...)
uses the information in RPCErrorMessagePB (if one is set) to blacklist
the target node.

Testing:
* Ran core tests
* Added new test to test_blacklist.py

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
---
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/runtime/query-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/status-aux-info.h
M be/src/util/network-util.cc
M be/src/util/network-util.h
M common/protobuf/common.proto
M common/protobuf/control_service.proto
M tests/custom_cluster/test_blacklist.py
11 files changed, 191 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/14677/7
-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 12:

> Right, but that RPC has already been retried multiple times, right?
 > I should probably validate that for these specific error codes, the
 > RPCs are actually retried.

Depends on the particular RPC. When we blacklist for a failed Exec() rpc, it will never have been retried. For TransmitData(), it will only be retried if the error that came back is "server too busy"

 > So depending on an external auto-scaling policy, its possible that
 > blacklisting a node (even for 12 seconds) makes the executor group
 > unhealthy, and causes the auto-scaler to create a new executor
 > group (a potentially expensive operation).

If auto-scaling takes into account blacklisting, I would probably say that's the wrong approach and we should think about changing it to only consider the statestore view of cluster membership.

 > In general, I think we can be pretty test driven here. The best way
 > to figure out what additional errors codes to add would be to run
 > some network fault injection tests and see what error codes Impala
 > fails with. I'll open a follow up JIRA for that.

Definitely agree that we need to put some work into realistic fault testing, and not just rely on our intuition about blacklisting/retry policy.

 > I generally wanted to avoid doing this. I think it would be better
 > for only the Coordinator to make the ultimate decision about
 > whether a node is blacklisted. KrpcDataStreamSender just provides
 > enough information for the Coordinator to make the blacklisting
 > decision. Otherwise the logic for determining if a node should be
 > blacklisted will be spread across multiple classes, making it hard
 > to reason about blacklisting decisions.

I don't think it would be so bad to add a utility function like IsRetriableRpcError(kudu::Status) or whatever to keep the logic consistent, but I don't feel strongly about how exactly we arrange this, as long as we're getting the correct interaction between blacklisting and retrying.

 > However, I think you do bring up a good. We don't want a failed RPC
 > to trigger a query retry unless we are actually blacklisting the
 > target node. After thinking through it some more, maybe marking
 > queries as "retryable" isn't necessary at all. Instead, whenever a
 > query fails + blacklists a node, it is retried. So query retries
 > are blacklisting driven. This would mean abandoning the patch for
 > IMPALA-9138 (Classify certain errors as retryable), which I'm fine
 > with doing. Let me know if you feel differently.
 
I'm definitely interested in approaches to retries that don't require marking statuses retriable (obviously I've had a lot of concerns with that approach). I could see some problems with this approach too, such as that we may have situations in the future where we want to retry queries without doing any blacklisting (eg. if we think there was stale metadata and replanning the query will let it succeed), or probably worse - that you could have a query that fails due to a failed rpc and reports that the node should be blacklisted, and then also hits another error that isn't retry-able but we retry it anyways. Not sure how common a situation like that would be, though.

 > > but there's still the problem that this patch will only blacklist
 > > one node per query.
 > 
 > Yeah, maybe this is too conservative. Assuming transparent query
 > retries support "x" number of retries, Impala should be able to
 > tolerate "x" faults across the duration of the query (and its
 > retries). Maybe that is okay, maybe not. Don't have strong feelings
 > here.
 
I think its probably fine to leave as is for now, until we have more confidence that blacklisting is really working how we want it to.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 12
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Mon, 16 Dec 2019 18:37:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5181/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Dec 2019 00:53:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 13:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5319/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 13
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 19 Dec 2019 04:20:32 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/common/status.cc
File be/src/common/status.cc:

http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/common/status.cc@315
PS4, Line 315:         msg_->SetRPCErrorMessage(RPCErrorMessage(status.retry_info.rpc_error_msg.dest_node));
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/service/client-request-state.cc
File be/src/service/client-request-state.cc:

http://gerrit.cloudera.org:8080/#/c/14677/4/be/src/service/client-request-state.cc@906
PS4, Line 906:         "Restarting of fetch requires enabling of query result caching.")).SetIsRecoverable();
line too long (94 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 19 Nov 2019 03:09:06 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPCs as retryable and update blacklist
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5060/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 19 Nov 2019 04:04:03 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has uploaded a new patch set (#9). ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................

IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Introduces a new optional field to ReportExecStatusRequestPB:
StatusAuxInfoPB. StatusAuxInfoPB contains optional metadata associated
with ReportExecStatusRequestPB::overall_status. Currently,
StatusAuxInfoPB only contains one field: RPCErrorMessagePB, which is only
set if the ReportExecStatusRequestPB::fragment_instance_id failed
because a RPC to another impalad failed. The RPCErrorMessagePB contains
the destination node of the failed RPC.

Coordinator::UpdateBackendExecStatus(ReportExecStatusRequestPB, ...)
uses the information in RPCErrorMessagePB (if one is set) to blacklist
the target node.

Testing:
* Ran core tests
* Added new test to test_blacklist.py

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
---
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/runtime/query-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/status-aux-info.h
M be/src/util/network-util.cc
M be/src/util/network-util.h
M common/protobuf/common.proto
M common/protobuf/control_service.proto
M tests/custom_cluster/test_blacklist.py
11 files changed, 189 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/14677/9
-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 10:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/14677/10//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14677/10//COMMIT_MSG@18
PS10, Line 18: the target node.
For coordinator C and executors S and T, this means that if S fails to send data to T, it will report back to C, and C will blacklist T, right? In that case, is it guaranteed that S will also blacklist T? Does this open up the possibility that a failure in S will blacklist other healthy nodes? Are there errors in the S->T RPC that we should ignore for the purpose of blacklisting?


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.h
File be/src/runtime/coordinator.h:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.h@252
PS10, Line 252:   /// TNetworkAddress.
mention ownership, nullptr (see above)


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@209
PS10, Line 209:     if (UNLIKELY(!addr_to_backend_state_.insert(addr_to_backend_state).second)) {
you can use emplace() to construct the element in place.

insert()/emplace() return an pair<iterator,bool>. It's not clear to me how this gets evaluated in the conditional, I think it would be better to assign them explicitly.


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@815
PS10, Line 815:       for (auto status_iter = request.instance_exec_status().begin();
Can you use a range-based loop?


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@826
PS10, Line 826:               addr_to_backend_state_.at(NetworkAddressPBToTNetworkAddress(dest_node));
This will throw an exception if the element does not exist (which is almost certainly not what we want). Instead you could just []-access the element and check for nullptr, or use find() != end(), and issue a warning in either case.


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@829
PS10, Line 829:           blacklisted_node = true;
Could just break; the loop here. I can't see where blacklisted_node is used otherwise, too.


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h
File be/src/runtime/runtime-state.h:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h@322
PS10, Line 322:   AuxErrorInfoPB* aux_error_info() {
Does this ever need to be non-const or can we return a const pointer?


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h@444
PS10, Line 444:   /// query_status_ != Status::OK().
Mention who owns this and who cleans it up? We should not rely on the rest of the code to call set_allocated_* eventually but instead we should make it a unique_ptr if we want to preserve the lazy allocation semantics.


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.cc
File be/src/runtime/runtime-state.cc:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.cc@317
PS10, Line 317:   NetworkAddressPB* network_addr = new NetworkAddressPB();
Can you remove the unguarded allocations? You can either replace them with unique_ptr, or assemble the objects top down using mutable_rpc_error_info() etc. Alternatively for this one and then one in L320 you can move the new into the set_allocate_* call (but then you might as well use mutable_*.

The one in 316 needs to go because the ownership leaves the scope of this function.


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/util/network-util.h
File be/src/util/network-util.h:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/util/network-util.h@72
PS10, Line 72: TNetworkAddress NetworkAddressPBToTNetworkAddress(const NetworkAddressPB& address);
I think FromNetworkAddressPB() would be descriptive enough since the calling code will show the target type (similar to MakeNetworkAddress above).


http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto@137
PS10, Line 137: RPC to another node failed
I think here it would be good to mention that this is between two executors, not the coordinator and an executor


http://gerrit.cloudera.org:8080/#/c/14677/10/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/14677/10/tests/custom_cluster/test_blacklist.py@121
PS10, Line 121:     """Verifies that an RPC failure causes the target node to be blacklisted. The
Can you add a test that checks that a failure between the two non-coordinator executors in the cluster gets one of them blacklisted?



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 10
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Wed, 11 Dec 2019 18:33:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9137, IMPALA-9138: Mark failed RPC as retryable, add dst node to blacklist

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137, IMPALA-9138: Mark failed RPC as retryable, add dst node to blacklist
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4994/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Sat, 09 Nov 2019 01:47:49 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 10:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5232/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 10
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <mi...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Mon, 09 Dec 2019 20:23:20 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14677/8/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/14677/8/tests/custom_cluster/test_blacklist.py@139
PS8, Line 139: e
flake8: E501 line too long (99 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 8
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Dec 2019 00:25:10 +0000
Gerrit-HasComments: Yes