You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2019/12/20 18:25:00 UTC

[jira] [Commented] (IMPALA-9137) Blacklist node if a DataStreamService RPC to the node fails

    [ https://issues.apache.org/jira/browse/IMPALA-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001071#comment-17001071 ] 

ASF subversion and git services commented on IMPALA-9137:
---------------------------------------------------------

Commit 8a4fececcf8e9599978cc1a532386b8e924838ed in impala's branch refs/heads/master from Sahil Takiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8a4fece ]

IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails

Introduces a new optional field to FragmentInstanceExecStatusPB:
AuxErrorInfoPB. AuxErrorInfoPB contains optional metadata associated
with a failed fragment instance. Currently, AuxErrorInfoPB only contains
one field: RPCErrorInfoPB, which is only set if the fragment failed
because a RPC to another impalad failed. The RPCErrorInfoPB contains
the destination node of the failed RPC and the posix error code of the
failed RPC.

Coordinator::UpdateBackendExecStatus(ReportExecStatusRequestPB, ...)
uses the information in RPCErrorInfoPB (if one is set) to blacklist
the target node. While RPCErrorInfoPB::dest_node can be set to the address
of the Coordinator, the Coordinator will not blacklist itself. The
Coordinator only blacklists the node if the RPC failed with a specific
error code (currently either ENOTCONN, ECONNREFUSED, ESHUTDOWN).

Testing:
* Ran core tests
* Added new test to test_blacklist.py

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Reviewed-on: http://gerrit.cloudera.org:8080/14677
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Blacklist node if a DataStreamService RPC to the node fails
> -----------------------------------------------------------
>
>                 Key: IMPALA-9137
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9137
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>
> If a query fails because a RPC to a specific node failed, the query error message will similar to one of the following:
> * {{ERROR: TransmitData() to 10.65.30.141:27000 failed: Network error: recv got EOF from 10.65.30.141:27000 (error 108)}}
> * {{ERROR: TransmitData() to 10.65.29.251:27000 failed: Network error: recv error from 0.0.0.0:0: Transport endpoint is not connected (error 107)}}
> * {{ERROR: TransmitData() to 10.65.26.254:27000 failed: Network error: Client connection negotiation failed: client connection to 10.65.26.254:27000: connect: Connection refused (error 111)}}
> * {{ERROR: EndDataStream() to 127.0.0.1:27002 failed: Network error: recv error from 0.0.0.0:0: Transport endpoint is not connected (error 107)}}
> RPCs are already retried, so it is likely that something is wrong with the target node. Perhaps it crashed or is so overloaded that it can't process RPC requests. In any case, the Impala Coordinator should blacklist the target of the failed RPC so that future queries don't fail with the same error.
> If the node crashed, the statestore will eventually remove the failed node from the cluster as well. However, the statestore can take a while to detect a failed node because it has a long timeout. The issue is that queries can still fail in within the timeout window. 
> This is necessary for transparent query retries because if a node does crash, it will take too long for the statestore to remove the crashed node from the cluster. So any attempt at retrying a query will just fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org