You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Glen Geng (Jira)" <ji...@apache.org> on 2020/09/04 07:14:00 UTC
[jira] [Resolved] (HDDS-4186) Adjust RetryPolicy of
SCMConnectionManager for SCM/Recon
[ https://issues.apache.org/jira/browse/HDDS-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Glen Geng resolved HDDS-4186.
-----------------------------
Resolution: Fixed
> Adjust RetryPolicy of SCMConnectionManager for SCM/Recon
> --------------------------------------------------------
>
> Key: HDDS-4186
> URL: https://issues.apache.org/jira/browse/HDDS-4186
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: Ozone Datanode
> Reporter: Glen Geng
> Assignee: Glen Geng
> Priority: Critical
> Labels: pull-request-available
>
> *The problem is:*
> If setup one Recon and one SCM, then shutdown the Recon server, all Datanodes will be stale/dead very soon at SCM side.
>
> *The root cause is:*
> Current RetryPolicy of Datanode for SCM is retryForeverWithFixedSleep:
> {code:java}
> RetryPolicy retryPolicy =
> RetryPolicies.retryForeverWithFixedSleep(
> 1000, TimeUnit.MILLISECONDS);
> StorageContainerDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
> StorageContainerDatanodeProtocolPB.class, version,
> address, UserGroupInformation.getCurrentUser(), hadoopConfig,
> NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
> retryPolicy).getProxy();{code}
> that for Recon is retryUpToMaximumCountWithFixedSleep:
> {code:java}
> RetryPolicy retryPolicy =
> RetryPolicies.retryUpToMaximumCountWithFixedSleep(10,
> 60000, TimeUnit.MILLISECONDS);
> ReconDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
> ReconDatanodeProtocolPB.class, version,
> address, UserGroupInformation.getCurrentUser(), hadoopConfig,
> NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
> retryPolicy).getProxy();
> {code}
>
> The executorService in DatanodeStateMachine is Executors.newFixedThreadPool(...), whose default pool size is 2, one for Recon, another for SCM.
>
> When encounter rpc failure, call() of RegisterEndpointTask, VersionEndpointTask, HeartbeatEndpointTask will retry while holding the rpcEndpoint.lock(). For example:
> {code:java}
> public EndpointStateMachine.EndPointStates call() throws Exception {
> rpcEndpoint.lock();
> try {
> ....
> SCMHeartbeatResponseProto reponse = rpcEndpoint.getEndPoint()
> .sendHeartbeat(request);
> ....
> } finally {
> rpcEndpoint.unlock();
> }
> return rpcEndpoint.getState();
> }
> {code}
>
> If Recon is down, the thread running Recon task will retry due to rpc failure, meanwhile holds the lock of EndpointStateMachine for Recon. When DatanodeStateMachine schedule the next round of SCM/Recon task, the only left thread will be assigned to run Recon task, and blocked at waiting for the lock of EndpointStateMachine for Recon.
> {code:java}
> public EndpointStateMachine.EndPointStates call() throws Exception {
> rpcEndpoint.lock();
> ...{code}
>
> *The solution is:*
> Since DatanodeStateMachine will periodically schedule SCM/Recon tasks, we may adjust RetryPolicy so that won't retry for longer that 1min.
>
> *The change has no side effect:*
> 1) VersionEndpointTask.call() is fine
> 2) RegisterEndpointTask.call() will query containerReport, nodeReport, pipelineReports from OzoneContainer, which is fine.
> 3) HeartbeatEndpointTask.call() will putBackReports(), which is fine.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org