You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "runzhiwang (Jira)" <ji...@apache.org> on 2021/02/25 07:22:00 UTC

[jira] [Resolved] (HDDS-4754) A restarted SCM quickly go OOM due to ContainerReport Storm from DN cluster.

     [ https://issues.apache.org/jira/browse/HDDS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

runzhiwang resolved HDDS-4754.
------------------------------
    Resolution: Fixed

> A restarted SCM quickly go OOM due to ContainerReport Storm from DN cluster.
> ----------------------------------------------------------------------------
>
>                 Key: HDDS-4754
>                 URL: https://issues.apache.org/jira/browse/HDDS-4754
>             Project: Apache Ozone
>          Issue Type: Improvement
>            Reporter: runzhiwang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: 企业微信截图_1611734015772.png
>
>
> During tencent monthly upgrade, we restart all DNs first, then stop the SCM, wait for a while, start it. SCM go OOM in a short time.
> Current retry policy of DN is retry sending with a 1s interval. Given at some time-point, all the DNs lost connection with the SCM at the same time, due to restart of SCM, all DNs will send container report to SCM nearly at the same time, which is a ContainerReport Storm.
> We propose to change datanode retry policy to connect SCM.
> {code:java}
> public void addSCMServer(InetSocketAddress address) throws IOException {
>   writeLock();
>   try {
>     if (scmMachines.containsKey(address)) {
>       LOG.warn("Trying to add an existing SCM Machine to Machines group. " +
>           "Ignoring the request.");
>       return;
>     }
>     Configuration hadoopConfig =
>         LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf);
>     RPC.setProtocolEngine(
>         hadoopConfig,
>         StorageContainerDatanodeProtocolPB.class,
>         ProtobufRpcEngine.class);
>     long version =
>         RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class);
>     RetryPolicy retryPolicy =
>         RetryPolicies.retryUpToMaximumCountWithFixedSleep(
>             getScmRpcRetryCount(conf),
>             1000, TimeUnit.MILLISECONDS);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org