You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/02/19 07:33:21 UTC

[GitHub] [ozone] Xushaohong opened a new pull request #1942: HDDS-4754. Make scm heartbeat rpc retry interval configurable

Xushaohong opened a new pull request #1942:
URL: https://github.com/apache/ozone/pull/1942


   ## What changes were proposed in this pull request?
   Background:
   The current retry policy of DN is to retry sending with a 1s interval. Given at some time-point, all the DNs lost connection with the SCM at the same time, due to the restart of SCM, all DNs will send container report to SCM nearly at the same time, which is a ContainerReport Storm.
   
   Solution:
   Manually adjust the rpc-retry-interval with rpc-retry-count could mitigate extreme cases such as OOM, when facing up a huge cluster.
   Make the rpc-retry-interval configurable. 
   
   
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-4754
   
   
   ## How was this patch tested?
   CI
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] GlenGeng commented on pull request #1942: HDDS-4754. Make scm heartbeat rpc retry interval configurable

Posted by GitBox <gi...@apache.org>.
GlenGeng commented on pull request #1942:
URL: https://github.com/apache/ozone/pull/1942#issuecomment-783048419


   +1
   
   Thanks @Xushaohong for the work. Thanks @linyiqun for the review.
   
   This is a preliminary work for the SCM OOM issue. Future proposal will be throttling the on-going reports at both SCM side and DN side, e.g., 1) SCM drops the reports if it has queued too many reports, 2) DN reduces the number of reports by recording a lease for its Container(recommended by @xiaoyuyao ).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] Xushaohong commented on pull request #1942: HDDS-4754. Make scm heartbeat rpc retry interval configurable

Posted by GitBox <gi...@apache.org>.
Xushaohong commented on pull request #1942:
URL: https://github.com/apache/ozone/pull/1942#issuecomment-783053620


   @runzhiwang pls take a look and help merge :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] linyiqun commented on a change in pull request #1942: HDDS-4754. Make scm heartbeat rpc retry interval configurable

Posted by GitBox <gi...@apache.org>.
linyiqun commented on a change in pull request #1942:
URL: https://github.com/apache/ozone/pull/1942#discussion_r579617532



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/SCMConnectionManager.java
##########
@@ -151,8 +152,8 @@ public void addSCMServer(InetSocketAddress address) throws IOException {
 
       RetryPolicy retryPolicy =
           RetryPolicies.retryUpToMaximumCountWithFixedSleep(
-              getScmRpcRetryCount(conf),
-              1000, TimeUnit.MILLISECONDS);
+              getScmRpcRetryCount(conf), getScmRpcRetryInterval(conf),

Review comment:
       Can we just reuse default DN heartbeat interval(HddsConfigKeys#HDDS_HEARTBEAT_INTERVAL_DEFAULT, 30s) rather than defined a new rpc retry interval here? Would this a better way?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] Xushaohong commented on pull request #1942: HDDS-4754. Make scm heartbeat rpc retry interval configurable

Posted by GitBox <gi...@apache.org>.
Xushaohong commented on pull request #1942:
URL: https://github.com/apache/ozone/pull/1942#issuecomment-781925701


   Please take a look @GlenGeng 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] linyiqun commented on a change in pull request #1942: HDDS-4754. Make scm heartbeat rpc retry interval configurable

Posted by GitBox <gi...@apache.org>.
linyiqun commented on a change in pull request #1942:
URL: https://github.com/apache/ozone/pull/1942#discussion_r579625598



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/SCMConnectionManager.java
##########
@@ -151,8 +152,8 @@ public void addSCMServer(InetSocketAddress address) throws IOException {
 
       RetryPolicy retryPolicy =
           RetryPolicies.retryUpToMaximumCountWithFixedSleep(
-              getScmRpcRetryCount(conf),
-              1000, TimeUnit.MILLISECONDS);
+              getScmRpcRetryCount(conf), getScmRpcRetryInterval(conf),

Review comment:
       Okay, get it.
   There is another place that also can be updated to use getScmRpcRetryInterval(conf) in this class. Can you update this ([SCMConnectionManager.java#L200](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/SCMConnectionManager.java#L200))?
   ```java
     /**
      * Adds a new Recon server to the set of endpoints.
      * @param address Recon address.
      * @throws IOException
      */
     public void addReconServer(InetSocketAddress address) throws IOException {
       LOG.info("Adding Recon Server : {}", address.toString());
       writeLock();
       try {
         if (scmMachines.containsKey(address)) {
           LOG.warn("Trying to add an existing SCM Machine to Machines group. " +
               "Ignoring the request.");
           return;
         }
         Configuration hadoopConfig =
             LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf);
         RPC.setProtocolEngine(hadoopConfig, ReconDatanodeProtocolPB.class,
             ProtobufRpcEngine.class);
         long version =
             RPC.getProtocolVersion(ReconDatanodeProtocolPB.class);
   
         RetryPolicy retryPolicy =
             RetryPolicies.retryUpToMaximumCountWithFixedSleep(
                 getScmRpcRetryCount(conf),
                 1000, TimeUnit.MILLISECONDS);  <======
   ...
   }
   ```
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] runzhiwang merged pull request #1942: HDDS-4754. Make scm heartbeat rpc retry interval configurable

Posted by GitBox <gi...@apache.org>.
runzhiwang merged pull request #1942:
URL: https://github.com/apache/ozone/pull/1942


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] Xushaohong closed pull request #1942: HDDS-4754. Make scm heartbeat rpc retry interval configurable

Posted by GitBox <gi...@apache.org>.
Xushaohong closed pull request #1942:
URL: https://github.com/apache/ozone/pull/1942


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] Xushaohong commented on a change in pull request #1942: HDDS-4754. Make scm heartbeat rpc retry interval configurable

Posted by GitBox <gi...@apache.org>.
Xushaohong commented on a change in pull request #1942:
URL: https://github.com/apache/ozone/pull/1942#discussion_r579621355



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/SCMConnectionManager.java
##########
@@ -151,8 +152,8 @@ public void addSCMServer(InetSocketAddress address) throws IOException {
 
       RetryPolicy retryPolicy =
           RetryPolicies.retryUpToMaximumCountWithFixedSleep(
-              getScmRpcRetryCount(conf),
-              1000, TimeUnit.MILLISECONDS);
+              getScmRpcRetryCount(conf), getScmRpcRetryInterval(conf),

Review comment:
       > Can we just reuse default DN heartbeat interval(HddsConfigKeys#HDDS_HEARTBEAT_INTERVAL_DEFAULT, 30s) rather than defined a new rpc retry interval here? Would this a better way?
   
   The retry interval is only 1 sec now, which is for quickly connecting the scm. The default HB interval may be too long.
   Actually, the retry count is not working,  since the DatanodeStateMachine keeps retrying after 15 retries finish. 
   The current retry policy seems still needs to be changed.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org