You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2020/12/04 19:04:12 UTC

[GitHub] [ozone] bharatviswa504 opened a new pull request #1659: HDDS-4329. Expose Ratis retry config cache in OM.

bharatviswa504 opened a new pull request #1659:
URL: https://github.com/apache/ozone/pull/1659


   ## What changes were proposed in this pull request?
   
   Expose Ratis retry cache. Followed new config approach and also used same config name as RatisServer config.
   And also with this Jira if any config is prefixed with "ozone.om.ha" and after that users try to use same the RatisServer config it will be used by OM RaftServer. (This allows if any configs need to be tuned and in OM we don't have a config, this capability can help in such scenarios)
   
   
   Used 5 minutes as default value. As with default leader election time out and with retry 15 count, this value should be more than enough to catch retry requests and return from the cache.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-4329
   
   ## How was this patch tested?
   
   Added a test
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 edited a comment on pull request #1659: HDDS-4329. Expose Ratis retry config cache in OM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 edited a comment on pull request #1659:
URL: https://github.com/apache/ozone/pull/1659#issuecomment-740095543

> @bharatviswa504 Thanks for working on this! The changes look good to me.
> What is the total retry duration for a om client request? I mean what is the retry policy used right now. The cache duration should be >= time for which a request can be retried.

We have 15 retries currently, and if same om is contacted again after failure it doubles the duration.
And the double of the time is not for the first round of failovers until we find leader.

And for any other errors, we don't have sleep between retry.

So basically first 3 to find leader OM no time wait. And next, we have detected leader then if it fails over to a new OM, we don't wait. So at best case it would be less than a minute (as for any n/w error we immediately failover). And if we continuously get LeaderNotReady, we try to contact same OM after first round, then it will be (2+4+8+16+ .... +2^12) = 8190 duration, this is the worst time. If OM has taken so long to become Ready after failover and finally in last iteration it became ready.

So, in the worst case, we wait for 150 minutes. But setting at 150 mts, will be taking too much heap on wellbehaved OM. As in most cases, if OM's are up/down the total time if the same OM is not contacted will be less than a minute. I

If we want to change might be changing wait between retries changing from 2 to 1 will help here. And total time in worst case will come to (1 +2 +3 + .. + 12) = 66. Which will be a total of around less than 2 minute.

cc @hanishakoneru for her thoughts, as she has originally implemented this wait between retry.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 edited a comment on pull request #1659: HDDS-4329. Expose Ratis retry config cache in OM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 edited a comment on pull request #1659:
URL: https://github.com/apache/ozone/pull/1659#issuecomment-740095543

We have 15 retries currently, and if same om is contacted again after failure it increases the duration linearly.
And the increase of the time is not for the first round of failovers until we find a leader.

And for any other errors, we don't have sleep between retry.

So basically first 3 to find leader OM no time wait. And next, we have detected leader then if it fails over to a new OM, we don't wait. So at best case it would be less than a minute (as for any n/w error we immediately failover). And if we continuously get LeaderNotReady, we try to contact same OM after first round, then it will be (0+2+4+6+8+ .... +22) = 132sec duration, this is the worst time. If OM has taken so long to become Ready after failover and finally in last iteration it became ready.

So, in the worst case, we wait for 132 seconds + (Few extra millisec/sec for the first 3 retries)

Thank You @hanishakoneru for correcting me in the calculation.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] lokeshj1703 commented on a change in pull request #1659: HDDS-4329. Expose Ratis retry config cache in OM.

Posted by GitBox <gi...@apache.org>.

lokeshj1703 commented on a change in pull request #1659:
URL: https://github.com/apache/ozone/pull/1659#discussion_r537476613



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAMetadataOnly.java
##########
@@ -422,6 +422,24 @@ public void testOMRetryCache() throws Exception {
     Assert.assertFalse(logCapturer.getOutput().contains("created volume:"
         + volumeName));
 
+    //Sleep for little above seconds to get cache clear.
+    Thread.sleep(65000);

Review comment:
       Can we use a smaller cache duration?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] lokeshj1703 commented on pull request #1659: HDDS-4329. Expose Ratis retry config cache in OM.

Posted by GitBox <gi...@apache.org>.

lokeshj1703 commented on pull request #1659:
URL: https://github.com/apache/ozone/pull/1659#issuecomment-740819937


   Sorry! Missed the update. I think we are good with 132 seconds of total retry duration.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org