You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/05/14 05:08:16 UTC

[GitHub] [ozone] bharatviswa504 opened a new pull request #2247: Hdds 5216: Fix race condition causing failOverProxy which is causing failover wrongly.

bharatviswa504 opened a new pull request #2247:
URL: https://github.com/apache/ozone/pull/2247


   ## What changes were proposed in this pull request?
   
   In OM SCM client is shared across RpcHandler threads.
   Where we have observed that failOver across multiple threads causing failover to happen incorrectly on same SCM and is exhausting retry count.
   
   And also one thing I have observed is
   
   If we observe the error is no route to scm3, but retry happened on scm1/172.31.0.9:9863
   ```
   2021-05-11 05:59:53,202 [IPC Server handler 10 on default port 9862] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.NoRouteToHostException: No Route to Host from  om1/172.31.0.11 to scm3:9863 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see:  http://wiki.apache.org/hadoop/NoRouteToHost, while invoking $Proxy32.send over nodeId=scm1,nodeAddress=scm1/172.31.0.9:9863 after 9 failover attempts. Trying to failover after sleeping for 2000ms.
   If we observe the error is no route to scm3, but retry happened on scm1/172.31.0.9:9863
   ```
   If we observe the error is no route to scm3, but retry happened on scm1/172.31.0.9:9863
   ```
   2021-05-11 05:59:59,345 [IPC Server handler 10 on default port 9862] WARN ipc.Client: Address change detected. Old: scm3/172.31.0.5:9863 New: scm3:9863
   2021-05-11 05:59:59,347 [IPC Server handler 10 on default port 9862] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.NoRouteToHostException: No Route to Host from  om1/172.31.0.11 to scm3:9863 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see:  http://wiki.apache.org/hadoop/NoRouteToHost, while invoking $Proxy32.send over nodeId=scm2,nodeAddress=scm2/172.31.0.6:9863 after 10 failover attempts. Trying to failover after sleeping for 2000ms.
   ```
   This is because our performFailOver is a no-op and if failOver is needed we update currentSCMProxyNodeID in shouldRetry in RetryPolicy. 
   
   **For example**
   2 Threads contacted SCM3, and got NoRouteToHostException, so shouldRetry from first thread will move the currentSCMProxyNodeID to scm1 and other thread, after this move currentSCMProxyNodeID to scm2. 
   
   Hadoop Proxy RetryInvocationHandler already takes care of if there is another thread trying to perform failOver it will not call performFailOver again. We shall see below WARN message, and get the currentProxy and contact that node.
   
   **om3_1       | 2021-05-14 05:04:28,699 [IPC Server handler 34 on default port 9862] WARN retry.RetryInvocationHandler: A failover has occurred since the start of call #24329 $Proxy32.send over nodeId=scm3,nodeAddress=scm3/192.168.0.6:9863**
   
   Solution here is to use performFailOver to update scmNodeID instead of using shouldRetry to update currentSCMProxyNodeID.
   
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-5216
   
   ## How was this patch tested?
   Tested locally, and now observed that it will not perform failOver again and exhausting retry counts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bharatviswa504 edited a comment on pull request #2247: Hdds 5216: Fix race condition causing SCM failOverProxy which is causing failover wrongly.

Posted by GitBox <gi...@apache.org>.
bharatviswa504 edited a comment on pull request #2247:
URL: https://github.com/apache/ozone/pull/2247#issuecomment-841990339


   > Overall LGTM, minor suggestions inline regarding the comments.
   > Thank you @bharatviswa504 for the improvement.
   
   Thank You @dineshchitlangia for the review. I have incorporated your comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bharatviswa504 merged pull request #2247: HDDS-5216. Fix race condition causing SCM failOverProxy which is causing failover wrongly.

Posted by GitBox <gi...@apache.org>.
bharatviswa504 merged pull request #2247:
URL: https://github.com/apache/ozone/pull/2247


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bharatviswa504 commented on pull request #2247: Hdds 5216: Fix race condition causing SCM failOverProxy which is causing failover wrongly.

Posted by GitBox <gi...@apache.org>.
bharatviswa504 commented on pull request #2247:
URL: https://github.com/apache/ozone/pull/2247#issuecomment-841973701






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] dineshchitlangia commented on a change in pull request #2247: Hdds 5216: Fix race condition causing SCM failOverProxy which is causing failover wrongly.

Posted by GitBox <gi...@apache.org>.
dineshchitlangia commented on a change in pull request #2247:
URL: https://github.com/apache/ozone/pull/2247#discussion_r633219987



##########
File path: hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/proxy/SCMContainerLocationFailoverProxyProvider.java
##########
@@ -63,8 +63,15 @@
   private final Map<String, SCMProxyInfo> scmProxyInfoMap;
   private List<String> scmNodeIds;
 
-  private String currentProxySCMNodeId;
-  private int currentProxyIndex;
+  // As when SCM Client is shared across threads, performFailOver method
+  // updates the currentProxySCMNodeId based on updateLeaderNodeId which is
+  // updated in shouldRetry. So, when 2 threads run parallel for
+  // one of the thread performFailOver is not called which is taken care by
+  // RetryInvocationHandler by checking expectedFailOverCount. So other thread
+  // shall not call performFailOver it will call getProxy() where we use
+  // currentProxySCMNodeId and return proxy.

Review comment:
       ```suggestion
     // As SCM Client is shared across threads, performFailOver()
     // updates the currentProxySCMNodeId based on the updateLeaderNodeId which is
     // updated in shouldRetry(). When 2 or more threads run in parallel, the
     // RetryInvocationHandler will check the expectedFailOverCount
     // and not execute performFailOver() for one of them. So the other thread(s)
     // shall not call performFailOver(), it will call getProxy() which uses
     // currentProxySCMNodeId and returns the proxy.
   ```

##########
File path: hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/proxy/SCMBlockLocationFailoverProxyProvider.java
##########
@@ -61,21 +61,28 @@
   private Map<String, SCMProxyInfo> scmProxyInfoMap;
   private List<String> scmNodeIds;
 
-  private String currentProxySCMNodeId;
-  private int currentProxyIndex;
+  // As when SCM Client is shared across threads, performFailOver method
+  // updates the currentProxySCMNodeId based on the updateLeaderNodeId which is
+  // updated in shouldRetry. So, when 2 threads run parallel for
+  // one of the thread performFailOver is not called which is taken care by
+  // RetryInvocationHandler by checking expectedFailOverCount. So other thread
+  // shall not call performFailOver it will call getProxy() where we use
+  // currentProxySCMNodeId and return proxy.

Review comment:
       ```suggestion
     // As SCM Client is shared across threads, performFailOver()
     // updates the currentProxySCMNodeId based on the updateLeaderNodeId which is
     // updated in shouldRetry(). When 2 or more threads run in parallel, the
     // RetryInvocationHandler will check the expectedFailOverCount
     // and not execute performFailOver() for one of them. So the other thread(s)
     // shall not call performFailOver(), it will call getProxy() which uses
     // currentProxySCMNodeId and returns the proxy.
   ```

##########
File path: hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/proxy/SCMSecurityProtocolFailoverProxyProvider.java
##########
@@ -64,8 +64,16 @@
 
   private List<String> scmNodeIds;
 
-  private String currentProxySCMNodeId;
-  private int currentProxyIndex;
+  // As when SCM Client is shared across threads, performFailOver method
+  // updates the currentProxySCMNodeId based on updateLeaderNodeId which is
+  // updated in shouldRetry. So, when 2 threads run parallel for
+  // one of the thread performFailOver is not called which is taken care by
+  // RetryInvocationHandler by checking expectedFailOverCount. So other thread
+  // shall not call performFailOver it will call getProxy() where we use
+  // currentProxySCMNodeId and return proxy.

Review comment:
       ```suggestion
     // As SCM Client is shared across threads, performFailOver()
     // updates the currentProxySCMNodeId based on the updateLeaderNodeId which is
     // updated in shouldRetry(). When 2 or more threads run in parallel, the
     // RetryInvocationHandler will check the expectedFailOverCount
     // and not execute performFailOver() for one of them. So the other thread(s)
     // shall not call performFailOver(), it will call getProxy() which uses
     // currentProxySCMNodeId and returns the proxy.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bharatviswa504 commented on pull request #2247: HDDS-5216. Fix race condition causing SCM failOverProxy which is causing failover wrongly.

Posted by GitBox <gi...@apache.org>.
bharatviswa504 commented on pull request #2247:
URL: https://github.com/apache/ozone/pull/2247#issuecomment-842210346


   Thank You @bshashikant and @dineshchitlangia for the review.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bharatviswa504 merged pull request #2247: HDDS-5216. Fix race condition causing SCM failOverProxy which is causing failover wrongly.

Posted by GitBox <gi...@apache.org>.
bharatviswa504 merged pull request #2247:
URL: https://github.com/apache/ozone/pull/2247


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bharatviswa504 commented on pull request #2247: Hdds 5216: Fix race condition causing SCM failOverProxy which is causing failover wrongly.

Posted by GitBox <gi...@apache.org>.
bharatviswa504 commented on pull request #2247:
URL: https://github.com/apache/ozone/pull/2247#issuecomment-841973701


   Thank You @bshashikant for the review.
   I have updated code with code comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bharatviswa504 edited a comment on pull request #2247: HDDS-5216. Fix race condition causing SCM failOverProxy which is causing failover wrongly.

Posted by GitBox <gi...@apache.org>.
bharatviswa504 edited a comment on pull request #2247:
URL: https://github.com/apache/ozone/pull/2247#issuecomment-842210346


   Thank You @bshashikant and @dineshchitlangia for the review.
   
   @dineshchitlangia I have incorporated your comments on **code comments** so went ahead and committed this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bharatviswa504 commented on pull request #2247: Hdds 5216: Fix race condition causing SCM failOverProxy which is causing failover wrongly.

Posted by GitBox <gi...@apache.org>.
bharatviswa504 commented on pull request #2247:
URL: https://github.com/apache/ozone/pull/2247#issuecomment-841990339


   > Overall LGTM, minor suggestions inline regarding the comments.
   > Thank you @bharatviswa504 for the improvement.
   
   Thank You @dineshchitlangia incorporated your comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] dineshchitlangia commented on a change in pull request #2247: Hdds 5216: Fix race condition causing SCM failOverProxy which is causing failover wrongly.

Posted by GitBox <gi...@apache.org>.
dineshchitlangia commented on a change in pull request #2247:
URL: https://github.com/apache/ozone/pull/2247#discussion_r633219987



##########
File path: hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/proxy/SCMContainerLocationFailoverProxyProvider.java
##########
@@ -63,8 +63,15 @@
   private final Map<String, SCMProxyInfo> scmProxyInfoMap;
   private List<String> scmNodeIds;
 
-  private String currentProxySCMNodeId;
-  private int currentProxyIndex;
+  // As when SCM Client is shared across threads, performFailOver method
+  // updates the currentProxySCMNodeId based on updateLeaderNodeId which is
+  // updated in shouldRetry. So, when 2 threads run parallel for
+  // one of the thread performFailOver is not called which is taken care by
+  // RetryInvocationHandler by checking expectedFailOverCount. So other thread
+  // shall not call performFailOver it will call getProxy() where we use
+  // currentProxySCMNodeId and return proxy.

Review comment:
       ```suggestion
     // As SCM Client is shared across threads, performFailOver()
     // updates the currentProxySCMNodeId based on the updateLeaderNodeId which is
     // updated in shouldRetry(). When 2 or more threads run in parallel, the
     // RetryInvocationHandler will check the expectedFailOverCount
     // and not execute performFailOver() for one of them. So the other thread(s)
     // shall not call performFailOver(), it will call getProxy() which uses
     // currentProxySCMNodeId and returns the proxy.
   ```

##########
File path: hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/proxy/SCMBlockLocationFailoverProxyProvider.java
##########
@@ -61,21 +61,28 @@
   private Map<String, SCMProxyInfo> scmProxyInfoMap;
   private List<String> scmNodeIds;
 
-  private String currentProxySCMNodeId;
-  private int currentProxyIndex;
+  // As when SCM Client is shared across threads, performFailOver method
+  // updates the currentProxySCMNodeId based on the updateLeaderNodeId which is
+  // updated in shouldRetry. So, when 2 threads run parallel for
+  // one of the thread performFailOver is not called which is taken care by
+  // RetryInvocationHandler by checking expectedFailOverCount. So other thread
+  // shall not call performFailOver it will call getProxy() where we use
+  // currentProxySCMNodeId and return proxy.

Review comment:
       ```suggestion
     // As SCM Client is shared across threads, performFailOver()
     // updates the currentProxySCMNodeId based on the updateLeaderNodeId which is
     // updated in shouldRetry(). When 2 or more threads run in parallel, the
     // RetryInvocationHandler will check the expectedFailOverCount
     // and not execute performFailOver() for one of them. So the other thread(s)
     // shall not call performFailOver(), it will call getProxy() which uses
     // currentProxySCMNodeId and returns the proxy.
   ```

##########
File path: hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/proxy/SCMSecurityProtocolFailoverProxyProvider.java
##########
@@ -64,8 +64,16 @@
 
   private List<String> scmNodeIds;
 
-  private String currentProxySCMNodeId;
-  private int currentProxyIndex;
+  // As when SCM Client is shared across threads, performFailOver method
+  // updates the currentProxySCMNodeId based on updateLeaderNodeId which is
+  // updated in shouldRetry. So, when 2 threads run parallel for
+  // one of the thread performFailOver is not called which is taken care by
+  // RetryInvocationHandler by checking expectedFailOverCount. So other thread
+  // shall not call performFailOver it will call getProxy() where we use
+  // currentProxySCMNodeId and return proxy.

Review comment:
       ```suggestion
     // As SCM Client is shared across threads, performFailOver()
     // updates the currentProxySCMNodeId based on the updateLeaderNodeId which is
     // updated in shouldRetry(). When 2 or more threads run in parallel, the
     // RetryInvocationHandler will check the expectedFailOverCount
     // and not execute performFailOver() for one of them. So the other thread(s)
     // shall not call performFailOver(), it will call getProxy() which uses
     // currentProxySCMNodeId and returns the proxy.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bharatviswa504 edited a comment on pull request #2247: Hdds 5216: Fix race condition causing SCM failOverProxy which is causing failover wrongly.

Posted by GitBox <gi...@apache.org>.
bharatviswa504 edited a comment on pull request #2247:
URL: https://github.com/apache/ozone/pull/2247#issuecomment-841990339






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org