You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/05/13 07:26:48 UTC

[GitHub] [ozone] symious opened a new pull request, #3409: HDDS-6743. Specify leader node for OM failover

symious opened a new pull request, #3409:
URL: https://github.com/apache/ozone/pull/3409

   ## What changes were proposed in this pull request?
   
   Currently if clients first connect to a follower OM, the response show the OM is not leader but didn't specify the real Leader node.
   
   This ticket is to let the reply to contains the Leader OM so that clients can connect to Leader node more conveniently.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-6743
   
   ## How was this patch tested?
   
   unit test
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
kerneltime commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1357247110

   The overall change looks good and would help debug as well. Some minor nits that need addressing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] neils-dev commented on a diff in pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by "neils-dev (via GitHub)" <gi...@apache.org>.
neils-dev commented on code in PR #3409:
URL: https://github.com/apache/ozone/pull/3409#discussion_r1105024511


##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerRatisServer.java:
##########
@@ -807,6 +810,57 @@ public RaftPeerId getRaftPeerId() {
     return this.raftPeerId;
   }
 
+  @VisibleForTesting
+  public RaftPeerId getRaftLeaderId() {
+    return this.getRaftLeader() == null ? null : this.getRaftLeader().getId();
+  }
+
+  @VisibleForTesting
+  public String getRaftLeaderAddress() {
+    if (this.getRaftLeader() == null) {

Review Comment:
   Perhaps consider using a local variable to hold the `RaftPeer` with a single call to `getRaftLeader()` at beginning of method.  Then using this var through rest of method instead of repeated calls to `getRaftLeader()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] symious commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
symious commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1298098889

   @kerneltime Still active I think, could you help to review the PR?
   I will resolve the conflictions later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] symious commented on a diff in pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
symious commented on code in PR #3409:
URL: https://github.com/apache/ozone/pull/3409#discussion_r1015149541


##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithFailover.java:
##########
@@ -62,4 +70,48 @@ public void testIncrementalWaitTimeWithSameNodeFailover() throws Exception {
     Assert.assertEquals((numTimesTriedToSameNode + 1) * waitBetweenRetries,
         omFailoverProxyProvider.getWaitTime());
   }
+

Review Comment:
   The test has been moved to another place.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] symious commented on a diff in pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
symious commented on code in PR #3409:
URL: https://github.com/apache/ozone/pull/3409#discussion_r1014961182


##########
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/exceptions/OMNotLeaderException.java:
##########
@@ -34,24 +34,33 @@ public class OMNotLeaderException extends IOException {
 
   private final String currentPeerId;
   private final String leaderPeerId;
+  private final String leaderAddress;
   private static final Pattern CURRENT_PEER_ID_PATTERN =
       Pattern.compile("OM:(.*) is not the leader[.]+.*", Pattern.DOTALL);
   private static final Pattern SUGGESTED_LEADER_PATTERN =
-      Pattern.compile(".*Suggested leader is OM:([^.]*).*", Pattern.DOTALL);
+      Pattern.compile(".*Suggested leader is OM:([^.]*)\\/(.*)\\.",
+          Pattern.DOTALL);
 
   public OMNotLeaderException(RaftPeerId currentPeerId) {
     super("OM:" + currentPeerId + " is not the leader. Could not " +
         "determine the leader node.");
     this.currentPeerId = currentPeerId.toString();
     this.leaderPeerId = null;
+    this.leaderAddress = null;
   }
 
   public OMNotLeaderException(RaftPeerId currentPeerId,
       RaftPeerId suggestedLeaderPeerId) {
+    this(currentPeerId, suggestedLeaderPeerId, null);
+  }
+
+  public OMNotLeaderException(RaftPeerId currentPeerId,
+      RaftPeerId suggestedLeaderPeerId, String suggestedLeaderAddress) {
     super("OM:" + currentPeerId + " is not the leader. Suggested leader is" +
-        " OM:" + suggestedLeaderPeerId + ".");
+        " OM:" + suggestedLeaderPeerId + "/" + suggestedLeaderAddress + ".");

Review Comment:
   @neils-dev Thank you for the review.
   I think the `suggestedLeaderAddress` will return `null` if there are some errors on the server side about the Ratis Peer or hostname resolving, which should be resolved on server side according to the error logs. Once the server side error fix, the client should be able to realise the error config according to the exception message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] neils-dev commented on a diff in pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by "neils-dev (via GitHub)" <gi...@apache.org>.
neils-dev commented on code in PR #3409:
URL: https://github.com/apache/ozone/pull/3409#discussion_r1098075995


##########
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/GrpcOMFailoverProxyProvider.java:
##########
@@ -85,14 +87,16 @@ protected void loadOMClientConfigs(ConfigurationSource config, String omSvcId)
         nodeId = OzoneConsts.OM_DEFAULT_NODE_ID;
       }
       if (hostaddr.isPresent()) {
+        int port = hostport.orElse(config
+            .getObject(GrpcOmTransport
+                .GrpcOmTransportConfig.class)
+            .getPort());
         ProxyInfo<T> proxyInfo =
             new ProxyInfo<>(createOMProxy(),
-                hostaddr.get() + ":"
-                    + hostport.orElse(config
-                    .getObject(GrpcOmTransport
-                        .GrpcOmTransportConfig.class)
-                    .getPort()));
+                hostaddr.get() + ":" + port);
         omProxies.put(nodeId, proxyInfo);
+        omNodeAddressMap.put(nodeId, NetUtils.createSocketAddr(hostaddr.get() +
+            ":" + port));

Review Comment:
   ```
   NetUtils.createSocketAddr(hostaddr.get() +
               ":" + port));
   ```
   can be rewritten using the `proxyInfo` we defined above such as,
   `NetUtils.createSocketAddr(proxyInfo.proxyInfo));
   `



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] symious commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
symious commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1125956137

   @JacksonYao287 Sure, thanks for the review.
   
   In https://github.com/apache/ozone/pull/2765#issuecomment-952091699, the concern I think is the misconfig of client side might trigger some dead loops, so an address was prefered to add instead of only OMNodeId.
   
   In the latest commit of this PR, the `OMNotLeaderException` includes the following information:
   1. raftPeerId
   2. raftLeaderId
   3. raftLeaderAddress
   
   An example of this exception message would be 
   ```org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:omNode-3 is not the leader. Suggested leader is OM:omNode-1/127.0.0.1```
   , when client received this exception, he should try the address first, only if the address is empty should he try to check the raftLeaderId we suggested.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] neils-dev commented on a diff in pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by "neils-dev (via GitHub)" <gi...@apache.org>.
neils-dev commented on code in PR #3409:
URL: https://github.com/apache/ozone/pull/3409#discussion_r1103497908


##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerRatisServer.java:
##########
@@ -807,6 +810,57 @@ public RaftPeerId getRaftPeerId() {
     return this.raftPeerId;
   }
 
+  @VisibleForTesting
+  public RaftPeerId getRaftLeaderId() {
+    return this.getRaftLeader() == null ? null : this.getRaftLeader().getId();
+  }
+
+  @VisibleForTesting
+  public String getRaftLeaderAddress() {
+    if (this.getRaftLeader() == null) {
+      return null;
+    }
+    InetAddress leaderInetAddress = null;
+    try {
+      Optional<String> hostname =
+          HddsUtils.getHostName(getRaftLeader().getAddress());
+      if (hostname.isPresent()) {
+        leaderInetAddress = InetAddress.getByName(hostname.get());
+      }
+    } catch (IOException ex) {
+      LOG.error("OM Ratis LeaderInetAddress {} is unresolvable",
+          getRaftLeader().getAddress());
+    }
+    return leaderInetAddress == null ? null :
+        leaderInetAddress.toString();
+  }
+
+  @VisibleForTesting
+  public RaftPeer getRaftLeader() {
+    RaftPeerId leaderId = null;

Review Comment:
   Can we either reuse `OzoneManagerRatisServer.getLeader()` or replace `getLeader()` with `getRaftLeader()`?  Do we need both?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
kerneltime commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1303091091

   thanks @symious will get this reviewed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] swamirishi commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by "swamirishi (via GitHub)" <gi...@apache.org>.
swamirishi commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1407753547

   @kerneltime The review comments have been addressed. I guess this can be merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] neils-dev commented on a diff in pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by "neils-dev (via GitHub)" <gi...@apache.org>.
neils-dev commented on code in PR #3409:
URL: https://github.com/apache/ozone/pull/3409#discussion_r1105021586


##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerRatisServer.java:
##########
@@ -807,6 +810,57 @@ public RaftPeerId getRaftPeerId() {
     return this.raftPeerId;
   }
 
+  @VisibleForTesting
+  public RaftPeerId getRaftLeaderId() {
+    return this.getRaftLeader() == null ? null : this.getRaftLeader().getId();
+  }
+
+  @VisibleForTesting
+  public String getRaftLeaderAddress() {
+    if (this.getRaftLeader() == null) {
+      return null;
+    }
+    InetAddress leaderInetAddress = null;
+    try {
+      Optional<String> hostname =
+          HddsUtils.getHostName(getRaftLeader().getAddress());
+      if (hostname.isPresent()) {
+        leaderInetAddress = InetAddress.getByName(hostname.get());
+      }
+    } catch (IOException ex) {

Review Comment:
   Minor.  Can be more specific here, catching thrown `UnknownHostException` .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
kerneltime commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1297360800

   @symious is this PR still active? If not we can close it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] symious commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
symious commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1125740202

   @adoroszlai @ChenSammi Could you help to review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] neils-dev commented on a diff in pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
neils-dev commented on code in PR #3409:
URL: https://github.com/apache/ozone/pull/3409#discussion_r1014922846


##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithFailover.java:
##########
@@ -62,4 +70,48 @@ public void testIncrementalWaitTimeWithSameNodeFailover() throws Exception {
     Assert.assertEquals((numTimesTriedToSameNode + 1) * waitBetweenRetries,
         omFailoverProxyProvider.getWaitTime());
   }
+

Review Comment:
   The javadoc for the test class, `TestOzoneManagerHAWithFailover` comments that tests should not be added to this class as it has a cleanup problem that leaves the cluster in a unusable states (not reusable).  Are we affected by this?
   
   > * NOTE: Do not add new tests to this class since
   
   https://github.com/apache/ozone/blob/40203afe63d40fa1340b64fd67ddfd5a2a6b0c99/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithFailover.java#L36



##########
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/exceptions/OMNotLeaderException.java:
##########
@@ -34,24 +34,33 @@ public class OMNotLeaderException extends IOException {
 
   private final String currentPeerId;
   private final String leaderPeerId;
+  private final String leaderAddress;
   private static final Pattern CURRENT_PEER_ID_PATTERN =
       Pattern.compile("OM:(.*) is not the leader[.]+.*", Pattern.DOTALL);
   private static final Pattern SUGGESTED_LEADER_PATTERN =
-      Pattern.compile(".*Suggested leader is OM:([^.]*).*", Pattern.DOTALL);
+      Pattern.compile(".*Suggested leader is OM:([^.]*)\\/(.*)\\.",
+          Pattern.DOTALL);
 
   public OMNotLeaderException(RaftPeerId currentPeerId) {
     super("OM:" + currentPeerId + " is not the leader. Could not " +
         "determine the leader node.");
     this.currentPeerId = currentPeerId.toString();
     this.leaderPeerId = null;
+    this.leaderAddress = null;
   }
 
   public OMNotLeaderException(RaftPeerId currentPeerId,
       RaftPeerId suggestedLeaderPeerId) {
+    this(currentPeerId, suggestedLeaderPeerId, null);
+  }
+
+  public OMNotLeaderException(RaftPeerId currentPeerId,
+      RaftPeerId suggestedLeaderPeerId, String suggestedLeaderAddress) {
     super("OM:" + currentPeerId + " is not the leader. Suggested leader is" +
-        " OM:" + suggestedLeaderPeerId + ".");
+        " OM:" + suggestedLeaderPeerId + "/" + suggestedLeaderAddress + ".");

Review Comment:
   Thanks @symious for this important patch to fix suggesting the leader om when the client submits a request to a follower.  
   Q. Will we also run into the client nodeid list misconfiguration problem brought up should the `OMNotALeaderException `return a `null` for the `suggestedLeaderAddress`?
   
   > In https://github.com/apache/ozone/pull/2765#issuecomment-952091699, the concern I think is the misconfig of client side might trigger some dead loops,



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] symious commented on a diff in pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
symious commented on code in PR #3409:
URL: https://github.com/apache/ozone/pull/3409#discussion_r1015149055


##########
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/exceptions/OMNotLeaderException.java:
##########
@@ -34,24 +34,33 @@ public class OMNotLeaderException extends IOException {
 
   private final String currentPeerId;
   private final String leaderPeerId;
+  private final String leaderAddress;
   private static final Pattern CURRENT_PEER_ID_PATTERN =
       Pattern.compile("OM:(.*) is not the leader[.]+.*", Pattern.DOTALL);
   private static final Pattern SUGGESTED_LEADER_PATTERN =
-      Pattern.compile(".*Suggested leader is OM:([^.]*).*", Pattern.DOTALL);
+      Pattern.compile(".*Suggested leader is OM:([^.]*)\\/(.*)\\.",
+          Pattern.DOTALL);
 
   public OMNotLeaderException(RaftPeerId currentPeerId) {
     super("OM:" + currentPeerId + " is not the leader. Could not " +
         "determine the leader node.");
     this.currentPeerId = currentPeerId.toString();
     this.leaderPeerId = null;
+    this.leaderAddress = null;
   }
 
   public OMNotLeaderException(RaftPeerId currentPeerId,
       RaftPeerId suggestedLeaderPeerId) {
+    this(currentPeerId, suggestedLeaderPeerId, null);
+  }
+
+  public OMNotLeaderException(RaftPeerId currentPeerId,
+      RaftPeerId suggestedLeaderPeerId, String suggestedLeaderAddress) {
     super("OM:" + currentPeerId + " is not the leader. Suggested leader is" +
-        " OM:" + suggestedLeaderPeerId + ".");
+        " OM:" + suggestedLeaderPeerId + "/" + suggestedLeaderAddress + ".");

Review Comment:
   Added a check if the Server and Client information of OM matches, if not, it will fallback to the initial implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] symious commented on a diff in pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
symious commented on code in PR #3409:
URL: https://github.com/apache/ozone/pull/3409#discussion_r1052222020


##########
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/exceptions/OMNotLeaderException.java:
##########
@@ -34,24 +34,33 @@ public class OMNotLeaderException extends IOException {
 
   private final String currentPeerId;
   private final String leaderPeerId;
+  private final String leaderAddress;
   private static final Pattern CURRENT_PEER_ID_PATTERN =
       Pattern.compile("OM:(.*) is not the leader[.]+.*", Pattern.DOTALL);
   private static final Pattern SUGGESTED_LEADER_PATTERN =
-      Pattern.compile(".*Suggested leader is OM:([^.]*).*", Pattern.DOTALL);
+      Pattern.compile(".*Suggested leader is OM:([^.]*)\\[(.*)\\]\\.",
+          Pattern.DOTALL);
 
   public OMNotLeaderException(RaftPeerId currentPeerId) {
     super("OM:" + currentPeerId + " is not the leader. Could not " +
         "determine the leader node.");
     this.currentPeerId = currentPeerId.toString();
     this.leaderPeerId = null;
+    this.leaderAddress = null;
   }
 
   public OMNotLeaderException(RaftPeerId currentPeerId,
       RaftPeerId suggestedLeaderPeerId) {
+    this(currentPeerId, suggestedLeaderPeerId, null);
+  }
+
+  public OMNotLeaderException(RaftPeerId currentPeerId,
+      RaftPeerId suggestedLeaderPeerId, String suggestedLeaderAddress) {
     super("OM:" + currentPeerId + " is not the leader. Suggested leader is" +
-        " OM:" + suggestedLeaderPeerId + ".");
+        " OM:" + suggestedLeaderPeerId + "[" + suggestedLeaderAddress + "].");
     this.currentPeerId = currentPeerId.toString();
     this.leaderPeerId = suggestedLeaderPeerId.toString();
+    this.leaderAddress = suggestedLeaderAddress;
   }
 
   public OMNotLeaderException(String message) {

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] neils-dev commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by "neils-dev (via GitHub)" <gi...@apache.org>.
neils-dev commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1428986721

   Thanks @symious .  I tested failover on the docker ha dev cluster and noticed that when leader node goes down, the followers respond in one of two ways, either i.) initially providing a stale om leader or ii,) null address om suggestion.
   
   i.) (stale leader suggestion) `ozone-ha-om2-1  | 2023-02-14 01:22:51 DEBUG OzoneManagerProtocolServerSideTranslatorPB:251 - OM:om2 is not the leader. Suggested leader is OM:om3[om3/172.27.0.9].`
   
   ii.) `ozone-ha-om2-1  | 2023-02-14 01:24:07 DEBUG OzoneManagerProtocolServerSideTranslatorPB:251 - OM:om2 is not the leader. Suggested leader is OM:om3[null].`
   
   
   The null address in the suggestion seems to be effective causing the failover provider to choose the next om to try from its om node map and resolves the failover.
   
   
   Q. should the node give a stale om leader id, is it possible that we encounter a loop condition where we continue to try the suggestion, fail, then ask the same om for the suggestion that we already tried?
   https://github.com/symious/ozone/blob/98caa74e8870829f974e77b89da6d293b7db62cd/hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMFailoverProxyProviderBase.java#L189


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] JacksonYao287 commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
JacksonYao287 commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1125883419

   thanks @symious for the work! 
   i have a patch for this issue earlier #2765 
   
   @hanishakoneru left a comment to explain why this should not be done. https://github.com/apache/ozone/pull/2765#issuecomment-952091699
   
   i suggest we should achieve agreement on this issue first , and then go ahead. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] DaveTeng0 commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
DaveTeng0 commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1298037109

   Just saw this PR, recently I've also been researching some issue related to the out-of-sync mapping between client and server. just mark myself here in order to follow up the latest change of this PR! thanks all! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] symious commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by "symious (via GitHub)" <gi...@apache.org>.
symious commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1420307781

   @neils-dev Updated the PR, please have a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] DaveTeng0 commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
DaveTeng0 commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1352447606

   Changes look good to me! 
   (hope this could be merged, I definitely could utilize them in one of my current task! thanks!)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
kerneltime commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1303091377

   cc @duongkame @aswinshakil @tanvipenumudy 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] neils-dev merged pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by "neils-dev (via GitHub)" <gi...@apache.org>.
neils-dev merged PR #3409:
URL: https://github.com/apache/ozone/pull/3409


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on a diff in pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by GitBox <gi...@apache.org>.
kerneltime commented on code in PR #3409:
URL: https://github.com/apache/ozone/pull/3409#discussion_r1051909665


##########
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/exceptions/OMNotLeaderException.java:
##########
@@ -34,24 +34,33 @@ public class OMNotLeaderException extends IOException {
 
   private final String currentPeerId;
   private final String leaderPeerId;
+  private final String leaderAddress;
   private static final Pattern CURRENT_PEER_ID_PATTERN =
       Pattern.compile("OM:(.*) is not the leader[.]+.*", Pattern.DOTALL);
   private static final Pattern SUGGESTED_LEADER_PATTERN =
-      Pattern.compile(".*Suggested leader is OM:([^.]*).*", Pattern.DOTALL);
+      Pattern.compile(".*Suggested leader is OM:([^.]*)\\[(.*)\\]\\.",
+          Pattern.DOTALL);
 
   public OMNotLeaderException(RaftPeerId currentPeerId) {
     super("OM:" + currentPeerId + " is not the leader. Could not " +
         "determine the leader node.");
     this.currentPeerId = currentPeerId.toString();
     this.leaderPeerId = null;
+    this.leaderAddress = null;
   }
 
   public OMNotLeaderException(RaftPeerId currentPeerId,
       RaftPeerId suggestedLeaderPeerId) {
+    this(currentPeerId, suggestedLeaderPeerId, null);
+  }
+
+  public OMNotLeaderException(RaftPeerId currentPeerId,
+      RaftPeerId suggestedLeaderPeerId, String suggestedLeaderAddress) {
     super("OM:" + currentPeerId + " is not the leader. Suggested leader is" +
-        " OM:" + suggestedLeaderPeerId + ".");
+        " OM:" + suggestedLeaderPeerId + "[" + suggestedLeaderAddress + "].");
     this.currentPeerId = currentPeerId.toString();
     this.leaderPeerId = suggestedLeaderPeerId.toString();
+    this.leaderAddress = suggestedLeaderAddress;
   }
 
   public OMNotLeaderException(String message) {

Review Comment:
   We should not support this. The caller has to specify either the peer ID or the leader ID.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by "kerneltime (via GitHub)" <gi...@apache.org>.
kerneltime commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1408010095

   @swamirishi thanks for the update. I will take a look.
   @neils-dev can you please revisit the PR as well?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] neils-dev commented on a diff in pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by "neils-dev (via GitHub)" <gi...@apache.org>.
neils-dev commented on code in PR #3409:
URL: https://github.com/apache/ozone/pull/3409#discussion_r1098041892


##########
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMFailoverProxyProviderBase.java:
##########
@@ -172,12 +175,23 @@ public RetryAction shouldRetry(Exception exception, int retries,
           OMNotLeaderException notLeaderException =
               getNotLeaderException(exception);
           if (notLeaderException != null) {
-            // TODO: NotLeaderException should include the host
-            //  address of the suggested leader along with the nodeID.
-            //  Failing over just based on nodeID is not very robust.
-
             // Prepare the next OM to be tried. This will help with calculation
             // of the wait times needed get creating the retryAction.
+            if (notLeaderException.getSuggestedLeaderAddress() != null &&
+                notLeaderException.getSuggestedLeaderNodeId() != null) {
+              String suggestedLeaderAddress =
+                  notLeaderException.getSuggestedLeaderAddress();
+              String suggestedNodeId =
+                  notLeaderException.getSuggestedLeaderNodeId();
+              if (omNodeAddressMap.containsKey(suggestedNodeId) &&

Review Comment:
   _Instead_ of calling `notLeaderException.getSuggestedLeaderAddress() `and `getSuggestedLeaderNodeId() ` twice, first checking if non-null then setting variables `suggestedLeaderAddress` and `suggestedNodeID`, _we can just_ call once setting the local variables and check if variables are non-null later before `setNextOmProxy()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] symious commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by "symious (via GitHub)" <gi...@apache.org>.
symious commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1429044177

   @neils-dev Thank you for the review. Updated the patch, could you have a check?
   
   I think the problem you mentioned happens when the leader is not generated, once the leader is confirmed, the client should be forwarded to the correct OM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] neils-dev commented on pull request #3409: HDDS-6743. Specify leader node for OM failover

Posted by "neils-dev (via GitHub)" <gi...@apache.org>.
neils-dev commented on PR #3409:
URL: https://github.com/apache/ozone/pull/3409#issuecomment-1439394362

   Thanks @JacksonYao287 , @kerneltime , @DaveTeng0 , @swamirishi for your review and comments.  Thanks @symious for this.  Merge to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org