You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2020/09/14 18:43:17 UTC

[GitHub] [helix] xyuanlu opened a new pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

xyuanlu opened a new pull request #1362:
URL: https://github.com/apache/helix/pull/1362


   ### Issues
   
   - [X] My PR addresses the following Helix issues and references them in the PR description:
   
   Resolve #1361
   
   ### Description
   
   - [X] Here are some details about my PR, including screenshots of any UI changes:
   
   Currently we do all the state validation (the from state should be same current state and to state should be different to current state) for state transition message when the message is being executed in HelixStateTransitionHandler.
   
   We should move the to state validation to the scheduling stage because of the following reasons:
   1. It is a no-op operation, thus we should not create a thread to execute it.
   2. The lag between scheduling and executing could add additional latency for controller and eventually application latency.
   
   ### Tests
   
   - [X] The following tests are written for this issue:
   
   TestStateTransitionHandlerFactory.testStaledMessage
   
   - [X] The following is the result of the "mvn test" command on the appropriate module:
   
   ```
   [WARNING] Tests run: 1201, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 4,077.491 s - in TestSuite
   [INFO]
   [INFO] Results:
   [INFO]
   [WARNING] Tests run: 1201, Failures: 0, Errors: 0, Skipped: 1
   [INFO]
   [INFO] ------------------------------------------------------------------------
   [INFO] BUILD SUCCESS
   [INFO] ------------------------------------------------------------------------
   [INFO] Total time:  01:08 h
   [INFO] Finished at: 2020-09-12T01:46:03-07:00
   [INFO] ------------------------------------------------------------------------
   ```
   
   ### Documentation (Optional)
   
   - In case of new functionality, my PR adds documentation in the following wiki page:
   
   (Link the GitHub wiki you added)
   
   ### Commits
   
   - My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Code Quality
   
   - My diff has been formatted using helix-style.xml 
   (helix-style-intellij.xml if IntelliJ IDE is used)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r490476885



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,36 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  private void validateStaleMessage (boolean isPreCheck) throws Exception {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      throw new HelixDuplicatedStateTransitionException(String

Review comment:
       Thanks for the follow up. 
   In my understanding, this message must be a new message generated by controller, maybe caused by reading the old currentState etc.
   Since this is a new message, doesn't this count as DuplicatedStateTransition?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r493192875



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,36 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  private void validateStaleMessage (boolean isPreCheck) throws Exception {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      throw new HelixDuplicatedStateTransitionException(String
+          .format("Partition %s current state is same as toState (%s->%s) from message.",
+              partitionName, fromState, toState));
+    } else if (!isPreCheck && fromState != null && !fromState.equals("*") && !fromState

Review comment:
       I had an offline sync with Ali. He has submit a PR to fix this issue (#1390). I changed my PR accordingly. 
   Currently my PR contains his patch.  Will rebase after his PR merged. 

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,35 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  public Exception isMessageStaled(boolean inSchedulerCheck) {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    Exception err = null;
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      err = new HelixDuplicatedStateTransitionException(String
+          .format("Partition %s current state is same as toState (%s->%s) from message.",
+              partitionName, fromState, toState));
+    } else if (!inSchedulerCheck && fromState != null && !fromState.equals("*") && !fromState

Review comment:
       I had an offline sync with Ali. He has submit a PR to fix this issue (#1390). I changed my PR accordingly. We now do the same validation for scheduling stage and execution stage.  
   Currently my PR contains his patch. Will rebase after his PR merged.

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);
+            if (err != null) {
+              throw err;
+            }
+          }
+          if (stateTransitionHandlers.containsKey(messageTarget)) {
+            // If there are 2 messages in same batch about same partition's state transition,
+            // the later one is discarded
+            Message duplicatedMessage = stateTransitionHandlers.get(messageTarget)._message;
+            throw new HelixException(String.format(

Review comment:
       Update. 

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,36 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  private void validateStaleMessage (boolean isPreCheck) throws Exception {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      throw new HelixDuplicatedStateTransitionException(String
+          .format("Partition %s current state is same as toState (%s->%s) from message.",
+              partitionName, fromState, toState));
+    } else if (!isPreCheck && fromState != null && !fromState.equals("*") && !fromState

Review comment:
       I had an offline sync with Ali. He has submit a PR to fix this issue (#1390). I changed my PR accordingly. We now do the same validation for scheduling stage and execution stage.  
   Currently my PR contains his patch. Will rebase after his PR merged.

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);

Review comment:
       I had an offline sync with Ali. He has submit a PR to fix this issue (#1390). I changed my PR accordingly. We now do the same validation for scheduling stage and execution stage.  
   Currently my PR contains his patch. Will rebase after his PR merged.

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);
+            if (err != null) {
+              throw err;
+            }
+          }
+          if (stateTransitionHandlers.containsKey(messageTarget)) {
+            // If there are 2 messages in same batch about same partition's state transition,
+            // the later one is discarded
+            Message duplicatedMessage = stateTransitionHandlers.get(messageTarget)._message;
+            throw new HelixException(String.format(

Review comment:
       Updated. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r490407050



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,36 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  private void validateStaleMessage (boolean isPreCheck) throws Exception {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      throw new HelixDuplicatedStateTransitionException(String

Review comment:
       If it already been processed (in READ or UNPROCESSABLE state) then we would ignore the ST message before this check. (line 856 in HelixTaskExecutor)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r499050284



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -460,7 +438,33 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
       StateTransitionError error = new StateTransitionError(type, code, e);
       _stateModel.rollbackOnError(_message, _notificationContext, error);
     }
+  }
+
+  // Verify the fromState and current state of the stateModel.
+  public Exception staleMessageValidator() {

Review comment:
       We are not using the boolean return value. May I ask is this for a cleaner interface purpose for further usage? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r488230565



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,35 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  public Exception isMessageStaled(boolean inSchedulerCheck) {

Review comment:
       TFTR. Updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r492320893



##########
File path: helix-core/src/test/java/org/apache/helix/messaging/handling/TestHelixTaskExecutor.java
##########
@@ -212,18 +213,33 @@ public TestStateTransitionHandlerFactory(String msgType, long delay) {
       _delay = delay;
     }
 
-    class TestStateTransitionMessageHandler extends MessageHandler {
+    class TestStateTransitionMessageHandler extends HelixStateTransitionHandler {
+      boolean _testIsMessageStaled;
+
       public TestStateTransitionMessageHandler(Message message, NotificationContext context) {
-        super(message, context);
+        super(null, null, message, context, null);
+        _testIsMessageStaled = false;
+      }
+
+      public TestStateTransitionMessageHandler(Message message, NotificationContext context,
+          CurrentState currentStateDelta) {
+        super(null, null, message, context, currentStateDelta);
+        _testIsMessageStaled = true;
+
+
       }
 
       @Override
-      public HelixTaskResult handleMessage() throws InterruptedException {
+      public HelixTaskResult handleMessage() {
         HelixTaskResult result = new HelixTaskResult();
         _processedMsgIds.put(_message.getMsgId(), _message.getMsgId());
         if (_delay > 0) {
           System.out.println("Sleeping..." + _delay);
-          Thread.sleep(_delay);
+          try{
+            Thread.sleep(_delay);
+          } catch (Exception e) {
+            assert (false);

Review comment:
       I think the reason is that the base handleMessage does not throw exception. The try-catch block here is only for handle exceptions in test. 

##########
File path: helix-core/src/test/java/org/apache/helix/messaging/handling/TestHelixTaskExecutor.java
##########
@@ -232,11 +248,29 @@ public HelixTaskResult handleMessage() throws InterruptedException {
       @Override
       public void onError(Exception e, ErrorCode code, ErrorType type) {
       }
+
+      @Override
+      public void precheckForStaleMessage() throws Exception {
+        if (_testIsMessageStaled) {
+          super.precheckForStaleMessage();
+        }
+      }
     }
 
     @Override
     public MessageHandler createHandler(Message message, NotificationContext context) {
-      return new TestStateTransitionMessageHandler(message, context);
+      if (message.getResourceName()!="testStaledMessageResource") {

Review comment:
       TFTR. Updated. 

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);
+            if (err != null) {
+              throw err;
+            }
+          }
+          if (stateTransitionHandlers.containsKey(messageTarget)) {

Review comment:
       Yes. We do a filtering first based on current state, and if the message is stale, we will directly delete it. (and we will throw exception as well, but not at the execution stage as before)
   Although there is no existing bug complaining 2 valid new messages at the same time, I think we should not assume there will be no duplicated messages so we should still keep the 3rd check. 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] pkuwm commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
pkuwm commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r495429734



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -460,7 +438,33 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
       StateTransitionError error = new StateTransitionError(type, code, e);
       _stateModel.rollbackOnError(_message, _notificationContext, error);
     }
+  }
+
+  // Verify the fromState and current state of the stateModel.
+  public Exception staleMessageValidator() {

Review comment:
       `staleMessageValidator` sounds more like a class/variable name. How about changing the method name to a verb style: `validateStaleMessage()` (it does something)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r492326134



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);
+            if (err != null) {
+              throw err;
+            }
+          }
+          if (stateTransitionHandlers.containsKey(messageTarget)) {

Review comment:
       Yes. We do a filtering first based on current state, and if the message is stale, we will directly delete it. (and we will throw exception as well, but not at the execution stage as before)
   Although there is no existing bug complaining 2 valid new messages at the same time, I think we should not assume there will be no duplicated messages so we should still keep the 3rd check. 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on pull request #1362:
URL: https://github.com/apache/helix/pull/1362#issuecomment-705258213


   This PR is ready to be merged. Approved by @dasahcc 
   
   Final commit message:
   Move ST message to state validation from executing phase to scheduling phase.
   
   Move ST message to state validation to scheduling phase.because of the following reasons:
   It is a no-op operation, thus we should not create a thread to execute it.
   The lag between scheduling and executing could add additional latency for controller and eventually application latency. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r488302199



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,35 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  public Exception validateStaleMessage(boolean inSchedulerCheck) {

Review comment:
       Usually, our validate method follows the same convention. It either returns boolean or void (in this case, throw an exception if invalid). I suggest we follow the same logic here.

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);
+            if (err != null) {
+              throw err;
+            }
+          }
+          if (stateTransitionHandlers.containsKey(messageTarget)) {
+            // If there are 2 messages in same batch about same partition's state transition,
+            // the later one is discarded
+            Message duplicatedMessage = stateTransitionHandlers.get(messageTarget)._message;
+            throw new HelixException(String.format(

Review comment:
       Same for this one. It is the existing logic, but this exception will be processed by the following catch, which records the error as "Failed to create message handler...". This seems to be inaccurate.

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);

Review comment:
       What will be wrong if we do validateStaleMessage(false) here? If it works, then we can have a common logic, right?

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,35 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  public Exception validateStaleMessage(boolean inSchedulerCheck) {

Review comment:
       Passing an additional boolean parameter is hard for the user to call. I suggest doing the following,
   1. create a private method "private void validateStaleMessage(boolean checkFromState) {...}"
   2. create a public method "public void precheckForStaleMessage() {validateStaleMessage(true)}"

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);
+            if (err != null) {
+              throw err;

Review comment:
       I think we want to "continue" instead of throwing the Exception for more graceful handling.
   We can do this:
   "reportAndRemoveMessage(message, accessor, instanceName, ProcessedMessageState.DISCARDED);"

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);
+            if (err != null) {
+              throw err;
+            }
+          }
+          if (stateTransitionHandlers.containsKey(messageTarget)) {

Review comment:
       Just curious, why we want this to happen later than the other 2?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r496215289



##########
File path: helix-core/src/test/java/org/apache/helix/messaging/handling/TestHelixTaskExecutor.java
##########
@@ -212,18 +213,33 @@ public TestStateTransitionHandlerFactory(String msgType, long delay) {
       _delay = delay;
     }
 
-    class TestStateTransitionMessageHandler extends MessageHandler {
+    class TestStateTransitionMessageHandler extends HelixStateTransitionHandler {
+      boolean _testIsMessageStaled;
+
       public TestStateTransitionMessageHandler(Message message, NotificationContext context) {
-        super(message, context);
+        super(null, null, message, context, null);
+        _testIsMessageStaled = false;
+      }
+
+      public TestStateTransitionMessageHandler(Message message, NotificationContext context,
+          CurrentState currentStateDelta) {
+        super(null, null, message, context, currentStateDelta);
+        _testIsMessageStaled = true;
+
+
       }
 
       @Override
-      public HelixTaskResult handleMessage() throws InterruptedException {
+      public HelixTaskResult handleMessage() {
         HelixTaskResult result = new HelixTaskResult();
         _processedMsgIds.put(_message.getMsgId(), _message.getMsgId());
         if (_delay > 0) {
           System.out.println("Sleeping..." + _delay);
-          Thread.sleep(_delay);
+          try{
+            Thread.sleep(_delay);
+          } catch (Exception e) {
+            assert (false);

Review comment:
       Please correct me if I am wrong. 
   For the `InterruptedException`, we need to either catch it or add a declaration in method signature. Since the HelixStateTransitionHandler.handleMessage() does not throw exception, we could only do a try-catch here. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r488322440



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);
+            if (err != null) {
+              throw err;

Review comment:
       We do continue when catch the exception (line 957). We could change the message if I think if the wording is not accurate in the catch block. 

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);
+            if (err != null) {
+              throw err;
+            }
+          }
+          if (stateTransitionHandlers.containsKey(messageTarget)) {

Review comment:
       Please correct me if I am wrong. I think if we have an old O->S and a new S->M with current state is S. We do want to ignore the O->S and continue with S->M. 
   

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,35 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  public Exception validateStaleMessage(boolean inSchedulerCheck) {

Review comment:
       TFTR. Will update. 

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);

Review comment:
       The handling for these two cases are different.
   
   For the first case when toState == currentState, we just treat the task as succeeded and finish task. There is no difference between doing the no-op when executing and bypass the task at scheduling phase.
   
   In the second case when currentState != fromState, executor will set the task as failed and report state transition error. Ir will also update ZK.
   
   Having same logic causes a test fail. Controller keeps sending many 'INIT->RUNNING' message when currentState is TASK_ERROR if we don't report task error. Currently task error is reported in HelixStateTransitionHandler.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r490476885



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,36 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  private void validateStaleMessage (boolean isPreCheck) throws Exception {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      throw new HelixDuplicatedStateTransitionException(String

Review comment:
       Thanks for the follow up. 
   In my understanding, this message must be a new message generated by controller, maybe caused by reading the old currentState etc.Since this is a new message, doesn't this count as HelixDuplicatedStateTransition?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r501281121



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -460,7 +438,33 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
       StateTransitionError error = new StateTransitionError(type, code, e);
       _stateModel.rollbackOnError(_message, _notificationContext, error);
     }
+  }
+
+  // Verify the fromState and current state of the stateModel.
+  public Exception staleMessageValidator() {

Review comment:
       Updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r498549327



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -460,7 +438,33 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
       StateTransitionError error = new StateTransitionError(type, code, e);
       _stateModel.rollbackOnError(_message, _notificationContext, error);
     }
+  }
+
+  // Verify the fromState and current state of the stateModel.
+  public Exception staleMessageValidator() {

Review comment:
       nit, it would be more graceful to define a ValidationResult class and contain a boolean (pass or not) and an optional Exception. Especially as a public method. Is there a chance we can make it protected or package private?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r488221033



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,35 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  public Exception isMessageStaled(boolean inSchedulerCheck) {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    Exception err = null;
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      err = new HelixDuplicatedStateTransitionException(String
+          .format("Partition %s current state is same as toState (%s->%s) from message.",
+              partitionName, fromState, toState));
+    } else if (!inSchedulerCheck && fromState != null && !fromState.equals("*") && !fromState

Review comment:
       TFTR. This case is a bit different than the above one. The handling for these two cases are different. 
   
   For the first case when toState == currentState, we just treat the task as succeeded and finish task. There is no difference between doing the no-op when executing and bypass the task at scheduling phase.  
   
   In the second case when currentState != fromState, executor will set the task as failed and report state transition error. Ir will also update ZK.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] dasahcc commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
dasahcc commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r489759062



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,36 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  private void validateStaleMessage (boolean isPreCheck) throws Exception {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      throw new HelixDuplicatedStateTransitionException(String

Review comment:
       This is not duplicate state transition. Could be some STs are old in pending queue just not match what target state we have now.

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,36 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  private void validateStaleMessage (boolean isPreCheck) throws Exception {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      throw new HelixDuplicatedStateTransitionException(String
+          .format("Partition %s current state is same as toState (%s->%s) from message.",
+              partitionName, fromState, toState));
+    } else if (!isPreCheck && fromState != null && !fromState.equals("*") && !fromState

Review comment:
       Why we need this precheck flag? Two checks in onMessage and HelixTask execution validation logic should be same




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] zhangmeng916 commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
zhangmeng916 commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r488206119



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,35 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  public Exception isMessageStaled(boolean inSchedulerCheck) {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    Exception err = null;
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      err = new HelixDuplicatedStateTransitionException(String
+          .format("Partition %s current state is same as toState (%s->%s) from message.",
+              partitionName, fromState, toState));
+    } else if (!inSchedulerCheck && fromState != null && !fromState.equals("*") && !fromState

Review comment:
       Besides the current_state equals to_state case, is this also a case that we can move to scheduling phase instead of execution phase? That could help reduce even more threads usage. 

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,35 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  public Exception isMessageStaled(boolean inSchedulerCheck) {

Review comment:
       minor: change isMessageStaled to isMessageStale. We didn't on purpose stale the message. 

##########
File path: helix-core/src/test/java/org/apache/helix/messaging/handling/TestHelixTaskExecutor.java
##########
@@ -377,6 +414,55 @@ public void testDuplicatedMessage() throws InterruptedException {
     System.out.println("END TestHelixTaskExecutor.testDuplicatedMessage()");
   }
 
+  @Test()
+  public void testStaledMessage() throws InterruptedException {
+    System.out.println("START TestHelixTaskExecutor.testStaledMessage()");
+    HelixTaskExecutor executor = new HelixTaskExecutor();
+    HelixManager manager = new MockClusterManager();
+    HelixDataAccessor dataAccessor = manager.getHelixDataAccessor();
+    PropertyKey.Builder keyBuilder = dataAccessor.keyBuilder();
+
+    TestStateTransitionHandlerFactory stateTransitionFactory =
+        new TestStateTransitionHandlerFactory(Message.MessageType.STATE_TRANSITION.name(), 1000);
+    executor.registerMessageHandlerFactory(Message.MessageType.STATE_TRANSITION.name(),
+        stateTransitionFactory);
+
+    NotificationContext changeContext = new NotificationContext(manager);
+    List<Message> msgList = new ArrayList<Message>();
+
+    int nMsgs = 1;
+    String instanceName = manager.getInstanceName();
+    for (int i = 0; i < nMsgs; i++) {
+      Message msg =
+          new Message(Message.MessageType.STATE_TRANSITION.name(), UUID.randomUUID().toString());
+      msg.setTgtSessionId(manager.getSessionId());
+      msg.setCreateTimeStamp((long) i);
+      msg.setTgtName("Localhost_1123");
+      msg.setSrcName("127.101.1.23_2234");
+      msg.setPartitionName("Partition");
+      msg.setResourceName("testStaledMessageResource");
+      msg.setStateModelDef("DummyMasterSlave");
+      msg.setFromState("SLAVE");
+      msg.setToState("MASTER");
+      dataAccessor.setProperty(msg.getKey(keyBuilder, instanceName), msg);
+      msgList.add(msg);
+    }
+
+    Assert.assertEquals(dataAccessor.getChildValues(keyBuilder.messages(instanceName), true).size(),
+            nMsgs);
+
+    changeContext.setChangeType(HelixConstants.ChangeType.MESSAGE);
+    executor.onMessage(instanceName, msgList, changeContext);
+
+    Thread.sleep(200);
+
+    // The message should be ignored since toState is the same as current state.
+    Assert.assertEquals(dataAccessor.getChildValues(keyBuilder.messages(instanceName), true).size(),

Review comment:
       How soon is the message gets deleted from instance?

##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).isMessageStaled(true /*inSchedulerCheck*/);

Review comment:
       The return type is confusing. `isMessageStaled` implies a boolean value as return. You can change the function name to something like validateStaleMessage, etc, or change the return type.

##########
File path: helix-core/src/main/java/org/apache/helix/task/TaskDriver.java
##########
@@ -980,9 +980,10 @@ public TaskState pollForJobState(String workflowName, String jobName, long timeo
         && System.currentTimeMillis() < st + timeout);
 
     if (ctx == null || !allowedStates.contains(ctx.getJobState(jobName))) {
-      throw new HelixException(
-          String.format("Workflow \"%s\" context is null or job \"%s\" is not in states: %s",
-              workflowName, jobName, allowedStates));
+      String cur = ctx == null ? "null" : ctx.getJobState(jobName).toString();
+      throw new HelixException(String.format(
+          "Workflow \"%s\" context is null or job \"%s\" is not in states: %s, cur state is: %s",
+          workflowName, jobName, allowedStates, cur));

Review comment:
       Is this just a log improvement? The logic is not impacted by this PR's change, right?

##########
File path: helix-core/src/test/java/org/apache/helix/messaging/handling/TestHelixTaskExecutor.java
##########
@@ -377,6 +414,55 @@ public void testDuplicatedMessage() throws InterruptedException {
     System.out.println("END TestHelixTaskExecutor.testDuplicatedMessage()");
   }
 
+  @Test()
+  public void testStaledMessage() throws InterruptedException {
+    System.out.println("START TestHelixTaskExecutor.testStaledMessage()");
+    HelixTaskExecutor executor = new HelixTaskExecutor();
+    HelixManager manager = new MockClusterManager();
+    HelixDataAccessor dataAccessor = manager.getHelixDataAccessor();
+    PropertyKey.Builder keyBuilder = dataAccessor.keyBuilder();
+
+    TestStateTransitionHandlerFactory stateTransitionFactory =
+        new TestStateTransitionHandlerFactory(Message.MessageType.STATE_TRANSITION.name(), 1000);
+    executor.registerMessageHandlerFactory(Message.MessageType.STATE_TRANSITION.name(),
+        stateTransitionFactory);
+
+    NotificationContext changeContext = new NotificationContext(manager);
+    List<Message> msgList = new ArrayList<Message>();
+
+    int nMsgs = 1;
+    String instanceName = manager.getInstanceName();
+    for (int i = 0; i < nMsgs; i++) {
+      Message msg =
+          new Message(Message.MessageType.STATE_TRANSITION.name(), UUID.randomUUID().toString());
+      msg.setTgtSessionId(manager.getSessionId());
+      msg.setCreateTimeStamp((long) i);
+      msg.setTgtName("Localhost_1123");
+      msg.setSrcName("127.101.1.23_2234");
+      msg.setPartitionName("Partition");
+      msg.setResourceName("testStaledMessageResource");
+      msg.setStateModelDef("DummyMasterSlave");
+      msg.setFromState("SLAVE");
+      msg.setToState("MASTER");
+      dataAccessor.setProperty(msg.getKey(keyBuilder, instanceName), msg);
+      msgList.add(msg);
+    }
+
+    Assert.assertEquals(dataAccessor.getChildValues(keyBuilder.messages(instanceName), true).size(),
+            nMsgs);
+
+    changeContext.setChangeType(HelixConstants.ChangeType.MESSAGE);
+    executor.onMessage(instanceName, msgList, changeContext);
+
+    Thread.sleep(200);
+
+    // The message should be ignored since toState is the same as current state.
+    Assert.assertEquals(dataAccessor.getChildValues(keyBuilder.messages(instanceName), true).size(),
+        0);
+
+    System.out.println("END TestHelixTaskExecutor.testDuplicatedMessage()");

Review comment:
       The test name is wrong here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] dasahcc commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
dasahcc commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r490470416



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,36 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  private void validateStaleMessage (boolean isPreCheck) throws Exception {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      throw new HelixDuplicatedStateTransitionException(String

Review comment:
       What I am trying to say is not use HelixDuplicatedStateTransitionException, because there could be other cases for this scenario. If we check log, we may get confused.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang merged pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
jiajunwang merged pull request #1362:
URL: https://github.com/apache/helix/pull/1362


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r488217503



##########
File path: helix-core/src/main/java/org/apache/helix/task/TaskDriver.java
##########
@@ -980,9 +980,10 @@ public TaskState pollForJobState(String workflowName, String jobName, long timeo
         && System.currentTimeMillis() < st + timeout);
 
     if (ctx == null || !allowedStates.contains(ctx.getJobState(jobName))) {
-      throw new HelixException(
-          String.format("Workflow \"%s\" context is null or job \"%s\" is not in states: %s",
-              workflowName, jobName, allowedStates));
+      String cur = ctx == null ? "null" : ctx.getJobState(jobName).toString();
+      throw new HelixException(String.format(
+          "Workflow \"%s\" context is null or job \"%s\" is not in states: %s, cur state is: %s",
+          workflowName, jobName, allowedStates, cur));

Review comment:
       Yes it is a log improvement. I personally find the new log more helpful when debugging. Since we are comparing two states, printing out the expected state and actually state could be more useful. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] dasahcc commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
dasahcc commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r492248965



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,36 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  private void validateStaleMessage (boolean isPreCheck) throws Exception {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      throw new HelixDuplicatedStateTransitionException(String
+          .format("Partition %s current state is same as toState (%s->%s) from message.",
+              partitionName, fromState, toState));
+    } else if (!isPreCheck && fromState != null && !fromState.equals("*") && !fromState

Review comment:
       For the test you talked about, I think we may need to understand why we keep sending message INIT -> RUNNING. Because once the message has been DROPPED, controller should refresh current state and get the TASK_ERROR state. Then we should have some message like TASK_ERROR -> INIT or something, if we miss it, we should fix the logic to make it more robust. And I am OK to not fix it in this PR. But if that's the case, let's propose another PR for that. Otherwise, this is not a complete scenario. For regular resource management, it is not a complete filtering logic.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] pkuwm commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
pkuwm commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r490754564



##########
File path: helix-core/src/test/java/org/apache/helix/messaging/handling/TestHelixTaskExecutor.java
##########
@@ -232,11 +248,29 @@ public HelixTaskResult handleMessage() throws InterruptedException {
       @Override
       public void onError(Exception e, ErrorCode code, ErrorType type) {
       }
+
+      @Override
+      public void precheckForStaleMessage() throws Exception {
+        if (_testIsMessageStaled) {
+          super.precheckForStaleMessage();
+        }
+      }
     }
 
     @Override
     public MessageHandler createHandler(Message message, NotificationContext context) {
-      return new TestStateTransitionMessageHandler(message, context);
+      if (message.getResourceName()!="testStaledMessageResource") {

Review comment:
       `equals()` should be used to check the string names are the same.
   `!=` checks if they are the **same** object: same memory address, etc..

##########
File path: helix-core/src/test/java/org/apache/helix/messaging/handling/TestHelixTaskExecutor.java
##########
@@ -212,18 +213,33 @@ public TestStateTransitionHandlerFactory(String msgType, long delay) {
       _delay = delay;
     }
 
-    class TestStateTransitionMessageHandler extends MessageHandler {
+    class TestStateTransitionMessageHandler extends HelixStateTransitionHandler {
+      boolean _testIsMessageStaled;
+
       public TestStateTransitionMessageHandler(Message message, NotificationContext context) {
-        super(message, context);
+        super(null, null, message, context, null);
+        _testIsMessageStaled = false;
+      }
+
+      public TestStateTransitionMessageHandler(Message message, NotificationContext context,
+          CurrentState currentStateDelta) {
+        super(null, null, message, context, currentStateDelta);
+        _testIsMessageStaled = true;
+
+
       }
 
       @Override
-      public HelixTaskResult handleMessage() throws InterruptedException {
+      public HelixTaskResult handleMessage() {
         HelixTaskResult result = new HelixTaskResult();
         _processedMsgIds.put(_message.getMsgId(), _message.getMsgId());
         if (_delay > 0) {
           System.out.println("Sleeping..." + _delay);
-          Thread.sleep(_delay);
+          try{
+            Thread.sleep(_delay);
+          } catch (Exception e) {
+            assert (false);

Review comment:
       Can you help understand why we need to assert instead of throwing an `InterruptedException`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r497086889



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);
+            if (err != null) {
+              throw err;

Review comment:
       Updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] dasahcc commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
dasahcc commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r490470416



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,36 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  private void validateStaleMessage (boolean isPreCheck) throws Exception {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      throw new HelixDuplicatedStateTransitionException(String

Review comment:
       What I am trying to say is not use HelixDuplicatedStateTransitionException, because there could be other cases for this scenario. If we check log, we may get confused for new check logic.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r488225516



##########
File path: helix-core/src/test/java/org/apache/helix/messaging/handling/TestHelixTaskExecutor.java
##########
@@ -377,6 +414,55 @@ public void testDuplicatedMessage() throws InterruptedException {
     System.out.println("END TestHelixTaskExecutor.testDuplicatedMessage()");
   }
 
+  @Test()
+  public void testStaledMessage() throws InterruptedException {
+    System.out.println("START TestHelixTaskExecutor.testStaledMessage()");
+    HelixTaskExecutor executor = new HelixTaskExecutor();
+    HelixManager manager = new MockClusterManager();
+    HelixDataAccessor dataAccessor = manager.getHelixDataAccessor();
+    PropertyKey.Builder keyBuilder = dataAccessor.keyBuilder();
+
+    TestStateTransitionHandlerFactory stateTransitionFactory =
+        new TestStateTransitionHandlerFactory(Message.MessageType.STATE_TRANSITION.name(), 1000);
+    executor.registerMessageHandlerFactory(Message.MessageType.STATE_TRANSITION.name(),
+        stateTransitionFactory);
+
+    NotificationContext changeContext = new NotificationContext(manager);
+    List<Message> msgList = new ArrayList<Message>();
+
+    int nMsgs = 1;
+    String instanceName = manager.getInstanceName();
+    for (int i = 0; i < nMsgs; i++) {
+      Message msg =
+          new Message(Message.MessageType.STATE_TRANSITION.name(), UUID.randomUUID().toString());
+      msg.setTgtSessionId(manager.getSessionId());
+      msg.setCreateTimeStamp((long) i);
+      msg.setTgtName("Localhost_1123");
+      msg.setSrcName("127.101.1.23_2234");
+      msg.setPartitionName("Partition");
+      msg.setResourceName("testStaledMessageResource");
+      msg.setStateModelDef("DummyMasterSlave");
+      msg.setFromState("SLAVE");
+      msg.setToState("MASTER");
+      dataAccessor.setProperty(msg.getKey(keyBuilder, instanceName), msg);
+      msgList.add(msg);
+    }
+
+    Assert.assertEquals(dataAccessor.getChildValues(keyBuilder.messages(instanceName), true).size(),
+            nMsgs);
+
+    changeContext.setChangeType(HelixConstants.ChangeType.MESSAGE);
+    executor.onMessage(instanceName, msgList, changeContext);
+
+    Thread.sleep(200);
+
+    // The message should be ignored since toState is the same as current state.
+    Assert.assertEquals(dataAccessor.getChildValues(keyBuilder.messages(instanceName), true).size(),

Review comment:
       In scheduling phase, HelixUtil.removeMessageFromZK will be called as soon as we hit the HelixDuplicatedStateTransitionException exception. There is no thread queueing in between. So I think it should be the IO delay time.  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] pkuwm commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
pkuwm commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r495429105



##########
File path: helix-core/src/test/java/org/apache/helix/messaging/handling/TestHelixTaskExecutor.java
##########
@@ -212,18 +213,33 @@ public TestStateTransitionHandlerFactory(String msgType, long delay) {
       _delay = delay;
     }
 
-    class TestStateTransitionMessageHandler extends MessageHandler {
+    class TestStateTransitionMessageHandler extends HelixStateTransitionHandler {
+      boolean _testIsMessageStaled;
+
       public TestStateTransitionMessageHandler(Message message, NotificationContext context) {
-        super(message, context);
+        super(null, null, message, context, null);
+        _testIsMessageStaled = false;
+      }
+
+      public TestStateTransitionMessageHandler(Message message, NotificationContext context,
+          CurrentState currentStateDelta) {
+        super(null, null, message, context, currentStateDelta);
+        _testIsMessageStaled = true;
+
+
       }
 
       @Override
-      public HelixTaskResult handleMessage() throws InterruptedException {
+      public HelixTaskResult handleMessage() {
         HelixTaskResult result = new HelixTaskResult();
         _processedMsgIds.put(_message.getMsgId(), _message.getMsgId());
         if (_delay > 0) {
           System.out.println("Sleeping..." + _delay);
-          Thread.sleep(_delay);
+          try{
+            Thread.sleep(_delay);
+          } catch (Exception e) {
+            assert (false);

Review comment:
       I don't think we need this change, as throwing an `InterruptedException` is good enough and cleaner to fail the test, unless the upstream caller catches and swallows the `InterruptedException`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] zhangmeng916 commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
zhangmeng916 commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r491806183



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);
+            if (err != null) {
+              throw err;
+            }
+          }
+          if (stateTransitionHandlers.containsKey(messageTarget)) {

Review comment:
       So I think the logic changed here. Previously we always discard the later one (so it's possible we choose the stale one, which is the earlier one). But now we do a filtering first based on current state, and if the message is stale, we will directly delete it, and there won't be duplicated message exception at all. Is this the logic you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] zhangmeng916 commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
zhangmeng916 commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r488229740



##########
File path: helix-core/src/test/java/org/apache/helix/messaging/handling/TestHelixTaskExecutor.java
##########
@@ -377,6 +414,55 @@ public void testDuplicatedMessage() throws InterruptedException {
     System.out.println("END TestHelixTaskExecutor.testDuplicatedMessage()");
   }
 
+  @Test()
+  public void testStaledMessage() throws InterruptedException {
+    System.out.println("START TestHelixTaskExecutor.testStaledMessage()");
+    HelixTaskExecutor executor = new HelixTaskExecutor();
+    HelixManager manager = new MockClusterManager();
+    HelixDataAccessor dataAccessor = manager.getHelixDataAccessor();
+    PropertyKey.Builder keyBuilder = dataAccessor.keyBuilder();
+
+    TestStateTransitionHandlerFactory stateTransitionFactory =
+        new TestStateTransitionHandlerFactory(Message.MessageType.STATE_TRANSITION.name(), 1000);
+    executor.registerMessageHandlerFactory(Message.MessageType.STATE_TRANSITION.name(),
+        stateTransitionFactory);
+
+    NotificationContext changeContext = new NotificationContext(manager);
+    List<Message> msgList = new ArrayList<Message>();
+
+    int nMsgs = 1;
+    String instanceName = manager.getInstanceName();
+    for (int i = 0; i < nMsgs; i++) {
+      Message msg =
+          new Message(Message.MessageType.STATE_TRANSITION.name(), UUID.randomUUID().toString());
+      msg.setTgtSessionId(manager.getSessionId());
+      msg.setCreateTimeStamp((long) i);
+      msg.setTgtName("Localhost_1123");
+      msg.setSrcName("127.101.1.23_2234");
+      msg.setPartitionName("Partition");
+      msg.setResourceName("testStaledMessageResource");
+      msg.setStateModelDef("DummyMasterSlave");
+      msg.setFromState("SLAVE");
+      msg.setToState("MASTER");
+      dataAccessor.setProperty(msg.getKey(keyBuilder, instanceName), msg);
+      msgList.add(msg);
+    }
+
+    Assert.assertEquals(dataAccessor.getChildValues(keyBuilder.messages(instanceName), true).size(),
+            nMsgs);
+
+    changeContext.setChangeType(HelixConstants.ChangeType.MESSAGE);
+    executor.onMessage(instanceName, msgList, changeContext);
+
+    Thread.sleep(200);
+
+    // The message should be ignored since toState is the same as current state.
+    Assert.assertEquals(dataAccessor.getChildValues(keyBuilder.messages(instanceName), true).size(),

Review comment:
       ok, that's fine. Asked cause I saw the 200ms sleep, just to confirm it immediately happens.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r490476885



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,36 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  private void validateStaleMessage (boolean isPreCheck) throws Exception {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      throw new HelixDuplicatedStateTransitionException(String

Review comment:
       Thanks for the follow up. 
   In my understanding, this message must be a new message generated by controller, maybe caused by reading the old currentState etc.Since this is a new message, doesn't this count as DuplicatedStateTransition?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r492321938



##########
File path: helix-core/src/test/java/org/apache/helix/messaging/handling/TestHelixTaskExecutor.java
##########
@@ -232,11 +248,29 @@ public HelixTaskResult handleMessage() throws InterruptedException {
       @Override
       public void onError(Exception e, ErrorCode code, ErrorType type) {
       }
+
+      @Override
+      public void precheckForStaleMessage() throws Exception {
+        if (_testIsMessageStaled) {
+          super.precheckForStaleMessage();
+        }
+      }
     }
 
     @Override
     public MessageHandler createHandler(Message message, NotificationContext context) {
-      return new TestStateTransitionMessageHandler(message, context);
+      if (message.getResourceName()!="testStaledMessageResource") {

Review comment:
       TFTR. Updated. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r498412738



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -460,7 +438,33 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
       StateTransitionError error = new StateTransitionError(type, code, e);
       _stateModel.rollbackOnError(_message, _notificationContext, error);
     }
+  }
+
+  // Verify the fromState and current state of the stateModel.
+  public Exception staleMessageValidator() {

Review comment:
       In my understanding, validateXXX is a function throws exception and returns void or boolean. Since I have a function that returns an exception, I am not sure about the name validateXXX.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] dasahcc commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
dasahcc commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r492248965



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,36 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  private void validateStaleMessage (boolean isPreCheck) throws Exception {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      throw new HelixDuplicatedStateTransitionException(String
+          .format("Partition %s current state is same as toState (%s->%s) from message.",
+              partitionName, fromState, toState));
+    } else if (!isPreCheck && fromState != null && !fromState.equals("*") && !fromState

Review comment:
       For the test you talked about, I think we may need to understand why we keep sending message INIT -> RUNNING. Because once the message has been DROPPED, controller should refresh current state and get the TASK_ERROR state. Then we should have some message like TASK_ERROR -> INIT or something, if we miss it, we should fix the logic to make it more robust. And I am OK to not fix it in this PR. But if that's the case, let's propose another PR for that. Otherwise, this is not a complete scenario. For regular resource management, it is not a complete filtering logic.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] zhangmeng916 commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
zhangmeng916 commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r491806183



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);
+            if (err != null) {
+              throw err;
+            }
+          }
+          if (stateTransitionHandlers.containsKey(messageTarget)) {

Review comment:
       So I think the logic changed here. Previously we always discard the later one (so it's possible we choose the stale one, which is the earlier one). But now we do a filtering first based on current state, and if the message is stale, we will directly delete it, and there won't be duplicated message exception at all. Is this the logic you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r492320893



##########
File path: helix-core/src/test/java/org/apache/helix/messaging/handling/TestHelixTaskExecutor.java
##########
@@ -212,18 +213,33 @@ public TestStateTransitionHandlerFactory(String msgType, long delay) {
       _delay = delay;
     }
 
-    class TestStateTransitionMessageHandler extends MessageHandler {
+    class TestStateTransitionMessageHandler extends HelixStateTransitionHandler {
+      boolean _testIsMessageStaled;
+
       public TestStateTransitionMessageHandler(Message message, NotificationContext context) {
-        super(message, context);
+        super(null, null, message, context, null);
+        _testIsMessageStaled = false;
+      }
+
+      public TestStateTransitionMessageHandler(Message message, NotificationContext context,
+          CurrentState currentStateDelta) {
+        super(null, null, message, context, currentStateDelta);
+        _testIsMessageStaled = true;
+
+
       }
 
       @Override
-      public HelixTaskResult handleMessage() throws InterruptedException {
+      public HelixTaskResult handleMessage() {
         HelixTaskResult result = new HelixTaskResult();
         _processedMsgIds.put(_message.getMsgId(), _message.getMsgId());
         if (_delay > 0) {
           System.out.println("Sleeping..." + _delay);
-          Thread.sleep(_delay);
+          try{
+            Thread.sleep(_delay);
+          } catch (Exception e) {
+            assert (false);

Review comment:
       I think the reason is that the base handleMessage does not throw exception. The try-catch block here is only for handle exceptions in test. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on pull request #1362:
URL: https://github.com/apache/helix/pull/1362#issuecomment-702375221


   This PR is ready to be merged.
   
   Final commit message:
   Move ST message to state validation from executing phase to scheduling phase.
   
   Move ST message to state validation to scheduling phase.because of the following reasons:
   It is a no-op operation, thus we should not create a thread to execute it.
   The lag between scheduling and executing could add additional latency for controller and eventually application latency.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r494043007



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);
+            if (err != null) {
+              throw err;

Review comment:
       try-catch is relatively expensive. And it might be misused if the caught exception is too wildly defined. So my suggestion is to handle the error case with regular logic as long as it is possible.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r494043007



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -919,9 +912,25 @@ public void onMessage(String instanceName, List<Message> messages,
             // discard the message. Controller will resend if this is a valid message
             throw new HelixException(String.format(
                 "Another state transition for %s:%s is in progress with msg: %s, p2p: %s, read: %d, current:%d. Discarding %s->%s message",
-                message.getResourceName(), message.getPartitionName(), msg.getMsgId(), String.valueOf(msg.isRelayMessage()),
-                msg.getReadTimeStamp(), System.currentTimeMillis(), message.getFromState(),
-                message.getToState()));
+                message.getResourceName(), message.getPartitionName(), msg.getMsgId(),
+                String.valueOf(msg.isRelayMessage()), msg.getReadTimeStamp(),
+                System.currentTimeMillis(), message.getFromState(), message.getToState()));
+          }
+          if (createHandler instanceof HelixStateTransitionHandler) {
+            // We only check to state if there is no ST task scheduled/executing.
+            Exception err = ((HelixStateTransitionHandler) createHandler).validateStaleMessage(true /*inSchedulerCheck*/);
+            if (err != null) {
+              throw err;

Review comment:
       try-catch is relatively expensive. And it might be misused if the caught exception is too wildly defined. So my suggestion is to handle the error case with regular logic as long as it is possible.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a change in pull request #1362: Move ST message to state validation from executing phase to scheduling phase.

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on a change in pull request #1362:
URL: https://github.com/apache/helix/pull/1362#discussion_r490412279



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -463,6 +442,36 @@ public void onError(Exception e, ErrorCode code, ErrorType type) {
 
   }
 
+  // Verify the fromState and current state of the stateModel.
+  private void validateStaleMessage (boolean isPreCheck) throws Exception {
+    String fromState = _message.getFromState();
+    String toState = _message.getToState();
+    String partitionName = _message.getPartitionName();
+
+    // state in _currentStateDelta uses current state from state model. It has the
+    // most up-to-date. current state. In case currentState in stateModel is null,
+    // partition is in initial state and we using it as current state.
+    // Defined in HelixStateMachineEngine.
+    String state = _currentStateDelta.getState(partitionName);
+
+    if (toState.equalsIgnoreCase(state)) {
+      // To state equals current state, we can just ignore the message
+      throw new HelixDuplicatedStateTransitionException(String
+          .format("Partition %s current state is same as toState (%s->%s) from message.",
+              partitionName, fromState, toState));
+    } else if (!isPreCheck && fromState != null && !fromState.equals("*") && !fromState

Review comment:
       TFTR. 
   Originally in HelixStateTransitionHandler, handling for these two cases are different.
   For the first case when toState == currentState, we just treat the task as succeeded and finish task. There is no difference between doing the no-op when executing and bypass the task at scheduling phase.
   
   In the second case when currentState != fromState, executor will set the task as failed and report state transition error. Ir will also update ZK.
   
   Having same logic causes a test fail. Controller keeps sending many 'INIT->RUNNING' message when currentState is TASK_ERROR if we don't report task error. Currently task error is reported in HelixStateTransitionHandler.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org