You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2021/04/30 19:47:47 UTC

[GitHub] [helix] jiajunwang opened a new pull request #1721: Remove the logic that disables the partitions which fails to be state-transited from the ERROR state.

jiajunwang opened a new pull request #1721:
URL: https://github.com/apache/helix/pull/1721


   ### Issues
   
   - [X] My PR addresses the following Helix issues and references them in the PR description:
   
   close #1720 
   
   ### Description
   
   - [X] Here are some details about my PR, including screenshots of any UI changes:
   
   This logic is causing confusion when the partitions are disabled unexpected.
   In reality, if the partition fails to finish a state-transition from the ERROR state, it will remain in the ERROR state. And disabling it, in addition, is not desired.
   
   ### Tests
   
   - [ ] The following tests are written for this issue:
   
   There is no additional logic added. I will double-check if there is any existing test that I need to remove.
   
   - The following is the result of the "mvn test" command on the appropriate module:
   
   (If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)
   
   ### Documentation (Optional)
   
   - In case of new functionality, my PR adds documentation in the following wiki page:
   
   (Link the GitHub wiki you added)
   
   ### Commits
   
   - My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Code Quality
   
   - My diff has been formatted using helix-style.xml 
   (helix-style-intellij.xml if IntelliJ IDE is used)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1721: Remove the logic that disables the partitions which fails to be state-transited from the ERROR state.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1721:
URL: https://github.com/apache/helix/pull/1721#discussion_r625307742



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -263,11 +263,6 @@ void postHandleMessage() {
         _stateModel.rollbackOnError(_message, _notificationContext, error);
         _currentStateDelta.setState(partitionKey, HelixDefinedState.ERROR.toString());
         _stateModel.updateState(HelixDefinedState.ERROR.toString());
-
-        // if we have errors transit from ERROR state, disable the partition
-        if (_message.getFromState().equalsIgnoreCase(HelixDefinedState.ERROR.toString())) {
-          disablePartition();

Review comment:
       An ERROR partition that is reset without error won't be disabled. Only when the resetting fails, it is disabled.
   
   And I also think it is designed for extra cautious. However, if you think about this story, the only reason a partition becomes OFFLINE and disabled is that the admin tries to reset multiple times, and a latter attempt is successfully done. Meaning the Helix lib behaves differently according to how many times the reset is done to a partition. This is very confusing.
   
   Assuming Helix users are reasonable and they do the necessary checks in the ERROR -> OFFLINE ST to confirm the partition data is in good shape, then disabling the partition does not really help anyone.
   On the other hand, if we think ERROR -> OFFLINE ST is not safe and the application logic is not complete, why shall we trust the human validate logic?
   
   Overall, I think the validation should be the responsibility of ERROR -> OFFLINE ST. And if the check fails, the partition will remain in an ERROR state. So disabling it or not does not make difference.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang merged pull request #1721: Remove the logic that disables the partitions which fails to be state-transited from the ERROR state.

Posted by GitBox <gi...@apache.org>.
jiajunwang merged pull request #1721:
URL: https://github.com/apache/helix/pull/1721


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1721: Remove the logic that disables the partitions which fails to be state-transited from the ERROR state.

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1721:
URL: https://github.com/apache/helix/pull/1721#discussion_r624569519



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixStateTransitionHandler.java
##########
@@ -263,11 +263,6 @@ void postHandleMessage() {
         _stateModel.rollbackOnError(_message, _notificationContext, error);
         _currentStateDelta.setState(partitionKey, HelixDefinedState.ERROR.toString());
         _stateModel.updateState(HelixDefinedState.ERROR.toString());
-
-        // if we have errors transit from ERROR state, disable the partition
-        if (_message.getFromState().equalsIgnoreCase(HelixDefinedState.ERROR.toString())) {
-          disablePartition();

Review comment:
       This is a general question related to the original intent/design of this feature. My understanding was that we disable partitions/replicas going from `ERROR` to other states because we want some control in how the cluster will be rebalanced/shuffled after that partition exits out of the ERROR state; in other words, we don't want it to "rejoin" the cluster as an enabled partition right away.
   
   This is because for partitions transitioning out of ERROR are usually being monitored by the application admin (they have to explicitly "reset" (or fix) the partition, and usually they want to wait until the application has properly finished resetting on their end. So to take extra caution, we transition it out of ERROR in a disabled state. Once the application admin deems it okay for this particular partition to be rebalanced into the cluster, they can manually enable it.
   
   What do you think? Isn't this the current usage pattern with Helix applications?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on pull request #1721: Remove the logic that disables the partitions which fails to be state-transited from the ERROR state.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on pull request #1721:
URL: https://github.com/apache/helix/pull/1721#issuecomment-832282284


   Approved by @narendly , ready to be merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org