You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2023/01/13 18:34:54 UTC

[GitHub] [helix] xyuanlu opened a new pull request, #2340: [WIP - waiting for CI] Add metrics for rebalance throttled because of error partition

xyuanlu opened a new pull request, #2340:
URL: https://github.com/apache/helix/pull/2340

   ### Issues
   
   - [ ] My PR addresses the following Helix issues and references them in the PR description:
   
   (#200 - Link your issue number here: You can write "Fixes #XXX". Please use the proper keyword so that the issue gets closed automatically. See https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue
   Any of the following keywords can be used: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved)
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI changes:
   
   (Write a concise description including what, why, how)
   
   ### Tests
   
   - [ ] The following tests are written for this issue:
   
   (List the names of added unit/integration tests)
   
   - The following is the result of the "mvn test" command on the appropriate module:
   
   (If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)
   
   ### Changes that Break Backward Compatibility (Optional)
   
   - My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:
   
   (Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)
   
   ### Documentation (Optional)
   
   - In case of new functionality, my PR adds documentation in the following wiki page:
   
   (Link the GitHub wiki you added)
   
   ### Commits
   
   - My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Code Quality
   
   - My diff has been formatted using helix-style.xml 
   (helix-style-intellij.xml if IntelliJ IDE is used)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
rahulrane50 commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1073928319


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ResourceMonitor.java:
##########
@@ -128,6 +129,8 @@ public ResourceMonitor(String clusterName, String resourceName, ObjectName objec
     _numOfPartitionsInExternalView = new SimpleDynamicMetric("ExternalViewPartitionGauge", 0L);
     _numOfPartitions = new SimpleDynamicMetric("PartitionGauge", 0L);
     _numPendingStateTransitions = new SimpleDynamicMetric("PendingStateTransitionGauge", 0L);
+    _rebalanceThrottledByErrorPartitionGauge =

Review Comment:
   I think you missed to reset it's value in resetResourceStateGauges() method?



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ResourceMonitor.java:
##########
@@ -371,11 +374,12 @@ public void updateStateHandoffStats(MonitorState monitorState, long totalDuratio
 
   public void updateRebalancerStats(long numPendingRecoveryRebalancePartitions,
       long numPendingLoadRebalancePartitions, long numRecoveryRebalanceThrottledPartitions,
-      long numLoadRebalanceThrottledPartitions) {
+      long numLoadRebalanceThrottledPartitions, boolean rebalanceThrottledByErrorPartitio) {

Review Comment:
   neat typo: rebalanceThrottledByErrorPartitions



##########
helix-core/src/test/java/org/apache/helix/controller/stages/TestIntermediateStateCalcStage.java:
##########
@@ -272,6 +273,128 @@ public void testWithClusterConfigChange() {
     }
   }
 
+  @Test
+  public void testThrottleByErrorPartition() {
+    String resourcePrefix = "resource";
+    int nResource = 3;
+    int nPartition = 3;
+    int nReplica = 3;
+
+    String[] resources = new String[nResource];
+    for (int i = 0; i < nResource; i++) {
+      resources[i] = resourcePrefix + "_" + i;
+    }
+
+    preSetup(resources, nReplica, nReplica);
+    event.addAttribute(AttributeName.RESOURCES.name(),
+        getResourceMap(resources, nPartition, "OnlineOffline"));
+    event.addAttribute(AttributeName.RESOURCES_TO_REBALANCE.name(),
+        getResourceMap(resources, nPartition, "OnlineOffline"));
+    ClusterStatusMonitor monitor = new ClusterStatusMonitor(_clusterName);
+    monitor.active();
+    event.addAttribute(AttributeName.clusterStatusMonitor.name(), monitor);
+
+    // Initialize best possible state and current state
+    BestPossibleStateOutput bestPossibleStateOutput = new BestPossibleStateOutput();
+    MessageOutput messageSelectOutput = new MessageOutput();
+    CurrentStateOutput currentStateOutput = new CurrentStateOutput();
+    IntermediateStateOutput expectedResult = new IntermediateStateOutput();
+
+    _clusterConfig.setErrorOrRecoveryPartitionThresholdForLoadBalance(0);

Review Comment:
   For my understanding, we have set this value 0 here mean there always be downward rebalance because all positive values values will be greater than 0. If that's the correct understanding then don't we want to test for positive threshold value?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1073975830


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ResourceMonitor.java:
##########
@@ -128,6 +129,8 @@ public ResourceMonitor(String clusterName, String resourceName, ObjectName objec
     _numOfPartitionsInExternalView = new SimpleDynamicMetric("ExternalViewPartitionGauge", 0L);
     _numOfPartitions = new SimpleDynamicMetric("PartitionGauge", 0L);
     _numPendingStateTransitions = new SimpleDynamicMetric("PendingStateTransitionGauge", 0L);
+    _rebalanceThrottledByErrorPartitionGauge =

Review Comment:
   I don't think we want to reset this value. If there is no pipeline/rebalance, it keeps at the current value. This value is updated when intermediate state is triggered. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1080678210


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ResourceMonitor.java:
##########
@@ -371,11 +374,12 @@ public void updateStateHandoffStats(MonitorState monitorState, long totalDuratio
 
   public void updateRebalancerStats(long numPendingRecoveryRebalancePartitions,
       long numPendingLoadRebalancePartitions, long numRecoveryRebalanceThrottledPartitions,
-      long numLoadRebalanceThrottledPartitions) {
+      long numLoadRebalanceThrottledPartitions, boolean rebalanceThrottledByErrorPartitio) {
     _numPendingRecoveryRebalanceReplicas.updateValue(numPendingRecoveryRebalancePartitions);
     _numPendingLoadRebalanceReplicas.updateValue(numPendingLoadRebalancePartitions);
     _numRecoveryRebalanceThrottledReplicas.updateValue(numRecoveryRebalanceThrottledPartitions);
     _numLoadRebalanceThrottledReplicas.updateValue(numLoadRebalanceThrottledPartitions);
+    _rebalanceThrottledByErrorPartitionGauge.updateValue(rebalanceThrottledByErrorPartitio? 1L : 0L);

Review Comment:
   It indicating rebalance is throttled because of there are error partition. Other throttled reasons can be self recovered (pending message etc), but error partition need manual handling. So it is listed as a separate metrics. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1072868935


##########
helix-core/src/main/java/org/apache/helix/controller/stages/IntermediateStateCalcStage.java:
##########
@@ -421,7 +419,8 @@ private PartitionStateMap computeIntermediatePartitionState(ResourceControllerDa
     if (clusterStatusMonitor != null) {
       clusterStatusMonitor
           .updateRebalancerStats(resourceName, messagesForRecovery.size(), messagesForLoad.size(),
-              messagesThrottledForRecovery.size(), messagesThrottledForLoad.size());
+              messagesThrottledForRecovery.size(), messagesThrottledForLoad.size(),
+              onlyDownwardLoadBalance);

Review Comment:
   TFTR. `onlyDownwardLoadBalance` is a 0-1 value metric. 1 means the load babalance is throttled because of current ERROR partition count > defined threshold. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1081928254


##########
helix-core/src/test/java/org/apache/helix/controller/stages/TestIntermediateStateCalcStage.java:
##########
@@ -264,6 +268,136 @@ public void testWithClusterConfigChange() {
 
     IntermediateStateOutput output = event.getAttribute(AttributeName.INTERMEDIATE_STATE.name());
 
+
+    // Validate that there are 2 resourced load balance been throttled

Review Comment:
   Right. Thanks for catching it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1072893104


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -1121,4 +1120,13 @@ public long getPendingStateTransitionGuage() {
     }
     return total;
   }
+
+  @Override
+  public long getRebalanceThrottledByErrorPartition() {
+    long total = 0;
+    for (Map.Entry<String, ResourceMonitor> entry : _resourceMonitorMap.entrySet()) {

Review Comment:
   IMHO, lamda is slower. Probably we want better performance in functions being called rapidly. (1/min)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1080617281


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -1121,4 +1120,13 @@ public long getPendingStateTransitionGuage() {
     }
     return total;
   }
+
+  @Override
+  public long getRebalanceThrottledByErrorPartition() {
+    long total = 0;
+    for (Map.Entry<String, ResourceMonitor> entry : _resourceMonitorMap.entrySet()) {

Review Comment:
   TFTR. I slightly would prefer keep it homogeneous to other functions and the I could run a small profiler later. Could do optimization for all these metrics aggregation functions later in a separate PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1072918228


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -937,8 +937,7 @@ private void unregisterAllJobs() {
     }
   }
 
-  // For test only
-  protected ResourceMonitor getResourceMonitor(String resourceName) {
+  public ResourceMonitor getResourceMonitor(String resourceName) {

Review Comment:
   The function need to be accessed by tests in another package. So changed it to public. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] qqu0127 commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
qqu0127 commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1073739996


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -1121,4 +1120,13 @@ public long getPendingStateTransitionGuage() {
     }
     return total;
   }
+
+  @Override
+  public long getRebalanceThrottledByErrorPartition() {
+    long total = 0;
+    for (Map.Entry<String, ResourceMonitor> entry : _resourceMonitorMap.entrySet()) {

Review Comment:
   Performance wise, Lambda/stream is not necessarily slower than for-loop, we also have parallel stream. Besides, benchmarking and then optimizing is the usual way for performance optimization.
   
   I don't have strong opinion on either approach, but if you go with forloop, you can simply use `Map.values()` if you don't care about keys.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] desaikomal commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
desaikomal commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1073113630


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ResourceMonitor.java:
##########
@@ -371,11 +374,12 @@ public void updateStateHandoffStats(MonitorState monitorState, long totalDuratio
 
   public void updateRebalancerStats(long numPendingRecoveryRebalancePartitions,
       long numPendingLoadRebalancePartitions, long numRecoveryRebalanceThrottledPartitions,
-      long numLoadRebalanceThrottledPartitions) {
+      long numLoadRebalanceThrottledPartitions, boolean rebalanceThrottledByErrorPartitio) {
     _numPendingRecoveryRebalanceReplicas.updateValue(numPendingRecoveryRebalancePartitions);
     _numPendingLoadRebalanceReplicas.updateValue(numPendingLoadRebalancePartitions);
     _numRecoveryRebalanceThrottledReplicas.updateValue(numRecoveryRebalanceThrottledPartitions);
     _numLoadRebalanceThrottledReplicas.updateValue(numLoadRebalanceThrottledPartitions);
+    _rebalanceThrottledByErrorPartitionGauge.updateValue(rebalanceThrottledByErrorPartitio? 1L : 0L);

Review Comment:
   so this is just boolean indicating throttled or not due to error partition?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] desaikomal commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
desaikomal commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1072826719


##########
helix-core/src/main/java/org/apache/helix/controller/stages/IntermediateStateCalcStage.java:
##########
@@ -421,7 +419,8 @@ private PartitionStateMap computeIntermediatePartitionState(ResourceControllerDa
     if (clusterStatusMonitor != null) {
       clusterStatusMonitor
           .updateRebalancerStats(resourceName, messagesForRecovery.size(), messagesForLoad.size(),
-              messagesThrottledForRecovery.size(), messagesThrottledForLoad.size());
+              messagesThrottledForRecovery.size(), messagesThrottledForLoad.size(),
+              onlyDownwardLoadBalance);

Review Comment:
   Is  onlyDownwardLoadBalance same as ERROR partition count?



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -937,8 +937,7 @@ private void unregisterAllJobs() {
     }
   }
 
-  // For test only
-  protected ResourceMonitor getResourceMonitor(String resourceName) {
+  public ResourceMonitor getResourceMonitor(String resourceName) {

Review Comment:
   any particular reason to make it public?



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -1121,4 +1120,13 @@ public long getPendingStateTransitionGuage() {
     }
     return total;
   }
+
+  @Override
+  public long getRebalanceThrottledByErrorPartition() {
+    long total = 0;
+    for (Map.Entry<String, ResourceMonitor> entry : _resourceMonitorMap.entrySet()) {

Review Comment:
   i am not great with Java8 lambda, but can this be written using that??  Again, I am not expert.



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ResourceMonitor.java:
##########
@@ -467,19 +475,13 @@ public void resetMaxTopStateHandoffGauge() {
         _numPendingLoadRebalanceReplicas,
         _numRecoveryRebalanceThrottledReplicas,
         _numLoadRebalanceThrottledReplicas,
-        _externalViewIdealStateDiff,
-        _successfulTopStateHandoffDurationCounter,
-        _successTopStateHandoffCounter,
-        _failedTopStateHandoffCounter,
-        _maxSinglePartitionTopStateHandoffDuration,
-        _partitionTopStateHandoffDurationGauge,
+        _externalViewIdealStateDiff, _successfulTopStateHandoffDurationCounter,
+        _successTopStateHandoffCounter, _failedTopStateHandoffCounter,
+        _maxSinglePartitionTopStateHandoffDuration, _partitionTopStateHandoffDurationGauge,
         _partitionTopStateHandoffHelixLatencyGauge,
-        _partitionTopStateNonGracefulHandoffDurationGauge,
-        _totalMessageReceived,
-        _totalMessageReceivedCounter,
-        _numPendingStateTransitions,
-        _rebalanceState
-    );
+        _partitionTopStateNonGracefulHandoffDurationGauge, _totalMessageReceived,
+        _totalMessageReceivedCounter, _numPendingStateTransitions, _rebalanceState,

Review Comment:
   earlier style was keeping metrics per-line, you are combining 2 per-line. any particular reason?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu merged pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by "xyuanlu (via GitHub)" <gi...@apache.org>.
xyuanlu merged PR #2340:
URL: https://github.com/apache/helix/pull/2340


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] mgao0 commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
mgao0 commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1081708151


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -1121,4 +1120,13 @@ public long getPendingStateTransitionGuage() {
     }
     return total;
   }
+
+  @Override
+  public long getRebalanceThrottledByErrorPartitionGauge() {

Review Comment:
   The naming is confusing. It is called getRebalanceThrottledByErrorPartitionGauge both in ClusterStatusMonitor and in ResourceMonitor. You may consider something like numOfResourcesRebalanceThrottledGauge, or some other name you think is more clear.



##########
helix-core/src/test/java/org/apache/helix/controller/stages/TestIntermediateStateCalcStage.java:
##########
@@ -264,6 +268,136 @@ public void testWithClusterConfigChange() {
 
     IntermediateStateOutput output = event.getAttribute(AttributeName.INTERMEDIATE_STATE.name());
 
+
+    // Validate that there are 2 resourced load balance been throttled

Review Comment:
   Did you mean "0" here?



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -1121,4 +1120,13 @@ public long getPendingStateTransitionGuage() {
     }
     return total;
   }
+
+  @Override
+  public long getRebalanceThrottledByErrorPartition() {
+    long total = 0;
+    for (Map.Entry<String, ResourceMonitor> entry : _resourceMonitorMap.entrySet()) {

Review Comment:
   This is a simple addition and no exceptions are thrown so lamda will be a good option. But also agree that we can keep the style consistent throughout the codebase and do a global replacement later once we know more about the performance impact. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1073973004


##########
helix-core/src/test/java/org/apache/helix/controller/stages/TestIntermediateStateCalcStage.java:
##########
@@ -272,6 +273,128 @@ public void testWithClusterConfigChange() {
     }
   }
 
+  @Test
+  public void testThrottleByErrorPartition() {
+    String resourcePrefix = "resource";
+    int nResource = 3;
+    int nPartition = 3;
+    int nReplica = 3;
+
+    String[] resources = new String[nResource];
+    for (int i = 0; i < nResource; i++) {
+      resources[i] = resourcePrefix + "_" + i;
+    }
+
+    preSetup(resources, nReplica, nReplica);
+    event.addAttribute(AttributeName.RESOURCES.name(),
+        getResourceMap(resources, nPartition, "OnlineOffline"));
+    event.addAttribute(AttributeName.RESOURCES_TO_REBALANCE.name(),
+        getResourceMap(resources, nPartition, "OnlineOffline"));
+    ClusterStatusMonitor monitor = new ClusterStatusMonitor(_clusterName);
+    monitor.active();
+    event.addAttribute(AttributeName.clusterStatusMonitor.name(), monitor);
+
+    // Initialize best possible state and current state
+    BestPossibleStateOutput bestPossibleStateOutput = new BestPossibleStateOutput();
+    MessageOutput messageSelectOutput = new MessageOutput();
+    CurrentStateOutput currentStateOutput = new CurrentStateOutput();
+    IntermediateStateOutput expectedResult = new IntermediateStateOutput();
+
+    _clusterConfig.setErrorOrRecoveryPartitionThresholdForLoadBalance(0);

Review Comment:
   Throttle will be triggered if there are more than the configured TH. If there is no error partition, no throttling. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] desaikomal commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
desaikomal commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1072870646


##########
helix-core/src/main/java/org/apache/helix/controller/stages/IntermediateStateCalcStage.java:
##########
@@ -421,7 +419,8 @@ private PartitionStateMap computeIntermediatePartitionState(ResourceControllerDa
     if (clusterStatusMonitor != null) {
       clusterStatusMonitor
           .updateRebalancerStats(resourceName, messagesForRecovery.size(), messagesForLoad.size(),
-              messagesThrottledForRecovery.size(), messagesThrottledForLoad.size());
+              messagesThrottledForRecovery.size(), messagesThrottledForLoad.size(),
+              onlyDownwardLoadBalance);

Review Comment:
   makes sense. thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1073974174


##########
helix-core/src/test/java/org/apache/helix/controller/stages/TestIntermediateStateCalcStage.java:
##########
@@ -272,6 +273,128 @@ public void testWithClusterConfigChange() {
     }
   }
 
+  @Test
+  public void testThrottleByErrorPartition() {
+    String resourcePrefix = "resource";
+    int nResource = 3;
+    int nPartition = 3;
+    int nReplica = 3;
+
+    String[] resources = new String[nResource];
+    for (int i = 0; i < nResource; i++) {
+      resources[i] = resourcePrefix + "_" + i;
+    }
+
+    preSetup(resources, nReplica, nReplica);
+    event.addAttribute(AttributeName.RESOURCES.name(),
+        getResourceMap(resources, nPartition, "OnlineOffline"));
+    event.addAttribute(AttributeName.RESOURCES_TO_REBALANCE.name(),
+        getResourceMap(resources, nPartition, "OnlineOffline"));
+    ClusterStatusMonitor monitor = new ClusterStatusMonitor(_clusterName);
+    monitor.active();
+    event.addAttribute(AttributeName.clusterStatusMonitor.name(), monitor);
+
+    // Initialize best possible state and current state
+    BestPossibleStateOutput bestPossibleStateOutput = new BestPossibleStateOutput();
+    MessageOutput messageSelectOutput = new MessageOutput();
+    CurrentStateOutput currentStateOutput = new CurrentStateOutput();
+    IntermediateStateOutput expectedResult = new IntermediateStateOutput();
+
+    _clusterConfig.setErrorOrRecoveryPartitionThresholdForLoadBalance(0);

Review Comment:
   Also we have a test for this config on line 231. That unit test does not test metrics though. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] desaikomal commented on pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
desaikomal commented on PR #2340:
URL: https://github.com/apache/helix/pull/2340#issuecomment-1386053026

   > ### Issues
   > * [x]  My PR addresses the following Helix issues and references them in the PR description:
   > 
   > #2341
   > 
   > ### Description
   > * [x]  Here are some details about my PR, including screenshots of any UI changes:
   > 
   > This change adds metrics for rebalance throttled because of error partition. Including aggregated metrics for the whole cluster and per resource metrics.
   > 
   > ### Tests
   > * [x]  The following tests are written for this issue:
   >   TestIntermediateStateCalcStage.testThrottleByErrorPartition
   > * The following is the result of the "mvn test" command on the appropriate module:
   > 
   > (If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)
   > 
   > ### Changes that Break Backward Compatibility (Optional)
   > * My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:
   > 
   > (Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)
   > 
   > ### Documentation (Optional)
   > * In case of new functionality, my PR adds documentation in the following wiki page:
   > 
   > (Link the GitHub wiki you added)
   > 
   > ### Commits
   > * My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
   >   
   >   1. Subject is separated from body by a blank line
   >   2. Subject is limited to 50 characters (not including Jira issue reference)
   >   3. Subject does not end with a period
   >   4. Subject uses the imperative mood ("add", not "adding")
   >   5. Body wraps at 72 characters
   >   6. Body explains "what" and "why", not "how"
   > 
   > ### Code Quality
   > * My diff has been formatted using helix-style.xml
   >   (helix-style-intellij.xml if IntelliJ IDE is used)
   
   Should you say, "Fixes #2340 " ??


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2340: Add metrics for rebalance throttled because of error partition

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on code in PR #2340:
URL: https://github.com/apache/helix/pull/2340#discussion_r1072869307


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ResourceMonitor.java:
##########
@@ -467,19 +475,13 @@ public void resetMaxTopStateHandoffGauge() {
         _numPendingLoadRebalanceReplicas,
         _numRecoveryRebalanceThrottledReplicas,
         _numLoadRebalanceThrottledReplicas,
-        _externalViewIdealStateDiff,
-        _successfulTopStateHandoffDurationCounter,
-        _successTopStateHandoffCounter,
-        _failedTopStateHandoffCounter,
-        _maxSinglePartitionTopStateHandoffDuration,
-        _partitionTopStateHandoffDurationGauge,
+        _externalViewIdealStateDiff, _successfulTopStateHandoffDurationCounter,
+        _successTopStateHandoffCounter, _failedTopStateHandoffCounter,
+        _maxSinglePartitionTopStateHandoffDuration, _partitionTopStateHandoffDurationGauge,
         _partitionTopStateHandoffHelixLatencyGauge,
-        _partitionTopStateNonGracefulHandoffDurationGauge,
-        _totalMessageReceived,
-        _totalMessageReceivedCounter,
-        _numPendingStateTransitions,
-        _rebalanceState
-    );
+        _partitionTopStateNonGracefulHandoffDurationGauge, _totalMessageReceived,
+        _totalMessageReceivedCounter, _numPendingStateTransitions, _rebalanceState,

Review Comment:
   Auto formatting did that. Let me change it back. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org