You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2023/01/19 08:15:13 UTC

[GitHub] [helix] rahulrane50 opened a new pull request, #2344: Added new metric to report real time missing top state for partition

rahulrane50 opened a new pull request, #2344:
URL: https://github.com/apache/helix/pull/2344

   ### Issues
   
   - [X] My PR addresses the following Helix issues and references them in the PR description:
   Adds a new metric to report real time missing top state for any partition for each resource in a cluster.
   
   (#200 - Link your issue number here: You can write "Fixes #XXX". Please use the proper keyword so that the issue gets closed automatically. See https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue
   Any of the following keywords can be used: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved)
   
   ### Description
   Issue :
   Currently, if waged rebalancer is used then there could be a situation where multiple leader replicas are residing on same node. If that node goes down for maintainence or for any other reason then there is no current metric which reports about missing top state until any rebalancer event is triggered. 
   
   Ask :
   Ask here is to add a metric which satisfies following conditions :
   1. The metric should be at resource level and should be reported irrespective of top state replica is recovered or not for any partition of that resource.
   2. This metric should be started as soon as "any" partition for a resource has top state missing and it should be "reset" only when that resource has no partition with missing top state.
   
   Solution :
   The solution proposed in this PR starts a async thread as soon as first partition of any resource has top state missing and it continuously increments a counter at resource level as long as that resource has "at least one" partition with missing top state. If there are no resources with any partitions with missing top state then this async thread will sleep.
   The counter reported by this thread is monotonically increasing and will be reset when all top state replicas for that resource are recovered. This number DOES NOT gives any insights about which partitions are missing top state or how many partitions are missing top state but non-zero value of this counter indicates that there are at least one partition for that resource has missing top state.
   
   
   - [ ] Here are some details about my PR, including screenshots of any UI changes:
   
   (Write a concise description including what, why, how)
   
   ### Tests
   
   - [X ] The following tests are written for this issue:
   
   CI pipeline and added new tests for verifying metrics.
   
   ```
   mvn test -Dtest=TestTopStateHandoffMetrics
   [WARNING] Tests run: 12, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 18.723 s - in org.apache.helix.monitoring.mbeans.TestTopStateHandoffMetrics
   [INFO]
   [INFO] Results:
   [INFO]
   [WARNING] Tests run: 12, Failures: 0, Errors: 0, Skipped: 1
   
   ```
   
   (List the names of added unit/integration tests)
   
   - The following is the result of the "mvn test" command on the appropriate module:
   
   (If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)
   
   ### Changes that Break Backward Compatibility (Optional)
   
   - My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:
   
   (Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)
   
   ### Documentation (Optional)
   
   - In case of new functionality, my PR adds documentation in the following wiki page:
   
   (Link the GitHub wiki you added)
   
   ### Commits
   
   - My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Code Quality
   
   - My diff has been formatted using helix-style.xml 
   (helix-style-intellij.xml if IntelliJ IDE is used)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "xyuanlu (via GitHub)" <gi...@apache.org>.
xyuanlu commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1094940522


##########
helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java:
##########
@@ -85,6 +85,9 @@ private void updateTopStateStatus(ResourceControllerDataProvider cache,
     if (cache.getClusterConfig() != null) {
       durationThreshold = cache.getClusterConfig().getMissTopStateDurationThreshold();
     }
+    if (clusterStatusMonitor != null) {
+      clusterStatusMonitor.updateMissingTopStateDurationThreshold(durationThreshold);

Review Comment:
   I am not sure if we can assume that.  Normally we have default value and user can set -1 if they want to disable this. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] desaikomal commented on pull request #2344: Added new metric to report real time missing top state for partition

Posted by GitBox <gi...@apache.org>.
desaikomal commented on PR #2344:
URL: https://github.com/apache/helix/pull/2344#issuecomment-1397114199

   Just a small nit: since we have an ask, you can create an Issue with all the details and then in Issues section, can say, 'Fixes #<issue>', 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "xyuanlu (via GitHub)" <gi...@apache.org>.
xyuanlu commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084598233


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(Map<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                  }
+                }
+              }
+            }
+            // TODO: Check if this SLEEP_TIME is correct? Thread should keep on increasing the counter continuously until top
+            //  state is recovered but it can sleep for reasonable amount of time in between.
+            sleep(100);

Review Comment:
   +1 on conditional variable. Then no sleep is needed here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "xyuanlu (via GitHub)" <gi...@apache.org>.
xyuanlu commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084637020


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -109,9 +157,21 @@ public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
 
   private final Map<String, JobMonitor> _perTypeJobMonitorMap = new ConcurrentHashMap<>();
 
+  /**
+   * Missing top state resource map: resourceName-><PartitionName->startTimeOfMissingTopState>
+   */
+  private final Map<String, Map<String, Long>> _missingTopStateResourceMap = new ConcurrentHashMap<>();
+  private final AsyncMissingTopStateMonitor _asyncMissingTopStateMonitor = new AsyncMissingTopStateMonitor(_missingTopStateResourceMap);
+
   public ClusterStatusMonitor(String clusterName) {
     _clusterName = clusterName;
     _beanServer = ManagementFactory.getPlatformMBeanServer();
+    /**
+     * Start a async thread for each cluster which keeps monitoring missing top states of any resource for a cluster.
+     * This thread will keep on iterating over resources and report missingTopStateDuration for them until there are at
+     * least one resource with missing top state for a cluster.
+     */
+    _asyncMissingTopStateMonitor.start();

Review Comment:
   Sorry I did not follow. When ClusterStatusMonitor terminates, we need to join this thread dont we? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] desaikomal commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "desaikomal (via GitHub)" <gi...@apache.org>.
desaikomal commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084445310


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(Map<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                  }
+                }
+              }
+            }
+            // TODO: Check if this SLEEP_TIME is correct? Thread should keep on increasing the counter continuously until top
+            //  state is recovered but it can sleep for reasonable amount of time in between.
+            sleep(100);

Review Comment:
   Rahul, i am also not following why you need sleep. 
   I would keep the 'startTime' and always report 'currentTime - startTime as part of the reporting. so everytime this method gets called by our underlying system, you do that. why explicitly sleep?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on pull request #2344: Added new metric to report real time missing top state for partition

Posted by "xyuanlu (via GitHub)" <gi...@apache.org>.
xyuanlu commented on PR #2344:
URL: https://github.com/apache/helix/pull/2344#issuecomment-1401113057

   A general question here. To me it seems the new metrics is using some non 0 random number indicating at least one partition had no top state. (and 0 means all partition is good). I feel like we could have a more logistic meaningful number. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 closed pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 closed pull request #2344: Added new metric to report real time missing top state for partition
URL: https://github.com/apache/helix/pull/2344


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1087265251


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -109,9 +157,21 @@ public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
 
   private final Map<String, JobMonitor> _perTypeJobMonitorMap = new ConcurrentHashMap<>();
 
+  /**
+   * Missing top state resource map: resourceName-><PartitionName->startTimeOfMissingTopState>
+   */
+  private final Map<String, Map<String, Long>> _missingTopStateResourceMap = new ConcurrentHashMap<>();
+  private final AsyncMissingTopStateMonitor _asyncMissingTopStateMonitor = new AsyncMissingTopStateMonitor(_missingTopStateResourceMap);
+
   public ClusterStatusMonitor(String clusterName) {
     _clusterName = clusterName;
     _beanServer = ManagementFactory.getPlatformMBeanServer();
+    /**
+     * Start a async thread for each cluster which keeps monitoring missing top states of any resource for a cluster.
+     * This thread will keep on iterating over resources and report missingTopStateDuration for them until there are at
+     * least one resource with missing top state for a cluster.
+     */
+    _asyncMissingTopStateMonitor.start();

Review Comment:
   I checked the flow and async thread should be tied with ClusterStatusMonitor thread. That means currently, all beans are registered in active() method and unregistered or reset in reset() method. So whenever Helix controller will activate cluster status monitor it should start this async thread and whenever helix controller stops/resets cluster status monitor it should stop this async thread. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084613316


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;

Review Comment:
   TFTR @qqu0127! Fixed it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] desaikomal commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "desaikomal (via GitHub)" <gi...@apache.org>.
desaikomal commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084464176


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ResourceMonitor.java:
##########
@@ -335,12 +340,21 @@ private void resetResourceStateGauges() {
     _numOfPartitionsInExternalView.updateValue(0L);
     _numLessMinActiveReplicaPartitions.updateValue(0L);
     _numLessReplicaPartitions.updateValue(0L);
+    _oneOrManyPartitionsMissingTopStateRealTimeGuage.updateValue(0L);

Review Comment:
   just a suggestion: missingTopStateDuration



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "xyuanlu (via GitHub)" <gi...@apache.org>.
xyuanlu commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1087266504


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -109,9 +157,21 @@ public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
 
   private final Map<String, JobMonitor> _perTypeJobMonitorMap = new ConcurrentHashMap<>();
 
+  /**
+   * Missing top state resource map: resourceName-><PartitionName->startTimeOfMissingTopState>
+   */
+  private final Map<String, Map<String, Long>> _missingTopStateResourceMap = new ConcurrentHashMap<>();
+  private final AsyncMissingTopStateMonitor _asyncMissingTopStateMonitor = new AsyncMissingTopStateMonitor(_missingTopStateResourceMap);
+
   public ClusterStatusMonitor(String clusterName) {
     _clusterName = clusterName;
     _beanServer = ManagementFactory.getPlatformMBeanServer();
+    /**
+     * Start a async thread for each cluster which keeps monitoring missing top states of any resource for a cluster.
+     * This thread will keep on iterating over resources and report missingTopStateDuration for them until there are at
+     * least one resource with missing top state for a cluster.
+     */
+    _asyncMissingTopStateMonitor.start();

Review Comment:
   Does this mean after `reset()` we should kill this metrics reporting thread? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] junkaixue commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "junkaixue (via GitHub)" <gi...@apache.org>.
junkaixue commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1090061039


##########
helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java:
##########
@@ -319,12 +322,16 @@ private void reportTopStateHandoffFailIfNecessary(ResourceControllerDataProvider
     String partitionName = partition.getPartitionName();
     MissingTopStateRecord record = missingTopStateMap.get(resourceName).get(partitionName);
     long startTime = record.getStartTimeStamp();
-    if (startTime > 0 && System.currentTimeMillis() - startTime > durationThreshold && !record
-        .isFailed()) {
-      record.setFailed();
-      missingTopStateMap.get(resourceName).put(partitionName, record);
+    if (startTime > 0 && !record.isFailed()) {
       if (clusterStatusMonitor != null) {
-        clusterStatusMonitor.updateMissingTopStateDurationStats(resourceName, 0L, 0L, false, false);
+        clusterStatusMonitor.updateMissingTopStateResourceMap(resourceName, partitionName, true, startTime);
+      }
+      if (System.currentTimeMillis() - startTime > durationThreshold) {
+        record.setFailed();
+        missingTopStateMap.get(resourceName).put(partitionName, record);
+        if (clusterStatusMonitor != null) {
+          clusterStatusMonitor.updateMissingTopStateDurationStats(resourceName, 0L, 0L, false, false);

Review Comment:
   Maybe we can deprecate this piece of logic in the future. Because anyway we already report the top state missing there is a one.
   
   I am not sure whether would be useful to keep the duration one for end-to-end.



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,61 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetMissingTopStateDurationGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateMissingTopStateDurationGuage(System.currentTimeMillis() - missingTopStateStartTime);
+                  }
+                }
+
+              }
+            }
+            sleep(50); // Instead of providing stream of durtion values to histogram thread can sleep in between to save some CPU cycles.

Review Comment:
   I am still concerning this number as hard coding here... It would not be a good thing. Could you explain why adding this number helps?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1088535288


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,60 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetMissingTopStateDurationGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateMissingTopStateDurationGuage(System.currentTimeMillis() - missingTopStateStartTime);
+                  }
+                }
+
+              }
+            }
+          }
+        }
+      } catch (InterruptedException e) {
+        LOG.error("AsyncMissingTopStateMonitor has been interrupted.", e);
+      }
+    }
+
+    public void reset() {
+      for (String resource : _missingTopStateResourceMap.keySet()) {
+        ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resource);

Review Comment:
   From comments above method getResourceMonitor() it says it's only for tests. I think it's because it's not synchronized like getOrCreateResourceMonitor().  Anyways it doesn't create resource monitor everytime.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084576638


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(Map<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {

Review Comment:
   That's a good suggestion, but even if we parallelize this loop but still all those threads would be updating same guage value so it would be sequential only, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084629930


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -109,9 +157,21 @@ public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
 
   private final Map<String, JobMonitor> _perTypeJobMonitorMap = new ConcurrentHashMap<>();
 
+  /**
+   * Missing top state resource map: resourceName-><PartitionName->startTimeOfMissingTopState>
+   */
+  private final Map<String, Map<String, Long>> _missingTopStateResourceMap = new ConcurrentHashMap<>();
+  private final AsyncMissingTopStateMonitor _asyncMissingTopStateMonitor = new AsyncMissingTopStateMonitor(_missingTopStateResourceMap);
+
   public ClusterStatusMonitor(String clusterName) {
     _clusterName = clusterName;
     _beanServer = ManagementFactory.getPlatformMBeanServer();
+    /**
+     * Start a async thread for each cluster which keeps monitoring missing top states of any resource for a cluster.
+     * This thread will keep on iterating over resources and report missingTopStateDuration for them until there are at
+     * least one resource with missing top state for a cluster.
+     */
+    _asyncMissingTopStateMonitor.start();

Review Comment:
   Good question! I think it has to be stopped/interrupted for sure by ClusterStatusMonitor when it's cleaning up all state. But main process (ClusterStatusMonitor) do not have to wait until it's finished because it's async long running thread and metrics reporting is never ending job so it's lifecycle can be tied with main process. Let me know if that makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] qqu0127 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "qqu0127 (via GitHub)" <gi...@apache.org>.
qqu0127 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084678558


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(Map<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                resourcePartitionIt.remove();
+              } else {

Review Comment:
   One potential way in my mind is to separate the metrics update logic and entry remove. We can call `map.entrySet().removeIf()` after the forloop. But essentially it's iterating again. 
   I don't have strong opinion on this, so you feel free to make the call.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "xyuanlu (via GitHub)" <gi...@apache.org>.
xyuanlu commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1089268079


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,60 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetMissingTopStateDurationGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateMissingTopStateDurationGuage(System.currentTimeMillis() - missingTopStateStartTime);
+                  }
+                }
+
+              }
+            }
+          }
+        }
+      } catch (InterruptedException e) {
+        LOG.error("AsyncMissingTopStateMonitor has been interrupted.", e);
+      }
+    }
+
+    public void reset() {
+      for (String resource : _missingTopStateResourceMap.keySet()) {

Review Comment:
   May I ask why we need a new reset() method here? It is only being called in ClusterStatusMonitor reset() and that reset() is basically a close() function for ClusterStatusMonitor?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084612730


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -583,6 +643,33 @@ public void updateMissingTopStateDurationStats(String resourceName, long totalDu
     }
   }
 
+  public void updateMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+    _asyncMissingTopStateMonitor.setMissingTopStateDurationThreshold(missingTopStateDurationThreshold);
+  }
+
+  public void updateMissingTopStateResourceMap(String resourceName,String partitionName, boolean isTopStateMissing, long startTime) {
+    // Top state started missing
+    if (isTopStateMissing) {
+      // Wake up asyncMissingTopStateMonitor thread on first resource being added to map
+      if (_missingTopStateResourceMap.isEmpty()) {
+        synchronized (_asyncMissingTopStateMonitor) {
+          _asyncMissingTopStateMonitor.notify();
+        }
+      }
+      if (!_missingTopStateResourceMap.containsKey(resourceName)) {
+        _missingTopStateResourceMap.put(resourceName, new HashMap<String, Long>());
+      }
+      _missingTopStateResourceMap.get(resourceName).put(partitionName, startTime);
+    } else { // top state recovered
+      // remove partitions from resourceMap whose top state has been recovered, this will put
+      // asyncMissingTopStateMonitor thread to sleep when no resources left to monitor.
+      Map<String, Long> entry = _missingTopStateResourceMap.get(resourceName);
+      if (entry != null) {
+        entry.remove(partitionName);
+      }

Review Comment:
   let me know if above comment makes sense



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084629930


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -109,9 +157,21 @@ public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
 
   private final Map<String, JobMonitor> _perTypeJobMonitorMap = new ConcurrentHashMap<>();
 
+  /**
+   * Missing top state resource map: resourceName-><PartitionName->startTimeOfMissingTopState>
+   */
+  private final Map<String, Map<String, Long>> _missingTopStateResourceMap = new ConcurrentHashMap<>();
+  private final AsyncMissingTopStateMonitor _asyncMissingTopStateMonitor = new AsyncMissingTopStateMonitor(_missingTopStateResourceMap);
+
   public ClusterStatusMonitor(String clusterName) {
     _clusterName = clusterName;
     _beanServer = ManagementFactory.getPlatformMBeanServer();
+    /**
+     * Start a async thread for each cluster which keeps monitoring missing top states of any resource for a cluster.
+     * This thread will keep on iterating over resources and report missingTopStateDuration for them until there are at
+     * least one resource with missing top state for a cluster.
+     */
+    _asyncMissingTopStateMonitor.start();

Review Comment:
   Good question! I think it has to be stopped/interrupted for sure by ClusterStatusMonitor when it's cleaning up all state but main process (ClusterStatusMonitor) do not have to wait until it's finished because it's async long running thread and metrics reporting is never ending job so it's lifecycle can be tied with main process. Let me know if that makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on pull request #2344: Added new metric to report real time missing top state for partition

Posted by "xyuanlu (via GitHub)" <gi...@apache.org>.
xyuanlu commented on PR #2344:
URL: https://github.com/apache/helix/pull/2344#issuecomment-1401135982

   --> The reason of decoupling metric reporting with existing main process of clusterstatusmonitor is to report metric irrespective of any event is being handled or not.
   
   Please correct me if I am wrong. In your implementation, I think update happens at "TopStateHandoffReportStage" witch is trigged when event (update on current stage etc.) We didn't quite decouple these two...? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] junkaixue commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "junkaixue (via GitHub)" <gi...@apache.org>.
junkaixue commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1095335668


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,81 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {

Review Comment:
   This will be meaningful if our pipeline speed is very fast. Otherwise, it will not help as the cache updated state is refreshed per pipeline. I dont believe we need build this thread but just rely on pipeline call as we did before. 
   
   But we need to remove the constraint of doing only final reporting.
   
   We can discuss about it tomorrow f2f.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] desaikomal commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "desaikomal (via GitHub)" <gi...@apache.org>.
desaikomal commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1088396600


##########
helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java:
##########
@@ -85,6 +85,9 @@ private void updateTopStateStatus(ResourceControllerDataProvider cache,
     if (cache.getClusterConfig() != null) {
       durationThreshold = cache.getClusterConfig().getMissTopStateDurationThreshold();
     }
+    if (clusterStatusMonitor != null) {
+      clusterStatusMonitor.updateMissingTopStateDurationThreshold(durationThreshold);

Review Comment:
   if durationThreshold is Long.MAX-VALUE, our monitoring will never finish.. thread will remain spin with no-action.. 



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,60 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {

Review Comment:
   i am not sure if i understand this. you have the whole object in synchronized block and the block has while (true).. constantly processing, so how can any other thread enter?



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -583,6 +643,31 @@ public void updateMissingTopStateDurationStats(String resourceName, long totalDu
     }
   }
 
+  public void updateMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+    _asyncMissingTopStateMonitor.setMissingTopStateDurationThreshold(missingTopStateDurationThreshold);

Review Comment:
   now this is where i have question: since your _asyncMissingTopStateMonitor has synchronized and is in tight-loop, will this method go through? May be when the method is in "wait" state, it releases the lock and so this method can go through



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ResourceMonitor.java:
##########
@@ -139,6 +142,10 @@ public ResourceMonitor(String clusterName, String resourceName, ObjectName objec
     _partitionTopStateNonGracefulHandoffDurationGauge =
         new HistogramDynamicMetric("PartitionTopStateNonGracefulHandoffGauge", new Histogram(
             new SlidingTimeWindowArrayReservoir(getResetIntervalInMs(), TimeUnit.MILLISECONDS)));
+    _missingTopStateDurationGuage =
+        new HistogramDynamicMetric("MissingTopStateDurationGauge", new Histogram(
+            new SlidingTimeWindowArrayReservoir(getResetIntervalInMs(), TimeUnit.MILLISECONDS)));

Review Comment:
   can this be your sleep duration?



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,60 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetMissingTopStateDurationGuage();
+                resourcePartitionIt.remove();

Review Comment:
   when you remove the entry from the map on which you are already iterating, is it ok?
   in C++, i know that we can't have iterator and remove from iterator in same loop as "iterator" will need to reassign.
   Please check.



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,60 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetMissingTopStateDurationGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateMissingTopStateDurationGuage(System.currentTimeMillis() - missingTopStateStartTime);
+                  }
+                }
+
+              }
+            }
+          }
+        }
+      } catch (InterruptedException e) {
+        LOG.error("AsyncMissingTopStateMonitor has been interrupted.", e);
+      }
+    }
+
+    public void reset() {
+      for (String resource : _missingTopStateResourceMap.keySet()) {
+        ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resource);

Review Comment:
   can you not just use getResourceMonitor() as you don't want to create resource monitor if you are in reset path



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1085771190


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,55 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                  }
+                }
+
+              }
+            }
+            // TODO: Check if this SLEEP_TIME is correct? Thread should keep on increasing the counter continuously until top
+            //  state is recovered but it can sleep for reasonable amount of time in between.
+            sleep(100);
+          }
+        }
+      } catch (InterruptedException e) {
+        LOG.error("AsyncMissingTopStateMonitor has been interrupted.", e);

Review Comment:
   That's a very good point and catch. Handled this correctly now in active() and reset() flow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084577199


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -109,9 +157,21 @@ public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
 
   private final Map<String, JobMonitor> _perTypeJobMonitorMap = new ConcurrentHashMap<>();
 
+  /**
+   * Missing top state resource map: resourceName-><PartitionName->startTimeOfMissingTopState>
+   */
+  private final Map<String, Map<String, Long>> _missingTopStateResourceMap = new ConcurrentHashMap<>();
+  private final AsyncMissingTopStateMonitor _asyncMissingTopStateMonitor = new AsyncMissingTopStateMonitor(_missingTopStateResourceMap);

Review Comment:
   That's good suggestion, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] mgao0 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "mgao0 (via GitHub)" <gi...@apache.org>.
mgao0 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084406721


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -109,9 +157,21 @@ public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
 
   private final Map<String, JobMonitor> _perTypeJobMonitorMap = new ConcurrentHashMap<>();
 
+  /**
+   * Missing top state resource map: resourceName-><PartitionName->startTimeOfMissingTopState>
+   */
+  private final Map<String, Map<String, Long>> _missingTopStateResourceMap = new ConcurrentHashMap<>();

Review Comment:
   Can this line and next line can be part of ClusterStatusMonitor constructor? Just define the type of variable here, but do the instantiation in the constructor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on PR #2344:
URL: https://github.com/apache/helix/pull/2344#issuecomment-1440700375

   After internal syncup, it makes sense to not have this real-time metric. Helix will emit a metric as a part of current default pipeline (which is triggered on any helix event), to report number of partitions with missing top state beyond set threshold value. I will close this PR and create a new one to address this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1088534315


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,60 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetMissingTopStateDurationGuage();
+                resourcePartitionIt.remove();

Review Comment:
   This is the only way i could find for removing entries while iterative over map in java. It works fine



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1089339988


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ResourceMonitor.java:
##########
@@ -139,6 +142,10 @@ public ResourceMonitor(String clusterName, String resourceName, ObjectName objec
     _partitionTopStateNonGracefulHandoffDurationGauge =
         new HistogramDynamicMetric("PartitionTopStateNonGracefulHandoffGauge", new Histogram(
             new SlidingTimeWindowArrayReservoir(getResetIntervalInMs(), TimeUnit.MILLISECONDS)));
+    _missingTopStateDurationGuage =
+        new HistogramDynamicMetric("MissingTopStateDurationGauge", new Histogram(
+            new SlidingTimeWindowArrayReservoir(getResetIntervalInMs(), TimeUnit.MILLISECONDS)));

Review Comment:
   That's an excellent suggestion. After digging deeper, so the way Histrogram metrics works is based on reservoir sampling meaning even if stream of samples are provided to this guage but it will pick only k random samples (or based on sliding window provided here) and discard other samples provided. But still i just add sleep for very less duration to save some CPU cycles. So to summarize, as long as metric reporting is concerned, no matter how many samples are provided to this guage only handlful of samples will be emitted by helix controller. 
   
   Code pointer for reference : https://www.tabnine.com/code/java/classes/com.codahale.metrics.SlidingTimeWindowArrayReservoir



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084406786


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(Map<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                  }
+                }
+              }
+            }
+            // TODO: Check if this SLEEP_TIME is correct? Thread should keep on increasing the counter continuously until top
+            //  state is recovered but it can sleep for reasonable amount of time in between.
+            sleep(100);

Review Comment:
   Just replied on last comment. So this sleep() is not related to main conditional sleeping. This sleep is added because this for loop will continuously increment counters on every iteration which can overflow counter if topstate missing situation continues for long time and also it's not really needed to increment counter for every single ms/ns. As the TODO says async thread should continuously increment counter but with some reasonable(may be avg topstate missing duration/4 ms or something like this) sleep in between.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "xyuanlu (via GitHub)" <gi...@apache.org>.
xyuanlu commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1095220118


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,60 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {

Review Comment:
   You just need to lock the part where we check if condition is met. Not the whole while(true) block.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1089315825


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,60 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {

Review Comment:
   So other methods are not synchronized hence other thread (ClusterStatusMonitor) thread can still call these methods without need of acquiring thread objects's lock. (https://users.cs.fiu.edu/~weiss/cop4338_spr06/lectures/Threads/threads.pdf See slide 19). IMHO it's okay because setMissingTopStateDurationThreshold is not updating any shared resource. And reset method is like reseting the resource map which is again okay.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084398651


##########
helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java:
##########
@@ -319,12 +322,17 @@ private void reportTopStateHandoffFailIfNecessary(ResourceControllerDataProvider
     String partitionName = partition.getPartitionName();
     MissingTopStateRecord record = missingTopStateMap.get(resourceName).get(partitionName);
     long startTime = record.getStartTimeStamp();
-    if (startTime > 0 && System.currentTimeMillis() - startTime > durationThreshold && !record
+    if (startTime > 0 && !record
         .isFailed()) {

Review Comment:
   Sure let me fix indentation. But this is not refactoring and needed for this fix. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on PR #2344:
URL: https://github.com/apache/helix/pull/2344#issuecomment-1401049891

   TFTR @mgao0 please find answers inline!
   > Thanks for the PR. I have several general question:
   > 
   > 1. Is there a reason why we need a long-running thread for AsyncMissingTopStateMonitor?  Can we couple the update of ResourceMonitor with the update of updateMissingTopStateResourceMap in ClusterStatusMonitor, instead of checking the map periodically?
   --> The reason of decoupling metric reporting with existing main process of clusterstatusmonitor is to report metric irrespective of any event is being handled or not. Hence we couple of this with clusterstatusmonitor then thread will report metric only when resourcemap is updated ie., any event happens. About long-running thread, it sleeps whenever all resources have all partitions with top state recovered. To save some resources i had added a sleep in thread.
   > 2. Can we make this metric as optional, one that can be turned on and turned off by config?
   --> That's a good point! To be honest i'm not sure. In my mind this is very small features hence guarding it behind flag may not be that useful. In terms resource usage or performance it should not that much since it's single thread which should be running only when any resources have any partitions with missing top state. I'm open for suggestions though :)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "xyuanlu (via GitHub)" <gi...@apache.org>.
xyuanlu commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084582762


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -109,9 +157,21 @@ public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
 
   private final Map<String, JobMonitor> _perTypeJobMonitorMap = new ConcurrentHashMap<>();
 
+  /**
+   * Missing top state resource map: resourceName-><PartitionName->startTimeOfMissingTopState>
+   */
+  private final Map<String, Map<String, Long>> _missingTopStateResourceMap = new ConcurrentHashMap<>();
+  private final AsyncMissingTopStateMonitor _asyncMissingTopStateMonitor = new AsyncMissingTopStateMonitor(_missingTopStateResourceMap);
+
   public ClusterStatusMonitor(String clusterName) {
     _clusterName = clusterName;
     _beanServer = ManagementFactory.getPlatformMBeanServer();
+    /**
+     * Start a async thread for each cluster which keeps monitoring missing top states of any resource for a cluster.
+     * This thread will keep on iterating over resources and report missingTopStateDuration for them until there are at
+     * least one resource with missing top state for a cluster.
+     */
+    _asyncMissingTopStateMonitor.start();

Review Comment:
   Do we need to join this thread eventually? 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] qqu0127 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "qqu0127 (via GitHub)" <gi...@apache.org>.
qqu0127 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084318326


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;

Review Comment:
   Does the map have to be a concurrent map? Any risk of concurrent modification?



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(Map<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                resourcePartitionIt.remove();
+              } else {

Review Comment:
   For me I'm not a fan of this iterate and modify pattern. It's not wrong, but any chance we can make it a pure function?



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -583,6 +643,33 @@ public void updateMissingTopStateDurationStats(String resourceName, long totalDu
     }
   }
 
+  public void updateMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+    _asyncMissingTopStateMonitor.setMissingTopStateDurationThreshold(missingTopStateDurationThreshold);
+  }
+
+  public void updateMissingTopStateResourceMap(String resourceName,String partitionName, boolean isTopStateMissing, long startTime) {
+    // Top state started missing
+    if (isTopStateMissing) {
+      // Wake up asyncMissingTopStateMonitor thread on first resource being added to map
+      if (_missingTopStateResourceMap.isEmpty()) {
+        synchronized (_asyncMissingTopStateMonitor) {
+          _asyncMissingTopStateMonitor.notify();
+        }
+      }
+      if (!_missingTopStateResourceMap.containsKey(resourceName)) {
+        _missingTopStateResourceMap.put(resourceName, new HashMap<String, Long>());
+      }

Review Comment:
   This can be reduced to computeIfAbsent ?



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -109,9 +157,21 @@ public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
 
   private final Map<String, JobMonitor> _perTypeJobMonitorMap = new ConcurrentHashMap<>();
 
+  /**
+   * Missing top state resource map: resourceName-><PartitionName->startTimeOfMissingTopState>
+   */
+  private final Map<String, Map<String, Long>> _missingTopStateResourceMap = new ConcurrentHashMap<>();
+  private final AsyncMissingTopStateMonitor _asyncMissingTopStateMonitor = new AsyncMissingTopStateMonitor(_missingTopStateResourceMap);

Review Comment:
   Now I see it's indeed a ConcurrentHashMap. 
   One nit, let's make this type explicit in the signature. 



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -583,6 +643,33 @@ public void updateMissingTopStateDurationStats(String resourceName, long totalDu
     }
   }
 
+  public void updateMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+    _asyncMissingTopStateMonitor.setMissingTopStateDurationThreshold(missingTopStateDurationThreshold);
+  }
+
+  public void updateMissingTopStateResourceMap(String resourceName,String partitionName, boolean isTopStateMissing, long startTime) {
+    // Top state started missing
+    if (isTopStateMissing) {
+      // Wake up asyncMissingTopStateMonitor thread on first resource being added to map
+      if (_missingTopStateResourceMap.isEmpty()) {
+        synchronized (_asyncMissingTopStateMonitor) {
+          _asyncMissingTopStateMonitor.notify();
+        }
+      }
+      if (!_missingTopStateResourceMap.containsKey(resourceName)) {
+        _missingTopStateResourceMap.put(resourceName, new HashMap<String, Long>());
+      }
+      _missingTopStateResourceMap.get(resourceName).put(partitionName, startTime);
+    } else { // top state recovered
+      // remove partitions from resourceMap whose top state has been recovered, this will put
+      // asyncMissingTopStateMonitor thread to sleep when no resources left to monitor.
+      Map<String, Long> entry = _missingTopStateResourceMap.get(resourceName);
+      if (entry != null) {
+        entry.remove(partitionName);
+      }

Review Comment:
   Related to the above comment on iterator. Can we do all the GC-like map cleanup here? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "xyuanlu (via GitHub)" <gi...@apache.org>.
xyuanlu commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084648553


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(Map<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                  }
+                }
+              }
+            }
+            // TODO: Check if this SLEEP_TIME is correct? Thread should keep on increasing the counter continuously until top
+            //  state is recovered but it can sleep for reasonable amount of time in between.
+            sleep(100);

Review Comment:
   Having a sleep() with some random number is not as preferable as using conditional variable...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1085778366


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -652,6 +738,8 @@ public void reset() {
       _rebalanceFailureCount.set(0L);
       _continuousResourceRebalanceFailureCount.set(0L);
       _continuousTaskRebalanceFailureCount.set(0L);
+      _asyncMissingTopStateMonitor.interrupt();
+      _missingTopStateResourceMap.clear();

Review Comment:
   yup handled it now. Thanks for catching this. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "xyuanlu (via GitHub)" <gi...@apache.org>.
xyuanlu commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084643347


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -652,6 +738,8 @@ public void reset() {
       _rebalanceFailureCount.set(0L);
       _continuousResourceRebalanceFailureCount.set(0L);
       _continuousTaskRebalanceFailureCount.set(0L);
+      _asyncMissingTopStateMonitor.interrupt();
+      _missingTopStateResourceMap.clear();

Review Comment:
   do we want to reset the metrics to 0 as well? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084590386


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(Map<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                resourcePartitionIt.remove();
+              } else {

Review Comment:
   Sure umm only other option i can think of is to curate the list of keys which needs to be deleted and then at the end of for loop iterate over them and delete it. But i'm not sure if that would be any better in performance or code readability wise. Any suggestions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "xyuanlu (via GitHub)" <gi...@apache.org>.
xyuanlu commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084647869


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,55 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                  }
+                }
+
+              }
+            }
+            // TODO: Check if this SLEEP_TIME is correct? Thread should keep on increasing the counter continuously until top
+            //  state is recovered but it can sleep for reasonable amount of time in between.
+            sleep(100);
+          }
+        }
+      } catch (InterruptedException e) {
+        LOG.error("AsyncMissingTopStateMonitor has been interrupted.", e);

Review Comment:
   Question about the interrupt handling here and in reset(). 
   When the thread receives an interrupt in reset() and we log error and returns..? Then no monitor thread is created and loop to continue monitoring...? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] junkaixue commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "junkaixue (via GitHub)" <gi...@apache.org>.
junkaixue commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1090060258


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(Map<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {

Review Comment:
   Yes. But it will save our sequential computational time as this update is synchronized in Helix regular pipelines.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1091224695


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,61 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetMissingTopStateDurationGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateMissingTopStateDurationGuage(System.currentTimeMillis() - missingTopStateStartTime);
+                  }
+                }
+
+              }
+            }
+            sleep(50); // Instead of providing stream of durtion values to histogram thread can sleep in between to save some CPU cycles.

Review Comment:
   Had a offline discussion with Junkai. After giving it a thought i came with below solution please let me know if this looks okay. Summary :
   1. Original sleep here was to save few computational cycles in this tight for loop but it's difficult to justify correct sleep duration. 
   2. In new solution, thread will report a metric per partition only if it's not reported within last sliding window reset interval. Now this has few benefits as : one if sleeping stops thread to report metrics for all resources and all partitions. But that's may be wrong because what if during that sleep time resource has new partitions with missing top state. Hence ideally if thread has reported at least one duration for that partition then it can skip that partition until it's sliding window has started new window. 
   @desaikomal hence i didn't add sleep here for sliding window reset time but used that value to determine of duration should be reported for that partition or not. 



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,61 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetMissingTopStateDurationGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateMissingTopStateDurationGuage(System.currentTimeMillis() - missingTopStateStartTime);
+                  }
+                }
+
+              }
+            }
+            sleep(50); // Instead of providing stream of durtion values to histogram thread can sleep in between to save some CPU cycles.

Review Comment:
   Had a offline discussion with @junkaixue . After giving it a thought i came with below solution please let me know if this looks okay. Summary :
   1. Original sleep here was to save few computational cycles in this tight for loop but it's difficult to justify correct sleep duration. 
   2. In new solution, thread will report a metric per partition only if it's not reported within last sliding window reset interval. Now this has few benefits as : one if sleeping stops thread to report metrics for all resources and all partitions. But that's may be wrong because what if during that sleep time resource has new partitions with missing top state. Hence ideally if thread has reported at least one duration for that partition then it can skip that partition until it's sliding window has started new window. 
   @desaikomal hence i didn't add sleep here for sliding window reset time but used that value to determine of duration should be reported for that partition or not. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1095189058


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,60 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {

Review Comment:
   @xyuanlu because it's needed for wait() and notify() mechanism. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084616037


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(Map<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                  }
+                }
+              }
+            }
+            // TODO: Check if this SLEEP_TIME is correct? Thread should keep on increasing the counter continuously until top
+            //  state is recovered but it can sleep for reasonable amount of time in between.
+            sleep(100);

Review Comment:
   TFTR @desaikomal @xyuanlu let me know if my comment makes sense!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on PR #2344:
URL: https://github.com/apache/helix/pull/2344#issuecomment-1402449100

   @xyuanlu and @desaikomal I have updated PR with duration now instead up just randomly incrementing counter. I hope it makes sense now. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084598339


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -109,9 +157,21 @@ public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
 
   private final Map<String, JobMonitor> _perTypeJobMonitorMap = new ConcurrentHashMap<>();
 
+  /**
+   * Missing top state resource map: resourceName-><PartitionName->startTimeOfMissingTopState>
+   */
+  private final Map<String, Map<String, Long>> _missingTopStateResourceMap = new ConcurrentHashMap<>();

Review Comment:
   That's a good point! Reason of defining (allocating memory) in main process is that main process (ClusterStatusMonitor) will be the one who is updating this map and async thread will be just reading from it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1088529228


##########
helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java:
##########
@@ -85,6 +85,9 @@ private void updateTopStateStatus(ResourceControllerDataProvider cache,
     if (cache.getClusterConfig() != null) {
       durationThreshold = cache.getClusterConfig().getMissTopStateDurationThreshold();
     }
+    if (clusterStatusMonitor != null) {
+      clusterStatusMonitor.updateMissingTopStateDurationThreshold(durationThreshold);

Review Comment:
   That's correct if threshold is not set then it's no-op thread. But in this case most of the metrics are not reported so i'm assuming that this config is used commonly.



##########
helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java:
##########
@@ -85,6 +85,9 @@ private void updateTopStateStatus(ResourceControllerDataProvider cache,
     if (cache.getClusterConfig() != null) {
       durationThreshold = cache.getClusterConfig().getMissTopStateDurationThreshold();
     }
+    if (clusterStatusMonitor != null) {
+      clusterStatusMonitor.updateMissingTopStateDurationThreshold(durationThreshold);

Review Comment:
   That's correct if threshold is not set then it's no-op thread. But in this case most of the metrics related to top state are not reported so i'm assuming that this config is used commonly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1089317936


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -583,6 +643,31 @@ public void updateMissingTopStateDurationStats(String resourceName, long totalDu
     }
   }
 
+  public void updateMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+    _asyncMissingTopStateMonitor.setMissingTopStateDurationThreshold(missingTopStateDurationThreshold);

Review Comment:
   Answered in above comments. It's just not about in wait() state but whenever it's timesliced and this thread get it's timeslice it can call any non-synchronized methods of _asyncMissingTopStateMonitor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] junkaixue commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "junkaixue (via GitHub)" <gi...@apache.org>.
junkaixue commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1083067900


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(Map<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();

Review Comment:
   this wait here and here is the wake up process? Would you like to use conditional variable instead?



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ResourceMonitor.java:
##########
@@ -335,12 +340,21 @@ private void resetResourceStateGauges() {
     _numOfPartitionsInExternalView.updateValue(0L);
     _numLessMinActiveReplicaPartitions.updateValue(0L);
     _numLessReplicaPartitions.updateValue(0L);
+    _oneOrManyPartitionsMissingTopStateRealTimeGuage.updateValue(0L);

Review Comment:
   This is very long name. Also it already represents how many partitions missing as aggregation. Better be: _numPartitionsMissingTopStateRealTime



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(Map<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                  }
+                }
+              }
+            }
+            // TODO: Check if this SLEEP_TIME is correct? Thread should keep on increasing the counter continuously until top
+            //  state is recovered but it can sleep for reasonable amount of time in between.
+            sleep(100);

Review Comment:
   As I mentioned, instead of using sleep, maybe it would be better to use conditional variable.



##########
helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java:
##########
@@ -319,12 +322,17 @@ private void reportTopStateHandoffFailIfNecessary(ResourceControllerDataProvider
     String partitionName = partition.getPartitionName();
     MissingTopStateRecord record = missingTopStateMap.get(resourceName).get(partitionName);
     long startTime = record.getStartTimeStamp();
-    if (startTime > 0 && System.currentTimeMillis() - startTime > durationThreshold && !record
+    if (startTime > 0 && !record
         .isFailed()) {

Review Comment:
   +1. If this is refactoring. Better do it in a different PR without logic change. Would be easy for reviewing.



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(Map<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetOneOrManyPartitionsMissingTopStateRealTimeGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {

Review Comment:
   You can use Java parallel compute lamda feature.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084400890


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ResourceMonitor.java:
##########
@@ -335,12 +340,21 @@ private void resetResourceStateGauges() {
     _numOfPartitionsInExternalView.updateValue(0L);
     _numLessMinActiveReplicaPartitions.updateValue(0L);
     _numLessReplicaPartitions.updateValue(0L);
+    _oneOrManyPartitionsMissingTopStateRealTimeGuage.updateValue(0L);

Review Comment:
   TFTR @junkaixue! I agree with long name but i just wanted it to be verbose. Actually counter value doesn't indicate number of partitions missing top state (not even at aggregated level) because this number is continuously increasing once started. Any positive value of this counter just indicates that there could be at least one partition (could be more) with missing top states.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1091224695


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,61 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetMissingTopStateDurationGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateMissingTopStateDurationGuage(System.currentTimeMillis() - missingTopStateStartTime);
+                  }
+                }
+
+              }
+            }
+            sleep(50); // Instead of providing stream of durtion values to histogram thread can sleep in between to save some CPU cycles.

Review Comment:
   Had a offline discussion with @junkaixue . After giving it a thought i came with below solution please let me know if this looks okay. Summary :
   1. Original sleep here was to save few computational cycles in this tight for loop but it's difficult to justify correct sleep duration. 
   2. In new solution, thread will report a metric per partition only if it's not reported within last sliding window reset interval. Now this has few benefits as : one if sleeping stops thread to report metrics for all resources and all partitions. But that's may be wrong because what if during that sleep time resource has new partitions with missing top state. Hence ideally if thread has reported duration at least once for that partition then it can skip **that** partition until it's sliding window has finished. 
   @desaikomal hence i didn't add sleep here for sliding window reset time but used that value to determine of duration should be reported for that partition or not. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1087265251


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -109,9 +157,21 @@ public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
 
   private final Map<String, JobMonitor> _perTypeJobMonitorMap = new ConcurrentHashMap<>();
 
+  /**
+   * Missing top state resource map: resourceName-><PartitionName->startTimeOfMissingTopState>
+   */
+  private final Map<String, Map<String, Long>> _missingTopStateResourceMap = new ConcurrentHashMap<>();
+  private final AsyncMissingTopStateMonitor _asyncMissingTopStateMonitor = new AsyncMissingTopStateMonitor(_missingTopStateResourceMap);
+
   public ClusterStatusMonitor(String clusterName) {
     _clusterName = clusterName;
     _beanServer = ManagementFactory.getPlatformMBeanServer();
+    /**
+     * Start a async thread for each cluster which keeps monitoring missing top states of any resource for a cluster.
+     * This thread will keep on iterating over resources and report missingTopStateDuration for them until there are at
+     * least one resource with missing top state for a cluster.
+     */
+    _asyncMissingTopStateMonitor.start();

Review Comment:
   I checked the flow and async thread should be tied with ClusterStatusMonitor thread. That means currently, all beans are registered in active() method and unregistered or reset in reset() method. So whenever Helix controller will activate cluster status monitor it should start this async thread and whenever helix controller stops/resets cluster status monitor it should stop this async thread.  So just to re-iterate whenever cluster status monitor is reset() it has to be activated first by caller which will make sure that this async thread will be started.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1087308581


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -109,9 +157,21 @@ public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
 
   private final Map<String, JobMonitor> _perTypeJobMonitorMap = new ConcurrentHashMap<>();
 
+  /**
+   * Missing top state resource map: resourceName-><PartitionName->startTimeOfMissingTopState>
+   */
+  private final Map<String, Map<String, Long>> _missingTopStateResourceMap = new ConcurrentHashMap<>();
+  private final AsyncMissingTopStateMonitor _asyncMissingTopStateMonitor = new AsyncMissingTopStateMonitor(_missingTopStateResourceMap);
+
   public ClusterStatusMonitor(String clusterName) {
     _clusterName = clusterName;
     _beanServer = ManagementFactory.getPlatformMBeanServer();
+    /**
+     * Start a async thread for each cluster which keeps monitoring missing top states of any resource for a cluster.
+     * This thread will keep on iterating over resources and report missingTopStateDuration for them until there are at
+     * least one resource with missing top state for a cluster.
+     */
+    _asyncMissingTopStateMonitor.start();

Review Comment:
   Yup. that's what is done here. In reset() we unregister all things, clear map and then stop/interrupt thread. Async thread will be started again in active() call.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1088535288


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,60 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetMissingTopStateDurationGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && System.currentTimeMillis() - missingTopStateStartTime > _missingTopStateDurationThreshold) {
+                    resourceMonitor.updateMissingTopStateDurationGuage(System.currentTimeMillis() - missingTopStateStartTime);
+                  }
+                }
+
+              }
+            }
+          }
+        }
+      } catch (InterruptedException e) {
+        LOG.error("AsyncMissingTopStateMonitor has been interrupted.", e);
+      }
+    }
+
+    public void reset() {
+      for (String resource : _missingTopStateResourceMap.keySet()) {
+        ResourceMonitor resourceMonitor = getOrCreateResourceMonitor(resource);

Review Comment:
   From comments above method getResourceMonitor() it says it's only for tests. I think it's because it's not synchronized like getOrCreateResourceMonitor().  Anyways it doesn't create resource monitor everytime and hence used at all other places.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "xyuanlu (via GitHub)" <gi...@apache.org>.
xyuanlu commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1094966706


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,60 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {

Review Comment:
   Then why we want synchronized...? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by GitBox <gi...@apache.org>.
xyuanlu commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1082018501


##########
helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java:
##########
@@ -319,12 +322,17 @@ private void reportTopStateHandoffFailIfNecessary(ResourceControllerDataProvider
     String partitionName = partition.getPartitionName();
     MissingTopStateRecord record = missingTopStateMap.get(resourceName).get(partitionName);
     long startTime = record.getStartTimeStamp();
-    if (startTime > 0 && System.currentTimeMillis() - startTime > durationThreshold && !record
+    if (startTime > 0 && !record
         .isFailed()) {

Review Comment:
   nit: probably keep them in same line



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] helix-bot commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by GitBox <gi...@apache.org>.
helix-bot commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1082048214


##########
helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java:
##########
@@ -85,6 +85,9 @@ private void updateTopStateStatus(ResourceControllerDataProvider cache,
     if (cache.getClusterConfig() != null) {
       durationThreshold = cache.getClusterConfig().getMissTopStateDurationThreshold();
     }
+    if (clusterStatusMonitor != null) {
+      clusterStatusMonitor.updateMissingTopStateDurationThreshold(durationThreshold);

Review Comment:
   Curious, how do we calculate this `durationThreshold`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084404256


##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,54 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final Map<String, Map<String, Long>> _missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(Map<String, Map<String, Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();

Review Comment:
   Hmm this wait is for when all resources has no partitions with missing top state then this sleeps this async thread. As soon as first partition for any resource goes top state missing, main ClusterStatusMonitor process will notify this async thread to wake up through updateMissingTopStateResourceMap method. Let me know if that makes sense or using condition variable here would be better in terms of performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Posted by "rahulrane50 (via GitHub)" <gi...@apache.org>.
rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1084398651


##########
helix-core/src/main/java/org/apache/helix/controller/stages/TopStateHandoffReportStage.java:
##########
@@ -319,12 +322,17 @@ private void reportTopStateHandoffFailIfNecessary(ResourceControllerDataProvider
     String partitionName = partition.getPartitionName();
     MissingTopStateRecord record = missingTopStateMap.get(resourceName).get(partitionName);
     long startTime = record.getStartTimeStamp();
-    if (startTime > 0 && System.currentTimeMillis() - startTime > durationThreshold && !record
+    if (startTime > 0 && !record
         .isFailed()) {

Review Comment:
   Sure let me fix indentation. But this is not refactoring and needed for this fix (need to update map irrespective of threshold). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] mgao0 commented on pull request #2344: Added new metric to report real time missing top state for partition

Posted by "mgao0 (via GitHub)" <gi...@apache.org>.
mgao0 commented on PR #2344:
URL: https://github.com/apache/helix/pull/2344#issuecomment-1402539475

   > @mgao0 synced up offline. Thanks a lot for your inputs. I think you are correct it won't make sense if we just have one counter with increasing value. @xyuanlu and @desaikomal I have updated PR with duration now instead up just randomly incrementing counter. I hope it makes sense now.
   
   Thanks @rahulrane50 for the update. To add more details, the conclusion is that if only for count of missing top state partition, we don't need an async thread, we can just couple it with ClusterStatusMonitor, but if we want to get a real time measurement for how long the missing top state has been lasting, then it makes sense to use an async thread. Thanks for making the change from counting the count to measuring the duration, and from gauge to histogram which shows the distribution of missing top state duration for different partitions, I think it makes sense. I'll take another look at your updated PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org