You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by GitBox <gi...@apache.org> on 2020/12/12 01:26:52 UTC

[GitHub] [hbase] d-c-manning commented on a change in pull request #2762: HBASE-25329 Dump region hashes in logs for the regions that are stuck in transition for more than a configured amount of time

d-c-manning commented on a change in pull request #2762:
URL: https://github.com/apache/hbase/pull/2762#discussion_r541445488



##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
       if (oldestRITTime < ritTime) {
         oldestRITTime = ritTime;
       }
+      if (counter < 500) { // Record 500 oldest RITs

Review comment:
       500 seems high to me. Would 20 be enough? Should this be a constant in the file, even if not configurable?

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
       if (oldestRITTime < ritTime) {
         oldestRITTime = ritTime;
       }
+      if (counter < 500) { // Record 500 oldest RITs
+        oldestRITHashesAndStates.add(
+          state.getRegion().getRegionNameAsString() + ":" + state.getState().name()
+        );
+      }
+      counter += 1;
     }
     if (this.metricsAssignmentManager != null) {
       this.metricsAssignmentManager.updateRITOldestAge(oldestRITTime);
       this.metricsAssignmentManager.updateRITCount(totalRITs);
       this.metricsAssignmentManager.updateRITCountOverThreshold(totalRITsOverThreshold);
+
+      LOG.debug("Oldest RIT hashes and states: " + oldestRITHashesAndStates.toString());
+      long time = EnvironmentEdgeManager.currentTime();
+      if ((time - ritThreshold / 2) >= this.lastRITHashMetricUpdate) {

Review comment:
       Why do we update the metrics hashes only if it's been more than `ritThreshold / 2` time since last update? If we do the work to find the oldest, it seems like we should update the metrics always. Then any query to get metrics will always have the most recent results (~3 seconds old at max, with default `hbase.regionserver.msginterval`)
   
   Though I do think it's a good idea for limiting the logging. Should the `LOG.debug` statement be in here?
   
   Then this statement may deserve a comment.
   ```suggestion
         // Only log if it has been long enough since the last update (default 30 seconds)
         if ((time - ritThreshold / 2) >= this.lastRITHashMetricUpdate) {
   ```

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3538,9 +3541,11 @@ public void updateRegionsInTransitionMetrics() {
     int totalRITs = 0;
     int totalRITsOverThreshold = 0;
     long oldestRITTime = 0;
+    Set<String> oldestRITHashesAndStates = Sets.newHashSet(); // set of <rit hash>:<rit state>
     int ritThreshold = this.server.getConfiguration().
       getInt(HConstants.METRICS_RIT_STUCK_WARNING_THRESHOLD, 60000);
-    for (RegionState state: regionStates.getRegionsInTransition()) {
+    int counter = 0;
+    for (RegionState state: regionStates.getRegionsInTransitionOrderedByDuration()) {

Review comment:
       since this method uses the `state.getStamp()` to determine whether the RIT is over the threshold or the oldest time... should we just use `getRegionsInTransitionOrderedByTimestamp` instead?
   
   This new method `getRegionsInTransitionOrderedByDuration` is tracking total duration instead of duration since last state transition, so the longest RITs may not necessarily be those that are reported over threshold. I think it's probably good to be consistent... I'm not sure which way is better to be consistent, but since these metrics in this method are reporting time since last change instead of total duration, I'd favor that approach.

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
       if (oldestRITTime < ritTime) {
         oldestRITTime = ritTime;
       }
+      if (counter < 500) { // Record 500 oldest RITs
+        oldestRITHashesAndStates.add(
+          state.getRegion().getRegionNameAsString() + ":" + state.getState().name()
+        );
+      }
+      counter += 1;
     }
     if (this.metricsAssignmentManager != null) {
       this.metricsAssignmentManager.updateRITOldestAge(oldestRITTime);
       this.metricsAssignmentManager.updateRITCount(totalRITs);
       this.metricsAssignmentManager.updateRITCountOverThreshold(totalRITsOverThreshold);
+
+      LOG.debug("Oldest RIT hashes and states: " + oldestRITHashesAndStates.toString());

Review comment:
       we should only log if `oldestRITHashesAndStates` is non-empty.

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
       if (oldestRITTime < ritTime) {
         oldestRITTime = ritTime;
       }
+      if (counter < 500) { // Record 500 oldest RITs
+        oldestRITHashesAndStates.add(
+          state.getRegion().getRegionNameAsString() + ":" + state.getState().name()
+        );
+      }
+      counter += 1;
     }
     if (this.metricsAssignmentManager != null) {
       this.metricsAssignmentManager.updateRITOldestAge(oldestRITTime);
       this.metricsAssignmentManager.updateRITCount(totalRITs);
       this.metricsAssignmentManager.updateRITCountOverThreshold(totalRITsOverThreshold);
+
+      LOG.debug("Oldest RIT hashes and states: " + oldestRITHashesAndStates.toString());
+      long time = EnvironmentEdgeManager.currentTime();

Review comment:
       can we just reuse `currentTime` here from the beginning of the method?

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
       if (oldestRITTime < ritTime) {
         oldestRITTime = ritTime;
       }
+      if (counter < 500) { // Record 500 oldest RITs
+        oldestRITHashesAndStates.add(
+          state.getRegion().getRegionNameAsString() + ":" + state.getState().name()

Review comment:
       Do we have to use `name()` as I see other references use `state.getState()` directly.
   
   region names could be quite long - when you refer to hash, did you want the md5 encoded name?
   ```suggestion
             state.getRegion().getEncodedName() + ":" + state.getState()
   ```
   Or should the entire region name be available?

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -3549,11 +3554,24 @@ public void updateRegionsInTransitionMetrics() {
       if (oldestRITTime < ritTime) {
         oldestRITTime = ritTime;
       }
+      if (counter < 500) { // Record 500 oldest RITs
+        oldestRITHashesAndStates.add(

Review comment:
       should we also filter these to only those RITs that are over the threshold? Or do we want to display every RIT, even if it's only been transitioning for 100ms, if it's the longest/only RIT in the cluster?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org