You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@geode.apache.org by GitBox <gi...@apache.org> on 2022/06/01 16:40:32 UTC

[GitHub] [geode] albertogpz opened a new pull request, #7747: GEODE-10334: Send alert when thread stuck for long

albertogpz opened a new pull request, #7747:
URL: https://github.com/apache/geode/pull/7747

   This is the implementation of the feature
   described in RFC:
   https://cwiki.apache.org/confluence/display/GEODE/Management+of+threads+stuck+for+a+long+time+in+Geode
   
   <!-- Thank you for submitting a contribution to Apache Geode. -->
   
   <!-- In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken: 
   -->
   
   ### For all changes:
   - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
   
   - [ ] Has your PR been rebased against the latest commit within the target branch (typically `develop`)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   - [ ] Does `gradlew build` run cleanly?
   
   - [ ] Have you written or updated unit tests to verify your changes?
   
   - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   
   <!-- Note:
   Please ensure that once the PR is submitted, check Concourse for build issues and
   submit an update to your PR as soon as possible. If you need help, please send an
   email to dev@geode.apache.org.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] albertogpz commented on a diff in pull request #7747: GEODE-10334: Send alert when thread stuck for long

Posted by GitBox <gi...@apache.org>.
albertogpz commented on code in PR #7747:
URL: https://github.com/apache/geode/pull/7747#discussion_r895878888


##########
geode-core/src/main/java/org/apache/geode/internal/monitoring/executor/AbstractExecutor.java:
##########
@@ -51,9 +56,25 @@ protected AbstractExecutor(String groupName, long threadID) {
 
   public void handleExpiry(long stuckTime, Map<Long, ThreadInfo> threadInfoMap) {
     incNumIterationsStuck();
+    sendAlertForThreadStuckForLong(stuckTime, threadInfoMap);
     logger.warn(createThreadReport(stuckTime, threadInfoMap));
   }
 
+  private void sendAlertForThreadStuckForLong(long stuckTime, Map<Long, ThreadInfo> threadInfoMap) {
+    if (maxThreadStuckTime <= 0) {
+      return;
+    }
+    if (threadInfoMap.get(threadID) == null) {
+      return;
+    }
+    if (stuckForGood) {
+      return;
+    }
+    if (stuckTime > maxThreadStuckTime) {
+      stuckForGood = true;
+      logger.fatal(createThreadReport(stuckTime, threadInfoMap));

Review Comment:
   @kirklund I think the alert will not be generated by default if `error` is used. In order to see the alert when using `error`, the `alertLevel` in the `DistributedMXBean` should be changed from `severe` to a lower level.
   
   See https://geode.apache.org/docs/guide/114/managing/management/notification_federation_and_alerts.html
   
   Anyhow, I agree with you that it makes sense to use `error` given that having a thread stuck is not necessarily a catastrophic failure that may result in data corruption.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] kirklund commented on a diff in pull request #7747: GEODE-10334: Send alert when thread stuck for long

Posted by GitBox <gi...@apache.org>.
kirklund commented on code in PR #7747:
URL: https://github.com/apache/geode/pull/7747#discussion_r894864266


##########
geode-core/src/main/java/org/apache/geode/internal/monitoring/executor/AbstractExecutor.java:
##########
@@ -51,9 +56,25 @@ protected AbstractExecutor(String groupName, long threadID) {
 
   public void handleExpiry(long stuckTime, Map<Long, ThreadInfo> threadInfoMap) {
     incNumIterationsStuck();
+    sendAlertForThreadStuckForLong(stuckTime, threadInfoMap);
     logger.warn(createThreadReport(stuckTime, threadInfoMap));
   }
 
+  private void sendAlertForThreadStuckForLong(long stuckTime, Map<Long, ThreadInfo> threadInfoMap) {
+    if (maxThreadStuckTime <= 0) {
+      return;
+    }
+    if (threadInfoMap.get(threadID) == null) {
+      return;
+    }
+    if (stuckForGood) {
+      return;
+    }
+    if (stuckTime > maxThreadStuckTime) {
+      stuckForGood = true;
+      logger.fatal(createThreadReport(stuckTime, threadInfoMap));

Review Comment:
   `fatal` should be reserved for logging about catastrophic failure that may result in data corruption. The server should probably be taken off-line or bounced.
   
   `error` should be used to indicate that something failed but there should be no data corruption and it should generally be ok to allow the server to continue running.
   
   I think you probably want this log statement to be at `error` level which will still generate an `alert`.
   
   I'd like to add @dschneider-pivotal as a reviewer to confirm the above. Darrel, if you want to give me good definitions, I'd be happy to add it to the Apache Geode Wiki.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] albertogpz commented on a diff in pull request #7747: GEODE-10334: Send alert when thread stuck for long

Posted by GitBox <gi...@apache.org>.
albertogpz commented on code in PR #7747:
URL: https://github.com/apache/geode/pull/7747#discussion_r896655954


##########
geode-core/src/main/java/org/apache/geode/internal/monitoring/executor/AbstractExecutor.java:
##########
@@ -51,9 +56,25 @@ protected AbstractExecutor(String groupName, long threadID) {
 
   public void handleExpiry(long stuckTime, Map<Long, ThreadInfo> threadInfoMap) {
     incNumIterationsStuck();
+    sendAlertForThreadStuckForLong(stuckTime, threadInfoMap);
     logger.warn(createThreadReport(stuckTime, threadInfoMap));
   }
 
+  private void sendAlertForThreadStuckForLong(long stuckTime, Map<Long, ThreadInfo> threadInfoMap) {
+    if (maxThreadStuckTime <= 0) {
+      return;
+    }
+    if (threadInfoMap.get(threadID) == null) {
+      return;
+    }
+    if (stuckForGood) {
+      return;
+    }
+    if (stuckTime > maxThreadStuckTime) {
+      stuckForGood = true;
+      logger.fatal(createThreadReport(stuckTime, threadInfoMap));

Review Comment:
   @kirklund The problem I see with using `error` in this case is that, as you need to change the `alertLevel` in the `DistributedMXBean` to a lower level, the number of notifications received will be much higher, something that you do not necessarily want, for example because more resources will be used.
   
   There are other alerts in Geode issued when a fatal error is sent that do not necessarily mean a catastrophic failure, for example if the `ack-severe-threshold` is surpassed and that feature is enabled.
   
   Given that this is also a configurable feature, whoever makes use of it should know what to expect and how to act (according to the documentation) when the alert is received so I still think it makes sense to use `severe` to make sure the alert is always received if the feature is enabled.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] albertogpz commented on pull request #7747: GEODE-10344: Send alert when thread stuck for long

Posted by GitBox <gi...@apache.org>.
albertogpz commented on PR #7747:
URL: https://github.com/apache/geode/pull/7747#issuecomment-1174113842

   > It used to be that the Alert listener default was WARN not SEVERE. I think it was changed simply because we had too many log messages at WARN and ERROR that caused unwanted alerts. It does seem silly to log a msg saying it was FATAL/SEVERE when the server is still running. I think what is needed is a way to explicitly say "I want this msg to generate an alert" and it does not matter what its log level is. But since this particular msg needs to be explicitly enabled with a sys prop, I'm okay with it being whatever is needed to generate an alert.
   
   @kirklund  Do you agree with leaving the fatal log to make sure that the alert is always generated?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] albertogpz merged pull request #7747: GEODE-10344: Send alert when thread stuck for long

Posted by GitBox <gi...@apache.org>.
albertogpz merged PR #7747:
URL: https://github.com/apache/geode/pull/7747


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org