You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@geode.apache.org by GitBox <gi...@apache.org> on 2022/06/02 00:02:20 UTC

[GitHub] [geode] dschneider-pivotal opened a new pull request, #7751: See if old getThreadInfo does not do a stop the world safepoint

dschneider-pivotal opened a new pull request, #7751:
URL: https://github.com/apache/geode/pull/7751

   Now uses a cheaper getThreadInfo that does not get lock info
   To get lock info, set the system property "gemfire.threadmonitor.showLocks" to "true".
   
   <!-- Thank you for submitting a contribution to Apache Geode. -->
   
   <!-- In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken: 
   -->
   
   ### For all changes:
   - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
   
   - [ ] Has your PR been rebased against the latest commit within the target branch (typically `develop`)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   - [ ] Does `gradlew build` run cleanly?
   
   - [ ] Have you written or updated unit tests to verify your changes?
   
   - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   
   <!-- Note:
   Please ensure that once the PR is submitted, check Concourse for build issues and
   submit an update to your PR as soon as possible. If you need help, please send an
   email to dev@geode.apache.org.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] dschneider-pivotal commented on a diff in pull request #7751: GEODE-8977: change ThreadMonitor to reduce how long it does a "stop the world" ThreadDump vm op

Posted by GitBox <gi...@apache.org>.
dschneider-pivotal commented on code in PR #7751:
URL: https://github.com/apache/geode/pull/7751#discussion_r890618589


##########
geode-core/src/main/java/org/apache/geode/internal/monitoring/ThreadsMonitoringProcess.java:
##########
@@ -119,30 +120,87 @@ private int checkForStuckThreads(Collection<AbstractExecutor> executors, long cu
     return result;
   }
 
-  @VisibleForTesting
-  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
-    /*
-     * NOTE: at least some implementations of getThreadInfo(long[], boolean, boolean)
-     * will core dump if the long array contains a duplicate value.
-     * That is why stuckThreadIds is a Set instead of a List.
-     */
+  /**
+   * If set to true, then the JVM will be asked for what locks a thread holds.
+   * This is extra expensive to ask for on some JVMs so be careful setting this to true.
+   */
+  private static final boolean SHOW_LOCKS = Boolean.getBoolean("gemfire.threadmonitor.showLocks");
+  /**
+   * If set to true, then the JVM will be asked for all potential stuck threads with one call.
+   * Since getThreadInfo on many JVMs, stops ALL threads from running, and since getting info
+   * on multiple threads with one call is additional work, setting this can cause an extra long
+   * stop the world that can then cause other problems (like a forced disconnect).
+   * So be careful setting this to true.
+   */
+  private static final boolean BATCH_CALLS = Boolean.getBoolean("gemfire.threadmonitor.batchCalls");
+
+  private static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
+    return createThreadInfoMap(stuckThreadIds, SHOW_LOCKS, BATCH_CALLS);
+  }
+
+  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds,
+      boolean showLocks, boolean batchCalls) {
+    return createThreadInfoMap(ManagementFactory.getThreadMXBean(), stuckThreadIds, showLocks,
+        batchCalls);
+  }
+
+  static Map<Long, ThreadInfo> createThreadInfoMap(ThreadMXBean threadMXBean,
+      Set<Long> stuckThreadIds,
+      final boolean showLocks, final boolean batchCalls) {
     if (stuckThreadIds.isEmpty()) {
       return Collections.emptyMap();
     }
+    logger.info(
+        "Obtaining ThreadInfo for {} threads. Configuration: showLocks={} batchCalls={}. This is an expensive operation for the JVM and on most JVMs causes all threads to be paused.",
+        stuckThreadIds.size(), showLocks, batchCalls);
+    Map<Long, ThreadInfo> result = new HashMap<>();
+    if (batchCalls) {
+      createThreadInfoMapUsingSingleCall(threadMXBean, stuckThreadIds, showLocks, result);
+    } else {
+      for (long id : stuckThreadIds) {
+        ThreadInfo threadInfo = createThreadInfoForSinglePid(threadMXBean, showLocks, id);
+        if (threadInfo != null) {
+          result.put(threadInfo.getThreadId(), threadInfo);
+        }
+      }
+    }
+    logger.info("finished obtaining ThreadInfo");
+    return result;
+  }
+
+  private static ThreadInfo createThreadInfoForSinglePid(

Review Comment:
   yes, done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] dschneider-pivotal commented on a diff in pull request #7751: GEODE-8977: change ThreadMonitor to reduce how long it does a "stop the world" ThreadDump vm op

Posted by GitBox <gi...@apache.org>.
dschneider-pivotal commented on code in PR #7751:
URL: https://github.com/apache/geode/pull/7751#discussion_r890413150


##########
geode-core/src/main/java/org/apache/geode/internal/monitoring/ThreadsMonitoringProcess.java:
##########
@@ -119,29 +119,69 @@ private int checkForStuckThreads(Collection<AbstractExecutor> executors, long cu
     return result;
   }
 
+  /**
+   * If set to true, then the JVM will be asked for what locks a thread holds.
+   * This is extra expensive to ask for on some JVMs so be careful setting this to true.
+   */
+  private static final boolean SHOW_LOCKS = Boolean.getBoolean("gemfire.threadmonitor.showLocks");
+  /**
+   * If set to true, then the JVM will be asked for all potential stuck threads with one call.
+   * Since getThreadInfo on many JVMs, stops ALL threads from running, and since getting info
+   * on multiple threads with one call is additional work, setting this can cause an extra long
+   * stop the world that can then cause other problems (like a forced disconnect).
+   * So be careful setting this to true.
+   */
+  private static final boolean BATCH_CALLS = Boolean.getBoolean("gemfire.threadmonitor.batchCalls");
+
+  private static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
+    return createThreadInfoMap(stuckThreadIds, SHOW_LOCKS, BATCH_CALLS);
+  }
+
   @VisibleForTesting
-  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
-    /*
-     * NOTE: at least some implementations of getThreadInfo(long[], boolean, boolean)
-     * will core dump if the long array contains a duplicate value.
-     * That is why stuckThreadIds is a Set instead of a List.
-     */
+  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds,
+      final boolean showLocks, final boolean batchCalls) {
     if (stuckThreadIds.isEmpty()) {
       return Collections.emptyMap();
     }
-    long[] ids = new long[stuckThreadIds.size()];
-    int idx = 0;
-    for (long id : stuckThreadIds) {
-      ids[idx] = id;
-      idx++;
-    }
-    ThreadInfo[] threadInfos = ManagementFactory.getThreadMXBean().getThreadInfo(ids, true, true);
+    logger.info("Obtaining ThreadInfo for " + stuckThreadIds.size()

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] dschneider-pivotal commented on a diff in pull request #7751: GEODE-8977: change ThreadMonitor to reduce how long it does a "stop the world" ThreadDump vm op

Posted by GitBox <gi...@apache.org>.
dschneider-pivotal commented on code in PR #7751:
URL: https://github.com/apache/geode/pull/7751#discussion_r890569224


##########
geode-core/src/main/java/org/apache/geode/internal/monitoring/ThreadsMonitoringProcess.java:
##########
@@ -119,30 +120,87 @@ private int checkForStuckThreads(Collection<AbstractExecutor> executors, long cu
     return result;
   }
 
-  @VisibleForTesting
-  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
-    /*
-     * NOTE: at least some implementations of getThreadInfo(long[], boolean, boolean)
-     * will core dump if the long array contains a duplicate value.
-     * That is why stuckThreadIds is a Set instead of a List.
-     */
+  /**
+   * If set to true, then the JVM will be asked for what locks a thread holds.
+   * This is extra expensive to ask for on some JVMs so be careful setting this to true.
+   */
+  private static final boolean SHOW_LOCKS = Boolean.getBoolean("gemfire.threadmonitor.showLocks");
+  /**
+   * If set to true, then the JVM will be asked for all potential stuck threads with one call.
+   * Since getThreadInfo on many JVMs, stops ALL threads from running, and since getting info
+   * on multiple threads with one call is additional work, setting this can cause an extra long
+   * stop the world that can then cause other problems (like a forced disconnect).
+   * So be careful setting this to true.
+   */
+  private static final boolean BATCH_CALLS = Boolean.getBoolean("gemfire.threadmonitor.batchCalls");
+
+  private static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
+    return createThreadInfoMap(stuckThreadIds, SHOW_LOCKS, BATCH_CALLS);
+  }
+
+  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds,
+      boolean showLocks, boolean batchCalls) {
+    return createThreadInfoMap(ManagementFactory.getThreadMXBean(), stuckThreadIds, showLocks,
+        batchCalls);
+  }
+
+  static Map<Long, ThreadInfo> createThreadInfoMap(ThreadMXBean threadMXBean,
+      Set<Long> stuckThreadIds,
+      final boolean showLocks, final boolean batchCalls) {
     if (stuckThreadIds.isEmpty()) {
       return Collections.emptyMap();
     }
+    logger.info(
+        "Obtaining ThreadInfo for {} threads. Configuration: showLocks={} batchCalls={}. This is an expensive operation for the JVM and on most JVMs causes all threads to be paused.",

Review Comment:
   It has always been true. Between these two messages we are making calls that ask ALL threads to pause at a safe point. Once they are all paused we then get the ThreadInfo. If one of those threads has a high TTSP (time to safe point) then this can cause what looks like a stop the world gc but without any gc activity. Since what we do between these two log message could cause things like a forced disconnect I thought it was worth logging. Note that if we have no suspects of being stuck then we don't log the messages.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] dschneider-pivotal merged pull request #7751: GEODE-8977: change ThreadMonitor to reduce how long it does a "stop the world" ThreadDump vm op

Posted by GitBox <gi...@apache.org>.
dschneider-pivotal merged PR #7751:
URL: https://github.com/apache/geode/pull/7751


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] pivotal-jbarrett commented on a diff in pull request #7751: GEODE-8977: change ThreadMonitor to reduce how long it does a "stop the world" ThreadDump vm op

Posted by GitBox <gi...@apache.org>.
pivotal-jbarrett commented on code in PR #7751:
URL: https://github.com/apache/geode/pull/7751#discussion_r889254461


##########
geode-core/src/main/java/org/apache/geode/internal/monitoring/ThreadsMonitoringProcess.java:
##########
@@ -119,29 +119,69 @@ private int checkForStuckThreads(Collection<AbstractExecutor> executors, long cu
     return result;
   }
 
+  /**
+   * If set to true, then the JVM will be asked for what locks a thread holds.
+   * This is extra expensive to ask for on some JVMs so be careful setting this to true.
+   */
+  private static final boolean SHOW_LOCKS = Boolean.getBoolean("gemfire.threadmonitor.showLocks");
+  /**
+   * If set to true, then the JVM will be asked for all potential stuck threads with one call.
+   * Since getThreadInfo on many JVMs, stops ALL threads from running, and since getting info
+   * on multiple threads with one call is additional work, setting this can cause an extra long
+   * stop the world that can then cause other problems (like a forced disconnect).
+   * So be careful setting this to true.
+   */
+  private static final boolean BATCH_CALLS = Boolean.getBoolean("gemfire.threadmonitor.batchCalls");
+
+  private static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
+    return createThreadInfoMap(stuckThreadIds, SHOW_LOCKS, BATCH_CALLS);
+  }
+
   @VisibleForTesting
-  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
-    /*
-     * NOTE: at least some implementations of getThreadInfo(long[], boolean, boolean)
-     * will core dump if the long array contains a duplicate value.
-     * That is why stuckThreadIds is a Set instead of a List.
-     */
+  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds,
+      final boolean showLocks, final boolean batchCalls) {
     if (stuckThreadIds.isEmpty()) {
       return Collections.emptyMap();
     }
-    long[] ids = new long[stuckThreadIds.size()];
-    int idx = 0;
-    for (long id : stuckThreadIds) {
-      ids[idx] = id;
-      idx++;
-    }
-    ThreadInfo[] threadInfos = ManagementFactory.getThreadMXBean().getThreadInfo(ids, true, true);
+    logger.info("Obtaining ThreadInfo for " + stuckThreadIds.size()

Review Comment:
   Why are we using string concatenation here rather than the typical formatted arguments style?



##########
geode-core/src/main/java/org/apache/geode/internal/monitoring/ThreadsMonitoringProcess.java:
##########
@@ -119,29 +119,69 @@ private int checkForStuckThreads(Collection<AbstractExecutor> executors, long cu
     return result;
   }
 
+  /**
+   * If set to true, then the JVM will be asked for what locks a thread holds.
+   * This is extra expensive to ask for on some JVMs so be careful setting this to true.
+   */
+  private static final boolean SHOW_LOCKS = Boolean.getBoolean("gemfire.threadmonitor.showLocks");
+  /**
+   * If set to true, then the JVM will be asked for all potential stuck threads with one call.
+   * Since getThreadInfo on many JVMs, stops ALL threads from running, and since getting info
+   * on multiple threads with one call is additional work, setting this can cause an extra long
+   * stop the world that can then cause other problems (like a forced disconnect).
+   * So be careful setting this to true.
+   */
+  private static final boolean BATCH_CALLS = Boolean.getBoolean("gemfire.threadmonitor.batchCalls");
+
+  private static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
+    return createThreadInfoMap(stuckThreadIds, SHOW_LOCKS, BATCH_CALLS);
+  }
+
   @VisibleForTesting
-  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
-    /*
-     * NOTE: at least some implementations of getThreadInfo(long[], boolean, boolean)
-     * will core dump if the long array contains a duplicate value.
-     * That is why stuckThreadIds is a Set instead of a List.
-     */
+  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds,
+      final boolean showLocks, final boolean batchCalls) {
     if (stuckThreadIds.isEmpty()) {
       return Collections.emptyMap();
     }
-    long[] ids = new long[stuckThreadIds.size()];
-    int idx = 0;
-    for (long id : stuckThreadIds) {
-      ids[idx] = id;
-      idx++;
-    }
-    ThreadInfo[] threadInfos = ManagementFactory.getThreadMXBean().getThreadInfo(ids, true, true);
+    logger.info("Obtaining ThreadInfo for " + stuckThreadIds.size()
+        + " threads. Configuration: showLocks=" + showLocks + " batchCalls=" + batchCalls
+        + " This is an expensive operation for the JVM and on most JVMs causes all threads to be paused.");
     Map<Long, ThreadInfo> result = new HashMap<>();
-    for (ThreadInfo threadInfo : threadInfos) {
-      if (threadInfo != null) {
-        result.put(threadInfo.getThreadId(), threadInfo);
+    if (batchCalls) {

Review Comment:
   Rather than inlining the two modes can we split this into two separate methods for readability?



##########
geode-core/src/main/java/org/apache/geode/internal/monitoring/ThreadsMonitoringProcess.java:
##########
@@ -119,29 +119,69 @@ private int checkForStuckThreads(Collection<AbstractExecutor> executors, long cu
     return result;
   }
 
+  /**
+   * If set to true, then the JVM will be asked for what locks a thread holds.
+   * This is extra expensive to ask for on some JVMs so be careful setting this to true.
+   */
+  private static final boolean SHOW_LOCKS = Boolean.getBoolean("gemfire.threadmonitor.showLocks");
+  /**
+   * If set to true, then the JVM will be asked for all potential stuck threads with one call.
+   * Since getThreadInfo on many JVMs, stops ALL threads from running, and since getting info
+   * on multiple threads with one call is additional work, setting this can cause an extra long
+   * stop the world that can then cause other problems (like a forced disconnect).
+   * So be careful setting this to true.
+   */
+  private static final boolean BATCH_CALLS = Boolean.getBoolean("gemfire.threadmonitor.batchCalls");
+
+  private static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
+    return createThreadInfoMap(stuckThreadIds, SHOW_LOCKS, BATCH_CALLS);
+  }
+
   @VisibleForTesting
-  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
-    /*
-     * NOTE: at least some implementations of getThreadInfo(long[], boolean, boolean)
-     * will core dump if the long array contains a duplicate value.
-     * That is why stuckThreadIds is a Set instead of a List.
-     */
+  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds,

Review Comment:
   This method is visible for testing yet there does not appear to be any tests for all the branches of this code. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] dschneider-pivotal commented on pull request #7751: GEODE-8977: change ThreadMonitor to reduce how long it does a "stop the world" ThreadDump vm op

Posted by GitBox <gi...@apache.org>.
dschneider-pivotal commented on PR #7751:
URL: https://github.com/apache/geode/pull/7751#issuecomment-1147744983

   @pivotal-jbarrett I filed a new ticket to add perf testing: https://issues.apache.org/jira/browse/GEODE-10359


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] dschneider-pivotal commented on a diff in pull request #7751: GEODE-8977: change ThreadMonitor to reduce how long it does a "stop the world" ThreadDump vm op

Posted by GitBox <gi...@apache.org>.
dschneider-pivotal commented on code in PR #7751:
URL: https://github.com/apache/geode/pull/7751#discussion_r890412658


##########
geode-core/src/main/java/org/apache/geode/internal/monitoring/ThreadsMonitoringProcess.java:
##########
@@ -119,29 +119,69 @@ private int checkForStuckThreads(Collection<AbstractExecutor> executors, long cu
     return result;
   }
 
+  /**
+   * If set to true, then the JVM will be asked for what locks a thread holds.
+   * This is extra expensive to ask for on some JVMs so be careful setting this to true.
+   */
+  private static final boolean SHOW_LOCKS = Boolean.getBoolean("gemfire.threadmonitor.showLocks");
+  /**
+   * If set to true, then the JVM will be asked for all potential stuck threads with one call.
+   * Since getThreadInfo on many JVMs, stops ALL threads from running, and since getting info
+   * on multiple threads with one call is additional work, setting this can cause an extra long
+   * stop the world that can then cause other problems (like a forced disconnect).
+   * So be careful setting this to true.
+   */
+  private static final boolean BATCH_CALLS = Boolean.getBoolean("gemfire.threadmonitor.batchCalls");
+
+  private static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
+    return createThreadInfoMap(stuckThreadIds, SHOW_LOCKS, BATCH_CALLS);
+  }
+
   @VisibleForTesting
-  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
-    /*
-     * NOTE: at least some implementations of getThreadInfo(long[], boolean, boolean)
-     * will core dump if the long array contains a duplicate value.
-     * That is why stuckThreadIds is a Set instead of a List.
-     */
+  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds,

Review Comment:
   added test coverage



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] pivotal-jbarrett commented on a diff in pull request #7751: GEODE-8977: change ThreadMonitor to reduce how long it does a "stop the world" ThreadDump vm op

Posted by GitBox <gi...@apache.org>.
pivotal-jbarrett commented on code in PR #7751:
URL: https://github.com/apache/geode/pull/7751#discussion_r889259544


##########
geode-core/src/test/java/org/apache/geode/internal/monitoring/executor/AbstractExecutorGroupJUnitTest.java:
##########
@@ -104,7 +104,7 @@ public void handleExpiry(long stuckTime, Map<Long, ThreadInfo> map) {
         threadIds.add(blockedThread.getId());
         threadIds.add(blockingThread.getId());
         String threadReport = executor.createThreadReport(60000,
-            ThreadsMonitoringProcess.createThreadInfoMap(threadIds));
+            ThreadsMonitoringProcess.createThreadInfoMap(threadIds, true, true));

Review Comment:
   I find it confusing that the only unit test to change is one that isn't related to the changed unit. Please add tests  to the `ThreadsMonitoringProcessJUnitTest`. Please rename this test and `ThreadsMonitoringProcessJUnitTest` to conform to the current naming conventions. Please update both to use JUnit 5.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] dschneider-pivotal commented on a diff in pull request #7751: GEODE-8977: change ThreadMonitor to reduce how long it does a "stop the world" ThreadDump vm op

Posted by GitBox <gi...@apache.org>.
dschneider-pivotal commented on code in PR #7751:
URL: https://github.com/apache/geode/pull/7751#discussion_r890412818


##########
geode-core/src/main/java/org/apache/geode/internal/monitoring/ThreadsMonitoringProcess.java:
##########
@@ -119,29 +119,69 @@ private int checkForStuckThreads(Collection<AbstractExecutor> executors, long cu
     return result;
   }
 
+  /**
+   * If set to true, then the JVM will be asked for what locks a thread holds.
+   * This is extra expensive to ask for on some JVMs so be careful setting this to true.
+   */
+  private static final boolean SHOW_LOCKS = Boolean.getBoolean("gemfire.threadmonitor.showLocks");
+  /**
+   * If set to true, then the JVM will be asked for all potential stuck threads with one call.
+   * Since getThreadInfo on many JVMs, stops ALL threads from running, and since getting info
+   * on multiple threads with one call is additional work, setting this can cause an extra long
+   * stop the world that can then cause other problems (like a forced disconnect).
+   * So be careful setting this to true.
+   */
+  private static final boolean BATCH_CALLS = Boolean.getBoolean("gemfire.threadmonitor.batchCalls");
+
+  private static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
+    return createThreadInfoMap(stuckThreadIds, SHOW_LOCKS, BATCH_CALLS);
+  }
+
   @VisibleForTesting
-  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
-    /*
-     * NOTE: at least some implementations of getThreadInfo(long[], boolean, boolean)
-     * will core dump if the long array contains a duplicate value.
-     * That is why stuckThreadIds is a Set instead of a List.
-     */
+  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds,
+      final boolean showLocks, final boolean batchCalls) {
     if (stuckThreadIds.isEmpty()) {
       return Collections.emptyMap();
     }
-    long[] ids = new long[stuckThreadIds.size()];
-    int idx = 0;
-    for (long id : stuckThreadIds) {
-      ids[idx] = id;
-      idx++;
-    }
-    ThreadInfo[] threadInfos = ManagementFactory.getThreadMXBean().getThreadInfo(ids, true, true);
+    logger.info("Obtaining ThreadInfo for " + stuckThreadIds.size()
+        + " threads. Configuration: showLocks=" + showLocks + " batchCalls=" + batchCalls
+        + " This is an expensive operation for the JVM and on most JVMs causes all threads to be paused.");
     Map<Long, ThreadInfo> result = new HashMap<>();
-    for (ThreadInfo threadInfo : threadInfos) {
-      if (threadInfo != null) {
-        result.put(threadInfo.getThreadId(), threadInfo);
+    if (batchCalls) {

Review Comment:
   yes



##########
geode-core/src/test/java/org/apache/geode/internal/monitoring/executor/AbstractExecutorGroupJUnitTest.java:
##########
@@ -104,7 +104,7 @@ public void handleExpiry(long stuckTime, Map<Long, ThreadInfo> map) {
         threadIds.add(blockedThread.getId());
         threadIds.add(blockingThread.getId());
         String threadReport = executor.createThreadReport(60000,
-            ThreadsMonitoringProcess.createThreadInfoMap(threadIds));
+            ThreadsMonitoringProcess.createThreadInfoMap(threadIds, true, true));

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] pivotal-jbarrett commented on a diff in pull request #7751: GEODE-8977: change ThreadMonitor to reduce how long it does a "stop the world" ThreadDump vm op

Posted by GitBox <gi...@apache.org>.
pivotal-jbarrett commented on code in PR #7751:
URL: https://github.com/apache/geode/pull/7751#discussion_r890601277


##########
geode-core/src/main/java/org/apache/geode/internal/monitoring/ThreadsMonitoringProcess.java:
##########
@@ -119,30 +120,87 @@ private int checkForStuckThreads(Collection<AbstractExecutor> executors, long cu
     return result;
   }
 
-  @VisibleForTesting
-  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
-    /*
-     * NOTE: at least some implementations of getThreadInfo(long[], boolean, boolean)
-     * will core dump if the long array contains a duplicate value.
-     * That is why stuckThreadIds is a Set instead of a List.
-     */
+  /**
+   * If set to true, then the JVM will be asked for what locks a thread holds.
+   * This is extra expensive to ask for on some JVMs so be careful setting this to true.
+   */
+  private static final boolean SHOW_LOCKS = Boolean.getBoolean("gemfire.threadmonitor.showLocks");
+  /**
+   * If set to true, then the JVM will be asked for all potential stuck threads with one call.
+   * Since getThreadInfo on many JVMs, stops ALL threads from running, and since getting info
+   * on multiple threads with one call is additional work, setting this can cause an extra long
+   * stop the world that can then cause other problems (like a forced disconnect).
+   * So be careful setting this to true.
+   */
+  private static final boolean BATCH_CALLS = Boolean.getBoolean("gemfire.threadmonitor.batchCalls");
+
+  private static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
+    return createThreadInfoMap(stuckThreadIds, SHOW_LOCKS, BATCH_CALLS);
+  }
+
+  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds,
+      boolean showLocks, boolean batchCalls) {
+    return createThreadInfoMap(ManagementFactory.getThreadMXBean(), stuckThreadIds, showLocks,
+        batchCalls);
+  }
+
+  static Map<Long, ThreadInfo> createThreadInfoMap(ThreadMXBean threadMXBean,
+      Set<Long> stuckThreadIds,
+      final boolean showLocks, final boolean batchCalls) {
     if (stuckThreadIds.isEmpty()) {
       return Collections.emptyMap();
     }
+    logger.info(
+        "Obtaining ThreadInfo for {} threads. Configuration: showLocks={} batchCalls={}. This is an expensive operation for the JVM and on most JVMs causes all threads to be paused.",
+        stuckThreadIds.size(), showLocks, batchCalls);
+    Map<Long, ThreadInfo> result = new HashMap<>();
+    if (batchCalls) {
+      createThreadInfoMapUsingSingleCall(threadMXBean, stuckThreadIds, showLocks, result);
+    } else {
+      for (long id : stuckThreadIds) {
+        ThreadInfo threadInfo = createThreadInfoForSinglePid(threadMXBean, showLocks, id);
+        if (threadInfo != null) {
+          result.put(threadInfo.getThreadId(), threadInfo);
+        }
+      }
+    }
+    logger.info("finished obtaining ThreadInfo");
+    return result;
+  }
+
+  private static ThreadInfo createThreadInfoForSinglePid(

Review Comment:
   The `id` is for a thread not a process right? Shouldn't this be something like `createThreadInfoForSingleThread`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [geode] boglesby commented on a diff in pull request #7751: GEODE-8977: change ThreadMonitor to reduce how long it does a "stop the world" ThreadDump vm op

Posted by GitBox <gi...@apache.org>.
boglesby commented on code in PR #7751:
URL: https://github.com/apache/geode/pull/7751#discussion_r890562414


##########
geode-core/src/main/java/org/apache/geode/internal/monitoring/ThreadsMonitoringProcess.java:
##########
@@ -119,30 +120,87 @@ private int checkForStuckThreads(Collection<AbstractExecutor> executors, long cu
     return result;
   }
 
-  @VisibleForTesting
-  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
-    /*
-     * NOTE: at least some implementations of getThreadInfo(long[], boolean, boolean)
-     * will core dump if the long array contains a duplicate value.
-     * That is why stuckThreadIds is a Set instead of a List.
-     */
+  /**
+   * If set to true, then the JVM will be asked for what locks a thread holds.
+   * This is extra expensive to ask for on some JVMs so be careful setting this to true.
+   */
+  private static final boolean SHOW_LOCKS = Boolean.getBoolean("gemfire.threadmonitor.showLocks");
+  /**
+   * If set to true, then the JVM will be asked for all potential stuck threads with one call.
+   * Since getThreadInfo on many JVMs, stops ALL threads from running, and since getting info
+   * on multiple threads with one call is additional work, setting this can cause an extra long
+   * stop the world that can then cause other problems (like a forced disconnect).
+   * So be careful setting this to true.
+   */
+  private static final boolean BATCH_CALLS = Boolean.getBoolean("gemfire.threadmonitor.batchCalls");
+
+  private static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds) {
+    return createThreadInfoMap(stuckThreadIds, SHOW_LOCKS, BATCH_CALLS);
+  }
+
+  public static Map<Long, ThreadInfo> createThreadInfoMap(Set<Long> stuckThreadIds,
+      boolean showLocks, boolean batchCalls) {
+    return createThreadInfoMap(ManagementFactory.getThreadMXBean(), stuckThreadIds, showLocks,
+        batchCalls);
+  }
+
+  static Map<Long, ThreadInfo> createThreadInfoMap(ThreadMXBean threadMXBean,
+      Set<Long> stuckThreadIds,
+      final boolean showLocks, final boolean batchCalls) {
     if (stuckThreadIds.isEmpty()) {
       return Collections.emptyMap();
     }
+    logger.info(
+        "Obtaining ThreadInfo for {} threads. Configuration: showLocks={} batchCalls={}. This is an expensive operation for the JVM and on most JVMs causes all threads to be paused.",

Review Comment:
   A message like this is logged every time there are stuck threads now. Is this what we want? Is the `This is an expensive operation` part true now, or has it always been true of this monitor?
   ```
   [info 2022/06/06 13:53:47.175 PDT server-1 <ThreadsMonitor> tid=0x10] Obtaining ThreadInfo for 2 threads. Configuration: showLocks=false batchCalls=false. This is an expensive operation for the JVM and on most JVMs causes all threads to be paused.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org