You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2020/08/04 23:20:01 UTC

[GitHub] [helix] kaisun2000 opened a new pull request #1218: Fix ClusterManager thread leak

kaisun2000 opened a new pull request #1218:
URL: https://github.com/apache/helix/pull/1218


   
   ### Issues
   
   - [x] My PR addresses the following Helix issues and references them in the PR description:
   
   fix #1216 
   
   ### Description
   
   - [x] Here are some details about my PR, including screenshots of any UI changes:
   
   TestNoDoubleAssign relies on randomness of creating a lot of participant or
   ClusterManager. Due to race condition of stopping/starting participants, the
   Test leaks hundreds ClusterManager thread and ZkClient threads associated.
   
   ### Tests
   
   - [ ] The following tests are written for this issue:
   
   (List the names of added unit/integration tests)
   
   - [ ] The following is the result of the "mvn test" command on the appropriate module:
   
   (Copy & paste the result of "mvn test")
   
   ### Commits
   
   - [ ] My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation (Optional)
   
   - [ ] In case of new functionality, my PR adds documentation in the following wiki page:
   
   (Link the GitHub wiki you added)
   
   ### Code Quality
   
   - [x] My diff has been formatted using helix-style.xml 
   (helix-style-intellij.xml if IntelliJ IDE is used)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1218:
URL: https://github.com/apache/helix/pull/1218#discussion_r467527907



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestNoDoubleAssign.java
##########
@@ -42,7 +42,7 @@
 import org.testng.annotations.Test;
 
 public class TestNoDoubleAssign extends TaskTestBase {
-  private static final int THREAD_COUNT = 10;
+  private static final int THREAD_COUNT = 1;

Review comment:
       pollForDoubleAssign() also use this one.
   
   This test has a bad design. Aside from the race condition which cause leak,  you will see that breakConnection() and pollForDoubleAssign() does not really reliably reproduce the problem. 
   
   See the other discussion with JJ, the proposed fix is:
   
   ```
   _executorServiceConnection.scheduleAtFixedRate(() -> {
         // Randomly pick a Participant and cause a transient connection issue
        synchronized (_executorServiceConnection) {
         int participantIndex = RANDOM.nextInt(_numNodes);
         stopParticipant(participantIndex);
         startParticipant(participantIndex);
         }
     }, 0L, CONNECTION_DELAY, TimeUnit.MILLISECONDS);
   ```
   
   Let me know what is your take? Since the proposed fix from you and JJ are different. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] pkuwm commented on a change in pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
pkuwm commented on a change in pull request #1218:
URL: https://github.com/apache/helix/pull/1218#discussion_r465915399



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestNoDoubleAssign.java
##########
@@ -155,11 +155,16 @@ private void pollForDoubleAssign() {
    * Randomly causes Participants to lost connection temporarily.
    */
   private void breakConnection() {
+    // Note, send to THREAD_COUNT == 1 is a must to avoid leaking ClusterManager (participant).
+    // Otherwise, multiple thread has race of stopping and starting the same slot of index
+    // this would cause a vast amount hundreds to thousand leaking threads.
+    // The design of this test itself relys on randomness too, which is also not a good idea.
     _executorServiceConnection = Executors.newScheduledThreadPool(THREAD_COUNT);
     _executorServiceConnection.scheduleAtFixedRate(() -> {
       // Randomly pick a Participant and cause a transient connection issue
       int participantIndex = RANDOM.nextInt(_numNodes);
-      _participants[participantIndex].disconnect();
+      // _participants[participantIndex].disconnect();

Review comment:
       Remove this instead of commenting it out?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1218:
URL: https://github.com/apache/helix/pull/1218#discussion_r467232130



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestNoDoubleAssign.java
##########
@@ -155,11 +155,16 @@ private void pollForDoubleAssign() {
    * Randomly causes Participants to lost connection temporarily.
    */
   private void breakConnection() {
+    // Note, send to THREAD_COUNT == 1 is a must to avoid leaking ClusterManager (participant).

Review comment:
       Sure. Let me explain:
   
   Assume we have two thread A, B. Both of them choose slot 0. The sequence of action is 
   
   A stop 0
   B stop 0
   A start 0
   B start 0
   
   Now _participants[0] hold the reference of clutermanager started by B.  clustermanager started by A is not referenced by any one any more. They are leaked. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1218:
URL: https://github.com/apache/helix/pull/1218#discussion_r465948797



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestNoDoubleAssign.java
##########
@@ -42,7 +42,7 @@
 import org.testng.annotations.Test;
 
 public class TestNoDoubleAssign extends TaskTestBase {
-  private static final int THREAD_COUNT = 10;
+  private static final int THREAD_COUNT = 1;

Review comment:
       changed to use newSingleThreadExecutor for break connection




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1218:
URL: https://github.com/apache/helix/pull/1218#discussion_r467515679



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestNoDoubleAssign.java
##########
@@ -42,7 +42,7 @@
 import org.testng.annotations.Test;
 
 public class TestNoDoubleAssign extends TaskTestBase {
-  private static final int THREAD_COUNT = 10;
+  private static final int THREAD_COUNT = 1;

Review comment:
       Is THREAD_COUNT still needed? If not, why not delete it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1218:
URL: https://github.com/apache/helix/pull/1218#discussion_r467515722



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestNoDoubleAssign.java
##########
@@ -155,11 +155,15 @@ private void pollForDoubleAssign() {
    * Randomly causes Participants to lost connection temporarily.
    */
   private void breakConnection() {
-    _executorServiceConnection = Executors.newScheduledThreadPool(THREAD_COUNT);
+    // Note, THREAD_COUNT == 1 is a must to avoid leaking ClusterManager (participant).

Review comment:
       Please update the comment. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] mgao0 commented on a change in pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
mgao0 commented on a change in pull request #1218:
URL: https://github.com/apache/helix/pull/1218#discussion_r467238416



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestNoDoubleAssign.java
##########
@@ -155,11 +155,16 @@ private void pollForDoubleAssign() {
    * Randomly causes Participants to lost connection temporarily.
    */
   private void breakConnection() {
+    // Note, send to THREAD_COUNT == 1 is a must to avoid leaking ClusterManager (participant).

Review comment:
       Thanks for explaining @kaisun2000 . It makes sense.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1218:
URL: https://github.com/apache/helix/pull/1218#discussion_r467300538



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestNoDoubleAssign.java
##########
@@ -155,11 +155,16 @@ private void pollForDoubleAssign() {
    * Randomly causes Participants to lost connection temporarily.
    */
   private void breakConnection() {
+    // Note, send to THREAD_COUNT == 1 is a must to avoid leaking ClusterManager (participant).

Review comment:
       > I believe startParticipant will auto close it if one object is already running. We can also add synchronize there.
   
   Yes, it has. But over there still boils down to single thread execution. It does not increase parallism in this case. The above code seems to be clear in terms that we wan to stress that cleaning the resource by stopping it. Or we need to think about object life cycle.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] pkuwm commented on a change in pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
pkuwm commented on a change in pull request #1218:
URL: https://github.com/apache/helix/pull/1218#discussion_r465915063



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestNoDoubleAssign.java
##########
@@ -42,7 +42,7 @@
 import org.testng.annotations.Test;
 
 public class TestNoDoubleAssign extends TaskTestBase {
-  private static final int THREAD_COUNT = 10;
+  private static final int THREAD_COUNT = 1;

Review comment:
       Since thread count is 1, I think `newSingleThreadExecutor` is what you need?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 closed pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
kaisun2000 closed pull request #1218:
URL: https://github.com/apache/helix/pull/1218


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1218:
URL: https://github.com/apache/helix/pull/1218#discussion_r472434008



##########
File path: helix-core/src/test/resources/log4j.properties
##########
@@ -16,7 +16,11 @@
 # specific language governing permissions and limitations
 # under the License.
 #
+
+# Set root logger level to DEBUG and its only appender to R.

Review comment:
       Remove the unnecessary comments?

##########
File path: helix-core/src/test/java/org/apache/helix/integration/rebalancer/DelayedAutoRebalancer/TestDelayedAutoRebalance.java
##########
@@ -49,7 +49,7 @@
   protected static final int PARTITIONS = 5;
   // TODO: remove this wait time once we have a better way to determine if the rebalance has been
   // TODO: done as a reaction of the test operations.
-  protected static final int DEFAULT_REBALANCE_PROCESSING_WAIT_TIME = 1000;
+  protected static final int DEFAULT_REBALANCE_PROCESSING_WAIT_TIME = 3000;

Review comment:
       Are we sure this is causing the trouble?
   Based on what I tested before, 1 second is already a very long time. If with 1 second, the test is still not stable, then it might be other problems.
   
   Let me be clear,
   1. Changing this time just prolong the test, not a big deal.
   2. I'm more concerned that this won't really make the tests more stable.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestNoDoubleAssign.java
##########
@@ -158,9 +158,11 @@ private void breakConnection() {
     _executorServiceConnection = Executors.newScheduledThreadPool(THREAD_COUNT);
     _executorServiceConnection.scheduleAtFixedRate(() -> {
       // Randomly pick a Participant and cause a transient connection issue
-      int participantIndex = RANDOM.nextInt(_numNodes);
-      _participants[participantIndex].disconnect();
-      startParticipant(participantIndex);
+      synchronized (this){

Review comment:
       In this case, shall we lock all the start/stop callers?
   I think a better way is to lock inside the start and stop methods. Then add a reset method in addition which requests the same lock.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1218:
URL: https://github.com/apache/helix/pull/1218#discussion_r467290435



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestNoDoubleAssign.java
##########
@@ -155,11 +155,16 @@ private void pollForDoubleAssign() {
    * Randomly causes Participants to lost connection temporarily.
    */
   private void breakConnection() {
+    // Note, send to THREAD_COUNT == 1 is a must to avoid leaking ClusterManager (participant).

Review comment:
       something like this? I can give it a try. But it basically boils down to single thread case anyway. 
   
   ```
   _executorServiceConnection.scheduleAtFixedRate(() -> {
         // Randomly pick a Participant and cause a transient connection issue
        synchronized (_executorServiceConnection) {
         int participantIndex = RANDOM.nextInt(_numNodes);
         stopParticipant(participantIndex);
         startParticipant(participantIndex);
         }
     }, 0L, CONNECTION_DELAY, TimeUnit.MILLISECONDS);
   
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1218:
URL: https://github.com/apache/helix/pull/1218#discussion_r467282073



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestNoDoubleAssign.java
##########
@@ -155,11 +155,16 @@ private void pollForDoubleAssign() {
    * Randomly causes Participants to lost connection temporarily.
    */
   private void breakConnection() {
+    // Note, send to THREAD_COUNT == 1 is a must to avoid leaking ClusterManager (participant).

Review comment:
       @kaisun2000 if that's the only reason, then we can just add a concurrent control on the stop, start block?
   
   Moreover, I believe startParticipant will auto close it if one object is already running. We can also add synchronize there.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1218:
URL: https://github.com/apache/helix/pull/1218#discussion_r465922072



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestNoDoubleAssign.java
##########
@@ -155,11 +155,16 @@ private void pollForDoubleAssign() {
    * Randomly causes Participants to lost connection temporarily.
    */
   private void breakConnection() {
+    // Note, send to THREAD_COUNT == 1 is a must to avoid leaking ClusterManager (participant).

Review comment:
       This test relies on randomness. Also, the do not really wait for enough time to do breakConnection() etc. However, if we wait, there will be even a lot more leaking. Let us make an agreement that the design of the test is out of scope of this pull request. First stabilize the test. Will use singleThreadExecutor for now,




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1218:
URL: https://github.com/apache/helix/pull/1218#discussion_r465915122



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestNoDoubleAssign.java
##########
@@ -155,11 +155,16 @@ private void pollForDoubleAssign() {
    * Randomly causes Participants to lost connection temporarily.
    */
   private void breakConnection() {
+    // Note, send to THREAD_COUNT == 1 is a must to avoid leaking ClusterManager (participant).

Review comment:
       I suspect THREAD_COUNT > 1 is the test requirement here. But I'm not sure. @narendly, could you please comment here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] mgao0 commented on a change in pull request #1218: Fix ClusterManager thread leak

Posted by GitBox <gi...@apache.org>.
mgao0 commented on a change in pull request #1218:
URL: https://github.com/apache/helix/pull/1218#discussion_r467202562



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestNoDoubleAssign.java
##########
@@ -155,11 +155,16 @@ private void pollForDoubleAssign() {
    * Randomly causes Participants to lost connection temporarily.
    */
   private void breakConnection() {
+    // Note, send to THREAD_COUNT == 1 is a must to avoid leaking ClusterManager (participant).

Review comment:
       @kaisun2000 Can you help me understand why THREAD_COUNT == 1 is a must to avoid leaking ClusterManager (participant)? Does it cause leakage even we shut down the executor after test?
   In my understanding, the thread count there doesn't necessarily mean that this executor will use this number of threads at most. This number is for the core thread number, so more like a minimum number concept. If there's a  need, the executor will allocate more threads to handle the tasks even though THREAD_COUNT is specified as 1.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org