You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2020/08/07 04:32:30 UTC

[GitHub] [helix] kaisun2000 opened a new pull request #1227: Implement thread leakage check for each test class #1226

kaisun2000 opened a new pull request #1227:
URL: https://github.com/apache/helix/pull/1227


   Implement thread leakage check for each test class #1226 #1226 
   
   - [x] My PR addresses the following Helix issues and references them in the PR description:
   
   resolve #1226 
   
   
   ### Description
   
   - [x] Here are some details about my PR, including screenshots of any UI changes:
   
       Recently unit test can not complete due to threads leaking
       and ZK session leaking. We identified the leaks and fixed
       them. In order for future test not to leak resources, we
       need a way to enforce test class would release/clean up resource created during test.
       
       A check is implemented in @afterclass in ZkTestBase.java.
   
   ### Tests
   
   - [ ] The following tests are written for this issue:
   
   
   - [ ] The following is the result of the "mvn test" command on the appropriate module:
   
   (Before CI test pass, please copy & paste the result of "mvn test")
   
   running
   
   ### Documentation (Optional)
   
   - In case of new functionality, my PR adds documentation in the following wiki page:
   
   (Link the GitHub wiki you added)
   
   ### Commits
   
   - My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Code Quality
   
   - My diff has been formatted using helix-style.xml 
   (helix-style-intellij.xml if IntelliJ IDE is used)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499966188



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancer.java
##########
@@ -64,7 +78,10 @@ public void testExpiry() throws Exception {
     JobConfig.Builder jobBuilder = JobConfig.Builder.fromMap(WorkflowGenerator.DEFAULT_JOB_CONFIG);
     jobBuilder.setJobCommandConfigMap(commandConfig);
 
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);

Review comment:
       changed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499966770



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancerParallel.java
##########
@@ -109,16 +117,22 @@ public void testWhenAllowOverlapJobAssignment() throws Exception {
     for (int i = 0; i < PARALLEL_COUNT; i++) {
       List<TaskConfig> taskConfigs = new ArrayList<TaskConfig>();
       for (int j = 0; j < TASK_COUNT; j++) {
-        taskConfigs.add(new TaskConfig.Builder().setTaskId("job_" + (i + 1) + "_task_" + j)
-            .setCommand(MockTask.TASK_COMMAND).build());
+        taskConfigs.add(new TaskConfig.Builder()
+            .setTaskId("job_" + (i + 1) + "_task_" + j)
+            .setCommand(MockTask.TASK_COMMAND)
+            .addConfig(MockTask.JOB_DELAY, "2000")
+            .build());
       }
       jobConfigBuilders.add(new JobConfig.Builder().addTaskConfigs(taskConfigs));
     }
 
     _driver.stop(queueName);
+    List<String> jobNames = new ArrayList<>();
     for (int i = 0; i < jobConfigBuilders.size(); ++i) {
-      _driver.enqueueJob(queueName, "job_" + (i + 1), jobConfigBuilders.get(i));
+      jobNames.add("job_" + (i + 1));
+      //_driver.enqueueJob(queueName, "job_" + (i + 1), jobConfigBuilders.get(i));

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on pull request #1227:
URL: https://github.com/apache/helix/pull/1227#issuecomment-678000265


   > @kaisun2000 I strongly recommend splitting the logic changes from the pure testing fixes.
   > I mainly reviewed the logic change part. The test changes are too long to review. Could you please re-org the PR?
   
   Let us sync about it offline. There are conflict of different requirements.
   
   1/  If our goal is to have all the test pass in github, we have to run it consistently till it converge to no failing test with all the changes in. Otherwise, how can we verify? 
   2/ The huge diff would be difficult to review. Yes, that is another concern.
   3/ We have to make test converge quickly as people are checking in new code and potentially introducing new unstableness. We have to act quick, or it will not converge. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504345644



##########
File path: helix-core/src/test/java/org/apache/helix/model/TestResourceConfig.java
##########
@@ -34,6 +34,11 @@
   private static final ObjectMapper _objectMapper = new ObjectMapper();
 
   @Test
+  public void testHead() {

Review comment:
       This has been a while. I don't recall specific details. 
   
   It seems to be that the test case may be spread over other test classes, which is hard to track, say if thread leak or something. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499945174



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestStuckTaskQuota.java
##########
@@ -63,6 +63,9 @@ public void beforeClass() throws Exception {
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestStuckTaskQuota called.");

Review comment:
       changed to `    System.out.println("AfterClass: " + TestHelper.getTestClassName() + ":" + TestHelper.getTestMethodName()  + "invoked.");
   `




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504344664



##########
File path: helix-core/src/test/java/org/apache/helix/monitoring/TestClusterStatusMonitorLifecycle.java
##########
@@ -291,7 +292,14 @@ public void testClusterStatusMonitorLifecycle() throws Exception {
         Query.not(Query.match(Query.attr("SensorName"), Query.value("MessageQueueStatus.*"))),
         exp1);
 
-    Assert.assertTrue(TestHelper.verify(() -> ManagementFactory.getPlatformMBeanServer()
-        .queryMBeans(new ObjectName("ClusterStatus:*"), exp2).isEmpty(), 3000));
+    // Note, the _asyncTasksThreadPool shutting down logic in GenericHelixController is best effort
+    // there is not guarantee that all threads in the pool is gone. MOstly they will, but not always.
+    // see https://github.com/apache/helix/issues/1280
+    boolean result = TestHelper.verify(() -> ManagementFactory.getPlatformMBeanServer()
+        .queryMBeans(new ObjectName("ClusterStatus:*"), exp2).isEmpty(), TestHelper.WAIT_DURATION);
+
+    if (!result) {
+      System.out.println("controllers shutting down failed to tear down all _asyncTasksThreadPools");
+    }

Review comment:
       How can we fix it here? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r489849171



##########
File path: helix-core/src/test/conf/testng.xml
##########
@@ -18,7 +18,7 @@
   ~ under the License.
   -->
 <!DOCTYPE suite SYSTEM "http://testng.org/testng-1.0.dtd">
-<suite name="Suite" time-out="300000">
+<suite name="Suite" time-out="900000">

Review comment:
       At least several. ZK in github sometimes is slow.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r489847954



##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterVerifiers/ZkHelixClusterVerifier.java
##########
@@ -213,6 +222,13 @@ public boolean verifyByPolling() {
     return verifyByPolling(DEFAULT_TIMEOUT, DEFAULT_PERIOD);
   }
 
+  protected boolean _isLogMore = false;

Review comment:
       this is removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r508976290



##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestClusterSetup.java
##########
@@ -61,56 +61,162 @@
   protected static final String STATE_MODEL = "MasterSlave";
   protected static final String TEST_NODE = "testnode_1";
 
-  private ClusterSetup _clusterSetup;
+  private ClusterSetup _clusterSetup = null;
 
   private static String[] createArgs(String str) {
     String[] split = str.split("[ ]+");
     System.out.println(Arrays.toString(split));
     return split;
   }
 
-  @BeforeClass()
+  @BeforeClass
   public void beforeClass() throws Exception {
     System.out
         .println("START TestClusterSetup.beforeClass() " + new Date(System.currentTimeMillis()));
+    _clusterSetup = new ClusterSetup(ZK_ADDR);
   }
 
-  @AfterClass()
+  @AfterClass
   public void afterClass() {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestClusterSetup called.");
+
     deleteCluster(CLUSTER_NAME);
+    _clusterSetup.close();
     System.out.println("END TestClusterSetup.afterClass() " + new Date(System.currentTimeMillis()));
   }
 
-  @BeforeMethod()
+  @BeforeMethod
   public void setup() {
+    // System.out.println("@BeforeMethod TestClusterSetup beforeMethod called. ");

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r508903908



##########
File path: helix-core/src/test/java/org/apache/helix/integration/spectator/TestRoutingTableProviderFromTargetEV.java
##########
@@ -113,13 +113,17 @@ public void afterClass() throws Exception {
     }
     deleteCluster(CLUSTER_NAME);
   }
+  @Test
+  public void testHead() {
+    Assert.assertTrue(true);
+  }

Review comment:
       Assign an order, not use testHead




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r494728983



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -623,6 +623,13 @@ void reset() {
       }
     }
 
+    // some thing like STATE_TRANSITION.Key specific pool are not shut down.
+    for (String msgType : _executorMap.keySet()) {
+      ExecutorService pool = _executorMap.remove(msgType);

Review comment:
       removed. Instead noted the potential improvement for now.

##########
File path: helix-core/src/main/java/org/apache/helix/task/TaskDriver.java
##########
@@ -969,17 +969,29 @@ public TaskState pollForJobState(String workflowName, String jobName, long timeo
     Set<TaskState> allowedStates = new HashSet<>(Arrays.asList(states));
     // Wait for state
     long st = System.currentTimeMillis();
+    long ct = 0;
     do {
       Thread.sleep(timeToSleep);
       ctx = getWorkflowContext(workflowName);
+      ct =  System.currentTimeMillis();
     } while ((ctx == null || ctx.getJobState(jobName) == null
         || !allowedStates.contains(ctx.getJobState(jobName)))
-        && System.currentTimeMillis() < st + timeout);
+        && ct < st + timeout);
 
     if (ctx == null || !allowedStates.contains(ctx.getJobState(jobName))) {

Review comment:
       revert back.

##########
File path: helix-core/src/main/java/org/apache/helix/task/TaskStateModelFactory.java
##########
@@ -97,7 +97,7 @@ public void shutdown() {
   }
 
   @VisibleForTesting
-  void shutdownNow() {
+  public void shutdownNow() {

Review comment:
       helix-core/src/test/java/org/apache/helix/integration/multizk/TestMultiZkHelixJavaApis.java
   actually leaks hundreds of threads.
   We need to call this method to avoid the thread leak
   ```
   public void afterClass() throws Exception {
       String testClassName = getClass().getSimpleName();
       System.out.println("AfterClass: " + testClassName + " of TestMultiZkHelixJavaApis called.");
   
       try {
         // Kill all mock controllers and participants
         MOCK_CONTROLLERS.values().forEach(ClusterControllerManager::syncStop);
         MOCK_PARTICIPANTS.forEach(mockParticipantManager -> {
           mockParticipantManager.syncStop();
           StateMachineEngine stateMachine = mockParticipantManager.getStateMachineEngine();
           if (stateMachine != null) {
             StateModelFactory stateModelFactory = stateMachine.getStateModelFactory("Task");
             if (stateModelFactory != null && stateModelFactory instanceof TaskStateModelFactory) {
               ((TaskStateModelFactory) stateModelFactory).shutdownNow();
             }
           }
         });
   
   ```

##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterStateVerifier.java
##########
@@ -181,6 +181,12 @@ private BestPossAndExtViewZkVerifier(HelixZkClient zkClient, String clusterName,
       this.resources = resources;
     }
 
+    public void close() {

Review comment:
       There are still many places using this one.  Change to use new verifier seems to worth a separate diff.

##########
File path: helix-core/src/test/java/org/apache/helix/TestHelper.java
##########
@@ -794,7 +797,12 @@ public static boolean verify(Verifier verifier, long timeout) throws Exception {
     long start = System.currentTimeMillis();
     do {
       boolean result = verifier.verify();
-      if (result || (System.currentTimeMillis() - start) > timeout) {
+      long now = System.currentTimeMillis();
+      if (result || (now - start) > timeout) {

Review comment:
       changed.

##########
File path: helix-core/src/test/java/org/apache/helix/TestListenerCallback.java
##########
@@ -119,6 +119,8 @@ public void beforeClass() throws Exception {
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestListenerCallback called.");

Review comment:
       It does not really make test output much longer. The reason is that sometime, the error message is emitted from afterClass(), sometimes, from last testing method. It is hard to distinguish. 
   
   With this log, it would be clear for people to examine




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] alirezazamani commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
alirezazamani commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499848087



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestDeleteJobFromJobQueue.java
##########
@@ -69,11 +70,16 @@ public void testForceDeleteJobFromJobQueue() throws InterruptedException {
     Assert
         .assertNotNull(_driver.getJobContext(TaskUtil.getNamespacedJobName(jobQueueName, "job2")));
 
-    // The following force delete for the job should go through without getting an exception
-    _driver.deleteJob(jobQueueName, "job2", true);
+    try {

Review comment:
       Another way can be to stop the controller and do force delete? Is there any reason you chose to use try-catch? I think stopping the controller and doing force delete can be a better approach as it actually checks the dele operation.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -231,12 +272,14 @@ public void testQueueJobsMaxCapacity() throws InterruptedException {
 
     _driver.resume(queueName);
 
+    System.out.println("before the polls");

Review comment:
       Please check the whole PR and remove these prints is possible.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancerStopResume.java
##########
@@ -102,29 +123,41 @@ public void stopAndResumeNamedQueue() throws Exception {
     String queueName = TestHelper.getTestMethodName();
 
     // Create a queue
-    LOG.info("Starting job-queue: " + queueName);
-    JobQueue queue = new JobQueue.Builder(queueName).build();
+    LOG.info("Starting job-queue : " + queueName);
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);
+    WorkflowConfig wfg = wfgBuilder.build();
+    JobQueue queue = new JobQueue.Builder(queueName)
+        .setWorkflowConfig(wfg)
+        .build();
     _driver.createQueue(queue);
 
     // Enqueue jobs
+    List<String> jobNames = new ArrayList<>();
+    List<JobConfig.Builder> jobBuilders = new ArrayList<>();
     Set<String> master = Sets.newHashSet("MASTER");
     JobConfig.Builder job1 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
         .setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(master)
         .setJobCommandConfigMap(Collections.singletonMap(MockTask.JOB_DELAY, "200"));
     String job1Name = "masterJob";
     LOG.info("Enqueuing job: " + job1Name);
-    _driver.enqueueJob(queueName, job1Name, job1);
+    jobNames.add(job1Name);
+    jobBuilders.add(job1);
+    //_driver.enqueueJob(queueName, job1Name, job1);

Review comment:
       remove?

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -102,7 +136,9 @@ public void testJobSubmitGenericWorkflows() throws InterruptedException {
     JobConfig.Builder jobBuilder =
         new JobConfig.Builder().setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB)
             .setCommand(MockTask.TASK_COMMAND).setMaxAttemptsPerTask(2);
-    Workflow.Builder builder = new Workflow.Builder(workflowName);
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);

Review comment:
       nit wfBuilder.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -65,14 +97,16 @@ public void testJobQueueAddingJobsOneByOne() throws InterruptedException {
         TaskState.COMPLETED);
   }
 
-  @Test
+  @Test()
   public void testJobQueueAddingJobsAtSametime() throws InterruptedException {
     String queueName = TestHelper.getTestMethodName();
     JobQueue.Builder builder = TaskTestUtil.buildJobQueue(queueName);
     WorkflowConfig.Builder workflowCfgBuilder =
-        new WorkflowConfig.Builder().setWorkflowId(queueName).setParallelJobs(1);
+        new WorkflowConfig.Builder().setWorkflowId(queueName).setParallelJobs(1)
+        .setJobPurgeInterval(-1);
     _driver.start(builder.setWorkflowConfig(workflowCfgBuilder.build()).build());
 
+    System.out.println("before add jobs");

Review comment:
       You can delete this line?

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -188,6 +227,7 @@ public void testQueueParallelJobs() throws InterruptedException {
           TaskState.COMPLETED);
     }
 
+    System.out.println("after the poll");

Review comment:
       same.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -165,6 +202,7 @@ public void testQueueParallelJobs() throws InterruptedException {
           TaskState.COMPLETED);
     }
 
+    System.out.println("Before stop controller");

Review comment:
       I don't think these prints are necessary.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -41,13 +47,39 @@
   public void beforeClass() throws Exception {
     setSingleTestEnvironment();
     super.beforeClass();
+    //WorkflowConfig.disableJobPurge();
+  }
+
+  @AfterClass
+  @Override
+  public void afterClass() throws Exception {
+    //WorkflowConfig.enableJobPurge();
+    super.afterClass();
+  }
+
+  @BeforeMethod
+  public void beforeTest(Method testMethod, ITestContext testContext) {
+    long startTime = System.currentTimeMillis();
+    System.out.println("START " + testMethod.getName() + " at " + new Date(startTime));
+    testContext.setAttribute("StartTime", System.currentTimeMillis());
+  }
+
+  @AfterMethod
+  public void endTest(Method testMethod, ITestContext testContext) {
+    Long startTime = (Long) testContext.getAttribute("StartTime");
+    long endTime = System.currentTimeMillis();
+    System.out.println("END " + testMethod.getName() + " at " + new Date(endTime) + ", took: "
+        + (endTime - startTime) + "ms.");
   }
 
   @Test
   public void testJobQueueAddingJobsOneByOne() throws InterruptedException {
     String queueName = TestHelper.getTestMethodName();
     JobQueue.Builder builder = TaskTestUtil.buildJobQueue(queueName);
-    WorkflowConfig.Builder workflowCfgBuilder = new WorkflowConfig.Builder().setWorkflowId(queueName).setParallelJobs(1);
+    WorkflowConfig.Builder workflowCfgBuilder = new WorkflowConfig.Builder()
+        .setWorkflowId(queueName)
+        .setParallelJobs(1)
+        .setJobPurgeInterval(-1);

Review comment:
       As we discussed offline, if you are sure that this method does not cause overwriting the old configs, then this is fine with me.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -177,6 +215,7 @@ public void testQueueParallelJobs() throws InterruptedException {
     }
     _driver.enqueueJobs(queueName, jobNames, jobBuilders);
 
+    System.out.println("before start controller");

Review comment:
       same here.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -92,6 +125,7 @@ public void testJobQueueAddingJobsAtSametime() throws InterruptedException {
 
     _driver.resume(queueName);
 
+    System.out.println("before poll to job state");

Review comment:
       delete?

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -247,6 +290,7 @@ public void testQueueJobsMaxCapacity() throws InterruptedException {
     }
     Assert.assertTrue(exceptionHappenedWhileAddingNewJob);
 
+    System.out.println("after the exception");

Review comment:
       same.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -231,12 +272,14 @@ public void testQueueJobsMaxCapacity() throws InterruptedException {
 
     _driver.resume(queueName);
 
+    System.out.println("before the polls");
     // Wait until all of the enqueued jobs (Job0 to Job3) are finished
     for (int i = 0; i < numberOfJobsAddedInitially; i++) {
       _driver.pollForJobState(queueName, TaskUtil.getNamespacedJobName(queueName, "JOB" + i),
           TaskState.COMPLETED);
     }
 
+    System.out.println("before enqueue for exception");

Review comment:
       print?

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestStuckTaskQuota.java
##########
@@ -63,6 +63,9 @@ public void beforeClass() throws Exception {
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestStuckTaskQuota called.");

Review comment:
       Same here. You can use methods to get the names here.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestQuotaBasedScheduling.java
##########
@@ -543,11 +561,17 @@ public void testJobQueueScheduling() throws InterruptedException {
     jobConfigBulider = new JobConfig.Builder().setCommand(JOB_COMMAND).addTaskConfigs(taskConfigs)
         .setJobCommandConfigMap(_jobCommandMap).setJobType("B");
 
+    jobNames.clear();
+    jobBuilders.clear();
     for (int i = 5; i < 10; i++) {
       String jobName = "JOB_" + i;
       lastJobName = jobName;
-      _driver.enqueueJob(queueName, jobName, jobConfigBulider);
+      jobNames.add(jobName);
+      jobBuilders.add(jobConfigBulider);
+      //_driver.enqueueJob(queueName, jobName, jobConfigBulider);

Review comment:
       remove this line?

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestMaxNumberOfAttemptsMasterSwitch.java
##########
@@ -71,6 +71,9 @@ public void beforeClass() throws Exception {
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestMaxNumberOfAttempsMasterSwitch called.");

Review comment:
       Can you use TestHelper.getTestMethodName() and TestHelper.getTestClassName?

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancer.java
##########
@@ -217,22 +246,35 @@ public void testNamedQueue() throws Exception {
     String queueName = TestHelper.getTestMethodName();
 
     // Create a queue
-    JobQueue queue = new JobQueue.Builder(queueName).build();
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);
+    WorkflowConfig wfg = wfgBuilder.build();
+    JobQueue queue = new JobQueue.Builder(queueName)
+        .setWorkflowConfig(wfg)
+        .build();
     _driver.createQueue(queue);
 
     // Enqueue jobs
+    List<String> jobNames = new ArrayList<>();
+    List<JobConfig.Builder> jobBuilders = new ArrayList<>();
     Set<String> master = Sets.newHashSet("MASTER");
     Set<String> slave = Sets.newHashSet("SLAVE");
     JobConfig.Builder job1 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
         .setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(master);
     JobConfig.Builder job2 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
         .setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(slave);
-    _driver.enqueueJob(queueName, "masterJob", job1);
-    _driver.enqueueJob(queueName, "slaveJob", job2);
+    jobNames.add("masterJob");
+    jobNames.add("slaveJob");
+    jobBuilders.add(job1);
+    jobBuilders.add(job2);
+    //_driver.enqueueJob(queueName, "masterJob", job1);

Review comment:
       remove the comments if possible.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestQuotaBasedScheduling.java
##########
@@ -526,11 +538,17 @@ public void testJobQueueScheduling() throws InterruptedException {
     JobConfig.Builder jobConfigBulider = new JobConfig.Builder().setCommand(JOB_COMMAND)
         .addTaskConfigs(taskConfigs).setJobCommandConfigMap(_jobCommandMap).setJobType("A");
 
+    List<String> jobNames = new ArrayList<>();
+    List<JobConfig.Builder> jobBuilders = new ArrayList<>();
+
     for (int i = 0; i < 5; i++) {
       String jobName = "JOB_" + i;
       lastJobName = jobName;
-      _driver.enqueueJob(queueName, jobName, jobConfigBulider);
+      jobNames.add(jobName);
+      jobBuilders.add(jobConfigBulider);
+      // _driver.enqueueJob(queueName, jobName, jobConfigBulider);

Review comment:
       Remove this line?

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancer.java
##########
@@ -162,8 +183,12 @@ public void partitionSet() throws Exception {
   @Test
   public void testRepeatedWorkflow() throws Exception {
     String workflowName = "SomeWorkflow";
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);
+    WorkflowConfig wfg = wfgBuilder.build();

Review comment:
       nit, wgBuilder

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestJobFailureDependence.java
##########
@@ -161,15 +161,19 @@ public void testWorkflowFailureJobThreshold() throws Exception {
     _driver.updateWorkflow(queueName, configBuilder.build());
     _driver.stop(queueName);
 
+    List<String> jobNames = new ArrayList<>();
+    List<JobConfig.Builder> jobBuilders = new ArrayList<>();
     for (int i = 0; i < _numDbs; i++) {
       JobConfig.Builder jobConfig =
           new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND).setTargetResource(_testDbs.get(i))
               .setTargetPartitionStates(Sets.newHashSet("SLAVE")).setIgnoreDependentJobFailure(true);
       String jobName = "job" + _testDbs.get(i);
       queueBuilder.enqueueJob(jobName, jobConfig);
-      _driver.enqueueJob(queueName, jobName, jobConfig);
+      jobNames.add(jobName);
+      jobBuilders.add(jobConfig);
+      //_driver.enqueueJob(queueName, jobName, jobConfig);

Review comment:
       You can delete  this commented line.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancer.java
##########
@@ -217,22 +246,35 @@ public void testNamedQueue() throws Exception {
     String queueName = TestHelper.getTestMethodName();
 
     // Create a queue
-    JobQueue queue = new JobQueue.Builder(queueName).build();
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);
+    WorkflowConfig wfg = wfgBuilder.build();
+    JobQueue queue = new JobQueue.Builder(queueName)
+        .setWorkflowConfig(wfg)
+        .build();
     _driver.createQueue(queue);
 
     // Enqueue jobs
+    List<String> jobNames = new ArrayList<>();
+    List<JobConfig.Builder> jobBuilders = new ArrayList<>();
     Set<String> master = Sets.newHashSet("MASTER");
     Set<String> slave = Sets.newHashSet("SLAVE");
     JobConfig.Builder job1 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
         .setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(master);
     JobConfig.Builder job2 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
         .setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(slave);
-    _driver.enqueueJob(queueName, "masterJob", job1);
-    _driver.enqueueJob(queueName, "slaveJob", job2);
+    jobNames.add("masterJob");
+    jobNames.add("slaveJob");
+    jobBuilders.add(job1);
+    jobBuilders.add(job2);
+    //_driver.enqueueJob(queueName, "masterJob", job1);
+    //_driver.enqueueJob(queueName, "slaveJob", job2);
+    _driver.enqueueJobs(queueName, jobNames, jobBuilders);
 
     // Ensure successful completion
     String namespacedJob1 = queueName + "_masterJob";
     String namespacedJob2 = queueName + "_slaveJob";
+
+    System.out.println("testNamedQueue before pollForJobState");

Review comment:
       remove prints if possible.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancer.java
##########
@@ -64,7 +78,10 @@ public void testExpiry() throws Exception {
     JobConfig.Builder jobBuilder = JobConfig.Builder.fromMap(WorkflowGenerator.DEFAULT_JOB_CONFIG);
     jobBuilder.setJobCommandConfigMap(commandConfig);
 
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);

Review comment:
       wfBuilder?

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancerParallel.java
##########
@@ -109,16 +117,22 @@ public void testWhenAllowOverlapJobAssignment() throws Exception {
     for (int i = 0; i < PARALLEL_COUNT; i++) {
       List<TaskConfig> taskConfigs = new ArrayList<TaskConfig>();
       for (int j = 0; j < TASK_COUNT; j++) {
-        taskConfigs.add(new TaskConfig.Builder().setTaskId("job_" + (i + 1) + "_task_" + j)
-            .setCommand(MockTask.TASK_COMMAND).build());
+        taskConfigs.add(new TaskConfig.Builder()
+            .setTaskId("job_" + (i + 1) + "_task_" + j)
+            .setCommand(MockTask.TASK_COMMAND)
+            .addConfig(MockTask.JOB_DELAY, "2000")
+            .build());
       }
       jobConfigBuilders.add(new JobConfig.Builder().addTaskConfigs(taskConfigs));
     }
 
     _driver.stop(queueName);
+    List<String> jobNames = new ArrayList<>();
     for (int i = 0; i < jobConfigBuilders.size(); ++i) {
-      _driver.enqueueJob(queueName, "job_" + (i + 1), jobConfigBuilders.get(i));
+      jobNames.add("job_" + (i + 1));
+      //_driver.enqueueJob(queueName, "job_" + (i + 1), jobConfigBuilders.get(i));

Review comment:
       remove?

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskSchedulingTwoCurrentStates.java
##########
@@ -93,6 +93,8 @@ public void beforeClass() throws Exception {
 
   @AfterClass()
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestTaskSchedulingTwoCurrentStates called.");

Review comment:
       Same here. You can use TestHelper methods.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskThrottling.java
##########
@@ -87,15 +97,17 @@ public void testTaskThrottle() throws Exception {
     _gSetupTool.getClusterManagementTool().setConfig(scope, properties);
 
     String jobName2 = "Job2";
-    Workflow flow2 = WorkflowGenerator.generateSingleJobWorkflowBuilder(jobName2, jobConfig).build();
+    Workflow flow2 = WorkflowGenerator.generateSingleJobWorkflowBuilder(jobName2, jobConfig)
+        .setWorkflowConfig(wfg)
+        .build();
     _driver.start(flow2);
     _driver.pollForJobState(flow2.getName(), TaskUtil.getNamespacedJobName(flow2.getName(), jobName2),
         TaskState.IN_PROGRESS);
 
     // Expect 10 tasks
     Assert.assertTrue(TestHelper.verify(
         () -> (countRunningPartition(flow2, jobName2) == (_numNodes * perNodeTaskLimitation)),
-        TestHelper.WAIT_DURATION));
+        2 * TestHelper.WAIT_DURATION));

Review comment:
       Now even when you change it to 60 seconds, is this 2 * necessary?

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancerStopResume.java
##########
@@ -102,29 +123,41 @@ public void stopAndResumeNamedQueue() throws Exception {
     String queueName = TestHelper.getTestMethodName();
 
     // Create a queue
-    LOG.info("Starting job-queue: " + queueName);
-    JobQueue queue = new JobQueue.Builder(queueName).build();
+    LOG.info("Starting job-queue : " + queueName);
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);
+    WorkflowConfig wfg = wfgBuilder.build();
+    JobQueue queue = new JobQueue.Builder(queueName)
+        .setWorkflowConfig(wfg)
+        .build();
     _driver.createQueue(queue);
 
     // Enqueue jobs
+    List<String> jobNames = new ArrayList<>();
+    List<JobConfig.Builder> jobBuilders = new ArrayList<>();
     Set<String> master = Sets.newHashSet("MASTER");
     JobConfig.Builder job1 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
         .setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(master)
         .setJobCommandConfigMap(Collections.singletonMap(MockTask.JOB_DELAY, "200"));
     String job1Name = "masterJob";
     LOG.info("Enqueuing job: " + job1Name);
-    _driver.enqueueJob(queueName, job1Name, job1);
+    jobNames.add(job1Name);
+    jobBuilders.add(job1);
+    //_driver.enqueueJob(queueName, job1Name, job1);
 
     Set<String> slave = Sets.newHashSet("SLAVE");
     JobConfig.Builder job2 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
         .setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(slave);
     String job2Name = "slaveJob";
     LOG.info("Enqueuing job: " + job2Name);
-    _driver.enqueueJob(queueName, job2Name, job2);
+    jobNames.add(job2Name);
+    jobBuilders.add(job2);
+    _driver.enqueueJobs(queueName, jobNames,jobBuilders);
+    //_driver.enqueueJob(queueName, job2Name, job2);

Review comment:
       remove?

##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancerStopResume.java
##########
@@ -82,7 +99,11 @@ public void stopAndResume() throws Exception {
   @Test
   public void stopAndResumeWorkflow() throws Exception {
     String workflow = "SomeWorkflow";
-    Workflow flow = WorkflowGenerator.generateDefaultRepeatedJobWorkflowBuilder(workflow).build();
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);
+    WorkflowConfig wfg = wfgBuilder.build();

Review comment:
       Same here.

##########
File path: helix-core/src/test/java/org/apache/helix/task/TaskSynchronizedTestBase.java
##########
@@ -76,11 +77,15 @@ public void beforeClass() throws Exception {
     startParticipants();
     createManagers();
     _clusterVerifier =
-        new BestPossibleExternalViewVerifier.Builder(CLUSTER_NAME).setZkClient(_gZkClient).build();
+        new BestPossibleExternalViewVerifier.Builder(CLUSTER_NAME).setZkClient(_gZkClient)
+            .setWaitTillVerify(TestHelper.DEFAULT_REBALANCE_PROCESSING_WAIT_TIME)
+            .build();
   }
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TaskSynchronizedTestBase called.");

Review comment:
       TestHelper methods/




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r501451629



##########
File path: helix-core/src/test/java/org/apache/helix/controller/changedetector/TestResourceChangeDetector.java
##########
@@ -422,16 +425,23 @@ public void testIgnoreNonTopologyChanges() {
     }
   }
 
-  @Test(dependsOnMethods = "testIgnoreNonTopologyChanges")
+  static final int WAIT_TIME = 30 * 60 * 1000;

Review comment:
       fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1227: Implement thread leakage check for each test class

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r466827098



##########
File path: helix-core/src/test/java/org/apache/helix/common/ZkTestBase.java
##########
@@ -719,6 +720,12 @@ public void cleanupLiveInstanceOwners() {
       clientMap.clear();
     }
     _liveInstanceOwners.clear();
+
+    boolean status = TestHelper.afterClassCheck(name);
+    // Assert here does not work.
+    if (!status) {
+      System.out.println("---------- Test Class " + name + " thread leakage detected! ---------------");

Review comment:
       Shall we just fail the test? And why it is only applied to ZkTestBase? There are other test classes which are not based on this one.

##########
File path: helix-core/src/test/java/org/apache/helix/TestHelper.java
##########
@@ -847,4 +850,161 @@ public static void printZkListeners(HelixZkClient client) throws Exception {
     }
     System.out.println("}");
   }
+
+  // Thread dump related utility function
+  private static ThreadGroup getRootThreadGroup() {
+    ThreadGroup candidate = Thread.currentThread().getThreadGroup();
+    while (candidate.getParent() != null) {
+      candidate = candidate.getParent();
+    }
+    return candidate;
+  }
+
+  public static List<Thread> getAllThreads() {
+    ThreadGroup rootThreadGroup = getRootThreadGroup();
+    Thread[] threads = new Thread[32];
+    int count = rootThreadGroup.enumerate(threads);
+    while (count == threads.length) {
+      threads = new Thread[threads.length * 2];
+      count = rootThreadGroup.enumerate(threads);
+    }
+    return Arrays.asList(Arrays.copyOf(threads, count));
+  }
+
+  static public enum ThreadCategory {
+    ZkServer("zookeeper server threads", 4, 100,
+        new String[]{"SessionTracker", "NIOServerCxn", "SyncThread:", "ProcessThread"}),
+    ZkSession("zkclient/zooKeeper session threads", 12, 12,
+        new String[] {"ZkClient-EventThread", "ZkClient-AsyncCallback", "-EventThread", "-SendThread"}),
+    ForkJoin( "fork join pool threads", 2, 10, new String[]{"ForkJoinPool"}),
+    Timer( "timer threads", 0, 2, new String[]{"time"}),
+    TaskStateModel("TaskStateModel threads", 0, 0, new String[]{"TaskStateModel"}),
+    Other("Other threads", 0, 3, new String[]{""});
+
+    private String _description;
+    private List<String> _pattern;
+    private int _warningLimit;
+    private int _limit;
+
+    public String getDescription() {
+      return _description;
+    }
+
+    public Predicate<String> getMatchPred() {
+      if (this.name() != ThreadCategory.Other.name()) {
+        Predicate<String> pred = target -> {
+          for (String p : _pattern) {
+            if (target.toLowerCase().contains(p.toLowerCase())) {
+              return true;
+            }
+          }
+          return false;
+        };
+        return pred;
+      }
+
+      List<Predicate<String>> predicateList = new ArrayList<>();
+      for (ThreadCategory threadCategory : ThreadCategory.values()) {
+        if (threadCategory == ThreadCategory.Other) {
+          continue;
+        }
+        predicateList.add(threadCategory.getMatchPred());
+      }
+      Predicate<String> pred = target -> {
+        for (Predicate<String> p : predicateList) {
+          if (p.test(target)) {
+            return false;
+          }
+        }
+        return true;
+      };
+
+      return pred;
+    }
+
+    public int getWarningLimit() {
+      return _warningLimit;
+    }
+
+    public int getLimit() {
+      return _limit;
+    }
+
+    private ThreadCategory(String description, int warningLimit,  int limit, String[] patterns ) {
+      _description = description;
+      _pattern = Arrays.asList(patterns);
+      _warningLimit = warningLimit;
+      _limit = limit;
+    }
+  }
+
+  public static boolean afterClassCheck(String classname) {

Review comment:
       This method seems to be a required check instead of Test Helper.
   It would be cleaner if you can create a new class for it.

##########
File path: helix-core/src/test/java/org/apache/helix/TestHelper.java
##########
@@ -847,4 +850,161 @@ public static void printZkListeners(HelixZkClient client) throws Exception {
     }
     System.out.println("}");
   }
+
+  // Thread dump related utility function
+  private static ThreadGroup getRootThreadGroup() {
+    ThreadGroup candidate = Thread.currentThread().getThreadGroup();
+    while (candidate.getParent() != null) {
+      candidate = candidate.getParent();
+    }
+    return candidate;
+  }
+
+  public static List<Thread> getAllThreads() {
+    ThreadGroup rootThreadGroup = getRootThreadGroup();
+    Thread[] threads = new Thread[32];
+    int count = rootThreadGroup.enumerate(threads);
+    while (count == threads.length) {
+      threads = new Thread[threads.length * 2];
+      count = rootThreadGroup.enumerate(threads);
+    }
+    return Arrays.asList(Arrays.copyOf(threads, count));
+  }
+
+  static public enum ThreadCategory {
+    ZkServer("zookeeper server threads", 4, 100,

Review comment:
       The hard coded patterns seem hard to maintain. If any thread names are changed, then their leakage won't be detected, right? Can we make the pattern string defined in some constants and refer to them in here and also in the place where thread names are specified?

##########
File path: helix-core/src/test/java/org/apache/helix/common/ZkTestBase.java
##########
@@ -711,6 +711,7 @@ protected Message createMessage(Message.MessageType type, String msgId, String f
 
   @AfterClass

Review comment:
       nit, I believe you can define multiple AfterClass methods so we don't need to mix things up.

##########
File path: helix-core/src/test/java/org/apache/helix/TestHelper.java
##########
@@ -847,4 +850,161 @@ public static void printZkListeners(HelixZkClient client) throws Exception {
     }
     System.out.println("}");
   }
+
+  // Thread dump related utility function
+  private static ThreadGroup getRootThreadGroup() {
+    ThreadGroup candidate = Thread.currentThread().getThreadGroup();
+    while (candidate.getParent() != null) {
+      candidate = candidate.getParent();
+    }
+    return candidate;
+  }
+
+  public static List<Thread> getAllThreads() {
+    ThreadGroup rootThreadGroup = getRootThreadGroup();
+    Thread[] threads = new Thread[32];
+    int count = rootThreadGroup.enumerate(threads);
+    while (count == threads.length) {
+      threads = new Thread[threads.length * 2];
+      count = rootThreadGroup.enumerate(threads);
+    }
+    return Arrays.asList(Arrays.copyOf(threads, count));
+  }
+
+  static public enum ThreadCategory {
+    ZkServer("zookeeper server threads", 4, 100,
+        new String[]{"SessionTracker", "NIOServerCxn", "SyncThread:", "ProcessThread"}),
+    ZkSession("zkclient/zooKeeper session threads", 12, 12,
+        new String[] {"ZkClient-EventThread", "ZkClient-AsyncCallback", "-EventThread", "-SendThread"}),
+    ForkJoin( "fork join pool threads", 2, 10, new String[]{"ForkJoinPool"}),
+    Timer( "timer threads", 0, 2, new String[]{"time"}),
+    TaskStateModel("TaskStateModel threads", 0, 0, new String[]{"TaskStateModel"}),
+    Other("Other threads", 0, 3, new String[]{""});
+
+    private String _description;
+    private List<String> _pattern;
+    private int _warningLimit;
+    private int _limit;
+
+    public String getDescription() {
+      return _description;
+    }
+
+    public Predicate<String> getMatchPred() {
+      if (this.name() != ThreadCategory.Other.name()) {
+        Predicate<String> pred = target -> {
+          for (String p : _pattern) {
+            if (target.toLowerCase().contains(p.toLowerCase())) {
+              return true;
+            }
+          }
+          return false;
+        };
+        return pred;
+      }
+
+      List<Predicate<String>> predicateList = new ArrayList<>();
+      for (ThreadCategory threadCategory : ThreadCategory.values()) {
+        if (threadCategory == ThreadCategory.Other) {
+          continue;
+        }
+        predicateList.add(threadCategory.getMatchPred());
+      }
+      Predicate<String> pred = target -> {
+        for (Predicate<String> p : predicateList) {
+          if (p.test(target)) {
+            return false;
+          }
+        }
+        return true;
+      };
+
+      return pred;
+    }
+
+    public int getWarningLimit() {
+      return _warningLimit;
+    }
+
+    public int getLimit() {
+      return _limit;
+    }
+
+    private ThreadCategory(String description, int warningLimit,  int limit, String[] patterns ) {
+      _description = description;
+      _pattern = Arrays.asList(patterns);
+      _warningLimit = warningLimit;
+      _limit = limit;
+    }
+  }
+
+  public static boolean afterClassCheck(String classname) {
+    // step 1: get all active threads
+    List<Thread> threads = TestHelper.getAllThreads();
+    System.out.println(classname + " has active threads cnt:" + threads.size());
+
+
+    // step 2: caegorize threads

Review comment:
       typo?

##########
File path: helix-core/src/test/java/org/apache/helix/TestHelper.java
##########
@@ -847,4 +850,161 @@ public static void printZkListeners(HelixZkClient client) throws Exception {
     }
     System.out.println("}");
   }
+
+  // Thread dump related utility function
+  private static ThreadGroup getRootThreadGroup() {
+    ThreadGroup candidate = Thread.currentThread().getThreadGroup();
+    while (candidate.getParent() != null) {
+      candidate = candidate.getParent();
+    }
+    return candidate;
+  }
+
+  public static List<Thread> getAllThreads() {

Review comment:
       private?

##########
File path: helix-core/src/test/java/org/apache/helix/TestHelper.java
##########
@@ -847,4 +850,161 @@ public static void printZkListeners(HelixZkClient client) throws Exception {
     }
     System.out.println("}");
   }
+
+  // Thread dump related utility function
+  private static ThreadGroup getRootThreadGroup() {
+    ThreadGroup candidate = Thread.currentThread().getThreadGroup();
+    while (candidate.getParent() != null) {
+      candidate = candidate.getParent();
+    }
+    return candidate;
+  }
+
+  public static List<Thread> getAllThreads() {
+    ThreadGroup rootThreadGroup = getRootThreadGroup();
+    Thread[] threads = new Thread[32];
+    int count = rootThreadGroup.enumerate(threads);
+    while (count == threads.length) {
+      threads = new Thread[threads.length * 2];
+      count = rootThreadGroup.enumerate(threads);
+    }
+    return Arrays.asList(Arrays.copyOf(threads, count));
+  }
+
+  static public enum ThreadCategory {

Review comment:
       private?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r498960202



##########
File path: helix-core/src/main/java/org/apache/helix/task/TaskUtil.java
##########
@@ -952,6 +953,7 @@ private static boolean removeWorkflowJobContext(HelixPropertyStore<ZNRecord> pro
         return false;
       }
     }
+    LOG.info("removed job config {}.", path);

Review comment:
       In a separate diff.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r475810917



##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterVerifiers/ZkHelixClusterVerifier.java
##########
@@ -213,6 +222,13 @@ public boolean verifyByPolling() {
     return verifyByPolling(DEFAULT_TIMEOUT, DEFAULT_PERIOD);
   }
 
+  protected boolean _isLogMore = false;

Review comment:
       Do you have to use this, can we leverage LOG levels here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check for each test class

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r467310984



##########
File path: helix-core/src/test/java/org/apache/helix/TestHelper.java
##########
@@ -847,4 +850,161 @@ public static void printZkListeners(HelixZkClient client) throws Exception {
     }
     System.out.println("}");
   }
+
+  // Thread dump related utility function
+  private static ThreadGroup getRootThreadGroup() {
+    ThreadGroup candidate = Thread.currentThread().getThreadGroup();
+    while (candidate.getParent() != null) {
+      candidate = candidate.getParent();
+    }
+    return candidate;
+  }
+
+  public static List<Thread> getAllThreads() {
+    ThreadGroup rootThreadGroup = getRootThreadGroup();
+    Thread[] threads = new Thread[32];
+    int count = rootThreadGroup.enumerate(threads);
+    while (count == threads.length) {
+      threads = new Thread[threads.length * 2];
+      count = rootThreadGroup.enumerate(threads);
+    }
+    return Arrays.asList(Arrays.copyOf(threads, count));
+  }
+
+  static public enum ThreadCategory {
+    ZkServer("zookeeper server threads", 4, 100,

Review comment:
       done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r501410348



##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestPartitionMovementThrottle.java
##########
@@ -278,8 +278,8 @@ public void testThrottleOnlyClusterLevelAnyType() {
     ClusterLiveNodesVerifier liveNodesVerifier =
         new ClusterLiveNodesVerifier(_gZkClient, CLUSTER_NAME,
             Lists.transform(Arrays.asList(_participants), MockParticipantManager::getInstanceName));
-    Assert.assertTrue(liveNodesVerifier.verifyByZkCallback(1000));
-    Assert.assertTrue(_clusterVerifier.verifyByPolling());
+    Assert.assertTrue(liveNodesVerifier.verifyByZkCallback(5000));
+    // Assert.assertTrue(_clusterVerifier.verifyByPolling());

Review comment:
       remove.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r498508544



##########
File path: helix-core/src/test/conf/testng.xml
##########
@@ -18,7 +18,7 @@
   ~ under the License.
   -->
 <!DOCTYPE suite SYSTEM "http://testng.org/testng-1.0.dtd">
-<suite name="Suite" time-out="300000">
+<suite name="Suite" time-out="900000">

Review comment:
       TestEnableCompression for example.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504350026



##########
File path: helix-core/src/test/java/org/apache/helix/integration/messaging/TestP2PNoDuplicatedMessage.java
##########
@@ -174,7 +174,10 @@ public void testP2PStateTransitionEnabled() {
       verifyP2PEnabled(startTime);
     }
 
-    Assert.assertEquals(p2pTrigged, total);
+    // Discussed with Meng who originally created this one. The success rate really depends on how

Review comment:
       changed to     
   >// The success rate really depends on how quick participant act in relationship with controller.
      // For now, we set 90% threshold.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r508914742



##########
File path: helix-core/src/test/java/org/apache/helix/monitoring/TestClusterStatusMonitorLifecycle.java
##########
@@ -291,7 +292,14 @@ public void testClusterStatusMonitorLifecycle() throws Exception {
         Query.not(Query.match(Query.attr("SensorName"), Query.value("MessageQueueStatus.*"))),
         exp1);
 
-    Assert.assertTrue(TestHelper.verify(() -> ManagementFactory.getPlatformMBeanServer()
-        .queryMBeans(new ObjectName("ClusterStatus:*"), exp2).isEmpty(), 3000));
+    // Note, the _asyncTasksThreadPool shutting down logic in GenericHelixController is best effort
+    // there is not guarantee that all threads in the pool is gone. MOstly they will, but not always.
+    // see https://github.com/apache/helix/issues/1280
+    boolean result = TestHelper.verify(() -> ManagementFactory.getPlatformMBeanServer()
+        .queryMBeans(new ObjectName("ClusterStatus:*"), exp2).isEmpty(), TestHelper.WAIT_DURATION);
+
+    if (!result) {
+      System.out.println("controllers shutting down failed to tear down all _asyncTasksThreadPools");
+    }

Review comment:
       Assert the result, but in the error message we alert user this can happen due to ....




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504356152



##########
File path: helix-core/src/test/resources/log4j.properties
##########
@@ -37,4 +44,12 @@ log4j.logger.org.apache=ERROR
 log4j.logger.com.noelios=ERROR
 log4j.logger.org.restlet=ERROR
 
+
+#log4j.logger.org.apache.helix.controller.dataproviders.WorkflowControllerDataProvider=INFO

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r501409817



##########
File path: helix-core/src/test/java/org/apache/helix/controller/changedetector/TestResourceChangeDetector.java
##########
@@ -422,16 +425,23 @@ public void testIgnoreNonTopologyChanges() {
     }
   }
 
-  @Test(dependsOnMethods = "testIgnoreNonTopologyChanges")
+  static final int WAIT_TIME = 30 * 60 * 1000;

Review comment:
       No need to wait so long as the root cause is known.

##########
File path: helix-core/src/test/java/org/apache/helix/controller/changedetector/TestResourceChangeDetector.java
##########
@@ -422,16 +425,23 @@ public void testIgnoreNonTopologyChanges() {
     }
   }
 
-  @Test(dependsOnMethods = "testIgnoreNonTopologyChanges")
+  static final int WAIT_TIME = 30 * 60 * 1000;
+  @Test(dependsOnMethods = "testIgnoreNonTopologyChanges", timeOut = 30 * 60 * 1000)

Review comment:
       remove timeout.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestPartitionMovementThrottle.java
##########
@@ -278,8 +278,8 @@ public void testThrottleOnlyClusterLevelAnyType() {
     ClusterLiveNodesVerifier liveNodesVerifier =
         new ClusterLiveNodesVerifier(_gZkClient, CLUSTER_NAME,
             Lists.transform(Arrays.asList(_participants), MockParticipantManager::getInstanceName));
-    Assert.assertTrue(liveNodesVerifier.verifyByZkCallback(1000));
-    Assert.assertTrue(_clusterVerifier.verifyByPolling());
+    Assert.assertTrue(liveNodesVerifier.verifyByZkCallback(5000));
+    // Assert.assertTrue(_clusterVerifier.verifyByPolling());

Review comment:
       remove.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/stages/TaskGarbageCollectionStage.java
##########
@@ -78,6 +78,7 @@ public void process(ClusterEvent event) throws Exception {
         }
         long currentTime = System.currentTimeMillis();
         long nextPurgeTime = workflowContext.getLastJobPurgeTime() + purgeInterval;
+

Review comment:
       remove.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestZkCallbackHandlerLeak.java
##########
@@ -415,7 +415,7 @@ public void testDanglingCallbackHanlderFix() throws Exception {
     System.out.println("END " + clusterName + " at " + new Date(System.currentTimeMillis()));
   }
 
-  @Test
+  @Test(timeOut = 1800 * 1000)

Review comment:
       remove.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r501409817



##########
File path: helix-core/src/test/java/org/apache/helix/controller/changedetector/TestResourceChangeDetector.java
##########
@@ -422,16 +425,23 @@ public void testIgnoreNonTopologyChanges() {
     }
   }
 
-  @Test(dependsOnMethods = "testIgnoreNonTopologyChanges")
+  static final int WAIT_TIME = 30 * 60 * 1000;

Review comment:
       No need to wait so long as the root cause is known.

##########
File path: helix-core/src/test/java/org/apache/helix/controller/changedetector/TestResourceChangeDetector.java
##########
@@ -422,16 +425,23 @@ public void testIgnoreNonTopologyChanges() {
     }
   }
 
-  @Test(dependsOnMethods = "testIgnoreNonTopologyChanges")
+  static final int WAIT_TIME = 30 * 60 * 1000;
+  @Test(dependsOnMethods = "testIgnoreNonTopologyChanges", timeOut = 30 * 60 * 1000)

Review comment:
       remove timeout.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestPartitionMovementThrottle.java
##########
@@ -278,8 +278,8 @@ public void testThrottleOnlyClusterLevelAnyType() {
     ClusterLiveNodesVerifier liveNodesVerifier =
         new ClusterLiveNodesVerifier(_gZkClient, CLUSTER_NAME,
             Lists.transform(Arrays.asList(_participants), MockParticipantManager::getInstanceName));
-    Assert.assertTrue(liveNodesVerifier.verifyByZkCallback(1000));
-    Assert.assertTrue(_clusterVerifier.verifyByPolling());
+    Assert.assertTrue(liveNodesVerifier.verifyByZkCallback(5000));
+    // Assert.assertTrue(_clusterVerifier.verifyByPolling());

Review comment:
       remove.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/stages/TaskGarbageCollectionStage.java
##########
@@ -78,6 +78,7 @@ public void process(ClusterEvent event) throws Exception {
         }
         long currentTime = System.currentTimeMillis();
         long nextPurgeTime = workflowContext.getLastJobPurgeTime() + purgeInterval;
+

Review comment:
       remove.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestZkCallbackHandlerLeak.java
##########
@@ -415,7 +415,7 @@ public void testDanglingCallbackHanlderFix() throws Exception {
     System.out.println("END " + clusterName + " at " + new Date(System.currentTimeMillis()));
   }
 
-  @Test
+  @Test(timeOut = 1800 * 1000)

Review comment:
       remove.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/stages/TaskGarbageCollectionStage.java
##########
@@ -78,6 +78,7 @@ public void process(ClusterEvent event) throws Exception {
         }
         long currentTime = System.currentTimeMillis();
         long nextPurgeTime = workflowContext.getLastJobPurgeTime() + purgeInterval;
+

Review comment:
       fixed.

##########
File path: helix-core/src/test/java/org/apache/helix/TestHelper.java
##########
@@ -37,6 +37,8 @@
 import java.util.concurrent.ConcurrentHashMap;
 import java.util.concurrent.CountDownLatch;
 import java.util.concurrent.TimeUnit;
+import java.util.function.Predicate;
+import java.util.stream.Collectors;

Review comment:
       not needed, remove

##########
File path: helix-core/src/test/java/org/apache/helix/TestZKCallback.java
##########
@@ -166,6 +166,7 @@ public void testInvocation() throws Exception {
 
       ExternalView extView = new ExternalView("db-12345");
       accessor.setProperty(keyBuilder.externalView("db-12345"), extView);
+

Review comment:
       removed.

##########
File path: helix-core/src/test/java/org/apache/helix/controller/changedetector/TestResourceChangeDetector.java
##########
@@ -422,16 +425,23 @@ public void testIgnoreNonTopologyChanges() {
     }
   }
 
-  @Test(dependsOnMethods = "testIgnoreNonTopologyChanges")
+  static final int WAIT_TIME = 30 * 60 * 1000;

Review comment:
       fixed.

##########
File path: helix-core/src/test/java/org/apache/helix/controller/changedetector/TestResourceChangeDetector.java
##########
@@ -422,16 +425,23 @@ public void testIgnoreNonTopologyChanges() {
     }
   }
 
-  @Test(dependsOnMethods = "testIgnoreNonTopologyChanges")
+  static final int WAIT_TIME = 30 * 60 * 1000;
+  @Test(dependsOnMethods = "testIgnoreNonTopologyChanges", timeOut = 30 * 60 * 1000)

Review comment:
       fixed.

##########
File path: helix-core/src/test/java/org/apache/helix/controller/stages/TestRebalancePipeline.java
##########
@@ -157,7 +157,7 @@ public void testDuplicateMsg() throws Exception {
 
     Thread.sleep(2 * MessageGenerationPhase.DEFAULT_OBSELETE_MSG_PURGE_DELAY);
     runPipeline(event, dataRefresh, false);
-    
+

Review comment:
       remove.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestPartitionMovementThrottle.java
##########
@@ -278,8 +278,8 @@ public void testThrottleOnlyClusterLevelAnyType() {
     ClusterLiveNodesVerifier liveNodesVerifier =
         new ClusterLiveNodesVerifier(_gZkClient, CLUSTER_NAME,
             Lists.transform(Arrays.asList(_participants), MockParticipantManager::getInstanceName));
-    Assert.assertTrue(liveNodesVerifier.verifyByZkCallback(1000));
-    Assert.assertTrue(_clusterVerifier.verifyByPolling());
+    Assert.assertTrue(liveNodesVerifier.verifyByZkCallback(5000));
+    // Assert.assertTrue(_clusterVerifier.verifyByPolling());

Review comment:
       remove.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestRebalancerPersistAssignments.java
##########
@@ -29,6 +29,7 @@
 import org.apache.helix.TestHelper;
 import org.apache.helix.controller.rebalancer.strategy.CrushEdRebalanceStrategy;
 import org.apache.helix.controller.rebalancer.strategy.RebalanceStrategy;
+import org.apache.helix.TestHelper;

Review comment:
       remove

##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestZkCallbackHandlerLeak.java
##########
@@ -415,7 +415,7 @@ public void testDanglingCallbackHanlderFix() throws Exception {
     System.out.println("END " + clusterName + " at " + new Date(System.currentTimeMillis()));
   }
 
-  @Test
+  @Test(timeOut = 1800 * 1000)

Review comment:
       fixed

##########
File path: helix-core/src/test/java/org/apache/helix/integration/rebalancer/DelayedAutoRebalancer/TestDelayedAutoRebalance.java
##########
@@ -105,16 +105,20 @@ public void beforeClass() throws Exception {
    */
   @Test
   public void testDelayedPartitionMovement() throws Exception {
+    System.out.println("START TestDelayedAutoRebalance::testDelayedPartitionMovement");
     Map<String, ExternalView> externalViewsBefore = createTestDBs(1000000);
     validateDelayedMovements(externalViewsBefore);
+    System.out.println("START TestDelayedAutoRebalance::testDelayedPartitionMovement");

Review comment:
       Changed to END




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r475814065



##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterVerifiers/BestPossibleExternalViewVerifier.java
##########
@@ -236,6 +239,9 @@ protected synchronized boolean verifyState() {
       idealStates.entrySet()
           .removeIf(pair -> pair.getValue().getStateModelDefRef().equals(TaskConstants.STATE_MODEL_NAME));
 
+      if (_isLogMore) {
+        System.out.println("---- before verify instances !");
+      }

Review comment:
       This is not production code, they are only in verifier which is test utility I believe. The usage is to nail down how long does it take to wait so that we can tune the timeout of test. In github, some operations are fast (say starting Zookeeper)   while some are slow (like refresh cache).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504353410



##########
File path: helix-core/src/test/java/org/apache/helix/integration/common/ZkStandAloneCMTestBase.java
##########
@@ -94,6 +95,8 @@ public void beforeClass() throws Exception {
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();

Review comment:
       Let me remove it as @afterclass of zktest base do print out something like"
   
   ```
   AfterClass: TestAutoRebalance called
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r489850802



##########
File path: helix-core/src/test/java/org/apache/helix/ThreadLeakageChecker.java
##########
@@ -0,0 +1,178 @@
+package org.apache.helix;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.function.Predicate;
+import java.util.stream.Collectors;
+
+
+public class ThreadLeakageChecker {
+  private static ThreadGroup getRootThreadGroup() {
+    ThreadGroup candidate = Thread.currentThread().getThreadGroup();
+    while (candidate.getParent() != null) {
+      candidate = candidate.getParent();
+    }
+    return candidate;
+  }
+
+  private static List<Thread> getAllThreads() {
+    ThreadGroup rootThreadGroup = getRootThreadGroup();
+    Thread[] threads = new Thread[32];
+    int count = rootThreadGroup.enumerate(threads);
+    while (count == threads.length) {
+      threads = new Thread[threads.length * 2];
+      count = rootThreadGroup.enumerate(threads);
+    }
+    return Arrays.asList(Arrays.copyOf(threads, count));
+  }
+
+  private static final String[] ZkServerThrdPattern =
+      {"SessionTracker", "NIOServerCxn", "SyncThread:", "ProcessThread"};
+  private static final String[] ZkSessionThrdPattern =
+      new String[]{"ZkClient-EventThread", "ZkClient-AsyncCallback", "-EventThread", "-SendThread"};
+  private static final String[] ForkJoinThrdPattern = new String[]{"ForkJoinPool"};
+  private static final String[] TimerThrdPattern = new String[]{"time"};
+  private static final String[] TaskStateModelThrdPattern = new String[]{"TaskStateModel"};
+
+  private static enum ThreadCategory {
+    ZkServer("zookeeper server threads", 4, 100, ZkServerThrdPattern),
+    ZkSession("zkclient/zooKeeper session threads", 12, 12, ZkSessionThrdPattern),
+    ForkJoin("fork join pool threads", 2, 10, ForkJoinThrdPattern),
+    Timer("timer threads", 0, 2, TimerThrdPattern),
+    TaskStateModel("TaskStateModel threads", 0, 0, TaskStateModelThrdPattern),
+    Other("Other threads", 0, 3, new String[]{""});
+
+    private String _description;
+    private List<String> _pattern;
+    private int _warningLimit;
+    private int _limit;
+
+    public String getDescription() {
+      return _description;
+    }
+
+    public Predicate<String> getMatchPred() {
+      if (this.name() != ThreadCategory.Other.name()) {
+        Predicate<String> pred = target -> {
+          for (String p : _pattern) {
+            if (target.toLowerCase().contains(p.toLowerCase())) {
+              return true;
+            }
+          }
+          return false;
+        };
+        return pred;
+      }
+
+      List<Predicate<String>> predicateList = new ArrayList<>();
+      for (ThreadCategory threadCategory : ThreadCategory.values()) {
+        if (threadCategory == ThreadCategory.Other) {
+          continue;
+        }
+        predicateList.add(threadCategory.getMatchPred());
+      }
+      Predicate<String> pred = target -> {
+        for (Predicate<String> p : predicateList) {
+          if (p.test(target)) {
+            return false;
+          }
+        }
+        return true;
+      };
+
+      return pred;
+    }
+
+    public int getWarningLimit() {
+      return _warningLimit;
+    }
+
+    public int getLimit() {
+      return _limit;
+    }
+
+    private ThreadCategory(String description, int warningLimit, int limit, String[] patterns) {
+      _description = description;
+      _pattern = Arrays.asList(patterns);
+      _warningLimit = warningLimit;
+      _limit = limit;
+    }
+  }
+
+  public static boolean afterClassCheck(String classname) {
+    // step 1: get all active threads
+    List<Thread> threads = getAllThreads();
+    System.out.println(classname + " has active threads cnt:" + threads.size());
+
+    // step 2: categorize threads
+    Map<String, List<Thread>> threadByName = null;
+    Map<ThreadCategory, Integer> threadByCnt = new HashMap<>();
+    Map<ThreadCategory, Set<Thread>> threadByCat = new HashMap<>();
+    try {
+      threadByName = threads.
+          stream().
+          filter(p -> p.getThreadGroup() != null && p.getThreadGroup().getName() != null
+              && p.getThreadGroup().getName() != "system").

Review comment:
       good point. fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499920954



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -65,14 +97,16 @@ public void testJobQueueAddingJobsOneByOne() throws InterruptedException {
         TaskState.COMPLETED);
   }
 
-  @Test
+  @Test()
   public void testJobQueueAddingJobsAtSametime() throws InterruptedException {
     String queueName = TestHelper.getTestMethodName();
     JobQueue.Builder builder = TaskTestUtil.buildJobQueue(queueName);
     WorkflowConfig.Builder workflowCfgBuilder =
-        new WorkflowConfig.Builder().setWorkflowId(queueName).setParallelJobs(1);
+        new WorkflowConfig.Builder().setWorkflowId(queueName).setParallelJobs(1)
+        .setJobPurgeInterval(-1);
     _driver.start(builder.setWorkflowConfig(workflowCfgBuilder.build()).build());
 
+    System.out.println("before add jobs");

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r475810452



##########
File path: helix-core/src/test/java/org/apache/helix/common/ZkTestBase.java
##########
@@ -719,6 +722,20 @@ public void cleanupLiveInstanceOwners() {
       clientMap.clear();
     }
     _liveInstanceOwners.clear();
+
+    // wait 2 seconds for threads to be teared down.
+    // this reduce false alert.
+    Thread.sleep(1000);

Review comment:
       This does not apply here. Tearing down thread is async to thread shutdown by the Java virtual machine. There is not way for use to check when the Java virtual machine really tear it down.
   
   that said, I found by practice, waiting 2 sec does not really help much either. I will remove it a little bit later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r498953551



##########
File path: helix-core/src/main/java/org/apache/helix/task/TaskDriver.java
##########
@@ -941,9 +941,14 @@ public TaskState pollForJobState(String workflowName, String jobName, long timeo
         && System.currentTimeMillis() < st + timeout);
 
     if (ctx == null || !allowedStates.contains(ctx.getJobState(jobName))) {
+      WorkflowConfig wfcfg = getWorkflowConfig(workflowName);

Review comment:
       Revert it. Could be a separate pull




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r508912535



##########
File path: helix-core/src/test/java/org/apache/helix/model/TestResourceConfig.java
##########
@@ -34,6 +34,11 @@
   private static final ObjectMapper _objectMapper = new ObjectMapper();
 
   @Test
+  public void testHead() {

Review comment:
       assign order




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r494733531



##########
File path: helix-core/src/test/java/org/apache/helix/TestListenerCallback.java
##########
@@ -119,6 +119,8 @@ public void beforeClass() throws Exception {
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestListenerCallback called.");

Review comment:
       It does not really make test output much longer. The reason is that sometime, the error message is emitted from afterClass(), sometimes, from last testing method. It is hard to distinguish. 
   
   With this log, it would be clear for people to examine




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r501453398



##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestZkCallbackHandlerLeak.java
##########
@@ -415,7 +415,7 @@ public void testDanglingCallbackHanlderFix() throws Exception {
     System.out.println("END " + clusterName + " at " + new Date(System.currentTimeMillis()));
   }
 
-  @Test
+  @Test(timeOut = 1800 * 1000)

Review comment:
       fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r502716727



##########
File path: helix-core/src/test/java/org/apache/helix/TestListenerCallback.java
##########
@@ -119,6 +119,8 @@ public void beforeClass() throws Exception {
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();

Review comment:
       changed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r489848420



##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterVerifiers/ZkHelixClusterVerifier.java
##########
@@ -48,6 +50,10 @@
   private static Logger LOG = LoggerFactory.getLogger(ZkHelixClusterVerifier.class);
   protected static int DEFAULT_TIMEOUT = 300 * 1000;
   protected static int DEFAULT_PERIOD = 500;
+  // COOL_DOWN before starting vefiyByPool
+  // The goal is to make sure waiting for controller pipeline starts at least one cycle
+  // to update ideal state.
+  protected static int DEFAULT_COOLDOWN = 2 * 1000;

Review comment:
       Changed the way to what we discussed before.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504335428



##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestHelixAdminCli.java
##########
@@ -568,7 +568,9 @@ public void testDeactivateCluster() throws Exception {
     String command =
         "-zkSvr " + ZK_ADDR + " -activateCluster " + clusterName + " " + grandClusterName + " true";
     ClusterSetup.processCommandLineArgs(command.split("\\s+"));
-    Thread.sleep(500);
+
+    // wait long enough enough so that grand_cluster converge
+    Thread.sleep(20000);

Review comment:
       good point. added it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504342326



##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestClusterStateVerifier.java
##########
@@ -114,19 +118,30 @@ public void testResourceSubset() throws InterruptedException {
     Thread.sleep(1000);
     _admin.enableCluster(_clusterName, false);
     _admin.enableInstance(_clusterName, "localhost_12918", true);
-    boolean result =
-        ClusterStateVerifier.verifyByZkCallback(new BestPossAndExtViewZkVerifier(ZK_ADDR,
-            _clusterName, null, Sets.newHashSet(RESOURCES[1])));
-    Assert.assertTrue(result);
+    BestPossAndExtViewZkVerifier verifier = new BestPossAndExtViewZkVerifier(ZK_ADDR,
+        _clusterName, null, Sets.newHashSet(RESOURCES[1]));
+    boolean result = false;
+    try {
+      result = ClusterStateVerifier.verifyByZkCallback(verifier);
+      Assert.assertTrue(result);
+    } finally {
+      verifier.close();
+    }
+
     String[] args = {
         "--zkSvr", ZK_ADDR, "--cluster", _clusterName, "--resources", RESOURCES[1]
     };
     result = ClusterStateVerifier.verifyState(args);
     Assert.assertTrue(result);
 
     // But the full cluster verification should fail
-    boolean fullResult = new BestPossAndExtViewZkVerifier(ZK_ADDR, _clusterName).verify();
-    Assert.assertFalse(fullResult);
+    BestPossAndExtViewZkVerifier verifier1 = new BestPossAndExtViewZkVerifier(ZK_ADDR, _clusterName);

Review comment:
       changed to this:
   ```
       verifier = new BestPossAndExtViewZkVerifier(ZK_ADDR, _clusterName);
       try {
         boolean fullResult = verifier.verify();
         Assert.assertFalse(fullResult);
       } finally {
         verifier.close();
       }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499946596



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestDeleteJobFromJobQueue.java
##########
@@ -69,11 +70,16 @@ public void testForceDeleteJobFromJobQueue() throws InterruptedException {
     Assert
         .assertNotNull(_driver.getJobContext(TaskUtil.getNamespacedJobName(jobQueueName, "job2")));
 
-    // The following force delete for the job should go through without getting an exception
-    _driver.deleteJob(jobQueueName, "job2", true);
+    try {

Review comment:
       Changed to pause pipeline with following code:
   
   ```
       // force deletion can have Exception thrown as controller is writing to propertystore path too.
       // https://github.com/apache/helix/issues/1406
       
       // Thus, we stop pipeline to make sure there is not such race condition.
       _gSetupTool.getClusterManagementTool().enableCluster(CLUSTER_NAME, false);
       
       _driver.deleteJob(jobQueueName, "job2", true);
       Thread.sleep(3000); // note this sleep is critical as it would take time for controller to stop
   
       // Check that the job has been force-deleted (fully gone from ZK)
       Assert.assertNull(_driver.getJobConfig(TaskUtil.getNamespacedJobName(jobQueueName, "job2")));
       Assert.assertNull(_driver.getJobContext(TaskUtil.getNamespacedJobName(jobQueueName, "job2")));
   
   
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r489854463



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -177,7 +177,7 @@ public HelixTaskExecutor(ParticipantStatusMonitor participantStatusMonitor,
     _lock = new Object();
     _statusUpdateUtil = new StatusUpdateUtil();
 
-    _timer = new Timer("HelixTaskExecutor_timer", true); // created as a daemon timer thread to handle task timeout
+    _timer = new Timer("HelixTaskExecutor_Timer", true); // created as a daemon timer thread to handle task timeout

Review comment:
       Capital "T" we can do a little better matching when finding leaked thread by name.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504350703



##########
File path: helix-core/src/test/java/org/apache/helix/integration/messaging/TestBatchMessage.java
##########
@@ -187,15 +194,18 @@ public void testChangeBatchMessageMode() throws Exception {
       participants[i] = new MockParticipantManager(ZK_ADDR, clusterName, instanceName);
       participants[i].syncStart();
     }
-
-    result =
-        ClusterStateVerifier.verifyByZkCallback(new BestPossAndExtViewZkVerifier(ZK_ADDR,
-            clusterName));
-    Assert.assertTrue(result);
-    // Change to three is because there is an extra factory registered
-    // So one extra NO_OP message send
-    Assert.assertTrue(listener._maxNumberOfChildren <= 3,
-        "Should get no more than 2 messages (O->S and S->M)");
+    BestPossAndExtViewZkVerifier verifier1 = new BestPossAndExtViewZkVerifier(ZK_ADDR, clusterName);

Review comment:
       removed 1




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r489852639



##########
File path: helix-core/src/test/java/org/apache/helix/common/ZkTestBase.java
##########
@@ -719,6 +722,20 @@ public void cleanupLiveInstanceOwners() {
       clientMap.clear();
     }
     _liveInstanceOwners.clear();
+
+    // wait 2 seconds for threads to be teared down.
+    // this reduce false alert.
+    Thread.sleep(1000);
+    boolean status = false;
+    try {
+       status = ThreadLeakageChecker.afterClassCheck(testClassName);
+    } catch (Exception e) {
+      System.out.println("ThreadLeakageChecker exception:" + e.getStackTrace());
+    }
+    // Assert here does not work.
+    if (!status) {
+      System.out.println("---------- Test Class " + testClassName + " thread leakage detected! ---------------");

Review comment:
       This is going to keep the test continue to execute. Killing thread is a little bit dangerous as out detection logic may not 100% right.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r475810574



##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterVerifiers/ZkHelixClusterVerifier.java
##########
@@ -48,6 +50,10 @@
   private static Logger LOG = LoggerFactory.getLogger(ZkHelixClusterVerifier.class);
   protected static int DEFAULT_TIMEOUT = 300 * 1000;
   protected static int DEFAULT_PERIOD = 500;
+  // COOL_DOWN before starting vefiyByPool
+  // The goal is to make sure waiting for controller pipeline starts at least one cycle
+  // to update ideal state.
+  protected static int DEFAULT_COOLDOWN = 2 * 1000;

Review comment:
       Set a hardcoded time here is not a good idea, these tools have been used by many of customers, including Espresso, how do we know 2 second is good enough?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r474349029



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -623,6 +623,13 @@ void reset() {
       }
     }
 
+    // some thing like STATE_TRANSITION.Key specific pool are not shut down.
+    for (String msgType : _executorMap.keySet()) {
+      ExecutorService pool = _executorMap.remove(msgType);

Review comment:
       If the pool is created in user logic and passed in to construct the executor, then executor shall not shutdown it here. 

##########
File path: helix-core/src/main/java/org/apache/helix/task/TaskDriver.java
##########
@@ -969,17 +969,29 @@ public TaskState pollForJobState(String workflowName, String jobName, long timeo
     Set<TaskState> allowedStates = new HashSet<>(Arrays.asList(states));
     // Wait for state
     long st = System.currentTimeMillis();
+    long ct = 0;
     do {
       Thread.sleep(timeToSleep);
       ctx = getWorkflowContext(workflowName);
+      ct =  System.currentTimeMillis();
     } while ((ctx == null || ctx.getJobState(jobName) == null
         || !allowedStates.contains(ctx.getJobState(jobName)))
-        && System.currentTimeMillis() < st + timeout);
+        && ct < st + timeout);
 
     if (ctx == null || !allowedStates.contains(ctx.getJobState(jobName))) {

Review comment:
       I guess you can simplify this if-else. It is not quite readable. Please avoid the redundant check and add some comments here.

##########
File path: helix-core/src/test/java/org/apache/helix/TestHelper.java
##########
@@ -794,7 +797,12 @@ public static boolean verify(Verifier verifier, long timeout) throws Exception {
     long start = System.currentTimeMillis();
     do {
       boolean result = verifier.verify();
-      if (result || (System.currentTimeMillis() - start) > timeout) {
+      long now = System.currentTimeMillis();
+      if (result || (now - start) > timeout) {

Review comment:
       This code looks chaotic. Please simplify.
   
   boolean result = verifier.verify();
   boolean timeout = (System.currentTimeMillis() - start) > timeout;
   if (result || timeout) {
     if (timeout) {
       LOG.error()
     }
     return result;
   }

##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterVerifiers/BestPossibleExternalViewVerifier.java
##########
@@ -224,6 +224,9 @@ public boolean verifyByZkCallback(long timeout) {
   @Override
   protected synchronized boolean verifyState() {
     try {
+      if (_isLogMore) {

Review comment:
       I guess this helps your debug, but are you suggesting we check this into the master branch? In general, I don't like optional code in master.

##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterStateVerifier.java
##########
@@ -181,6 +181,12 @@ private BestPossAndExtViewZkVerifier(HelixZkClient zkClient, String clusterName,
       this.resources = resources;
     }
 
+    public void close() {

Review comment:
       Is this class deprecated? Instead of modifying it, can we just change the callers to use the newer verifier?

##########
File path: helix-core/src/main/java/org/apache/helix/task/TaskDriver.java
##########
@@ -72,7 +72,7 @@
   private static final Logger LOG = LoggerFactory.getLogger(TaskDriver.class);
 
   /** Default time out for monitoring workflow or job state */
-  private final static int DEFAULT_TIMEOUT = 5 * 60 * 1000; /* 5 mins */
+  private final static int DEFAULT_TIMEOUT = 15 * 60 * 1000; /* 15 mins, same as default in TestNg.xml */

Review comment:
       I think this impacts our features. ANd this number is not randomly set. Please don't change the default value. If it fails the test, I think we can just use the method in which you can specify the timeout explicitly.
   
   And, I do want to understand why it will help the test. Could you please clarify?

##########
File path: helix-core/src/test/conf/testng.xml
##########
@@ -18,7 +18,7 @@
   ~ under the License.
   -->
 <!DOCTYPE suite SYSTEM "http://testng.org/testng-1.0.dtd">
-<suite name="Suite" time-out="300000">
+<suite name="Suite" time-out="900000">

Review comment:
       Does any test definitely need more than 5 minutes?

##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterVerifiers/ZkHelixClusterVerifier.java
##########
@@ -48,6 +50,10 @@
   private static Logger LOG = LoggerFactory.getLogger(ZkHelixClusterVerifier.class);
   protected static int DEFAULT_TIMEOUT = 300 * 1000;
   protected static int DEFAULT_PERIOD = 500;
+  // COOL_DOWN before starting vefiyByPool
+  // The goal is to make sure waiting for controller pipeline starts at least one cycle
+  // to update ideal state.
+  protected static int DEFAULT_COOLDOWN = 2 * 1000;

Review comment:
       I think this is another workaround to avoid issue https://github.com/apache/helix/issues/526. If this is the case, then it would be an overkill to add to the general verifier logic. Because not all verifying logic depends on the pipeline running.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499936199



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -231,12 +272,14 @@ public void testQueueJobsMaxCapacity() throws InterruptedException {
 
     _driver.resume(queueName);
 
+    System.out.println("before the polls");

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499936352



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -247,6 +290,7 @@ public void testQueueJobsMaxCapacity() throws InterruptedException {
     }
     Assert.assertTrue(exceptionHappenedWhileAddingNewJob);
 
+    System.out.println("after the exception");

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r498965043



##########
File path: helix-core/src/main/java/org/apache/helix/task/WorkflowConfig.java
##########
@@ -94,6 +95,16 @@ public WorkflowConfig(WorkflowConfig cfg, String workflowId) {
         cfg.isAllowOverlapJobAssignment(), cfg.getTimeout());
   }
 
+  @VisibleForTesting

Review comment:
       Let us try use reflection to change it. And later remove the change once fix is in.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r502718303



##########
File path: helix-core/src/test/java/org/apache/helix/TestListenerCallbackBatchMode.java
##########
@@ -119,6 +119,9 @@ public void beforeClass()
   @AfterClass
   public void afterClass()
       throws Exception {
+    String testClassName = this.getShortClassName();

Review comment:
       Let me move this log to ZkTestBase




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499920881



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -41,13 +47,39 @@
   public void beforeClass() throws Exception {
     setSingleTestEnvironment();
     super.beforeClass();
+    //WorkflowConfig.disableJobPurge();
+  }
+
+  @AfterClass
+  @Override
+  public void afterClass() throws Exception {
+    //WorkflowConfig.enableJobPurge();
+    super.afterClass();
+  }
+
+  @BeforeMethod
+  public void beforeTest(Method testMethod, ITestContext testContext) {
+    long startTime = System.currentTimeMillis();
+    System.out.println("START " + testMethod.getName() + " at " + new Date(startTime));
+    testContext.setAttribute("StartTime", System.currentTimeMillis());
+  }
+
+  @AfterMethod
+  public void endTest(Method testMethod, ITestContext testContext) {
+    Long startTime = (Long) testContext.getAttribute("StartTime");
+    long endTime = System.currentTimeMillis();
+    System.out.println("END " + testMethod.getName() + " at " + new Date(endTime) + ", took: "
+        + (endTime - startTime) + "ms.");
   }
 
   @Test
   public void testJobQueueAddingJobsOneByOne() throws InterruptedException {
     String queueName = TestHelper.getTestMethodName();
     JobQueue.Builder builder = TaskTestUtil.buildJobQueue(queueName);
-    WorkflowConfig.Builder workflowCfgBuilder = new WorkflowConfig.Builder().setWorkflowId(queueName).setParallelJobs(1);
+    WorkflowConfig.Builder workflowCfgBuilder = new WorkflowConfig.Builder()
+        .setWorkflowId(queueName)
+        .setParallelJobs(1)
+        .setJobPurgeInterval(-1);

Review comment:
       yes, verified with debugger.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499967970



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancerStopResume.java
##########
@@ -102,29 +123,41 @@ public void stopAndResumeNamedQueue() throws Exception {
     String queueName = TestHelper.getTestMethodName();
 
     // Create a queue
-    LOG.info("Starting job-queue: " + queueName);
-    JobQueue queue = new JobQueue.Builder(queueName).build();
+    LOG.info("Starting job-queue : " + queueName);
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);
+    WorkflowConfig wfg = wfgBuilder.build();
+    JobQueue queue = new JobQueue.Builder(queueName)
+        .setWorkflowConfig(wfg)
+        .build();
     _driver.createQueue(queue);
 
     // Enqueue jobs
+    List<String> jobNames = new ArrayList<>();
+    List<JobConfig.Builder> jobBuilders = new ArrayList<>();
     Set<String> master = Sets.newHashSet("MASTER");
     JobConfig.Builder job1 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
         .setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(master)
         .setJobCommandConfigMap(Collections.singletonMap(MockTask.JOB_DELAY, "200"));
     String job1Name = "masterJob";
     LOG.info("Enqueuing job: " + job1Name);
-    _driver.enqueueJob(queueName, job1Name, job1);
+    jobNames.add(job1Name);
+    jobBuilders.add(job1);
+    //_driver.enqueueJob(queueName, job1Name, job1);
 
     Set<String> slave = Sets.newHashSet("SLAVE");
     JobConfig.Builder job2 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
         .setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(slave);
     String job2Name = "slaveJob";
     LOG.info("Enqueuing job: " + job2Name);
-    _driver.enqueueJob(queueName, job2Name, job2);
+    jobNames.add(job2Name);
+    jobBuilders.add(job2);
+    _driver.enqueueJobs(queueName, jobNames,jobBuilders);
+    //_driver.enqueueJob(queueName, job2Name, job2);

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check for each test class

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r467242429



##########
File path: helix-core/src/test/java/org/apache/helix/common/ZkTestBase.java
##########
@@ -719,6 +720,12 @@ public void cleanupLiveInstanceOwners() {
       clientMap.clear();
     }
     _liveInstanceOwners.clear();
+
+    boolean status = TestHelper.afterClassCheck(name);
+    // Assert here does not work.
+    if (!status) {
+      System.out.println("---------- Test Class " + name + " thread leakage detected! ---------------");

Review comment:
       This is something I spent a lot of time yesterday  Can't make it work this way. See comment 725 "Assert does not work". 
   
   On the hand, after testing /running this diff many time. I found this may not be a good idea to fail it outright. The reasons are:
   
   - closing Zk Session, Cancel() timer thread or executors by our code and till the thread there are shutdown are async. There are some time delta before these threads tearing down from cancel() time. And it is out of our control. For example, closing Zk session via Zookeeper client library. But the sendThread and eventThread associated with this session is not teared down before Zookeeper return from close() and hand the execution to our code. 
   - Some threads created from stream() etc java functional framework, such as forkjoinwork, or pool- can happen. We don't have control of their life time. 
   
   Thus, sometimes some of these to-be-closed threads would appear here. So relying on human examination here may not be a bad idea for now. Let me know what is your take?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499967699



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancerStopResume.java
##########
@@ -82,7 +99,11 @@ public void stopAndResume() throws Exception {
   @Test
   public void stopAndResumeWorkflow() throws Exception {
     String workflow = "SomeWorkflow";
-    Workflow flow = WorkflowGenerator.generateDefaultRepeatedJobWorkflowBuilder(workflow).build();
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);
+    WorkflowConfig wfg = wfgBuilder.build();

Review comment:
       fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r501988944



##########
File path: helix-core/src/test/java/org/apache/helix/TestListenerCallback.java
##########
@@ -119,6 +119,8 @@ public void beforeClass() throws Exception {
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();

Review comment:
       Let us make this name consistent with line 102-103, i.e, either fix that or change this line to make them consistent.

##########
File path: helix-core/src/test/java/org/apache/helix/TestListenerCallbackBatchMode.java
##########
@@ -119,6 +119,9 @@ public void beforeClass()
   @AfterClass
   public void afterClass()
       throws Exception {
+    String testClassName = this.getShortClassName();

Review comment:
       Do we need one in BeforeClass?  If we need this being print out before and after class methods, should we put this into ZkTestBase (like what we do in for beforeTest and afterTest in ZkTestBase)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499936076



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -177,6 +215,7 @@ public void testQueueParallelJobs() throws InterruptedException {
     }
     _driver.enqueueJobs(queueName, jobNames, jobBuilders);
 
+    System.out.println("before start controller");

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499966570



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancer.java
##########
@@ -217,22 +246,35 @@ public void testNamedQueue() throws Exception {
     String queueName = TestHelper.getTestMethodName();
 
     // Create a queue
-    JobQueue queue = new JobQueue.Builder(queueName).build();
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);
+    WorkflowConfig wfg = wfgBuilder.build();
+    JobQueue queue = new JobQueue.Builder(queueName)
+        .setWorkflowConfig(wfg)
+        .build();
     _driver.createQueue(queue);
 
     // Enqueue jobs
+    List<String> jobNames = new ArrayList<>();
+    List<JobConfig.Builder> jobBuilders = new ArrayList<>();
     Set<String> master = Sets.newHashSet("MASTER");
     Set<String> slave = Sets.newHashSet("SLAVE");
     JobConfig.Builder job1 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
         .setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(master);
     JobConfig.Builder job2 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
         .setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(slave);
-    _driver.enqueueJob(queueName, "masterJob", job1);
-    _driver.enqueueJob(queueName, "slaveJob", job2);
+    jobNames.add("masterJob");
+    jobNames.add("slaveJob");
+    jobBuilders.add(job1);
+    jobBuilders.add(job2);
+    //_driver.enqueueJob(queueName, "masterJob", job1);
+    //_driver.enqueueJob(queueName, "slaveJob", job2);
+    _driver.enqueueJobs(queueName, jobNames, jobBuilders);
 
     // Ensure successful completion
     String namespacedJob1 = queueName + "_masterJob";
     String namespacedJob2 = queueName + "_slaveJob";
+
+    System.out.println("testNamedQueue before pollForJobState");

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499936133



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -188,6 +227,7 @@ public void testQueueParallelJobs() throws InterruptedException {
           TaskState.COMPLETED);
     }
 
+    System.out.println("after the poll");

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r498959848



##########
File path: helix-core/src/main/java/org/apache/helix/task/TaskStateModelFactory.java
##########
@@ -97,7 +97,7 @@ public void shutdown() {
   }
 
   @VisibleForTesting
-  void shutdownNow() {
+  public void shutdownNow() {

Review comment:
       JK suggestion, revert this one and let it leak for now. 
   
   We may need another diff to let shutdown() have parameter to control if shutdown or shutdown now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r501452595



##########
File path: helix-core/src/test/java/org/apache/helix/controller/stages/TestRebalancePipeline.java
##########
@@ -157,7 +157,7 @@ public void testDuplicateMsg() throws Exception {
 
     Thread.sleep(2 * MessageGenerationPhase.DEFAULT_OBSELETE_MSG_PURGE_DELAY);
     runPipeline(event, dataRefresh, false);
-    
+

Review comment:
       remove.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504327940



##########
File path: helix-core/src/test/resources/log4j.properties
##########
@@ -27,6 +32,8 @@ log4j.appender.R=org.apache.log4j.RollingFileAppender
 log4j.appender.R.layout=org.apache.log4j.PatternLayout
 log4j.appender.R.layout.ConversionPattern=%5p [%C:%M] (%F:%L) - %m%n
 log4j.appender.R.File=target/ClusterManagerLogs/log.txt
+log4j.appender.R.MaxBackupIndex=200

Review comment:
       Let me remove it. 
   
   This thing is that if you really debug some test and enable some extensive logging. You probably still need to enable this one again. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r502718658



##########
File path: helix-core/src/test/java/org/apache/helix/TestListenerCallback.java
##########
@@ -119,6 +119,8 @@ public void beforeClass() throws Exception {
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();

Review comment:
       actually, remove this one as it is moved to zkTestBase.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r508972311



##########
File path: helix-core/src/test/java/org/apache/helix/integration/rebalancer/CrushRebalancers/TestCrushAutoRebalance.java
##########
@@ -298,7 +298,7 @@ private void validateZoneAndTagIsolation(IdealState is, ExternalView ev, int exp
       Assert.assertEquals(assignedZones.size(), expectedReplica);
     }
   }
-
+/* These blank test would cause @AfterClass not get invoked.

Review comment:
       This does not invoke parent class testcase




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r489853126



##########
File path: helix-core/src/test/java/org/apache/helix/common/ZkTestBase.java
##########
@@ -719,6 +722,20 @@ public void cleanupLiveInstanceOwners() {
       clientMap.clear();
     }
     _liveInstanceOwners.clear();
+
+    // wait 2 seconds for threads to be teared down.
+    // this reduce false alert.
+    Thread.sleep(1000);

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499967894



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancerStopResume.java
##########
@@ -102,29 +123,41 @@ public void stopAndResumeNamedQueue() throws Exception {
     String queueName = TestHelper.getTestMethodName();
 
     // Create a queue
-    LOG.info("Starting job-queue: " + queueName);
-    JobQueue queue = new JobQueue.Builder(queueName).build();
+    LOG.info("Starting job-queue : " + queueName);
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);
+    WorkflowConfig wfg = wfgBuilder.build();
+    JobQueue queue = new JobQueue.Builder(queueName)
+        .setWorkflowConfig(wfg)
+        .build();
     _driver.createQueue(queue);
 
     // Enqueue jobs
+    List<String> jobNames = new ArrayList<>();
+    List<JobConfig.Builder> jobBuilders = new ArrayList<>();
     Set<String> master = Sets.newHashSet("MASTER");
     JobConfig.Builder job1 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
         .setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(master)
         .setJobCommandConfigMap(Collections.singletonMap(MockTask.JOB_DELAY, "200"));
     String job1Name = "masterJob";
     LOG.info("Enqueuing job: " + job1Name);
-    _driver.enqueueJob(queueName, job1Name, job1);
+    jobNames.add(job1Name);
+    jobBuilders.add(job1);
+    //_driver.enqueueJob(queueName, job1Name, job1);

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r501988944



##########
File path: helix-core/src/test/java/org/apache/helix/TestListenerCallback.java
##########
@@ -119,6 +119,8 @@ public void beforeClass() throws Exception {
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();

Review comment:
       Let us make this name consistent with line 102-103, i.e, either fix that or change this line to make them consistent.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504344664



##########
File path: helix-core/src/test/java/org/apache/helix/monitoring/TestClusterStatusMonitorLifecycle.java
##########
@@ -291,7 +292,14 @@ public void testClusterStatusMonitorLifecycle() throws Exception {
         Query.not(Query.match(Query.attr("SensorName"), Query.value("MessageQueueStatus.*"))),
         exp1);
 
-    Assert.assertTrue(TestHelper.verify(() -> ManagementFactory.getPlatformMBeanServer()
-        .queryMBeans(new ObjectName("ClusterStatus:*"), exp2).isEmpty(), 3000));
+    // Note, the _asyncTasksThreadPool shutting down logic in GenericHelixController is best effort
+    // there is not guarantee that all threads in the pool is gone. MOstly they will, but not always.
+    // see https://github.com/apache/helix/issues/1280
+    boolean result = TestHelper.verify(() -> ManagementFactory.getPlatformMBeanServer()
+        .queryMBeans(new ObjectName("ClusterStatus:*"), exp2).isEmpty(), TestHelper.WAIT_DURATION);
+
+    if (!result) {
+      System.out.println("controllers shutting down failed to tear down all _asyncTasksThreadPools");
+    }

Review comment:
       Let us discuss how can we fix it here? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r508970545



##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestPartitionMovementThrottle.java
##########
@@ -278,8 +278,7 @@ public void testThrottleOnlyClusterLevelAnyType() {
     ClusterLiveNodesVerifier liveNodesVerifier =
         new ClusterLiveNodesVerifier(_gZkClient, CLUSTER_NAME,
             Lists.transform(Arrays.asList(_participants), MockParticipantManager::getInstanceName));
-    Assert.assertTrue(liveNodesVerifier.verifyByZkCallback(1000));
-    Assert.assertTrue(_clusterVerifier.verifyByPolling());
+    Assert.assertTrue(liveNodesVerifier.verifyByZkCallback(5000));

Review comment:
       restored.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504353695



##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestZkConnectionLost.java
##########
@@ -66,6 +66,8 @@
   private ClusterSetup _setupTool;
   private HelixZkClient _zkClient;
 
+  private static final int RESTART_CNT = 2;

Review comment:
       Let me change back to 4 to see how it goes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r500462866



##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterStateVerifier.java
##########
@@ -181,6 +181,12 @@ private BestPossAndExtViewZkVerifier(HelixZkClient zkClient, String clusterName,
       this.resources = resources;
     }
 
+    public void close() {

Review comment:
       Can we put this into a separate PR? i.e, add close method to ClusterStateVerifier.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504347427



##########
File path: helix-core/src/test/java/org/apache/helix/manager/zk/client/TestHelixZkClient.java
##########
@@ -37,7 +37,7 @@
   @Test
   public void testZkConnectionManager() {
     final String TEST_ROOT = "/testZkConnectionManager/IDEALSTATES";

Review comment:
       reverted,




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499969093



##########
File path: helix-core/src/test/java/org/apache/helix/task/TaskSynchronizedTestBase.java
##########
@@ -76,11 +77,15 @@ public void beforeClass() throws Exception {
     startParticipants();
     createManagers();
     _clusterVerifier =
-        new BestPossibleExternalViewVerifier.Builder(CLUSTER_NAME).setZkClient(_gZkClient).build();
+        new BestPossibleExternalViewVerifier.Builder(CLUSTER_NAME).setZkClient(_gZkClient)
+            .setWaitTillVerify(TestHelper.DEFAULT_REBALANCE_PROCESSING_WAIT_TIME)
+            .build();
   }
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TaskSynchronizedTestBase called.");

Review comment:
       fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on pull request #1227:
URL: https://github.com/apache/helix/pull/1227#issuecomment-677977936


   @kaisun2000 Let's don't keep adding more changes to the same PR. I think you have put 2 PRs into one change. It should be at least, 1. thread leakage check. 2. fixing the tests (I would prefer fixing them separately, but if the fixes are similar then it's OK).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r475819672



##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterVerifiers/ZkHelixClusterVerifier.java
##########
@@ -213,6 +222,13 @@ public boolean verifyByPolling() {
     return verifyByPolling(DEFAULT_TIMEOUT, DEFAULT_PERIOD);
   }
 
+  protected boolean _isLogMore = false;

Review comment:
       We can use LOG.info and enable org.apache.helix.tools.ClusterVerifiers info log in log4j. But this option would means by default it is not enabled. However, some test only times out occasionally. Then we can't catch it on the spot and it is hard to reproduce. 
   
   Once we are good with this test in github, we can consider to make it to use LOG?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check for each test class

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r467308783



##########
File path: helix-core/src/test/java/org/apache/helix/TestHelper.java
##########
@@ -847,4 +850,161 @@ public static void printZkListeners(HelixZkClient client) throws Exception {
     }
     System.out.println("}");
   }
+
+  // Thread dump related utility function
+  private static ThreadGroup getRootThreadGroup() {
+    ThreadGroup candidate = Thread.currentThread().getThreadGroup();
+    while (candidate.getParent() != null) {
+      candidate = candidate.getParent();
+    }
+    return candidate;
+  }
+
+  public static List<Thread> getAllThreads() {

Review comment:
       done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r498953286



##########
File path: helix-core/src/main/java/org/apache/helix/task/TaskUtil.java
##########
@@ -832,6 +832,7 @@ private static boolean isJobExpired(String jobName, JobConfig jobConfig, JobCont
       LOG.warn(
           "Job {} exists in JobDAG but JobConfig is missing! It's treated as expired and will be purged.",
           jobName);
+      System.out.println(String.format("Job %s exists in jobdag bug job config missing, expire the job", jobName));

Review comment:
       Remove




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499944813



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestQuotaBasedScheduling.java
##########
@@ -526,11 +538,17 @@ public void testJobQueueScheduling() throws InterruptedException {
     JobConfig.Builder jobConfigBulider = new JobConfig.Builder().setCommand(JOB_COMMAND)
         .addTaskConfigs(taskConfigs).setJobCommandConfigMap(_jobCommandMap).setJobType("A");
 
+    List<String> jobNames = new ArrayList<>();
+    List<JobConfig.Builder> jobBuilders = new ArrayList<>();
+
     for (int i = 0; i < 5; i++) {
       String jobName = "JOB_" + i;
       lastJobName = jobName;
-      _driver.enqueueJob(queueName, jobName, jobConfigBulider);
+      jobNames.add(jobName);
+      jobBuilders.add(jobConfigBulider);
+      // _driver.enqueueJob(queueName, jobName, jobConfigBulider);

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499966481



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancer.java
##########
@@ -217,22 +246,35 @@ public void testNamedQueue() throws Exception {
     String queueName = TestHelper.getTestMethodName();
 
     // Create a queue
-    JobQueue queue = new JobQueue.Builder(queueName).build();
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);
+    WorkflowConfig wfg = wfgBuilder.build();
+    JobQueue queue = new JobQueue.Builder(queueName)
+        .setWorkflowConfig(wfg)
+        .build();
     _driver.createQueue(queue);
 
     // Enqueue jobs
+    List<String> jobNames = new ArrayList<>();
+    List<JobConfig.Builder> jobBuilders = new ArrayList<>();
     Set<String> master = Sets.newHashSet("MASTER");
     Set<String> slave = Sets.newHashSet("SLAVE");
     JobConfig.Builder job1 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
         .setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(master);
     JobConfig.Builder job2 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
         .setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(slave);
-    _driver.enqueueJob(queueName, "masterJob", job1);
-    _driver.enqueueJob(queueName, "slaveJob", job2);
+    jobNames.add("masterJob");
+    jobNames.add("slaveJob");
+    jobBuilders.add(job1);
+    jobBuilders.add(job2);
+    //_driver.enqueueJob(queueName, "masterJob", job1);

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r475809892



##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterVerifiers/BestPossibleExternalViewVerifier.java
##########
@@ -236,6 +239,9 @@ protected synchronized boolean verifyState() {
       idealStates.entrySet()
           .removeIf(pair -> pair.getValue().getStateModelDefRef().equals(TaskConstants.STATE_MODEL_NAME));
 
+      if (_isLogMore) {
+        System.out.println("---- before verify instances !");
+      }

Review comment:
       Can we get rid of all these println statement in prod code?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499944290



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestMaxNumberOfAttemptsMasterSwitch.java
##########
@@ -71,6 +71,9 @@ public void beforeClass() throws Exception {
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestMaxNumberOfAttempsMasterSwitch called.");

Review comment:
       changed to `    System.out.println("AfterClass: " + TestHelper.getTestClassName() + ":" + TestHelper.getTestMethodName()  + "invoked.");
   `




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504360777



##########
File path: helix-core/src/test/java/org/apache/helix/integration/multizk/TestMultiZkHelixJavaApis.java
##########
@@ -222,6 +232,17 @@ public void afterClass() throws Exception {
       } else {
         System.clearProperty(MetadataStoreRoutingConstants.MSDS_SERVER_ENDPOINT_KEY);
       }
+
+      boolean status = false;

Review comment:
       rebased.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r502718794



##########
File path: helix-core/src/test/java/org/apache/helix/TestListenerCallback.java
##########
@@ -119,6 +119,8 @@ public void beforeClass() throws Exception {
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();

Review comment:
       remove this one and as logging from ZkTestBase now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r494730407



##########
File path: helix-core/src/main/java/org/apache/helix/task/TaskDriver.java
##########
@@ -969,17 +969,29 @@ public TaskState pollForJobState(String workflowName, String jobName, long timeo
     Set<TaskState> allowedStates = new HashSet<>(Arrays.asList(states));
     // Wait for state
     long st = System.currentTimeMillis();
+    long ct = 0;
     do {
       Thread.sleep(timeToSleep);
       ctx = getWorkflowContext(workflowName);
+      ct =  System.currentTimeMillis();
     } while ((ctx == null || ctx.getJobState(jobName) == null
         || !allowedStates.contains(ctx.getJobState(jobName)))
-        && System.currentTimeMillis() < st + timeout);
+        && ct < st + timeout);
 
     if (ctx == null || !allowedStates.contains(ctx.getJobState(jobName))) {

Review comment:
       revert back.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r501451201



##########
File path: helix-core/src/test/java/org/apache/helix/TestZKCallback.java
##########
@@ -166,6 +166,7 @@ public void testInvocation() throws Exception {
 
       ExternalView extView = new ExternalView("db-12345");
       accessor.setProperty(keyBuilder.externalView("db-12345"), extView);
+

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r501450709



##########
File path: helix-core/src/main/java/org/apache/helix/controller/stages/TaskGarbageCollectionStage.java
##########
@@ -78,6 +78,7 @@ public void process(ClusterEvent event) throws Exception {
         }
         long currentTime = System.currentTimeMillis();
         long nextPurgeTime = workflowContext.getLastJobPurgeTime() + purgeInterval;
+

Review comment:
       fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499968692



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskThrottling.java
##########
@@ -87,15 +97,17 @@ public void testTaskThrottle() throws Exception {
     _gSetupTool.getClusterManagementTool().setConfig(scope, properties);
 
     String jobName2 = "Job2";
-    Workflow flow2 = WorkflowGenerator.generateSingleJobWorkflowBuilder(jobName2, jobConfig).build();
+    Workflow flow2 = WorkflowGenerator.generateSingleJobWorkflowBuilder(jobName2, jobConfig)
+        .setWorkflowConfig(wfg)
+        .build();
     _driver.start(flow2);
     _driver.pollForJobState(flow2.getName(), TaskUtil.getNamespacedJobName(flow2.getName(), jobName2),
         TaskState.IN_PROGRESS);
 
     // Expect 10 tasks
     Assert.assertTrue(TestHelper.verify(
         () -> (countRunningPartition(flow2, jobName2) == (_numNodes * perNodeTaskLimitation)),
-        TestHelper.WAIT_DURATION));
+        2 * TestHelper.WAIT_DURATION));

Review comment:
       fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504349178



##########
File path: helix-core/src/test/java/org/apache/helix/integration/rebalancer/WagedRebalancer/TestWagedRebalance.java
##########
@@ -665,7 +642,6 @@ public void testRebalancerReset() throws Exception {
     // After reset done, the rebalancer will try to rebalance all the partitions since it has
     // forgotten the previous state.
     // TODO remove this sleep after fix https://github.com/apache/helix/issues/526

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r508911772



##########
File path: helix-core/src/test/java/org/apache/helix/manager/zk/client/TestHelixZkClient.java
##########
@@ -110,33 +110,39 @@ public void testSharingZkClient() throws Exception {
       @Override
       public void handleDataChange(String s, Object o) {
         notificationCountA[0]++;
+        System.out.println("sharedZkClient A increased n0 with path: " + s);

Review comment:
       print session id if test fail




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r508977084



##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestClusterSetup.java
##########
@@ -61,56 +61,162 @@
   protected static final String STATE_MODEL = "MasterSlave";
   protected static final String TEST_NODE = "testnode_1";
 
-  private ClusterSetup _clusterSetup;
+  private ClusterSetup _clusterSetup = null;
 
   private static String[] createArgs(String str) {
     String[] split = str.split("[ ]+");
     System.out.println(Arrays.toString(split));
     return split;
   }
 
-  @BeforeClass()
+  @BeforeClass
   public void beforeClass() throws Exception {
     System.out
         .println("START TestClusterSetup.beforeClass() " + new Date(System.currentTimeMillis()));
+    _clusterSetup = new ClusterSetup(ZK_ADDR);
   }
 
-  @AfterClass()
+  @AfterClass
   public void afterClass() {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestClusterSetup called.");
+
     deleteCluster(CLUSTER_NAME);
+    _clusterSetup.close();
     System.out.println("END TestClusterSetup.afterClass() " + new Date(System.currentTimeMillis()));
   }
 
-  @BeforeMethod()
+  @BeforeMethod
   public void setup() {
+    // System.out.println("@BeforeMethod TestClusterSetup beforeMethod called. ");
+    try {
+      _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
+      _clusterSetup.addCluster(CLUSTER_NAME, true);
+    } catch (Exception e) {
+      System.out.println("@BeforeMethod TestClusterSetup exception:" + e);
+    }
+  }
 
-    _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
-    _clusterSetup = new ClusterSetup(ZK_ADDR);
-    _clusterSetup.addCluster(CLUSTER_NAME, true);
+  @Test(expectedExceptions = HelixException.class)
+  public void testAddClusterWithInvalidCloudConfig() throws Exception {
+    String className = TestHelper.getTestClassName();
+    String methodName = TestHelper.getTestMethodName();
+    String clusterName = className + "_" + methodName;
+
+    CloudConfig.Builder cloudConfigInitBuilder = new CloudConfig.Builder();
+    cloudConfigInitBuilder.setCloudEnabled(true);
+    List<String> sourceList = new ArrayList<String>();
+    sourceList.add("TestURL");
+    cloudConfigInitBuilder.setCloudInfoSources(sourceList);
+    cloudConfigInitBuilder.setCloudProvider(CloudProvider.CUSTOMIZED);
+
+    CloudConfig cloudConfigInit = cloudConfigInitBuilder.build();
+
+    // Since setCloudInfoProcessorName is missing, this add cluster call will throw an exception
+    _clusterSetup.addCluster(clusterName, false, cloudConfigInit);
+  }
+
+  @Test(dependsOnMethods = "testAddClusterWithInvalidCloudConfig")
+  public void testAddClusterWithValidCloudConfig() throws Exception {
+    String className = TestHelper.getTestClassName();
+    String methodName = TestHelper.getTestMethodName();
+    String clusterName = className + "_" + methodName;
+
+    CloudConfig.Builder cloudConfigInitBuilder = new CloudConfig.Builder();
+    cloudConfigInitBuilder.setCloudEnabled(true);
+    cloudConfigInitBuilder.setCloudID("TestID");
+    List<String> sourceList = new ArrayList<String>();
+    sourceList.add("TestURL");
+    cloudConfigInitBuilder.setCloudInfoSources(sourceList);
+    cloudConfigInitBuilder.setCloudInfoProcessorName("TestProcessorName");
+    cloudConfigInitBuilder.setCloudProvider(CloudProvider.CUSTOMIZED);
+
+    CloudConfig cloudConfigInit = cloudConfigInitBuilder.build();
+
+    _clusterSetup.addCluster(clusterName, false, cloudConfigInit);
+
+    // Read CloudConfig from Zookeeper and check the content
+    ConfigAccessor _configAccessor = new ConfigAccessor(_gZkClient);
+    CloudConfig cloudConfigFromZk = _configAccessor.getCloudConfig(clusterName);
+    Assert.assertTrue(cloudConfigFromZk.isCloudEnabled());
+    Assert.assertEquals(cloudConfigFromZk.getCloudID(), "TestID");
+    List<String> listUrlFromZk = cloudConfigFromZk.getCloudInfoSources();
+    Assert.assertEquals(listUrlFromZk.get(0), "TestURL");
+    Assert.assertEquals(cloudConfigFromZk.getCloudInfoProcessorName(), "TestProcessorName");
+    Assert.assertEquals(cloudConfigFromZk.getCloudProvider(), CloudProvider.CUSTOMIZED.name());
+  }
+
+  @Test(dependsOnMethods = "testAddClusterWithValidCloudConfig")
+  public void testAddClusterAzureProvider() throws Exception {
+    String className = TestHelper.getTestClassName();
+    String methodName = TestHelper.getTestMethodName();
+    String clusterName = className + "_" + methodName;
+
+    CloudConfig.Builder cloudConfigInitBuilder = new CloudConfig.Builder();
+    cloudConfigInitBuilder.setCloudEnabled(true);
+    cloudConfigInitBuilder.setCloudID("TestID");
+    cloudConfigInitBuilder.setCloudProvider(CloudProvider.AZURE);
+
+    CloudConfig cloudConfigInit = cloudConfigInitBuilder.build();
+
+    _clusterSetup.addCluster(clusterName, false, cloudConfigInit);
+
+    // Read CloudConfig from Zookeeper and check the content
+    ConfigAccessor _configAccessor = new ConfigAccessor(ZK_ADDR);
+    CloudConfig cloudConfigFromZk = _configAccessor.getCloudConfig(clusterName);
+    Assert.assertTrue(cloudConfigFromZk.isCloudEnabled());
+    Assert.assertEquals(cloudConfigFromZk.getCloudID(), "TestID");
+    List<String> listUrlFromZk = cloudConfigFromZk.getCloudInfoSources();
+
+    // Since it is Azure, topology information should have been populated.
+    ClusterConfig clusterConfig = _configAccessor.getClusterConfig(clusterName);
+    Assert.assertEquals(clusterConfig.getTopology(), AzureConstants.AZURE_TOPOLOGY);
+    Assert.assertEquals(clusterConfig.getFaultZoneType(), AzureConstants.AZURE_FAULT_ZONE_TYPE);
+    Assert.assertTrue(clusterConfig.isTopologyAwareEnabled());
+
+    // Since provider is not customized, CloudInfoSources and CloudInfoProcessorName will be null.
+    Assert.assertNull(listUrlFromZk);
+    Assert.assertNull(cloudConfigFromZk.getCloudInfoProcessorName());
+    Assert.assertEquals(cloudConfigFromZk.getCloudProvider(), CloudProvider.AZURE.name());
   }
 
-  @Test
+  // Note, with mvn 3.6.1, we have a nasty bug that running "mvn test" under helix-core,
+  // all the bellow test will not be invoked. Also, @AfterClass cleanup of this class and

Review comment:
       you are right. changed the wording to
   
   >  // Note, with mvn 3.6.1, we have a nasty bug that running "mvn test" under helix-core,
     // all the bellow test will be invoked after other test including @AfterClass cleanup of this 
     // This bug does not happen of running command as "mvn test -Dtest=TestClusterSetup". Nor does it
     // happen in intellij. The workaround found is to add dependsOnMethods attribute to all the rest.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r501453131



##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestRebalancerPersistAssignments.java
##########
@@ -29,6 +29,7 @@
 import org.apache.helix.TestHelper;
 import org.apache.helix.controller.rebalancer.strategy.CrushEdRebalanceStrategy;
 import org.apache.helix.controller.rebalancer.strategy.RebalanceStrategy;
+import org.apache.helix.TestHelper;

Review comment:
       remove




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504354110



##########
File path: helix-core/src/test/java/org/apache/helix/controller/changedetector/TestResourceChangeDetector.java
##########
@@ -425,13 +426,19 @@ public void testIgnoreNonTopologyChanges() {
   @Test(dependsOnMethods = "testIgnoreNonTopologyChanges")
   public void testResetSnapshots() {
     // ensure the cluster converged before the test to ensure IS is not modified unexpectedly
-    HelixClusterVerifier _clusterVerifier =
+
+    ZkHelixClusterVerifier _clusterVerifier =

Review comment:
       changed back.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r500624701



##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterStateVerifier.java
##########
@@ -181,6 +181,12 @@ private BestPossAndExtViewZkVerifier(HelixZkClient zkClient, String clusterName,
       this.resources = resources;
     }
 
+    public void close() {

Review comment:
       #1445 is created for this one.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check for each test class

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r467312702



##########
File path: helix-core/src/test/java/org/apache/helix/TestHelper.java
##########
@@ -847,4 +850,161 @@ public static void printZkListeners(HelixZkClient client) throws Exception {
     }
     System.out.println("}");
   }
+
+  // Thread dump related utility function
+  private static ThreadGroup getRootThreadGroup() {
+    ThreadGroup candidate = Thread.currentThread().getThreadGroup();
+    while (candidate.getParent() != null) {
+      candidate = candidate.getParent();
+    }
+    return candidate;
+  }
+
+  public static List<Thread> getAllThreads() {
+    ThreadGroup rootThreadGroup = getRootThreadGroup();
+    Thread[] threads = new Thread[32];
+    int count = rootThreadGroup.enumerate(threads);
+    while (count == threads.length) {
+      threads = new Thread[threads.length * 2];
+      count = rootThreadGroup.enumerate(threads);
+    }
+    return Arrays.asList(Arrays.copyOf(threads, count));
+  }
+
+  static public enum ThreadCategory {
+    ZkServer("zookeeper server threads", 4, 100,
+        new String[]{"SessionTracker", "NIOServerCxn", "SyncThread:", "ProcessThread"}),
+    ZkSession("zkclient/zooKeeper session threads", 12, 12,
+        new String[] {"ZkClient-EventThread", "ZkClient-AsyncCallback", "-EventThread", "-SendThread"}),
+    ForkJoin( "fork join pool threads", 2, 10, new String[]{"ForkJoinPool"}),
+    Timer( "timer threads", 0, 2, new String[]{"time"}),
+    TaskStateModel("TaskStateModel threads", 0, 0, new String[]{"TaskStateModel"}),
+    Other("Other threads", 0, 3, new String[]{""});
+
+    private String _description;
+    private List<String> _pattern;
+    private int _warningLimit;
+    private int _limit;
+
+    public String getDescription() {
+      return _description;
+    }
+
+    public Predicate<String> getMatchPred() {
+      if (this.name() != ThreadCategory.Other.name()) {
+        Predicate<String> pred = target -> {
+          for (String p : _pattern) {
+            if (target.toLowerCase().contains(p.toLowerCase())) {
+              return true;
+            }
+          }
+          return false;
+        };
+        return pred;
+      }
+
+      List<Predicate<String>> predicateList = new ArrayList<>();
+      for (ThreadCategory threadCategory : ThreadCategory.values()) {
+        if (threadCategory == ThreadCategory.Other) {
+          continue;
+        }
+        predicateList.add(threadCategory.getMatchPred());
+      }
+      Predicate<String> pred = target -> {
+        for (Predicate<String> p : predicateList) {
+          if (p.test(target)) {
+            return false;
+          }
+        }
+        return true;
+      };
+
+      return pred;
+    }
+
+    public int getWarningLimit() {
+      return _warningLimit;
+    }
+
+    public int getLimit() {
+      return _limit;
+    }
+
+    private ThreadCategory(String description, int warningLimit,  int limit, String[] patterns ) {
+      _description = description;
+      _pattern = Arrays.asList(patterns);
+      _warningLimit = warningLimit;
+      _limit = limit;
+    }
+  }
+
+  public static boolean afterClassCheck(String classname) {

Review comment:
       done. refactored.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r501455087



##########
File path: helix-core/src/test/java/org/apache/helix/integration/rebalancer/DelayedAutoRebalancer/TestDelayedAutoRebalance.java
##########
@@ -105,16 +105,20 @@ public void beforeClass() throws Exception {
    */
   @Test
   public void testDelayedPartitionMovement() throws Exception {
+    System.out.println("START TestDelayedAutoRebalance::testDelayedPartitionMovement");
     Map<String, ExternalView> externalViewsBefore = createTestDBs(1000000);
     validateDelayedMovements(externalViewsBefore);
+    System.out.println("START TestDelayedAutoRebalance::testDelayedPartitionMovement");

Review comment:
       Changed to END




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang closed pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
jiajunwang closed pull request #1227:
URL: https://github.com/apache/helix/pull/1227


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499943585



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestJobFailureDependence.java
##########
@@ -161,15 +161,19 @@ public void testWorkflowFailureJobThreshold() throws Exception {
     _driver.updateWorkflow(queueName, configBuilder.build());
     _driver.stop(queueName);
 
+    List<String> jobNames = new ArrayList<>();
+    List<JobConfig.Builder> jobBuilders = new ArrayList<>();
     for (int i = 0; i < _numDbs; i++) {
       JobConfig.Builder jobConfig =
           new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND).setTargetResource(_testDbs.get(i))
               .setTargetPartitionStates(Sets.newHashSet("SLAVE")).setIgnoreDependentJobFailure(true);
       String jobName = "job" + _testDbs.get(i);
       queueBuilder.enqueueJob(jobName, jobConfig);
-      _driver.enqueueJob(queueName, jobName, jobConfig);
+      jobNames.add(jobName);
+      jobBuilders.add(jobConfig);
+      //_driver.enqueueJob(queueName, jobName, jobConfig);

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r508903269



##########
File path: helix-core/src/test/java/org/apache/helix/integration/rebalancer/CrushRebalancers/TestCrushAutoRebalance.java
##########
@@ -298,7 +298,7 @@ private void validateZoneAndTagIsolation(IdealState is, ExternalView ev, int exp
       Assert.assertEquals(assignedZones.size(), expectedReplica);
     }
   }
-
+/* These blank test would cause @AfterClass not get invoked.

Review comment:
       confirm parent testZone not going to run this way.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r494733066



##########
File path: helix-core/src/test/java/org/apache/helix/TestHelper.java
##########
@@ -794,7 +797,12 @@ public static boolean verify(Verifier verifier, long timeout) throws Exception {
     long start = System.currentTimeMillis();
     do {
       boolean result = verifier.verify();
-      if (result || (System.currentTimeMillis() - start) > timeout) {
+      long now = System.currentTimeMillis();
+      if (result || (now - start) > timeout) {

Review comment:
       changed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504343921



##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestClusterSetup.java
##########
@@ -221,18 +327,18 @@ public void testRebalanceCluster() throws Exception {
     verifyReplication(_gZkClient, CLUSTER_NAME, TEST_DB, 4);
   }
 
-  @Test()
+  @Test(dependsOnMethods = "testAddClusterWithValidCloudConfig")
   public void testParseCommandLinesArgs() throws Exception {
     // wipe ZK
-    _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
-    _clusterSetup = new ClusterSetup(ZK_ADDR);
+    //_gZkClient.deleteRecursively("/" + CLUSTER_NAME);
+    //_clusterSetup = new ClusterSetup(ZK_ADDR);
 
     ClusterSetup
         .processCommandLineArgs(createArgs("-zkSvr " + ZK_ADDR + " --addCluster " + CLUSTER_NAME));
 
     // wipe again
     _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
-    _clusterSetup = new ClusterSetup(ZK_ADDR);
+    //_clusterSetup = new ClusterSetup(ZK_ADDR);

Review comment:
       removed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504349597



##########
File path: helix-core/src/test/java/org/apache/helix/integration/rebalancer/CrushRebalancers/TestCrushAutoRebalance.java
##########
@@ -298,7 +298,7 @@ private void validateZoneAndTagIsolation(IdealState is, ExternalView ev, int exp
       Assert.assertEquals(assignedZones.size(), expectedReplica);
     }
   }
-
+/* These blank test would cause @AfterClass not get invoked.

Review comment:
       You are right. It is delayed after some other test class run. Still, this means hard to track leaking threads, as we want to have all test method in one test class to run at one batch. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r508972311



##########
File path: helix-core/src/test/java/org/apache/helix/integration/rebalancer/CrushRebalancers/TestCrushAutoRebalance.java
##########
@@ -298,7 +298,7 @@ private void validateZoneAndTagIsolation(IdealState is, ExternalView ev, int exp
       Assert.assertEquals(assignedZones.size(), expectedReplica);
     }
   }
-
+/* These blank test would cause @AfterClass not get invoked.

Review comment:
       This actually works.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r508978058



##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestClusterSetup.java
##########
@@ -61,56 +61,162 @@
   protected static final String STATE_MODEL = "MasterSlave";
   protected static final String TEST_NODE = "testnode_1";
 
-  private ClusterSetup _clusterSetup;
+  private ClusterSetup _clusterSetup = null;
 
   private static String[] createArgs(String str) {
     String[] split = str.split("[ ]+");
     System.out.println(Arrays.toString(split));
     return split;
   }
 
-  @BeforeClass()
+  @BeforeClass
   public void beforeClass() throws Exception {
     System.out
         .println("START TestClusterSetup.beforeClass() " + new Date(System.currentTimeMillis()));
+    _clusterSetup = new ClusterSetup(ZK_ADDR);
   }
 
-  @AfterClass()
+  @AfterClass
   public void afterClass() {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestClusterSetup called.");
+
     deleteCluster(CLUSTER_NAME);
+    _clusterSetup.close();
     System.out.println("END TestClusterSetup.afterClass() " + new Date(System.currentTimeMillis()));
   }
 
-  @BeforeMethod()
+  @BeforeMethod
   public void setup() {
+    // System.out.println("@BeforeMethod TestClusterSetup beforeMethod called. ");
+    try {
+      _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
+      _clusterSetup.addCluster(CLUSTER_NAME, true);
+    } catch (Exception e) {
+      System.out.println("@BeforeMethod TestClusterSetup exception:" + e);
+    }
+  }
 
-    _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
-    _clusterSetup = new ClusterSetup(ZK_ADDR);
-    _clusterSetup.addCluster(CLUSTER_NAME, true);
+  @Test(expectedExceptions = HelixException.class)

Review comment:
       This may be due to rebase. Nothing changed here. Moved back for easy comparison




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
lei-xia commented on pull request #1227:
URL: https://github.com/apache/helix/pull/1227#issuecomment-704425985


   This PR is too long for effective reviewing.  I scanned the PR quickly, I would suggest at least break it to 3 separate PRs with each of them focusing on just 1 topic, that will make them much easier to review:  1) Fix test issues caused by StateVerifier (i.e, add the delay),  2) Replace all hard-code sleep() call with TestVerifier, 3) Add ThreadLeak/memory usage check into the test, and maybe 4) other minor fixes in the tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499968484



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskSchedulingTwoCurrentStates.java
##########
@@ -93,6 +93,8 @@ public void beforeClass() throws Exception {
 
   @AfterClass()
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestTaskSchedulingTwoCurrentStates called.");

Review comment:
       changed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r494728983



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -623,6 +623,13 @@ void reset() {
       }
     }
 
+    // some thing like STATE_TRANSITION.Key specific pool are not shut down.
+    for (String msgType : _executorMap.keySet()) {
+      ExecutorService pool = _executorMap.remove(msgType);

Review comment:
       removed. Instead noted the potential improvement for now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r475821949



##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterVerifiers/ZkHelixClusterVerifier.java
##########
@@ -48,6 +50,10 @@
   private static Logger LOG = LoggerFactory.getLogger(ZkHelixClusterVerifier.class);
   protected static int DEFAULT_TIMEOUT = 300 * 1000;
   protected static int DEFAULT_PERIOD = 500;
+  // COOL_DOWN before starting vefiyByPool
+  // The goal is to make sure waiting for controller pipeline starts at least one cycle
+  // to update ideal state.
+  protected static int DEFAULT_COOLDOWN = 2 * 1000;

Review comment:
       This is just by experiment. Quite some flaky ones are gone. 
   
   In fact, longer cool_down aside from making test running a little bit longer, no negative impact, only positive impact. From the github run, this does not really significant (not noticable) make the test running longer,




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504350428



##########
File path: helix-core/src/test/java/org/apache/helix/integration/messaging/TestBatchMessage.java
##########
@@ -264,10 +274,16 @@ public void testSubMsgExecutionFail() throws Exception {
     Map<String, Map<String, String>> errStates = new HashMap<String, Map<String, String>>();
     errStates.put("TestDB0", new HashMap<String, String>());
     errStates.get("TestDB0").put(errPartition, masterOfPartition0);
-    boolean result =
-        ClusterStateVerifier.verifyByPolling(new ClusterStateVerifier.BestPossAndExtViewZkVerifier(
-            ZK_ADDR, clusterName, errStates));
-    Assert.assertTrue(result);
+
+    ClusterStateVerifier.BestPossAndExtViewZkVerifier verifier = new ClusterStateVerifier.BestPossAndExtViewZkVerifier(

Review comment:
       Removed ClusterStateVerifier from this one too.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504346734



##########
File path: helix-core/src/test/java/org/apache/helix/manager/zk/client/TestHelixZkClient.java
##########
@@ -110,33 +110,39 @@ public void testSharingZkClient() throws Exception {
       @Override
       public void handleDataChange(String s, Object o) {
         notificationCountA[0]++;
+        System.out.println("sharedZkClient A increased n0 with path: " + s);

Review comment:
       This one is actually worth discussion. Thisone failed rarely maybe twice in hundreds run. Not sure exactly why, and it is really very hard to reproduce, thus, wishing to keep some debug log. 
   Let us discuss what we can do with this one.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check for each test class

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r467313080



##########
File path: helix-core/src/test/java/org/apache/helix/common/ZkTestBase.java
##########
@@ -711,6 +711,7 @@ protected Message createMessage(Message.MessageType type, String msgId, String f
 
   @AfterClass

Review comment:
       We need this one to be called last after all test class level cleaning up. Defining another @afterClass in base class, we can't guarantee this order.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504356030



##########
File path: helix-core/src/test/resources/log4j.properties
##########
@@ -27,6 +32,8 @@ log4j.appender.R=org.apache.log4j.RollingFileAppender
 log4j.appender.R.layout=org.apache.log4j.PatternLayout
 log4j.appender.R.layout.ConversionPattern=%5p [%C:%M] (%F:%L) - %m%n
 log4j.appender.R.File=target/ClusterManagerLogs/log.txt
+log4j.appender.R.MaxBackupIndex=200

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r501451028



##########
File path: helix-core/src/test/java/org/apache/helix/TestHelper.java
##########
@@ -37,6 +37,8 @@
 import java.util.concurrent.ConcurrentHashMap;
 import java.util.concurrent.CountDownLatch;
 import java.util.concurrent.TimeUnit;
+import java.util.function.Predicate;
+import java.util.stream.Collectors;

Review comment:
       not needed, remove




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r501451685



##########
File path: helix-core/src/test/java/org/apache/helix/controller/changedetector/TestResourceChangeDetector.java
##########
@@ -422,16 +425,23 @@ public void testIgnoreNonTopologyChanges() {
     }
   }
 
-  @Test(dependsOnMethods = "testIgnoreNonTopologyChanges")
+  static final int WAIT_TIME = 30 * 60 * 1000;
+  @Test(dependsOnMethods = "testIgnoreNonTopologyChanges", timeOut = 30 * 60 * 1000)

Review comment:
       fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499936028



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -165,6 +202,7 @@ public void testQueueParallelJobs() throws InterruptedException {
           TaskState.COMPLETED);
     }
 
+    System.out.println("Before stop controller");

Review comment:
       fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504329307



##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestZkConnectionLost.java
##########
@@ -66,6 +66,8 @@
   private ClusterSetup _setupTool;
   private HelixZkClient _zkClient;
 
+  private static final int RESTART_CNT = 2;

Review comment:
       This some times has timeout even with enlarging the default timeout of unit test. Zk operation is very slow in Github.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r475776419



##########
File path: helix-core/src/main/java/org/apache/helix/task/TaskDriver.java
##########
@@ -969,17 +969,29 @@ public TaskState pollForJobState(String workflowName, String jobName, long timeo
     Set<TaskState> allowedStates = new HashSet<>(Arrays.asList(states));
     // Wait for state
     long st = System.currentTimeMillis();
+    long ct = 0;
     do {
       Thread.sleep(timeToSleep);
       ctx = getWorkflowContext(workflowName);
+      ct =  System.currentTimeMillis();
     } while ((ctx == null || ctx.getJobState(jobName) == null
         || !allowedStates.contains(ctx.getJobState(jobName)))
-        && System.currentTimeMillis() < st + timeout);
+        && ct < st + timeout);
 
     if (ctx == null || !allowedStates.contains(ctx.getJobState(jobName))) {

Review comment:
       What is the purpose of this change?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r498960942



##########
File path: helix-core/src/main/java/org/apache/helix/task/TaskUtil.java
##########
@@ -952,6 +953,7 @@ private static boolean removeWorkflowJobContext(HelixPropertyStore<ZNRecord> pro
         return false;
       }
     }
+    LOG.info("removed job config {}.", path);

Review comment:
       Log message is remove job context, not config.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check for each test class

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r467313142



##########
File path: helix-core/src/test/java/org/apache/helix/TestHelper.java
##########
@@ -847,4 +850,161 @@ public static void printZkListeners(HelixZkClient client) throws Exception {
     }
     System.out.println("}");
   }
+
+  // Thread dump related utility function
+  private static ThreadGroup getRootThreadGroup() {
+    ThreadGroup candidate = Thread.currentThread().getThreadGroup();
+    while (candidate.getParent() != null) {
+      candidate = candidate.getParent();
+    }
+    return candidate;
+  }
+
+  public static List<Thread> getAllThreads() {
+    ThreadGroup rootThreadGroup = getRootThreadGroup();
+    Thread[] threads = new Thread[32];
+    int count = rootThreadGroup.enumerate(threads);
+    while (count == threads.length) {
+      threads = new Thread[threads.length * 2];
+      count = rootThreadGroup.enumerate(threads);
+    }
+    return Arrays.asList(Arrays.copyOf(threads, count));
+  }
+
+  static public enum ThreadCategory {
+    ZkServer("zookeeper server threads", 4, 100,
+        new String[]{"SessionTracker", "NIOServerCxn", "SyncThread:", "ProcessThread"}),
+    ZkSession("zkclient/zooKeeper session threads", 12, 12,
+        new String[] {"ZkClient-EventThread", "ZkClient-AsyncCallback", "-EventThread", "-SendThread"}),
+    ForkJoin( "fork join pool threads", 2, 10, new String[]{"ForkJoinPool"}),
+    Timer( "timer threads", 0, 2, new String[]{"time"}),
+    TaskStateModel("TaskStateModel threads", 0, 0, new String[]{"TaskStateModel"}),
+    Other("Other threads", 0, 3, new String[]{""});
+
+    private String _description;
+    private List<String> _pattern;
+    private int _warningLimit;
+    private int _limit;
+
+    public String getDescription() {
+      return _description;
+    }
+
+    public Predicate<String> getMatchPred() {
+      if (this.name() != ThreadCategory.Other.name()) {
+        Predicate<String> pred = target -> {
+          for (String p : _pattern) {
+            if (target.toLowerCase().contains(p.toLowerCase())) {
+              return true;
+            }
+          }
+          return false;
+        };
+        return pred;
+      }
+
+      List<Predicate<String>> predicateList = new ArrayList<>();
+      for (ThreadCategory threadCategory : ThreadCategory.values()) {
+        if (threadCategory == ThreadCategory.Other) {
+          continue;
+        }
+        predicateList.add(threadCategory.getMatchPred());
+      }
+      Predicate<String> pred = target -> {
+        for (Predicate<String> p : predicateList) {
+          if (p.test(target)) {
+            return false;
+          }
+        }
+        return true;
+      };
+
+      return pred;
+    }
+
+    public int getWarningLimit() {
+      return _warningLimit;
+    }
+
+    public int getLimit() {
+      return _limit;
+    }
+
+    private ThreadCategory(String description, int warningLimit,  int limit, String[] patterns ) {
+      _description = description;
+      _pattern = Arrays.asList(patterns);
+      _warningLimit = warningLimit;
+      _limit = limit;
+    }
+  }
+
+  public static boolean afterClassCheck(String classname) {
+    // step 1: get all active threads
+    List<Thread> threads = TestHelper.getAllThreads();
+    System.out.println(classname + " has active threads cnt:" + threads.size());
+
+
+    // step 2: caegorize threads

Review comment:
       fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504347586



##########
File path: helix-core/src/test/java/org/apache/helix/manager/zk/client/TestHelixZkClient.java
##########
@@ -82,8 +82,8 @@ public void testZkConnectionManager() {
 
   @Test(dependsOnMethods = "testZkConnectionManager")
   public void testSharingZkClient() throws Exception {
-    final String TEST_ROOT = "/testSharingZkClient/IDEALSTATES";
-    final String TEST_PATH = TEST_ROOT + TEST_NODE;
+    final String TEST_ROOT = "/testSharingZkClient/IDEALSTATES1";

Review comment:
       reverted.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504353605



##########
File path: helix-core/src/test/java/org/apache/helix/integration/common/ZkStandAloneCMTestBase.java
##########
@@ -85,6 +85,7 @@ public void beforeClass() throws Exception {
     _clusterVerifier = new BestPossibleExternalViewVerifier.Builder(CLUSTER_NAME).setZkClient(_gZkClient)
         .setWaitTillVerify(TestHelper.DEFAULT_REBALANCE_PROCESSING_WAIT_TIME)
         .build();    Assert.assertTrue(_clusterVerifier.verifyByPolling());
+    Assert.assertTrue(_clusterVerifier.verifyByPolling());

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r503667622



##########
File path: helix-core/src/test/java/org/apache/helix/controller/changedetector/TestResourceChangeDetector.java
##########
@@ -425,13 +426,19 @@ public void testIgnoreNonTopologyChanges() {
   @Test(dependsOnMethods = "testIgnoreNonTopologyChanges")
   public void testResetSnapshots() {
     // ensure the cluster converged before the test to ensure IS is not modified unexpectedly
-    HelixClusterVerifier _clusterVerifier =
+
+    ZkHelixClusterVerifier _clusterVerifier =

Review comment:
       Why? HelixClusterVerifier has close() method.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestDisableCustomCodeRunner.java
##########
@@ -228,7 +228,7 @@ public void test() throws Exception {
       }
     }
     Assert.assertNotNull(leader);
-    for (String instance : callbacks.keySet()) {
+/*    for (String instance : callbacks.keySet()) {

Review comment:
       Why? I think these are test logic.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestPartitionMovementThrottle.java
##########
@@ -278,8 +278,7 @@ public void testThrottleOnlyClusterLevelAnyType() {
     ClusterLiveNodesVerifier liveNodesVerifier =
         new ClusterLiveNodesVerifier(_gZkClient, CLUSTER_NAME,
             Lists.transform(Arrays.asList(_participants), MockParticipantManager::getInstanceName));
-    Assert.assertTrue(liveNodesVerifier.verifyByZkCallback(1000));
-    Assert.assertTrue(_clusterVerifier.verifyByPolling());
+    Assert.assertTrue(liveNodesVerifier.verifyByZkCallback(5000));

Review comment:
       Again, why removing test logic?

##########
File path: helix-core/src/test/java/org/apache/helix/integration/common/ZkStandAloneCMTestBase.java
##########
@@ -94,6 +95,8 @@ public void beforeClass() throws Exception {
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();

Review comment:
       Can we add it to ZkTestBase? The @AfterClass methods with different names will be executed one by one.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/rebalancer/CrushRebalancers/TestCrushAutoRebalance.java
##########
@@ -298,7 +298,7 @@ private void validateZoneAndTagIsolation(IdealState is, ExternalView ev, int exp
       Assert.assertEquals(assignedZones.size(), expectedReplica);
     }
   }
-
+/* These blank test would cause @AfterClass not get invoked.

Review comment:
       I don't think so. It is delayed but not ignored. Is it a concern if the after class is invoked with some delay?
   
   If so, then all the tests that are developed with a parent test class have this potential issue. Then we need to fix it with a different strategy.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/messaging/TestP2PNoDuplicatedMessage.java
##########
@@ -174,7 +174,10 @@ public void testP2PStateTransitionEnabled() {
       verifyP2PEnabled(startTime);
     }
 
-    Assert.assertEquals(p2pTrigged, total);
+    // Discussed with Meng who originally created this one. The success rate really depends on how

Review comment:
       Who is "Meng"? This code is for everyone to review. Please be aware of and follow the open-source code convention.

##########
File path: helix-core/src/test/java/org/apache/helix/manager/zk/client/TestHelixZkClient.java
##########
@@ -110,33 +110,39 @@ public void testSharingZkClient() throws Exception {
       @Override
       public void handleDataChange(String s, Object o) {
         notificationCountA[0]++;
+        System.out.println("sharedZkClient A increased n0 with path: " + s);

Review comment:
       These outputs are too verbose. Do we really need them?

##########
File path: helix-core/src/test/java/org/apache/helix/manager/zk/client/TestHelixZkClient.java
##########
@@ -37,7 +37,7 @@
   @Test
   public void testZkConnectionManager() {
     final String TEST_ROOT = "/testZkConnectionManager/IDEALSTATES";

Review comment:
       I think the TEST_ROOT already contains the test method name.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/spectator/TestRoutingTableProviderPeriodicRefresh.java
##########
@@ -56,7 +56,7 @@
 
   private static final String STATE_MODEL = BuiltInStateModelDefinitions.MasterSlave.name();
   private static final String TEST_DB = "TestDB";
-  private static final String CLASS_NAME = TestRoutingTableProvider.class.getSimpleName();
+  private static final String CLASS_NAME = TestRoutingTableProviderPeriodicRefresh.class.getSimpleName();

Review comment:
       getShortClassName();

##########
File path: helix-core/src/test/resources/log4j.properties
##########
@@ -37,4 +44,12 @@ log4j.logger.org.apache=ERROR
 log4j.logger.com.noelios=ERROR
 log4j.logger.org.restlet=ERROR
 
+
+#log4j.logger.org.apache.helix.controller.dataproviders.WorkflowControllerDataProvider=INFO

Review comment:
       remove

##########
File path: helix-core/src/test/java/org/apache/helix/integration/messaging/TestBatchMessage.java
##########
@@ -187,15 +194,18 @@ public void testChangeBatchMessageMode() throws Exception {
       participants[i] = new MockParticipantManager(ZK_ADDR, clusterName, instanceName);
       participants[i].syncStart();
     }
-
-    result =
-        ClusterStateVerifier.verifyByZkCallback(new BestPossAndExtViewZkVerifier(ZK_ADDR,
-            clusterName));
-    Assert.assertTrue(result);
-    // Change to three is because there is an extra factory registered
-    // So one extra NO_OP message send
-    Assert.assertTrue(listener._maxNumberOfChildren <= 3,
-        "Should get no more than 2 messages (O->S and S->M)");
+    BestPossAndExtViewZkVerifier verifier1 = new BestPossAndExtViewZkVerifier(ZK_ADDR, clusterName);

Review comment:
       Please mind the naming.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/common/ZkStandAloneCMTestBase.java
##########
@@ -85,6 +85,7 @@ public void beforeClass() throws Exception {
     _clusterVerifier = new BestPossibleExternalViewVerifier.Builder(CLUSTER_NAME).setZkClient(_gZkClient)
         .setWaitTillVerify(TestHelper.DEFAULT_REBALANCE_PROCESSING_WAIT_TIME)
         .build();    Assert.assertTrue(_clusterVerifier.verifyByPolling());
+    Assert.assertTrue(_clusterVerifier.verifyByPolling());

Review comment:
       Duplicate code.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestZkConnectionLost.java
##########
@@ -66,6 +66,8 @@
   private ClusterSetup _setupTool;
   private HelixZkClient _zkClient;
 
+  private static final int RESTART_CNT = 2;

Review comment:
       It was 4.

##########
File path: helix-core/src/test/java/org/apache/helix/integration/messaging/TestBatchMessage.java
##########
@@ -264,10 +274,16 @@ public void testSubMsgExecutionFail() throws Exception {
     Map<String, Map<String, String>> errStates = new HashMap<String, Map<String, String>>();
     errStates.put("TestDB0", new HashMap<String, String>());
     errStates.get("TestDB0").put(errPartition, masterOfPartition0);
-    boolean result =
-        ClusterStateVerifier.verifyByPolling(new ClusterStateVerifier.BestPossAndExtViewZkVerifier(
-            ZK_ADDR, clusterName, errStates));
-    Assert.assertTrue(result);
+
+    ClusterStateVerifier.BestPossAndExtViewZkVerifier verifier = new ClusterStateVerifier.BestPossAndExtViewZkVerifier(

Review comment:
       nit, why the other BestPossAndExtViewZkVerifier does not need the prefix "ClusterStateVerifier."?

##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestClusterSetup.java
##########
@@ -61,56 +61,162 @@
   protected static final String STATE_MODEL = "MasterSlave";
   protected static final String TEST_NODE = "testnode_1";
 
-  private ClusterSetup _clusterSetup;
+  private ClusterSetup _clusterSetup = null;
 
   private static String[] createArgs(String str) {
     String[] split = str.split("[ ]+");
     System.out.println(Arrays.toString(split));
     return split;
   }
 
-  @BeforeClass()
+  @BeforeClass
   public void beforeClass() throws Exception {
     System.out
         .println("START TestClusterSetup.beforeClass() " + new Date(System.currentTimeMillis()));
+    _clusterSetup = new ClusterSetup(ZK_ADDR);
   }
 
-  @AfterClass()
+  @AfterClass
   public void afterClass() {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestClusterSetup called.");
+
     deleteCluster(CLUSTER_NAME);
+    _clusterSetup.close();
     System.out.println("END TestClusterSetup.afterClass() " + new Date(System.currentTimeMillis()));
   }
 
-  @BeforeMethod()
+  @BeforeMethod
   public void setup() {
+    // System.out.println("@BeforeMethod TestClusterSetup beforeMethod called. ");

Review comment:
       Remove

##########
File path: helix-core/src/test/java/org/apache/helix/monitoring/TestClusterStatusMonitorLifecycle.java
##########
@@ -291,7 +292,14 @@ public void testClusterStatusMonitorLifecycle() throws Exception {
         Query.not(Query.match(Query.attr("SensorName"), Query.value("MessageQueueStatus.*"))),
         exp1);
 
-    Assert.assertTrue(TestHelper.verify(() -> ManagementFactory.getPlatformMBeanServer()
-        .queryMBeans(new ObjectName("ClusterStatus:*"), exp2).isEmpty(), 3000));
+    // Note, the _asyncTasksThreadPool shutting down logic in GenericHelixController is best effort
+    // there is not guarantee that all threads in the pool is gone. MOstly they will, but not always.

Review comment:
       typo, MOstly -> Mostly

##########
File path: helix-core/src/test/java/org/apache/helix/integration/rebalancer/WagedRebalancer/TestWagedRebalance.java
##########
@@ -665,7 +642,6 @@ public void testRebalancerReset() throws Exception {
     // After reset done, the rebalancer will try to rebalance all the partitions since it has
     // forgotten the previous state.
     // TODO remove this sleep after fix https://github.com/apache/helix/issues/526

Review comment:
       Remove this too.

##########
File path: helix-core/src/test/java/org/apache/helix/manager/zk/client/TestHelixZkClient.java
##########
@@ -82,8 +82,8 @@ public void testZkConnectionManager() {
 
   @Test(dependsOnMethods = "testZkConnectionManager")
   public void testSharingZkClient() throws Exception {
-    final String TEST_ROOT = "/testSharingZkClient/IDEALSTATES";
-    final String TEST_PATH = TEST_ROOT + TEST_NODE;
+    final String TEST_ROOT = "/testSharingZkClient/IDEALSTATES1";

Review comment:
       Why?

##########
File path: helix-core/src/test/java/org/apache/helix/integration/multizk/TestMultiZkHelixJavaApis.java
##########
@@ -222,6 +232,17 @@ public void afterClass() throws Exception {
       } else {
         System.clearProperty(MetadataStoreRoutingConstants.MSDS_SERVER_ENDPOINT_KEY);
       }
+
+      boolean status = false;

Review comment:
       Please rebase, I think this change has been in.

##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestClusterSetup.java
##########
@@ -221,18 +327,18 @@ public void testRebalanceCluster() throws Exception {
     verifyReplication(_gZkClient, CLUSTER_NAME, TEST_DB, 4);
   }
 
-  @Test()
+  @Test(dependsOnMethods = "testAddClusterWithValidCloudConfig")
   public void testParseCommandLinesArgs() throws Exception {
     // wipe ZK
-    _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
-    _clusterSetup = new ClusterSetup(ZK_ADDR);
+    //_gZkClient.deleteRecursively("/" + CLUSTER_NAME);
+    //_clusterSetup = new ClusterSetup(ZK_ADDR);
 
     ClusterSetup
         .processCommandLineArgs(createArgs("-zkSvr " + ZK_ADDR + " --addCluster " + CLUSTER_NAME));
 
     // wipe again
     _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
-    _clusterSetup = new ClusterSetup(ZK_ADDR);
+    //_clusterSetup = new ClusterSetup(ZK_ADDR);

Review comment:
       remvoe

##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestClusterSetup.java
##########
@@ -221,18 +327,18 @@ public void testRebalanceCluster() throws Exception {
     verifyReplication(_gZkClient, CLUSTER_NAME, TEST_DB, 4);
   }
 
-  @Test()
+  @Test(dependsOnMethods = "testAddClusterWithValidCloudConfig")
   public void testParseCommandLinesArgs() throws Exception {
     // wipe ZK
-    _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
-    _clusterSetup = new ClusterSetup(ZK_ADDR);
+    //_gZkClient.deleteRecursively("/" + CLUSTER_NAME);
+    //_clusterSetup = new ClusterSetup(ZK_ADDR);

Review comment:
       Remove

##########
File path: helix-core/src/test/java/org/apache/helix/integration/spectator/TestRoutingTableProviderFromTargetEV.java
##########
@@ -113,13 +113,17 @@ public void afterClass() throws Exception {
     }
     deleteCluster(CLUSTER_NAME);
   }
+  @Test
+  public void testHead() {
+    Assert.assertTrue(true);
+  }

Review comment:
       Why? I guess you are playing some trick. Could you please explain?

##########
File path: helix-core/src/test/java/org/apache/helix/monitoring/TestClusterStatusMonitorLifecycle.java
##########
@@ -291,7 +292,14 @@ public void testClusterStatusMonitorLifecycle() throws Exception {
         Query.not(Query.match(Query.attr("SensorName"), Query.value("MessageQueueStatus.*"))),
         exp1);
 
-    Assert.assertTrue(TestHelper.verify(() -> ManagementFactory.getPlatformMBeanServer()
-        .queryMBeans(new ObjectName("ClusterStatus:*"), exp2).isEmpty(), 3000));
+    // Note, the _asyncTasksThreadPool shutting down logic in GenericHelixController is best effort
+    // there is not guarantee that all threads in the pool is gone. MOstly they will, but not always.
+    // see https://github.com/apache/helix/issues/1280
+    boolean result = TestHelper.verify(() -> ManagementFactory.getPlatformMBeanServer()
+        .queryMBeans(new ObjectName("ClusterStatus:*"), exp2).isEmpty(), TestHelper.WAIT_DURATION);
+
+    if (!result) {
+      System.out.println("controllers shutting down failed to tear down all _asyncTasksThreadPools");
+    }

Review comment:
       Don't do it, please. If it causes the test to become unstable, we can fix it separately. You are removing this test case now.

##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestClusterSetup.java
##########
@@ -61,56 +61,162 @@
   protected static final String STATE_MODEL = "MasterSlave";
   protected static final String TEST_NODE = "testnode_1";
 
-  private ClusterSetup _clusterSetup;
+  private ClusterSetup _clusterSetup = null;
 
   private static String[] createArgs(String str) {
     String[] split = str.split("[ ]+");
     System.out.println(Arrays.toString(split));
     return split;
   }
 
-  @BeforeClass()
+  @BeforeClass
   public void beforeClass() throws Exception {
     System.out
         .println("START TestClusterSetup.beforeClass() " + new Date(System.currentTimeMillis()));
+    _clusterSetup = new ClusterSetup(ZK_ADDR);
   }
 
-  @AfterClass()
+  @AfterClass
   public void afterClass() {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestClusterSetup called.");
+
     deleteCluster(CLUSTER_NAME);
+    _clusterSetup.close();
     System.out.println("END TestClusterSetup.afterClass() " + new Date(System.currentTimeMillis()));
   }
 
-  @BeforeMethod()
+  @BeforeMethod
   public void setup() {
+    // System.out.println("@BeforeMethod TestClusterSetup beforeMethod called. ");
+    try {
+      _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
+      _clusterSetup.addCluster(CLUSTER_NAME, true);
+    } catch (Exception e) {
+      System.out.println("@BeforeMethod TestClusterSetup exception:" + e);
+    }
+  }
 
-    _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
-    _clusterSetup = new ClusterSetup(ZK_ADDR);
-    _clusterSetup.addCluster(CLUSTER_NAME, true);
+  @Test(expectedExceptions = HelixException.class)
+  public void testAddClusterWithInvalidCloudConfig() throws Exception {
+    String className = TestHelper.getTestClassName();
+    String methodName = TestHelper.getTestMethodName();
+    String clusterName = className + "_" + methodName;
+
+    CloudConfig.Builder cloudConfigInitBuilder = new CloudConfig.Builder();
+    cloudConfigInitBuilder.setCloudEnabled(true);
+    List<String> sourceList = new ArrayList<String>();
+    sourceList.add("TestURL");
+    cloudConfigInitBuilder.setCloudInfoSources(sourceList);
+    cloudConfigInitBuilder.setCloudProvider(CloudProvider.CUSTOMIZED);
+
+    CloudConfig cloudConfigInit = cloudConfigInitBuilder.build();
+
+    // Since setCloudInfoProcessorName is missing, this add cluster call will throw an exception
+    _clusterSetup.addCluster(clusterName, false, cloudConfigInit);
+  }
+
+  @Test(dependsOnMethods = "testAddClusterWithInvalidCloudConfig")
+  public void testAddClusterWithValidCloudConfig() throws Exception {
+    String className = TestHelper.getTestClassName();
+    String methodName = TestHelper.getTestMethodName();
+    String clusterName = className + "_" + methodName;
+
+    CloudConfig.Builder cloudConfigInitBuilder = new CloudConfig.Builder();
+    cloudConfigInitBuilder.setCloudEnabled(true);
+    cloudConfigInitBuilder.setCloudID("TestID");
+    List<String> sourceList = new ArrayList<String>();
+    sourceList.add("TestURL");
+    cloudConfigInitBuilder.setCloudInfoSources(sourceList);
+    cloudConfigInitBuilder.setCloudInfoProcessorName("TestProcessorName");
+    cloudConfigInitBuilder.setCloudProvider(CloudProvider.CUSTOMIZED);
+
+    CloudConfig cloudConfigInit = cloudConfigInitBuilder.build();
+
+    _clusterSetup.addCluster(clusterName, false, cloudConfigInit);
+
+    // Read CloudConfig from Zookeeper and check the content
+    ConfigAccessor _configAccessor = new ConfigAccessor(_gZkClient);
+    CloudConfig cloudConfigFromZk = _configAccessor.getCloudConfig(clusterName);
+    Assert.assertTrue(cloudConfigFromZk.isCloudEnabled());
+    Assert.assertEquals(cloudConfigFromZk.getCloudID(), "TestID");
+    List<String> listUrlFromZk = cloudConfigFromZk.getCloudInfoSources();
+    Assert.assertEquals(listUrlFromZk.get(0), "TestURL");
+    Assert.assertEquals(cloudConfigFromZk.getCloudInfoProcessorName(), "TestProcessorName");
+    Assert.assertEquals(cloudConfigFromZk.getCloudProvider(), CloudProvider.CUSTOMIZED.name());
+  }
+
+  @Test(dependsOnMethods = "testAddClusterWithValidCloudConfig")
+  public void testAddClusterAzureProvider() throws Exception {
+    String className = TestHelper.getTestClassName();
+    String methodName = TestHelper.getTestMethodName();
+    String clusterName = className + "_" + methodName;
+
+    CloudConfig.Builder cloudConfigInitBuilder = new CloudConfig.Builder();
+    cloudConfigInitBuilder.setCloudEnabled(true);
+    cloudConfigInitBuilder.setCloudID("TestID");
+    cloudConfigInitBuilder.setCloudProvider(CloudProvider.AZURE);
+
+    CloudConfig cloudConfigInit = cloudConfigInitBuilder.build();
+
+    _clusterSetup.addCluster(clusterName, false, cloudConfigInit);
+
+    // Read CloudConfig from Zookeeper and check the content
+    ConfigAccessor _configAccessor = new ConfigAccessor(ZK_ADDR);
+    CloudConfig cloudConfigFromZk = _configAccessor.getCloudConfig(clusterName);
+    Assert.assertTrue(cloudConfigFromZk.isCloudEnabled());
+    Assert.assertEquals(cloudConfigFromZk.getCloudID(), "TestID");
+    List<String> listUrlFromZk = cloudConfigFromZk.getCloudInfoSources();
+
+    // Since it is Azure, topology information should have been populated.
+    ClusterConfig clusterConfig = _configAccessor.getClusterConfig(clusterName);
+    Assert.assertEquals(clusterConfig.getTopology(), AzureConstants.AZURE_TOPOLOGY);
+    Assert.assertEquals(clusterConfig.getFaultZoneType(), AzureConstants.AZURE_FAULT_ZONE_TYPE);
+    Assert.assertTrue(clusterConfig.isTopologyAwareEnabled());
+
+    // Since provider is not customized, CloudInfoSources and CloudInfoProcessorName will be null.
+    Assert.assertNull(listUrlFromZk);
+    Assert.assertNull(cloudConfigFromZk.getCloudInfoProcessorName());
+    Assert.assertEquals(cloudConfigFromZk.getCloudProvider(), CloudProvider.AZURE.name());
   }
 
-  @Test
+  // Note, with mvn 3.6.1, we have a nasty bug that running "mvn test" under helix-core,
+  // all the bellow test will not be invoked. Also, @AfterClass cleanup of this class and

Review comment:
       I mentioned once that I think the below tests are not "not invoked". They are just delayed. Have you verified that if they never been triggered during the whole test suite?

##########
File path: helix-core/src/test/java/org/apache/helix/model/TestResourceConfig.java
##########
@@ -34,6 +34,11 @@
   private static final ObjectMapper _objectMapper = new ObjectMapper();
 
   @Test
+  public void testHead() {

Review comment:
       Again, curious about this trick. Wait for the explanation.

##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestClusterSetup.java
##########
@@ -61,56 +61,162 @@
   protected static final String STATE_MODEL = "MasterSlave";
   protected static final String TEST_NODE = "testnode_1";
 
-  private ClusterSetup _clusterSetup;
+  private ClusterSetup _clusterSetup = null;
 
   private static String[] createArgs(String str) {
     String[] split = str.split("[ ]+");
     System.out.println(Arrays.toString(split));
     return split;
   }
 
-  @BeforeClass()
+  @BeforeClass
   public void beforeClass() throws Exception {
     System.out
         .println("START TestClusterSetup.beforeClass() " + new Date(System.currentTimeMillis()));
+    _clusterSetup = new ClusterSetup(ZK_ADDR);
   }
 
-  @AfterClass()
+  @AfterClass
   public void afterClass() {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestClusterSetup called.");
+
     deleteCluster(CLUSTER_NAME);
+    _clusterSetup.close();
     System.out.println("END TestClusterSetup.afterClass() " + new Date(System.currentTimeMillis()));
   }
 
-  @BeforeMethod()
+  @BeforeMethod
   public void setup() {
+    // System.out.println("@BeforeMethod TestClusterSetup beforeMethod called. ");
+    try {
+      _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
+      _clusterSetup.addCluster(CLUSTER_NAME, true);
+    } catch (Exception e) {
+      System.out.println("@BeforeMethod TestClusterSetup exception:" + e);
+    }
+  }
 
-    _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
-    _clusterSetup = new ClusterSetup(ZK_ADDR);
-    _clusterSetup.addCluster(CLUSTER_NAME, true);
+  @Test(expectedExceptions = HelixException.class)

Review comment:
       The following method movements are not easy to review since I don't know what has been changed. Could you move them to the original place so we can compare side by side?

##########
File path: helix-core/src/test/resources/log4j.properties
##########
@@ -16,7 +16,12 @@
 # specific language governing permissions and limitations
 # under the License.
 #
+
+
+# Set root logger level to DEBUG and its only appender to R.
 log4j.rootLogger=ERROR, C
+# log4j.rootLogger=ERROR, C, R

Review comment:
       remove

##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestHelixAdminCli.java
##########
@@ -568,7 +568,9 @@ public void testDeactivateCluster() throws Exception {
     String command =
         "-zkSvr " + ZK_ADDR + " -activateCluster " + clusterName + " " + grandClusterName + " true";
     ClusterSetup.processCommandLineArgs(command.split("\\s+"));
-    Thread.sleep(500);
+
+    // wait long enough enough so that grand_cluster converge
+    Thread.sleep(20000);

Review comment:
       Verifier polling the grand_cluster?

##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestClusterStateVerifier.java
##########
@@ -114,19 +118,30 @@ public void testResourceSubset() throws InterruptedException {
     Thread.sleep(1000);
     _admin.enableCluster(_clusterName, false);
     _admin.enableInstance(_clusterName, "localhost_12918", true);
-    boolean result =
-        ClusterStateVerifier.verifyByZkCallback(new BestPossAndExtViewZkVerifier(ZK_ADDR,
-            _clusterName, null, Sets.newHashSet(RESOURCES[1])));
-    Assert.assertTrue(result);
+    BestPossAndExtViewZkVerifier verifier = new BestPossAndExtViewZkVerifier(ZK_ADDR,
+        _clusterName, null, Sets.newHashSet(RESOURCES[1]));
+    boolean result = false;
+    try {
+      result = ClusterStateVerifier.verifyByZkCallback(verifier);
+      Assert.assertTrue(result);
+    } finally {
+      verifier.close();
+    }
+
     String[] args = {
         "--zkSvr", ZK_ADDR, "--cluster", _clusterName, "--resources", RESOURCES[1]
     };
     result = ClusterStateVerifier.verifyState(args);
     Assert.assertTrue(result);
 
     // But the full cluster verification should fail
-    boolean fullResult = new BestPossAndExtViewZkVerifier(ZK_ADDR, _clusterName).verify();
-    Assert.assertFalse(fullResult);
+    BestPossAndExtViewZkVerifier verifier1 = new BestPossAndExtViewZkVerifier(ZK_ADDR, _clusterName);

Review comment:
       verifier = ...

##########
File path: helix-core/src/test/resources/log4j.properties
##########
@@ -27,6 +32,8 @@ log4j.appender.R=org.apache.log4j.RollingFileAppender
 log4j.appender.R.layout=org.apache.log4j.PatternLayout
 log4j.appender.R.layout.ConversionPattern=%5p [%C:%M] (%F:%L) - %m%n
 log4j.appender.R.File=target/ClusterManagerLogs/log.txt
+log4j.appender.R.MaxBackupIndex=200

Review comment:
       I think we discussed, there is no need to configure MaxBackupIndex, right?

##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestHelixAdminCli.java
##########
@@ -600,6 +602,15 @@ public void testDeactivateCluster() throws Exception {
       }
     }
 
+    boolean leaderNotExists = TestHelper.verify(() -> {
+      boolean isLeaderExists = _gZkClient.exists(path);
+      if (isLeaderExists) {
+        System.out.println("mysteriou leader out");

Review comment:
       typo? "mysteriou"
   
   This print is too verbose. Please remove.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r476894932



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -623,6 +623,13 @@ void reset() {
       }
     }
 
+    // some thing like STATE_TRANSITION.Key specific pool are not shut down.
+    for (String msgType : _executorMap.keySet()) {
+      ExecutorService pool = _executorMap.remove(msgType);

Review comment:
       https://github.com/apache/helix/issues/1245. 
   HelixTaskExecutor does not shutdown STATE_TRANSITION.Key specific pool. That is the root cause. This has thread leaking in production environment if the msgtype-resourcekey feature is enabled when zk session expires.
   
   In sum, these STATE_TRANSITIONLKEY pool should also be closed. But the customer created ones should not be shutdown,




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] lei-xia commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
lei-xia commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r501990938



##########
File path: helix-core/src/test/java/org/apache/helix/TestListenerCallbackBatchMode.java
##########
@@ -119,6 +119,9 @@ public void beforeClass()
   @AfterClass
   public void afterClass()
       throws Exception {
+    String testClassName = this.getShortClassName();

Review comment:
       Do we need one in BeforeClass?  If we need this being print out before and after class methods, should we put this into ZkTestBase (like what we do in for beforeTest and afterTest in ZkTestBase)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504355990



##########
File path: helix-core/src/test/resources/log4j.properties
##########
@@ -16,7 +16,12 @@
 # specific language governing permissions and limitations
 # under the License.
 #
+
+
+# Set root logger level to DEBUG and its only appender to R.
 log4j.rootLogger=ERROR, C
+# log4j.rootLogger=ERROR, C, R

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r489848583



##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterVerifiers/BestPossibleExternalViewVerifier.java
##########
@@ -236,6 +239,9 @@ protected synchronized boolean verifyState() {
       idealStates.entrySet()
           .removeIf(pair -> pair.getValue().getStateModelDefRef().equals(TaskConstants.STATE_MODEL_NAME));
 
+      if (_isLogMore) {
+        System.out.println("---- before verify instances !");
+      }

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504332484



##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestHelixAdminCli.java
##########
@@ -600,6 +602,15 @@ public void testDeactivateCluster() throws Exception {
       }
     }
 
+    boolean leaderNotExists = TestHelper.verify(() -> {
+      boolean isLeaderExists = _gZkClient.exists(path);
+      if (isLeaderExists) {
+        System.out.println("mysteriou leader out");

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504353863



##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestDisableCustomCodeRunner.java
##########
@@ -228,7 +228,7 @@ public void test() throws Exception {
       }
     }
     Assert.assertNotNull(leader);
-    for (String instance : callbacks.keySet()) {
+/*    for (String instance : callbacks.keySet()) {

Review comment:
       This needs some disussion,




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504343843



##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestClusterSetup.java
##########
@@ -221,18 +327,18 @@ public void testRebalanceCluster() throws Exception {
     verifyReplication(_gZkClient, CLUSTER_NAME, TEST_DB, 4);
   }
 
-  @Test()
+  @Test(dependsOnMethods = "testAddClusterWithValidCloudConfig")
   public void testParseCommandLinesArgs() throws Exception {
     // wipe ZK
-    _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
-    _clusterSetup = new ClusterSetup(ZK_ADDR);
+    //_gZkClient.deleteRecursively("/" + CLUSTER_NAME);
+    //_clusterSetup = new ClusterSetup(ZK_ADDR);

Review comment:
       removed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499936281



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -231,12 +272,14 @@ public void testQueueJobsMaxCapacity() throws InterruptedException {
 
     _driver.resume(queueName);
 
+    System.out.println("before the polls");
     // Wait until all of the enqueued jobs (Job0 to Job3) are finished
     for (int i = 0; i < numberOfJobsAddedInitially; i++) {
       _driver.pollForJobState(queueName, TaskUtil.getNamespacedJobName(queueName, "JOB" + i),
           TaskState.COMPLETED);
     }
 
+    System.out.println("before enqueue for exception");

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499966987



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestTaskRebalancer.java
##########
@@ -162,8 +183,12 @@ public void partitionSet() throws Exception {
   @Test
   public void testRepeatedWorkflow() throws Exception {
     String workflowName = "SomeWorkflow";
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);
+    WorkflowConfig wfg = wfgBuilder.build();

Review comment:
       changed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r494731321



##########
File path: helix-core/src/main/java/org/apache/helix/task/TaskStateModelFactory.java
##########
@@ -97,7 +97,7 @@ public void shutdown() {
   }
 
   @VisibleForTesting
-  void shutdownNow() {
+  public void shutdownNow() {

Review comment:
       helix-core/src/test/java/org/apache/helix/integration/multizk/TestMultiZkHelixJavaApis.java
   actually leaks hundreds of threads.
   We need to call this method to avoid the thread leak
   ```
   public void afterClass() throws Exception {
       String testClassName = getClass().getSimpleName();
       System.out.println("AfterClass: " + testClassName + " of TestMultiZkHelixJavaApis called.");
   
       try {
         // Kill all mock controllers and participants
         MOCK_CONTROLLERS.values().forEach(ClusterControllerManager::syncStop);
         MOCK_PARTICIPANTS.forEach(mockParticipantManager -> {
           mockParticipantManager.syncStop();
           StateMachineEngine stateMachine = mockParticipantManager.getStateMachineEngine();
           if (stateMachine != null) {
             StateModelFactory stateModelFactory = stateMachine.getStateModelFactory("Task");
             if (stateModelFactory != null && stateModelFactory instanceof TaskStateModelFactory) {
               ((TaskStateModelFactory) stateModelFactory).shutdownNow();
             }
           }
         });
   
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504349025



##########
File path: helix-core/src/test/java/org/apache/helix/integration/spectator/TestRoutingTableProviderPeriodicRefresh.java
##########
@@ -56,7 +56,7 @@
 
   private static final String STATE_MODEL = BuiltInStateModelDefinitions.MasterSlave.name();
   private static final String TEST_DB = "TestDB";
-  private static final String CLASS_NAME = TestRoutingTableProvider.class.getSimpleName();
+  private static final String CLASS_NAME = TestRoutingTableProviderPeriodicRefresh.class.getSimpleName();

Review comment:
       changed to `TestHelper.getTestClassName()`
   
   getShortClassName() not working here due to this is static field and getShortClassName is not.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499935940



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -102,7 +136,9 @@ public void testJobSubmitGenericWorkflows() throws InterruptedException {
     JobConfig.Builder jobBuilder =
         new JobConfig.Builder().setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB)
             .setCommand(MockTask.TASK_COMMAND).setMaxAttemptsPerTask(2);
-    Workflow.Builder builder = new Workflow.Builder(workflowName);
+    WorkflowConfig.Builder wfgBuilder = new WorkflowConfig.Builder().setJobPurgeInterval(-1);

Review comment:
       fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r475810452



##########
File path: helix-core/src/test/java/org/apache/helix/common/ZkTestBase.java
##########
@@ -719,6 +722,20 @@ public void cleanupLiveInstanceOwners() {
       clientMap.clear();
     }
     _liveInstanceOwners.clear();
+
+    // wait 2 seconds for threads to be teared down.
+    // this reduce false alert.
+    Thread.sleep(1000);

Review comment:
       This does not apply here. Tearing down thread is async to thread shutdown by the Java virtual machine. There is not way for us to check when the Java virtual machine really tear it down.
   
   that said, I found by practice, waiting 2 sec does not really help much either. I will remove it a little bit later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r508915755



##########
File path: helix-core/src/test/java/org/apache/helix/integration/TestDisableCustomCodeRunner.java
##########
@@ -228,7 +228,7 @@ public void test() throws Exception {
       }
     }
     Assert.assertNotNull(leader);
-    for (String instance : callbacks.keySet()) {
+/*    for (String instance : callbacks.keySet()) {

Review comment:
       add todo:




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499921011



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestEnqueueJobs.java
##########
@@ -92,6 +125,7 @@ public void testJobQueueAddingJobsAtSametime() throws InterruptedException {
 
     _driver.resume(queueName);
 
+    System.out.println("before poll to job state");

Review comment:
       removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on pull request #1227:
URL: https://github.com/apache/helix/pull/1227#issuecomment-834666617


   Close due to inactive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r508972255



##########
File path: helix-core/src/test/java/org/apache/helix/integration/rebalancer/CrushRebalancers/TestCrushAutoRebalance.java
##########
@@ -298,7 +298,7 @@ private void validateZoneAndTagIsolation(IdealState is, ExternalView ev, int exp
       Assert.assertEquals(assignedZones.size(), expectedReplica);
     }
   }
-
+/* These blank test would cause @AfterClass not get invoked.

Review comment:
       >Start zookeeper at localhost:2183 in thread main
   START TestCrushAutoRebalance at Tue Oct 20 20:43:11 PDT 2020
   START TestCrushAutoRebalance testZoneIsolationWithInstanceTag at Tue Oct 20 20:43:13 PDT 2020
   END TestCrushAutoRebalance testZoneIsolationWithInstanceTag at Tue Oct 20 20:43:15 PDT 2020, took: 1811ms.
   START TestCrushAutoRebalance testZoneIsolationWithInstanceTag at Tue Oct 20 20:43:15 PDT 2020
   END TestCrushAutoRebalance testZoneIsolationWithInstanceTag at Tue Oct 20 20:43:16 PDT 2020, took: 1773ms.
   START TestCrushAutoRebalance testZoneIsolationWithInstanceTag at Tue Oct 20 20:43:16 PDT 2020
   END TestCrushAutoRebalance testZoneIsolationWithInstanceTag at Tue Oct 20 20:43:18 PDT 2020, took: 1718ms.
   START TestCrushAutoRebalance testZoneIsolation at Tue Oct 20 20:43:18 PDT 2020
   END TestCrushAutoRebalance testZoneIsolation at Tue Oct 20 20:43:20 PDT 2020, took: 1663ms.
   START TestCrushAutoRebalance testZoneIsolation at Tue Oct 20 20:43:20 PDT 2020
   END TestCrushAutoRebalance testZoneIsolation at Tue Oct 20 20:43:21 PDT 2020, took: 1731ms.
   START TestCrushAutoRebalance testZoneIsolation at Tue Oct 20 20:43:21 PDT 2020
   END TestCrushAutoRebalance testZoneIsolation at Tue Oct 20 20:43:23 PDT 2020, took: 1672ms.
   START TestCrushAutoRebalance testLackEnoughLiveRacks at Tue Oct 20 20:43:23 PDT 2020
   END TestCrushAutoRebalance testLackEnoughLiveRacks at Tue Oct 20 20:43:25 PDT 2020, took: 1817ms.
   START TestCrushAutoRebalance testLackEnoughRacks at Tue Oct 20 20:43:25 PDT 2020
   END TestCrushAutoRebalance testLackEnoughRacks at Tue Oct 20 20:43:27 PDT 2020, took: 1819ms.
   END TestCrushAutoRebalance at Tue Oct 20 20:43:27 PDT 2020




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r508915459



##########
File path: helix-core/src/test/java/org/apache/helix/tools/TestClusterSetup.java
##########
@@ -61,56 +61,162 @@
   protected static final String STATE_MODEL = "MasterSlave";
   protected static final String TEST_NODE = "testnode_1";
 
-  private ClusterSetup _clusterSetup;
+  private ClusterSetup _clusterSetup = null;
 
   private static String[] createArgs(String str) {
     String[] split = str.split("[ ]+");
     System.out.println(Arrays.toString(split));
     return split;
   }
 
-  @BeforeClass()
+  @BeforeClass
   public void beforeClass() throws Exception {
     System.out
         .println("START TestClusterSetup.beforeClass() " + new Date(System.currentTimeMillis()));
+    _clusterSetup = new ClusterSetup(ZK_ADDR);
   }
 
-  @AfterClass()
+  @AfterClass
   public void afterClass() {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestClusterSetup called.");
+
     deleteCluster(CLUSTER_NAME);
+    _clusterSetup.close();
     System.out.println("END TestClusterSetup.afterClass() " + new Date(System.currentTimeMillis()));
   }
 
-  @BeforeMethod()
+  @BeforeMethod
   public void setup() {
+    // System.out.println("@BeforeMethod TestClusterSetup beforeMethod called. ");
+    try {
+      _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
+      _clusterSetup.addCluster(CLUSTER_NAME, true);
+    } catch (Exception e) {
+      System.out.println("@BeforeMethod TestClusterSetup exception:" + e);
+    }
+  }
 
-    _gZkClient.deleteRecursively("/" + CLUSTER_NAME);
-    _clusterSetup = new ClusterSetup(ZK_ADDR);
-    _clusterSetup.addCluster(CLUSTER_NAME, true);
+  @Test(expectedExceptions = HelixException.class)

Review comment:
       I will look at it and confirm.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r494731812



##########
File path: helix-core/src/main/java/org/apache/helix/tools/ClusterStateVerifier.java
##########
@@ -181,6 +181,12 @@ private BestPossAndExtViewZkVerifier(HelixZkClient zkClient, String clusterName,
       this.resources = resources;
     }
 
+    public void close() {

Review comment:
       There are still many places using this one.  Change to use new verifier seems to worth a separate diff.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r504344907



##########
File path: helix-core/src/test/java/org/apache/helix/monitoring/TestClusterStatusMonitorLifecycle.java
##########
@@ -291,7 +292,14 @@ public void testClusterStatusMonitorLifecycle() throws Exception {
         Query.not(Query.match(Query.attr("SensorName"), Query.value("MessageQueueStatus.*"))),
         exp1);
 
-    Assert.assertTrue(TestHelper.verify(() -> ManagementFactory.getPlatformMBeanServer()
-        .queryMBeans(new ObjectName("ClusterStatus:*"), exp2).isEmpty(), 3000));
+    // Note, the _asyncTasksThreadPool shutting down logic in GenericHelixController is best effort
+    // there is not guarantee that all threads in the pool is gone. MOstly they will, but not always.

Review comment:
       fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] dasahcc commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
dasahcc commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r469588104



##########
File path: helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java
##########
@@ -177,7 +177,7 @@ public HelixTaskExecutor(ParticipantStatusMonitor participantStatusMonitor,
     _lock = new Object();
     _statusUpdateUtil = new StatusUpdateUtil();
 
-    _timer = new Timer("HelixTaskExecutor_timer", true); // created as a daemon timer thread to handle task timeout
+    _timer = new Timer("HelixTaskExecutor_Timer", true); // created as a daemon timer thread to handle task timeout

Review comment:
       What's the difference of this line? Let's try to minimize the diff for reviewing.

##########
File path: helix-core/src/test/java/org/apache/helix/TestHelper.java
##########
@@ -21,6 +21,7 @@
 
 import java.io.File;
 import java.io.IOException;
+import java.lang.reflect.Array;

Review comment:
       Let's minimize the diff.

##########
File path: helix-core/src/main/java/org/apache/helix/task/TaskStateModelFactory.java
##########
@@ -97,7 +97,7 @@ public void shutdown() {
   }
 
   @VisibleForTesting
-  void shutdownNow() {
+  public void shutdownNow() {

Review comment:
       Is the VisibleForTesting not enough? This is not a good idea to expose it .

##########
File path: helix-core/src/test/java/org/apache/helix/TestListenerCallback.java
##########
@@ -119,6 +119,8 @@ public void beforeClass() throws Exception {
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();
+    System.out.println("AfterClass: " + testClassName + " of TestListenerCallback called.");

Review comment:
       Will this bring lots of printout? Is this for debugging purpose?

##########
File path: helix-core/src/test/java/org/apache/helix/ThreadLeakageChecker.java
##########
@@ -0,0 +1,178 @@
+package org.apache.helix;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.function.Predicate;
+import java.util.stream.Collectors;
+
+
+public class ThreadLeakageChecker {
+  private static ThreadGroup getRootThreadGroup() {
+    ThreadGroup candidate = Thread.currentThread().getThreadGroup();
+    while (candidate.getParent() != null) {
+      candidate = candidate.getParent();
+    }
+    return candidate;
+  }
+
+  private static List<Thread> getAllThreads() {
+    ThreadGroup rootThreadGroup = getRootThreadGroup();
+    Thread[] threads = new Thread[32];
+    int count = rootThreadGroup.enumerate(threads);
+    while (count == threads.length) {
+      threads = new Thread[threads.length * 2];
+      count = rootThreadGroup.enumerate(threads);
+    }
+    return Arrays.asList(Arrays.copyOf(threads, count));
+  }
+
+  private static final String[] ZkServerThrdPattern =
+      {"SessionTracker", "NIOServerCxn", "SyncThread:", "ProcessThread"};
+  private static final String[] ZkSessionThrdPattern =
+      new String[]{"ZkClient-EventThread", "ZkClient-AsyncCallback", "-EventThread", "-SendThread"};
+  private static final String[] ForkJoinThrdPattern = new String[]{"ForkJoinPool"};
+  private static final String[] TimerThrdPattern = new String[]{"time"};
+  private static final String[] TaskStateModelThrdPattern = new String[]{"TaskStateModel"};
+
+  private static enum ThreadCategory {
+    ZkServer("zookeeper server threads", 4, 100, ZkServerThrdPattern),
+    ZkSession("zkclient/zooKeeper session threads", 12, 12, ZkSessionThrdPattern),
+    ForkJoin("fork join pool threads", 2, 10, ForkJoinThrdPattern),
+    Timer("timer threads", 0, 2, TimerThrdPattern),
+    TaskStateModel("TaskStateModel threads", 0, 0, TaskStateModelThrdPattern),
+    Other("Other threads", 0, 3, new String[]{""});
+
+    private String _description;
+    private List<String> _pattern;
+    private int _warningLimit;
+    private int _limit;
+
+    public String getDescription() {
+      return _description;
+    }
+
+    public Predicate<String> getMatchPred() {
+      if (this.name() != ThreadCategory.Other.name()) {
+        Predicate<String> pred = target -> {
+          for (String p : _pattern) {
+            if (target.toLowerCase().contains(p.toLowerCase())) {
+              return true;
+            }
+          }
+          return false;
+        };
+        return pred;
+      }
+
+      List<Predicate<String>> predicateList = new ArrayList<>();
+      for (ThreadCategory threadCategory : ThreadCategory.values()) {
+        if (threadCategory == ThreadCategory.Other) {
+          continue;
+        }
+        predicateList.add(threadCategory.getMatchPred());
+      }
+      Predicate<String> pred = target -> {
+        for (Predicate<String> p : predicateList) {
+          if (p.test(target)) {
+            return false;
+          }
+        }
+        return true;
+      };
+
+      return pred;
+    }
+
+    public int getWarningLimit() {
+      return _warningLimit;
+    }
+
+    public int getLimit() {
+      return _limit;
+    }
+
+    private ThreadCategory(String description, int warningLimit, int limit, String[] patterns) {
+      _description = description;
+      _pattern = Arrays.asList(patterns);
+      _warningLimit = warningLimit;
+      _limit = limit;
+    }
+  }
+
+  public static boolean afterClassCheck(String classname) {
+    // step 1: get all active threads
+    List<Thread> threads = getAllThreads();
+    System.out.println(classname + " has active threads cnt:" + threads.size());
+
+    // step 2: categorize threads
+    Map<String, List<Thread>> threadByName = null;
+    Map<ThreadCategory, Integer> threadByCnt = new HashMap<>();
+    Map<ThreadCategory, Set<Thread>> threadByCat = new HashMap<>();
+    try {
+      threadByName = threads.
+          stream().
+          filter(p -> p.getThreadGroup() != null && p.getThreadGroup().getName() != null
+              && p.getThreadGroup().getName() != "system").

Review comment:
       This is a bug. I dont think it will work since in Java, String compare uses .equal().

##########
File path: helix-core/src/test/java/org/apache/helix/common/ZkTestBase.java
##########
@@ -719,6 +722,20 @@ public void cleanupLiveInstanceOwners() {
       clientMap.clear();
     }
     _liveInstanceOwners.clear();
+
+    // wait 2 seconds for threads to be teared down.
+    // this reduce false alert.
+    Thread.sleep(1000);
+    boolean status = false;
+    try {
+       status = ThreadLeakageChecker.afterClassCheck(testClassName);
+    } catch (Exception e) {
+      System.out.println("ThreadLeakageChecker exception:" + e.getStackTrace());
+    }
+    // Assert here does not work.
+    if (!status) {
+      System.out.println("---------- Test Class " + testClassName + " thread leakage detected! ---------------");

Review comment:
       Can we kill the threads and print log? At least, we keep the test working.

##########
File path: helix-core/src/test/java/org/apache/helix/common/ZkTestBase.java
##########
@@ -719,6 +722,20 @@ public void cleanupLiveInstanceOwners() {
       clientMap.clear();
     }
     _liveInstanceOwners.clear();
+
+    // wait 2 seconds for threads to be teared down.
+    // this reduce false alert.
+    Thread.sleep(1000);

Review comment:
       Are we still using thread.sleep? At least, we should have periodical check, for example check every 10 milisecond.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r502716727



##########
File path: helix-core/src/test/java/org/apache/helix/TestListenerCallback.java
##########
@@ -119,6 +119,8 @@ public void beforeClass() throws Exception {
 
   @AfterClass
   public void afterClass() throws Exception {
+    String testClassName = this.getShortClassName();

Review comment:
       changed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on a change in pull request #1227: Implement thread leakage check, fixing major thread leakage test and some flaky test

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on a change in pull request #1227:
URL: https://github.com/apache/helix/pull/1227#discussion_r499944992



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestQuotaBasedScheduling.java
##########
@@ -543,11 +561,17 @@ public void testJobQueueScheduling() throws InterruptedException {
     jobConfigBulider = new JobConfig.Builder().setCommand(JOB_COMMAND).addTaskConfigs(taskConfigs)
         .setJobCommandConfigMap(_jobCommandMap).setJobType("B");
 
+    jobNames.clear();
+    jobBuilders.clear();
     for (int i = 5; i < 10; i++) {
       String jobName = "JOB_" + i;
       lastJobName = jobName;
-      _driver.enqueueJob(queueName, jobName, jobConfigBulider);
+      jobNames.add(jobName);
+      jobBuilders.add(jobConfigBulider);
+      //_driver.enqueueJob(queueName, jobName, jobConfigBulider);

Review comment:
       removed,.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org