You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2020/11/03 13:52:28 UTC

[GitHub] [helix] NealSun96 opened a new pull request #1508: Fix ondemand rebalance flooding and log flooding caused by dangling jobs

NealSun96 opened a new pull request #1508:
URL: https://github.com/apache/helix/pull/1508


   ### Issues
   
   - [x] My PR addresses the following Helix issues and references them in the PR description:
   
   Fixes #1506, #1507
   
   ### Description
   
   - [x] Here are some details about my PR, including screenshots of any UI changes:
   
   This PR fixes 2 issues:
   1. Ondemand rebalance flooding after TF IS removal. This is caused by an old problem. A jobConfig can exist for the previous iterations of a workflow, but the job doesn't exist in the job DAG. Because of runtime DAG refresh logic, such a jobConfig will cause the runtime DAG to be refreshed every time the pipeline runs. A new runtime DAG causes all the jobs to be reprocessed, and during processing, the cleanup logic is run for every job that has already completed. Before IS removal, the cleanup logic attempts to delete the IS again, which has no effect; in the new code, an onDemand rebalance is triggered instead. 
   This "dangling job" will not be removed, so the job dag refresh keeps on happening, which causes the ondemand rebalances to keep on firing. 
   2. Log flooding for jobs that miss target resources after TF IS removal. Similarly, "dangling jobs" can have missing target resources once their target resources are deleted. Before IS removal, this is not a problem because this log is only triggered when the job is first processed; now, the processing logic happens every pipeline (since it's config based, instead of IS based), so the log could keep on firing. 
   
   ### Tests
   
   - [x] The following is the result of the "mvn test" command on the appropriate module:
   
   ```
   [ERROR] Tests run: 1236, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 4,922.413 s <<< FAILURE! - in TestSuite
   [ERROR] testDeactivateCluster(org.apache.helix.tools.TestHelixAdminCli)  Time elapsed: 2.383 s  <<< FAILURE!
   org.apache.helix.HelixException: There are still LEADER in the cluster, shut them down first.
           at org.apache.helix.tools.TestHelixAdminCli.testDeactivateCluster(TestHelixAdminCli.java:604)
   
   [INFO]
   [INFO] Results:
   [INFO]
   [ERROR] Failures:
   [ERROR]   TestHelixAdminCli.testDeactivateCluster:604 ยป Helix There are still LEADER in ...
   [INFO]
   [ERROR] Tests run: 1236, Failures: 1, Errors: 0, Skipped: 1
   [INFO]
   [INFO] ------------------------------------------------------------------------
   [INFO] BUILD FAILURE
   [INFO] ------------------------------------------------------------------------
   [INFO] Total time:  01:22 h
   [INFO] Finished at: 2020-11-02T18:52:09-08:00
   [INFO] ------------------------------------------------------------------------
   ```
   
   Rerun
   
   ```
   [INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 23.672 s - in org.apache.helix.tools.TestHelixAdminCli
   [INFO] 
   [INFO] Results:
   [INFO] 
   [INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0
   [INFO] 
   [INFO] ------------------------------------------------------------------------
   [INFO] BUILD SUCCESS
   [INFO] ------------------------------------------------------------------------
   [INFO] Total time:  28.917 s
   [INFO] Finished at: 2020-11-02T19:44:51-08:00
   [INFO] ------------------------------------------------------------------------
   ```
   
   ### Documentation (Optional)
   
   - In case of new functionality, my PR adds documentation in the following wiki page:
   
   (Link the GitHub wiki you added)
   
   ### Commits
   
   - My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Code Quality
   
   - My diff has been formatted using helix-style.xml 
   (helix-style-intellij.xml if IntelliJ IDE is used)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on pull request #1508: Fix ondemand rebalance flooding and log flooding caused by dangling jobs

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on pull request #1508:
URL: https://github.com/apache/helix/pull/1508#issuecomment-720973555


   I think the logic change is fine. But it's better to add the test coverage.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] alirezazamani commented on a change in pull request #1508: Fix ondemand rebalance flooding and log flooding caused by dangling jobs

Posted by GitBox <gi...@apache.org>.
alirezazamani commented on a change in pull request #1508:
URL: https://github.com/apache/helix/pull/1508#discussion_r516829905



##########
File path: helix-core/src/main/java/org/apache/helix/common/caches/TaskDataCache.java
##########
@@ -116,33 +116,6 @@ public synchronized boolean refresh(HelixDataAccessor accessor,
       }
     }
 
-    // The following 3 blocks is for finding a list of workflows whose JobDAGs have been changed
-    // because their RuntimeJobDags would need to be re-built
-    // newly added jobs
-    for (String jobName : newJobConfigs.keySet()) {
-      if (!_jobConfigMap.containsKey(jobName) && newJobConfigs.get(jobName).getWorkflow() != null) {
-        workflowsUpdated.add(newJobConfigs.get(jobName).getWorkflow());
-      }
-
-      // Only for JobQueues when a new job is enqueued, there exists a race condition where only
-      // JobConfig is updated and the RuntimeJobDag does not get updated because when the client
-      // (TaskDriver) submits, it creates JobConfig ZNode first and modifies its parent JobDag next.
-      // To ensure that they are both properly updated, check that workflow's DAG and existing
-      // JobConfigs are consistent for JobQueues
-      JobConfig jobConfig = newJobConfigs.get(jobName);
-      if (_workflowConfigMap.containsKey(jobConfig.getWorkflow())) {
-        WorkflowConfig workflowConfig = _workflowConfigMap.get(jobConfig.getWorkflow());
-        // Check that the job's parent workflow's DAG contains this job
-        if ((workflowConfig.isJobQueue() || !workflowConfig.isTerminable()) && !_runtimeJobDagMap
-            .get(workflowConfig.getWorkflowId()).getAllNodes().contains(jobName)) {
-          // Inconsistency between JobConfigs and DAGs found. Add the workflow to workflowsUpdated
-          // to rebuild the RuntimeJobDag
-          workflowsUpdated.add(jobConfig.getWorkflow());
-        }
-      }
-    }
-
-    // Removed jobs
     // This block makes sure that the workflow config has been changed.
     // This avoid the race condition where job config has been purged but job has not been deleted

Review comment:
       Maybe this comment needs to be changed. It was added for the purge race condition. Now it is serving as general-purpose workflow config modification.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on pull request #1508: Fix ondemand rebalance flooding and log flooding caused by dangling jobs

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on pull request #1508:
URL: https://github.com/apache/helix/pull/1508#issuecomment-721876923


   This PR is ready to be merged, approved by @alirezazamani
   Final commit message:
   ## Fix ondemand rebalance flooding and log flooding caused by dangling jobs  ##
   This PR changes runtime dag refresh logic to eliminate ondemand rebalance flooding caused by dangling jobs. This PR also changes log level to get rid of log flooding caused by missing target resources. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] alirezazamani merged pull request #1508: Fix ondemand rebalance flooding and log flooding caused by dangling jobs

Posted by GitBox <gi...@apache.org>.
alirezazamani merged pull request #1508:
URL: https://github.com/apache/helix/pull/1508


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] NealSun96 commented on a change in pull request #1508: Fix ondemand rebalance flooding and log flooding caused by dangling jobs

Posted by GitBox <gi...@apache.org>.
NealSun96 commented on a change in pull request #1508:
URL: https://github.com/apache/helix/pull/1508#discussion_r516415341



##########
File path: helix-core/src/main/java/org/apache/helix/common/caches/TaskDataCache.java
##########
@@ -116,33 +116,6 @@ public synchronized boolean refresh(HelixDataAccessor accessor,
       }
     }
 
-    // The following 3 blocks is for finding a list of workflows whose JobDAGs have been changed

Review comment:
       Why this section is removed: this section is refresh runtime dags based on JobConfigs. For every new job config, its workflow is added; for ever job config that doesn't exist in the jobqueue's dag (seems to be duplicate with the first part of logic), its workflow is added. 
   
   With dangling jobs, their jobConfigs are never deleted, so this section is activated all the time and the workflow, if exists, will have its DAG refreshed all the time. The problem is detailed in PR description. 
   
   I don't think the extra logic for new job config or for the race condition makes much sense, given that a new logic which compares WorkflowConfig version is in place - if a new job is added, eventually the DAG will be updated, which will cause a version change and a runtime dag refresh. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org