You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2020/07/07 20:15:30 UTC

[GitHub] [helix] alirezazamani opened a new pull request #1142: Respect Maximum Number Of Attempts for the tasks

alirezazamani opened a new pull request #1142:
URL: https://github.com/apache/helix/pull/1142


   ### Issues
   - [x] My PR addresses the following Helix issues and references them in the PR title:
   Fixes #957 
   Fixes #1141 
   
   ### Description
   - [x] Here are some details about my PR, including screenshots of any UI changes:
   
   In this PR, several scheduling parts have been changed which enforces the scheduler to respect maximum number of attempts for the tasks.
   
   Also, it has been observed that when a task being dropped and scheduled again, max number of attempts is not being respected. In this PR, further checks is added to not schedule the tasks again once we reach its maximum number of attempts.
   
   Note: Several of the tests have been changed. Specially the once with strict MaxNumberOfAttempts. These tests need to be changed because they are mostly related to the targeted jobs. Once we start the participants and targeted partition bounces between the participants, the tasks needs to be reassigned and number of attempts for the task will increase. We haven't noticed it before we haven't been respecting MaxNumberOfAttempts for the DROPPED tasks. In other word, previously controller increases the task number of attempts without actually respecting the fact that the task has reached maximum number of attempts.
   
   
   ### Tests
   - [x] The following tests are written for this issue:
   TestMaxNumberOfAttemptsMasterSwitch
   
   - [x] The following is the result of the "mvn test" command on the appropriate module:
   
   helix-core:
   ```
   [INFO] Results:
   [INFO] 
   [ERROR] Failures: 
   [ERROR]   TestJobQueueCleanUp.testJobQueueAutoCleanUp ยป ThreadTimeout Method org.testng....
   [ERROR]   TestClusterVerifier.testResourceSubset:225 expected:<false> but was:<true>
   [INFO] 
   [ERROR] Tests run: 1150, Failures: 2, Errors: 0, Skipped: 0
   [INFO] 
   [INFO] ------------------------------------------------------------------------
   [INFO] BUILD FAILURE
   [INFO] ------------------------------------------------------------------------
   [INFO] Total time:  01:25 h
   [INFO] Finished at: 2020-07-07T12:19:10-07:00
   [INFO] ------------------------------------------------------------------------
   ```
   The failed test succeeded when run individually. (TestJobQueueCleanUp has been stabilized in master branch)
   mvn test -Dtest="TestJobQueueCleanUp,TestClusterVerifier"
   ```
   [INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 43.827 s - in TestSuite
   [INFO] 
   [INFO] Results:
   [INFO] 
   [INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0
   [INFO] 
   [INFO] ------------------------------------------------------------------------
   [INFO] BUILD SUCCESS
   [INFO] ------------------------------------------------------------------------
   [INFO] Total time:  49.286 s
   [INFO] Finished at: 2020-07-07T13:03:07-07:00
   [INFO] ------------------------------------------------------------------------
   ```
   helix-rest:
   ```
   [INFO] Tests run: 159, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 41.184 s - in TestSuite
   [INFO] 
   [INFO] Results:
   [INFO] 
   [INFO] Tests run: 159, Failures: 0, Errors: 0, Skipped: 0
   [INFO] 
   [INFO] ------------------------------------------------------------------------
   [INFO] BUILD SUCCESS
   [INFO] ------------------------------------------------------------------------
   [INFO] Total time:  46.343 s
   [INFO] Finished at: 2020-07-07T13:14:33-07:00
   [INFO] ------------------------------------------------------------------------
   ```
   
   
   ### Commits
   
   - [x] My commits all reference appropriate Apache Helix GitHub issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Code Quality
   
   - [x] My diff has been formatted using helix-style.xml


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1142: (WIP) Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r453339342



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestDeleteJobFromJobQueue.java
##########
@@ -44,10 +44,10 @@ public void testForceDeleteJobFromJobQueue() throws InterruptedException {
     // Create two jobs: job1 will complete fast, and job2 will be stuck in progress (taking a long
     // time to finish). The idea is to force-delete a stuck job (job2).
     JobConfig.Builder jobBuilder = JobConfig.Builder.fromMap(WorkflowGenerator.DEFAULT_JOB_CONFIG)
-        .setMaxAttemptsPerTask(1).setWorkflow(jobQueueName)
+        .setWorkflow(jobQueueName)

Review comment:
       I just realized that you did add a new test below, TestMaxNumberOfAttemptsMasterSwitch. Great job!
   
   Now, if we could reason through how the "rebalanceRunningTask" config should (or should not) apply here, we're set :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] alirezazamani commented on a change in pull request #1142: (WIP) Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
alirezazamani commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r455406410



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestDeleteJobFromJobQueue.java
##########
@@ -44,10 +44,10 @@ public void testForceDeleteJobFromJobQueue() throws InterruptedException {
     // Create two jobs: job1 will complete fast, and job2 will be stuck in progress (taking a long
     // time to finish). The idea is to force-delete a stuck job (job2).
     JobConfig.Builder jobBuilder = JobConfig.Builder.fromMap(WorkflowGenerator.DEFAULT_JOB_CONFIG)
-        .setMaxAttemptsPerTask(1).setWorkflow(jobQueueName)
+        .setWorkflow(jobQueueName)

Review comment:
       I reverted this change for this test. However, we still need to increase number of attempts for some test that we stop the participant and start them again. 
   For example, in the TestForceDeleteWorkflow.testDeleteCompletedWorkflowForcefully:
   In super.beforeClass() we start the participant, however, once the test started, we stop those participants and start them again. I have observed that if we start the workflow very close to the time that we start the participant one by one (kind of like rolling upgrade), Master might jump several times between the nodes. Hence if we set maxNumberOfAttempts to 1, then the test will fail. Since we just started to respect maxNumberOfAttempts in this PR.
   
   About your second comment: Yeah the newly added test checks maxNumberOfAttempts for targetJobChange. For "rebalanceRunningTask", I checked other tests and I saw we this logic is already covered in "TestRebalanceRunningTask" extensively for different scenarios. So if rebalance running task is set and we need to drop the task in one instance and run in other one (Send INIT->RUNNING), then this is new scheduling and we are respecting max number of attempts. So rebalanceRunningTask is happening until we reach maxNumberOfAttempts. In other word, once a task reaches maxNumberOfAttempts, it won't be scheduled again. So no new INIT->RUNNING once we reach maxNumberOfAttempts.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1142: (WIP) Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r453338650



##########
File path: helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
##########
@@ -745,7 +743,7 @@ protected void scheduleForNextTask(String job, JobContext jobCtx, long now) {
     }
   }
 
-  // add all partitions that have been tried maxNumberAttempts
+  // add all partitions that are given up

Review comment:
       What is the distinction between "given up" and partitions that exceeded the # of maximum attempts?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] alirezazamani commented on a change in pull request #1142: (WIP) Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
alirezazamani commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r455406410



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestDeleteJobFromJobQueue.java
##########
@@ -44,10 +44,10 @@ public void testForceDeleteJobFromJobQueue() throws InterruptedException {
     // Create two jobs: job1 will complete fast, and job2 will be stuck in progress (taking a long
     // time to finish). The idea is to force-delete a stuck job (job2).
     JobConfig.Builder jobBuilder = JobConfig.Builder.fromMap(WorkflowGenerator.DEFAULT_JOB_CONFIG)
-        .setMaxAttemptsPerTask(1).setWorkflow(jobQueueName)
+        .setWorkflow(jobQueueName)

Review comment:
       I reverted this change for this test. However, we still need to increase number of attempts for some test that we stop the participant and start them again. 
   For example, in the TestForceDeleteWorkflow.testDeleteCompletedWorkflowForcefully:
   In super.beforeClass() we start the participant, however, once the test started, we stop those participants and start them again. I have observed that if we start the workflow very close to the time that we start the participant one by one (kind of like rolling upgrade), Master might jump several times between the nodes. Hence if we set maxNumberOfAttempts to 1, then the test will fail. Since we just started to respect maxNumberOfAttempts in this PR.
   
   About your second comment: Yeah the newly added test checks maxNumberOfAttempts for targetJobChange. For "rebalanceRunningTask", I checked other tests and I saw we this logic is already covered in "TestRebalanceRunningTask" extensively for different scenarios. So if rebalance running task is set and we need to drop the task in one instance and run in other one (Send INIT->RUNNING), then this is new scheduling and we are respecting max number of attempts. So rebalanceRunningTask is happening until we reach maxNumberOfAttempts. In other word, once a task reaches maxNumberOfAttempts, it won't be scheduled again. So no new INIT->RUNNING once we reach maxNumberOfAttempts :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1142: (WIP) Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r453338578



##########
File path: helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
##########
@@ -560,7 +556,7 @@ protected void handleAdditionalTaskAssignment(
       excludeSet.addAll(assignedSet);
     }
     addCompletedTasks(excludeSet, jobCtx, allPartitions);
-    addGiveupPartitions(excludeSet, jobCtx, allPartitions, jobCfg);
+    addPartitionsReachedMaximumRetries(excludeSet, jobCtx, allPartitions, jobCfg);

Review comment:
       Thanks for renaming!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] alirezazamani edited a comment on pull request #1142: Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
alirezazamani edited a comment on pull request #1142:
URL: https://github.com/apache/helix/pull/1142#issuecomment-661253597


   > Please address the comments. Approving because the pending issues are minor, style-related issues. Also, please rebase from master and run the tests before marking it as ready for being merged. Great work!
   
   @narendly Thank you for reviewing. I addressed your final comments as well. Let me do final checks and several rounds of testing. After that I will put final message for merging. Thanks 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1142: (WIP) Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r453339581



##########
File path: helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
##########
@@ -185,17 +185,13 @@ public void updatePreviousAssignedTasksStatus(
         switch (currState) {
         case RUNNING: {
           TaskPartitionState nextState = TaskPartitionState.RUNNING;
-          if (jobState == TaskState.TIMING_OUT) {
+          if (jobState == TaskState.TIMING_OUT || jobState == TaskState.TIMED_OUT
+              || jobState == TaskState.FAILING || jobState == TaskState.FAILED
+              || jobState == TaskState.ABORTED) {

Review comment:
       Also just to confirm, this is to make sure that task states that are stuck in RUNNING get updated to ABORTED?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] alirezazamani commented on a change in pull request #1142: Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
alirezazamani commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r457601880



##########
File path: helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
##########
@@ -74,6 +75,12 @@ public void updatePreviousAssignedTasksStatus(
       Set<Integer> skippedPartitions, WorkflowControllerDataProvider cache,
       Map<String, Set<Integer>> tasksToDrop) {
 
+    // If a job is in the following states and contains running tasks, all of the running tasks
+    // should go to the ABORTED states.
+    Set<TaskState> jobStatesForRunningTaskToAbortedState =

Review comment:
       Yeah I wasn't happy with previous name as well. Change it as you suggested. Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] alirezazamani commented on a change in pull request #1142: Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
alirezazamani commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r457601516



##########
File path: helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
##########
@@ -74,6 +75,12 @@ public void updatePreviousAssignedTasksStatus(
       Set<Integer> skippedPartitions, WorkflowControllerDataProvider cache,
       Map<String, Set<Integer>> tasksToDrop) {
 
+    // If a job is in the following states and contains running tasks, all of the running tasks
+    // should go to the ABORTED states.

Review comment:
       Fixed. Thanks




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] alirezazamani commented on a change in pull request #1142: (WIP) Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
alirezazamani commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r455396910



##########
File path: helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
##########
@@ -185,17 +185,13 @@ public void updatePreviousAssignedTasksStatus(
         switch (currState) {
         case RUNNING: {
           TaskPartitionState nextState = TaskPartitionState.RUNNING;
-          if (jobState == TaskState.TIMING_OUT) {
+          if (jobState == TaskState.TIMING_OUT || jobState == TaskState.TIMED_OUT
+              || jobState == TaskState.FAILING || jobState == TaskState.FAILED
+              || jobState == TaskState.ABORTED) {

Review comment:
       Done. I create a list for these states. Yes and also make sure task goes to ABORTED state and we release quota.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] alirezazamani commented on pull request #1142: Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
alirezazamani commented on pull request #1142:
URL: https://github.com/apache/helix/pull/1142#issuecomment-661971760


   This PR is ready to be merged, approved by @narendly 
   
   Final commit message:
   Respect maximum number of attempts for the tasks
   
   In this commit, several scheduling parts have been changed in order to
   enforce the scheduler to respect maximum number of attempts for
   the tasks.
   
   Also, it has been observed that when a task being dropped and
   scheduled again, max number of attempts is not being respected.
   in this commit, further checks is added to avoid schedule the
   tasks again once we reach its maximum number of attempts.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] alirezazamani commented on a change in pull request #1142: Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
alirezazamani commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r457601610



##########
File path: helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
##########
@@ -745,8 +748,12 @@ protected void scheduleForNextTask(String job, JobContext jobCtx, long now) {
     }
   }
 
-  // add all partitions that have been tried maxNumberAttempts
-  protected static void addGiveupPartitions(Set<Integer> set, JobContext ctx,
+  // Add all partitions/tasks that are cannot be retried. These tasks are:
+  // 1- Task is in ABORTED or ERROR state.
+  // 2- Task has just gone to TIMED_OUT, ERROR or DROPPED states and has reached to its
+  // maxNumberAttempts
+  // These tasks determine whether the job needs to FAILED or not.
+  protected static void addGiveUpPartitions(Set<Integer> set, JobContext ctx,

Review comment:
       Done :-)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] alirezazamani commented on a change in pull request #1142: Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
alirezazamani commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r455407676



##########
File path: helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
##########
@@ -560,7 +556,7 @@ protected void handleAdditionalTaskAssignment(
       excludeSet.addAll(assignedSet);
     }
     addCompletedTasks(excludeSet, jobCtx, allPartitions);
-    addGiveupPartitions(excludeSet, jobCtx, allPartitions, jobCfg);
+    addPartitionsReachedMaximumRetries(excludeSet, jobCtx, allPartitions, jobCfg);

Review comment:
       ;-)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1142: Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r457138461



##########
File path: helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
##########
@@ -745,8 +748,12 @@ protected void scheduleForNextTask(String job, JobContext jobCtx, long now) {
     }
   }
 
-  // add all partitions that have been tried maxNumberAttempts
-  protected static void addGiveupPartitions(Set<Integer> set, JobContext ctx,
+  // Add all partitions/tasks that are cannot be retried. These tasks are:
+  // 1- Task is in ABORTED or ERROR state.
+  // 2- Task has just gone to TIMED_OUT, ERROR or DROPPED states and has reached to its
+  // maxNumberAttempts
+  // These tasks determine whether the job needs to FAILED or not.
+  protected static void addGiveUpPartitions(Set<Integer> set, JobContext ctx,

Review comment:
       Typo: `addGivenUpPartitions`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly merged pull request #1142: Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
narendly merged pull request #1142:
URL: https://github.com/apache/helix/pull/1142


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] alirezazamani commented on a change in pull request #1142: (WIP) Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
alirezazamani commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r455399152



##########
File path: helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
##########
@@ -745,7 +743,7 @@ protected void scheduleForNextTask(String job, JobContext jobCtx, long now) {
     }
   }
 
-  // add all partitions that have been tried maxNumberAttempts
+  // add all partitions that are given up

Review comment:
       Thanks for the comment. I added comments for more explanation. 
   Also I noticed the given up tasks has been already added in job dispatcher. So we do not need to add the tasks again. However, we want to have a sanity check to filter all of the tasks, regarding of their states, if they reached their maximum number of attempts. Because it has been observed that sometimes, the state of the task might be INIT or DROPPED or other states (if we have pending messages) and we miss max number of attempts check for those tasks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1142: (WIP) Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r453338543



##########
File path: helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
##########
@@ -185,17 +185,13 @@ public void updatePreviousAssignedTasksStatus(
         switch (currState) {
         case RUNNING: {
           TaskPartitionState nextState = TaskPartitionState.RUNNING;
-          if (jobState == TaskState.TIMING_OUT) {
+          if (jobState == TaskState.TIMING_OUT || jobState == TaskState.TIMED_OUT
+              || jobState == TaskState.FAILING || jobState == TaskState.FAILED
+              || jobState == TaskState.ABORTED) {

Review comment:
       Would it be cleaner to define a set of enums (states) where we should mark the tasks aborted? Then we could do a simple exists() check. 
   
   Or, it seems that if we check for != condition, we won't have to list out so many states here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] alirezazamani commented on pull request #1142: Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
alirezazamani commented on pull request #1142:
URL: https://github.com/apache/helix/pull/1142#issuecomment-661253597


   > Please address the comments. Approving because the pending issues are minor, style-related issues. Also, please rebase from master and run the tests before marking it as ready for being merged. Great work!
   
   @narendly Thank you for reviewing. I addressed your final comments as well. Let me do final checks and several round of testing. After that I will put final message for merging. Thanks 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1142: Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r457136441



##########
File path: helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
##########
@@ -74,6 +75,12 @@ public void updatePreviousAssignedTasksStatus(
       Set<Integer> skippedPartitions, WorkflowControllerDataProvider cache,
       Map<String, Set<Integer>> tasksToDrop) {
 
+    // If a job is in the following states and contains running tasks, all of the running tasks
+    // should go to the ABORTED states.

Review comment:
       Nit: 
   
   If a job is in one of the following states and its tasks are in RUNNING states, the tasks will be aborted.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1142: Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r457136829



##########
File path: helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
##########
@@ -74,6 +75,12 @@ public void updatePreviousAssignedTasksStatus(
       Set<Integer> skippedPartitions, WorkflowControllerDataProvider cache,
       Map<String, Set<Integer>> tasksToDrop) {
 
+    // If a job is in the following states and contains running tasks, all of the running tasks
+    // should go to the ABORTED states.
+    Set<TaskState> jobStatesForRunningTaskToAbortedState =

Review comment:
       I think calling it `jobStatesForAbortingTasks` does the trick and is shorter. What do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1142: (WIP) Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r453339180



##########
File path: helix-core/src/test/java/org/apache/helix/integration/task/TestDeleteJobFromJobQueue.java
##########
@@ -44,10 +44,10 @@ public void testForceDeleteJobFromJobQueue() throws InterruptedException {
     // Create two jobs: job1 will complete fast, and job2 will be stuck in progress (taking a long
     // time to finish). The idea is to force-delete a stuck job (job2).
     JobConfig.Builder jobBuilder = JobConfig.Builder.fromMap(WorkflowGenerator.DEFAULT_JOB_CONFIG)
-        .setMaxAttemptsPerTask(1).setWorkflow(jobQueueName)
+        .setWorkflow(jobQueueName)

Review comment:
       Could you explain further why this change is needed? If we add more nodes, the db partitions will get shifted. Are we adding nodes while the workflow is in progress and before its tasks complete? In that case, the shuffling of tasks (getting dropped and rescheduled onto another instance) makes sense and we should increment # of attempts.
   
   Since the focus of this test doesn't seem to be # of attempts, do you think we should add a new test that tests for this? We want to make sure that for targeted tasks, if we add more nodes and if that causes db partitions to be moved, then we should observe the tasks get dropped and retried with an increase in the # of attempts. One thing to watch out for here is "rebalanceRunningTask" - what role should that parameter play?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1142: (WIP) Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1142:
URL: https://github.com/apache/helix/pull/1142#discussion_r453338650



##########
File path: helix-core/src/main/java/org/apache/helix/task/AbstractTaskDispatcher.java
##########
@@ -745,7 +743,7 @@ protected void scheduleForNextTask(String job, JobContext jobCtx, long now) {
     }
   }
 
-  // add all partitions that have been tried maxNumberAttempts
+  // add all partitions that are given up

Review comment:
       What is the distinction between "given up" and partitions that exceeded the # of maximum attempts? Do you think we could clarify that here?
   
   I have a feeling that "given up" might be too broad or abstract.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] alirezazamani edited a comment on pull request #1142: Respect Maximum Number Of Attempts for the tasks

Posted by GitBox <gi...@apache.org>.
alirezazamani edited a comment on pull request #1142:
URL: https://github.com/apache/helix/pull/1142#issuecomment-661253597


   > Please address the comments. Approving because the pending issues are minor, style-related issues. Also, please rebase from master and run the tests before marking it as ready for being merged. Great work!
   
   @narendly Thank you for reviewing. I addressed your final comments as well. Let me do final checks and run several rounds of tests. After that, I will put final message for merging. Thanks 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org