You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by David McLaughlin <da...@dmclaughlin.com> on 2018/01/25 09:03:48 UTC

Review Request 65339: Fix infinite loop in Task State Machine due to TASK_UNKNOWN handling

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65339/
-----------------------------------------------------------

Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.


Bugs: AURORA-1966
    https://issues.apache.org/jira/browse/AURORA-1966


Repository: aurora


Description
-------

As reported in https://issues.apache.org/jira/browse/AURORA-1966, Mesos sends a TASK_UNKNOWN when we try to kill (or reconcile) tasks that are unknown. On master, this leads to an infinite loop. The sequence of events is:

1) We map TASK_UNKNOWN to PARTITIONED
2) We react to restarting or terminal -> PARTITIONED state by telling Mesos "that is a bad state transition, that task should be dead".
3) Mesos replies with: that task is TASK_UNKNOWN
4) GO TO 1

AURORA-1966 describes just one case of this happening, but there are many other legitimate paths to this. 

This patch cleans up the logic. The two main changes:

1) Do not allow ASSIGNED -> PARTITIONED. This is not really related to this bug, but I found this logic error during debugging. ASSIGNED is a transient state and is subject to the transient task timeout in the Scheduler, so we should not attempt to move to PARTITIONED during that window. 
2) Do not try to kill tasks we think are terminal when Mesos tells us they are unknown. Originally we did this because "manageTerminalTasks" is also used for restarting tasks - but in both cases it never makes sense to respond  to "I don't know about that task" with a request to kill it.


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java b8ba5da729fcf5965b577c23e3062e5607bd07e7 
  src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 3d98fe651ad2b89a03044e8a06953a0cea876321 


Diff: https://reviews.apache.org/r/65339/diff/1/


Testing
-------

./gradlew test

Verified this fixes the issue reported in AURORA-1966 by forcing LaunchException in OfferManagerImpl in my vagrant image and viewing logs.


Thanks,

David McLaughlin


Re: Review Request 65339: Fix infinite loop in Task State Machine due to TASK_UNKNOWN handling

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65339/#review196227
-----------------------------------------------------------



Master (dbe7137) is red with this patch.
  ./build-support/jenkins/build.sh

:distZip
:assemble
:compileTestJavaNote: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: /home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/storage/durability/DurableStorageTest.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

:processTestResources
:testClasses
:compileJmhJavaNote: /home/jenkins/jenkins-slave/workspace/AuroraBot/src/jmh/java/org/apache/aurora/benchmark/fakes/FakeSchedulerDriver.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

:processJmhResources NO-SOURCE
:jmhClasses
:checkstyleJmh
:checkstyleMain
:checkstyleTest
:licenseJmh UP-TO-DATE
:licenseMain UP-TO-DATE
:licenseTest UP-TO-DATE
:license UP-TO-DATE
:pmdJmh
:pmdMain
/home/jenkins/jenkins-slave/workspace/AuroraBot/src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java:182:	These nested if statements could be combined
/home/jenkins/jenkins-slave/workspace/AuroraBot/src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java:182:	These nested if statements could be combined
:pmdMain FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':pmdMain'.
> 2 PMD rule violations were found. See the report at: file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/pmd/main.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

* Get more help at https://help.gradle.org

BUILD FAILED in 4m 30s
38 actionable tasks: 29 executed, 9 up-to-date


I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Jan. 25, 2018, 9:03 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65339/
> -----------------------------------------------------------
> 
> (Updated Jan. 25, 2018, 9:03 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.
> 
> 
> Bugs: AURORA-1966
>     https://issues.apache.org/jira/browse/AURORA-1966
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> As reported in https://issues.apache.org/jira/browse/AURORA-1966, Mesos sends a TASK_UNKNOWN when we try to kill (or reconcile) tasks that are unknown. On master, this leads to an infinite loop. The sequence of events is:
> 
> 1) We map TASK_UNKNOWN to PARTITIONED
> 2) We react to restarting or terminal -> PARTITIONED state by telling Mesos "that is a bad state transition, that task should be dead".
> 3) Mesos replies with: that task is TASK_UNKNOWN
> 4) GO TO 1
> 
> AURORA-1966 describes just one case of this happening, but there are many other legitimate paths to this. 
> 
> This patch cleans up the logic. The two main changes:
> 
> 1) Do not allow ASSIGNED -> PARTITIONED. This is not really related to this bug, but I found this logic error during debugging. ASSIGNED is a transient state and is subject to the transient task timeout in the Scheduler, so we should not attempt to move to PARTITIONED during that window. 
> 2) Do not try to kill tasks we think are terminal when Mesos tells us they are unknown. Originally we did this because "manageTerminalTasks" is also used for restarting tasks - but in both cases it never makes sense to respond  to "I don't know about that task" with a request to kill it.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java b8ba5da729fcf5965b577c23e3062e5607bd07e7 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 3d98fe651ad2b89a03044e8a06953a0cea876321 
> 
> 
> Diff: https://reviews.apache.org/r/65339/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> 
> Verified this fixes the issue reported in AURORA-1966 by forcing LaunchException in OfferManagerImpl in my vagrant image and viewing logs.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 65339: Fix infinite loop in Task State Machine due to TASK_UNKNOWN handling

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65339/#review196451
-----------------------------------------------------------


Ship it!




Ship It!

- Santhosh Kumar Shanmugham


On Jan. 25, 2018, 1:33 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65339/
> -----------------------------------------------------------
> 
> (Updated Jan. 25, 2018, 1:33 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.
> 
> 
> Bugs: AURORA-1966
>     https://issues.apache.org/jira/browse/AURORA-1966
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> As reported in https://issues.apache.org/jira/browse/AURORA-1966, Mesos sends a TASK_UNKNOWN when we try to kill (or reconcile) tasks that are unknown. On master, this leads to an infinite loop. The sequence of events is:
> 
> 1) We map TASK_UNKNOWN to PARTITIONED
> 2) We react to restarting or terminal -> PARTITIONED state by telling Mesos "that is a bad state transition, that task should be dead".
> 3) Mesos replies with: that task is TASK_UNKNOWN
> 4) GO TO 1
> 
> AURORA-1966 describes just one case of this happening, but there are many other legitimate paths to this. 
> 
> This patch cleans up the logic. The two main changes:
> 
> 1) Do not allow ASSIGNED -> PARTITIONED. This is not really related to this bug, but I found this logic error during debugging. ASSIGNED is a transient state and is subject to the transient task timeout in the Scheduler, so we should not attempt to move to PARTITIONED during that window. 
> 2) Do not try to kill tasks we think are terminal when Mesos tells us they are unknown. Originally we did this because "manageTerminalTasks" is also used for restarting tasks - but in both cases it never makes sense to respond  to "I don't know about that task" with a request to kill it.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java b8ba5da729fcf5965b577c23e3062e5607bd07e7 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 3d98fe651ad2b89a03044e8a06953a0cea876321 
> 
> 
> Diff: https://reviews.apache.org/r/65339/diff/2/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> 
> Verified this fixes the issue reported in AURORA-1966 by forcing LaunchException in OfferManagerImpl in my vagrant image and viewing logs.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 65339: Fix infinite loop in Task State Machine due to TASK_UNKNOWN handling

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65339/#review196230
-----------------------------------------------------------


Ship it!




Master (dbe7137) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Jan. 25, 2018, 10:33 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65339/
> -----------------------------------------------------------
> 
> (Updated Jan. 25, 2018, 10:33 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.
> 
> 
> Bugs: AURORA-1966
>     https://issues.apache.org/jira/browse/AURORA-1966
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> As reported in https://issues.apache.org/jira/browse/AURORA-1966, Mesos sends a TASK_UNKNOWN when we try to kill (or reconcile) tasks that are unknown. On master, this leads to an infinite loop. The sequence of events is:
> 
> 1) We map TASK_UNKNOWN to PARTITIONED
> 2) We react to restarting or terminal -> PARTITIONED state by telling Mesos "that is a bad state transition, that task should be dead".
> 3) Mesos replies with: that task is TASK_UNKNOWN
> 4) GO TO 1
> 
> AURORA-1966 describes just one case of this happening, but there are many other legitimate paths to this. 
> 
> This patch cleans up the logic. The two main changes:
> 
> 1) Do not allow ASSIGNED -> PARTITIONED. This is not really related to this bug, but I found this logic error during debugging. ASSIGNED is a transient state and is subject to the transient task timeout in the Scheduler, so we should not attempt to move to PARTITIONED during that window. 
> 2) Do not try to kill tasks we think are terminal when Mesos tells us they are unknown. Originally we did this because "manageTerminalTasks" is also used for restarting tasks - but in both cases it never makes sense to respond  to "I don't know about that task" with a request to kill it.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java b8ba5da729fcf5965b577c23e3062e5607bd07e7 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 3d98fe651ad2b89a03044e8a06953a0cea876321 
> 
> 
> Diff: https://reviews.apache.org/r/65339/diff/2/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> 
> Verified this fixes the issue reported in AURORA-1966 by forcing LaunchException in OfferManagerImpl in my vagrant image and viewing logs.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 65339: Fix infinite loop in Task State Machine due to TASK_UNKNOWN handling

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65339/
-----------------------------------------------------------

(Updated Jan. 25, 2018, 9:33 a.m.)


Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.


Bugs: AURORA-1966
    https://issues.apache.org/jira/browse/AURORA-1966


Repository: aurora


Description
-------

As reported in https://issues.apache.org/jira/browse/AURORA-1966, Mesos sends a TASK_UNKNOWN when we try to kill (or reconcile) tasks that are unknown. On master, this leads to an infinite loop. The sequence of events is:

1) We map TASK_UNKNOWN to PARTITIONED
2) We react to restarting or terminal -> PARTITIONED state by telling Mesos "that is a bad state transition, that task should be dead".
3) Mesos replies with: that task is TASK_UNKNOWN
4) GO TO 1

AURORA-1966 describes just one case of this happening, but there are many other legitimate paths to this. 

This patch cleans up the logic. The two main changes:

1) Do not allow ASSIGNED -> PARTITIONED. This is not really related to this bug, but I found this logic error during debugging. ASSIGNED is a transient state and is subject to the transient task timeout in the Scheduler, so we should not attempt to move to PARTITIONED during that window. 
2) Do not try to kill tasks we think are terminal when Mesos tells us they are unknown. Originally we did this because "manageTerminalTasks" is also used for restarting tasks - but in both cases it never makes sense to respond  to "I don't know about that task" with a request to kill it.


Diffs (updated)
-----

  src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java b8ba5da729fcf5965b577c23e3062e5607bd07e7 
  src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 3d98fe651ad2b89a03044e8a06953a0cea876321 


Diff: https://reviews.apache.org/r/65339/diff/2/

Changes: https://reviews.apache.org/r/65339/diff/1-2/


Testing
-------

./gradlew test

Verified this fixes the issue reported in AURORA-1966 by forcing LaunchException in OfferManagerImpl in my vagrant image and viewing logs.


Thanks,

David McLaughlin