You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by David McLaughlin <da...@dmclaughlin.com> on 2018/01/25 08:43:03 UTC

Review Request 65338: Fix error handling logic for launch failures

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65338/
-----------------------------------------------------------

Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.


Repository: aurora


Description
-------

Discovered while debugging https://issues.apache.org/jira/browse/AURORA-1966. Before we attempt to launch a task, we  move the task to ASSIGNED state. However, the code to deal with launch failures expects the task to be in PENDING state. So the ASSIGNED -> LOST state change fails, and instead we rely on the transient task timeout for correctness. This means errors that can be recovered from in seconds instead take at least five minutes (by default).


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java 916908bbf635a261c01777cd3a357ca457dd9726 
  src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 533bb44953163e2148fa18c394a4338938dae205 


Diff: https://reviews.apache.org/r/65338/diff/1/


Testing
-------

./gradlew test

Also tested in Vagrant.


Thanks,

David McLaughlin


Re: Review Request 65338: Fix error handling logic for launch failures

Posted by Jordan Ly <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65338/#review196275
-----------------------------------------------------------


Ship it!




Ship It!

- Jordan Ly


On Jan. 25, 2018, 6:21 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65338/
> -----------------------------------------------------------
> 
> (Updated Jan. 25, 2018, 6:21 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.
> 
> 
> Bugs: AURORA-1965
>     https://issues.apache.org/jira/browse/AURORA-1965
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Discovered while debugging https://issues.apache.org/jira/browse/AURORA-1966. Before we attempt to launch a task, we  move the task to ASSIGNED state. However, the code to deal with launch failures expects the task to be in PENDING state. So the ASSIGNED -> LOST state change fails, and instead we rely on the transient task timeout for correctness. This means errors that can be recovered from in seconds instead take at least five minutes (by default).
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java 916908bbf635a261c01777cd3a357ca457dd9726 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 533bb44953163e2148fa18c394a4338938dae205 
> 
> 
> Diff: https://reviews.apache.org/r/65338/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> 
> Also tested in Vagrant.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 65338: Fix error handling logic for launch failures

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65338/
-----------------------------------------------------------

(Updated Jan. 25, 2018, 6:21 p.m.)


Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.


Bugs: AURORA-1965
    https://issues.apache.org/jira/browse/AURORA-1965


Repository: aurora


Description
-------

Discovered while debugging https://issues.apache.org/jira/browse/AURORA-1966. Before we attempt to launch a task, we  move the task to ASSIGNED state. However, the code to deal with launch failures expects the task to be in PENDING state. So the ASSIGNED -> LOST state change fails, and instead we rely on the transient task timeout for correctness. This means errors that can be recovered from in seconds instead take at least five minutes (by default).


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java 916908bbf635a261c01777cd3a357ca457dd9726 
  src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 533bb44953163e2148fa18c394a4338938dae205 


Diff: https://reviews.apache.org/r/65338/diff/1/


Testing
-------

./gradlew test

Also tested in Vagrant.


Thanks,

David McLaughlin


Re: Review Request 65338: Fix error handling logic for launch failures

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65338/#review196273
-----------------------------------------------------------


Ship it!




Please link to - https://issues.apache.org/jira/browse/AURORA-1965 and resolve.

- Santhosh Kumar Shanmugham


On Jan. 25, 2018, 12:43 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65338/
> -----------------------------------------------------------
> 
> (Updated Jan. 25, 2018, 12:43 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Discovered while debugging https://issues.apache.org/jira/browse/AURORA-1966. Before we attempt to launch a task, we  move the task to ASSIGNED state. However, the code to deal with launch failures expects the task to be in PENDING state. So the ASSIGNED -> LOST state change fails, and instead we rely on the transient task timeout for correctness. This means errors that can be recovered from in seconds instead take at least five minutes (by default).
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java 916908bbf635a261c01777cd3a357ca457dd9726 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 533bb44953163e2148fa18c394a4338938dae205 
> 
> 
> Diff: https://reviews.apache.org/r/65338/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> 
> Also tested in Vagrant.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 65338: Fix error handling logic for launch failures

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On Jan. 25, 2018, 2 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java
> > Line 135 (original), 135 (patched)
> > <https://reviews.apache.org/r/65338/diff/1/?file=1946598#file1946598line136>
> >
> >     To prevent future mistakes of the same kind, should we remove the `casState` state here and just transition to `LOST` unconditionally?
> >     
> >     I cannot think of a scenario right now where launching would fail, but we would still like the task to live on.

The CAS is to avoid race conditions (e.g. executing this block twice and trying to move from LOST to LOST). So I think it's a good idea to keep it?


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65338/#review196248
-----------------------------------------------------------


On Jan. 25, 2018, 8:43 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65338/
> -----------------------------------------------------------
> 
> (Updated Jan. 25, 2018, 8:43 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Discovered while debugging https://issues.apache.org/jira/browse/AURORA-1966. Before we attempt to launch a task, we  move the task to ASSIGNED state. However, the code to deal with launch failures expects the task to be in PENDING state. So the ASSIGNED -> LOST state change fails, and instead we rely on the transient task timeout for correctness. This means errors that can be recovered from in seconds instead take at least five minutes (by default).
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java 916908bbf635a261c01777cd3a357ca457dd9726 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 533bb44953163e2148fa18c394a4338938dae205 
> 
> 
> Diff: https://reviews.apache.org/r/65338/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> 
> Also tested in Vagrant.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 65338: Fix error handling logic for launch failures

Posted by Stephan Erb <se...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65338/#review196248
-----------------------------------------------------------




src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java
Line 135 (original), 135 (patched)
<https://reviews.apache.org/r/65338/#comment275796>

    To prevent future mistakes of the same kind, should we remove the `casState` state here and just transition to `LOST` unconditionally?
    
    I cannot think of a scenario right now where launching would fail, but we would still like the task to live on.


- Stephan Erb


On Jan. 25, 2018, 9:43 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65338/
> -----------------------------------------------------------
> 
> (Updated Jan. 25, 2018, 9:43 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Discovered while debugging https://issues.apache.org/jira/browse/AURORA-1966. Before we attempt to launch a task, we  move the task to ASSIGNED state. However, the code to deal with launch failures expects the task to be in PENDING state. So the ASSIGNED -> LOST state change fails, and instead we rely on the transient task timeout for correctness. This means errors that can be recovered from in seconds instead take at least five minutes (by default).
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java 916908bbf635a261c01777cd3a357ca457dd9726 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 533bb44953163e2148fa18c394a4338938dae205 
> 
> 
> Diff: https://reviews.apache.org/r/65338/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> 
> Also tested in Vagrant.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 65338: Fix error handling logic for launch failures

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65338/#review196226
-----------------------------------------------------------


Ship it!




Master (dbe7137) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Jan. 25, 2018, 8:43 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65338/
> -----------------------------------------------------------
> 
> (Updated Jan. 25, 2018, 8:43 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Discovered while debugging https://issues.apache.org/jira/browse/AURORA-1966. Before we attempt to launch a task, we  move the task to ASSIGNED state. However, the code to deal with launch failures expects the task to be in PENDING state. So the ASSIGNED -> LOST state change fails, and instead we rely on the transient task timeout for correctness. This means errors that can be recovered from in seconds instead take at least five minutes (by default).
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java 916908bbf635a261c01777cd3a357ca457dd9726 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 533bb44953163e2148fa18c394a4338938dae205 
> 
> 
> Diff: https://reviews.apache.org/r/65338/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> 
> Also tested in Vagrant.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>