You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by Santhosh Kumar Shanmugham <sa...@gmail.com> on 2018/07/18 20:27:14 UTC

Review Request 67967: Unhandled exception should not strand runner in STARTING state.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67967/
-----------------------------------------------------------

Review request for Aurora, David McLaughlin, Jordan Ly, Reza Motamedi, and Stephan Erb.


Repository: aurora


Description
-------

If the ThermoTaskRunner encounters an Exception when trying to
fork the process, it bubbles this up to the Executor which does
not handle execptions other than TaskError. This leads to the
executor leaving the task in STARTING state and we end up with
tasks that get stranded in this state.

Fix it so that any unknown expection that is thrown when starting
a runner leads to task failure and get marked as FAILED.


Diffs
-----

  src/main/python/apache/aurora/executor/aurora_executor.py 8a9958fffc2312686dccc7daf6d216631d4c956e 
  src/test/python/apache/aurora/executor/test_thermos_executor.py f6ae1be5d56bfd845bd09db67ef0000a92091136 


Diff: https://reviews.apache.org/r/67967/diff/1/


Testing
-------

./gradlew test
./pants test src/test/python/apache::


Thanks,

Santhosh Kumar Shanmugham


Re: Review Request 67967: Unhandled exception should not strand runner in STARTING state.

Posted by Stephan Erb <se...@apache.org>.

> On July 18, 2018, 11:21 p.m., Stephan Erb wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py
> > Lines 159 (patched)
> > <https://reviews.apache.org/r/67967/diff/1/?file=2061542#file2061542line159>
> >
> >     Should we use TASK_LOST here instead? Most users interpret TASK_FAILED as their responsibility whereas TASK_LOST is more of a misshap of Aurora/Mesos/Thermos. I would think an unknown exception in the runner is part of the latter category.
> 
> Santhosh Kumar Shanmugham wrote:
>     Hmm. Then we can argue that failure to create sandbox or fork the process etc also should be treated as TASK_LOST? At Twitter this is really not going to help us, since we have platform wrapper that cause TASK_FAILED and it is already hard to differentiate user configuration failures against platform dependency failures.
>     
>     I wanted to keep this consistent with the rest. If TASK_LOST makes more sense for you I can update it.

You make a good point. Let's keep it at FAILED. If really necessary, I could always come back later with a more complete proposal.


- Stephan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67967/#review206209
-----------------------------------------------------------


On July 18, 2018, 10:27 p.m., Santhosh Kumar Shanmugham wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67967/
> -----------------------------------------------------------
> 
> (Updated July 18, 2018, 10:27 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Jordan Ly, Reza Motamedi, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> If the ThermoTaskRunner encounters an Exception when trying to
> fork the process, it bubbles this up to the Executor which does
> not handle execptions other than TaskError. This leads to the
> executor leaving the task in STARTING state and we end up with
> tasks that get stranded in this state.
> 
> Fix it so that any unknown expection that is thrown when starting
> a runner leads to task failure and get marked as FAILED.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 8a9958fffc2312686dccc7daf6d216631d4c956e 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py f6ae1be5d56bfd845bd09db67ef0000a92091136 
> 
> 
> Diff: https://reviews.apache.org/r/67967/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> ./pants test src/test/python/apache::
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>


Re: Review Request 67967: Unhandled exception should not strand runner in STARTING state.

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.

> On July 18, 2018, 2:21 p.m., Stephan Erb wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py
> > Lines 159 (patched)
> > <https://reviews.apache.org/r/67967/diff/1/?file=2061542#file2061542line159>
> >
> >     Should we use TASK_LOST here instead? Most users interpret TASK_FAILED as their responsibility whereas TASK_LOST is more of a misshap of Aurora/Mesos/Thermos. I would think an unknown exception in the runner is part of the latter category.

Hmm. Then we can argue that failure to create sandbox or fork the process etc also should be treated as TASK_LOST? At Twitter this is really not going to help us, since we have platform wrapper that cause TASK_FAILED and it is already hard to differentiate user configuration failures against platform dependency failures.

I wanted to keep this consistent with the rest. If TASK_LOST makes more sense for you I can update it.


- Santhosh Kumar


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67967/#review206209
-----------------------------------------------------------


On July 18, 2018, 1:27 p.m., Santhosh Kumar Shanmugham wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67967/
> -----------------------------------------------------------
> 
> (Updated July 18, 2018, 1:27 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Jordan Ly, Reza Motamedi, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> If the ThermoTaskRunner encounters an Exception when trying to
> fork the process, it bubbles this up to the Executor which does
> not handle execptions other than TaskError. This leads to the
> executor leaving the task in STARTING state and we end up with
> tasks that get stranded in this state.
> 
> Fix it so that any unknown expection that is thrown when starting
> a runner leads to task failure and get marked as FAILED.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 8a9958fffc2312686dccc7daf6d216631d4c956e 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py f6ae1be5d56bfd845bd09db67ef0000a92091136 
> 
> 
> Diff: https://reviews.apache.org/r/67967/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> ./pants test src/test/python/apache::
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>


Re: Review Request 67967: Unhandled exception should not strand runner in STARTING state.

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.

> On July 18, 2018, 2:21 p.m., Stephan Erb wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py
> > Lines 159 (patched)
> > <https://reviews.apache.org/r/67967/diff/1/?file=2061542#file2061542line159>
> >
> >     Should we use TASK_LOST here instead? Most users interpret TASK_FAILED as their responsibility whereas TASK_LOST is more of a misshap of Aurora/Mesos/Thermos. I would think an unknown exception in the runner is part of the latter category.
> 
> Santhosh Kumar Shanmugham wrote:
>     Hmm. Then we can argue that failure to create sandbox or fork the process etc also should be treated as TASK_LOST? At Twitter this is really not going to help us, since we have platform wrapper that cause TASK_FAILED and it is already hard to differentiate user configuration failures against platform dependency failures.
>     
>     I wanted to keep this consistent with the rest. If TASK_LOST makes more sense for you I can update it.
> 
> Stephan Erb wrote:
>     You make a good point. Let's keep it at FAILED. If really necessary, I could always come back later with a more complete proposal.

I think the differentiation of user vs platform failure needs a whole lot of clean up in the executor codebase. We have been putting this work off for sometime but we are starting to realize that we need this data to make better decisions.


- Santhosh Kumar


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67967/#review206209
-----------------------------------------------------------


On July 18, 2018, 1:27 p.m., Santhosh Kumar Shanmugham wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67967/
> -----------------------------------------------------------
> 
> (Updated July 18, 2018, 1:27 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Jordan Ly, Reza Motamedi, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> If the ThermoTaskRunner encounters an Exception when trying to
> fork the process, it bubbles this up to the Executor which does
> not handle execptions other than TaskError. This leads to the
> executor leaving the task in STARTING state and we end up with
> tasks that get stranded in this state.
> 
> Fix it so that any unknown expection that is thrown when starting
> a runner leads to task failure and get marked as FAILED.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 8a9958fffc2312686dccc7daf6d216631d4c956e 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py f6ae1be5d56bfd845bd09db67ef0000a92091136 
> 
> 
> Diff: https://reviews.apache.org/r/67967/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> ./pants test src/test/python/apache::
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>


Re: Review Request 67967: Unhandled exception should not strand runner in STARTING state.

Posted by Stephan Erb <se...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67967/#review206209
-----------------------------------------------------------




src/main/python/apache/aurora/executor/aurora_executor.py
Lines 159 (patched)
<https://reviews.apache.org/r/67967/#comment289101>

    Should we use TASK_LOST here instead? Most users interpret TASK_FAILED as their responsibility whereas TASK_LOST is more of a misshap of Aurora/Mesos/Thermos. I would think an unknown exception in the runner is part of the latter category.


- Stephan Erb


On July 18, 2018, 10:27 p.m., Santhosh Kumar Shanmugham wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67967/
> -----------------------------------------------------------
> 
> (Updated July 18, 2018, 10:27 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Jordan Ly, Reza Motamedi, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> If the ThermoTaskRunner encounters an Exception when trying to
> fork the process, it bubbles this up to the Executor which does
> not handle execptions other than TaskError. This leads to the
> executor leaving the task in STARTING state and we end up with
> tasks that get stranded in this state.
> 
> Fix it so that any unknown expection that is thrown when starting
> a runner leads to task failure and get marked as FAILED.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 8a9958fffc2312686dccc7daf6d216631d4c956e 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py f6ae1be5d56bfd845bd09db67ef0000a92091136 
> 
> 
> Diff: https://reviews.apache.org/r/67967/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> ./pants test src/test/python/apache::
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>


Re: Review Request 67967: Unhandled exception should not strand runner in STARTING state.

Posted by Stephan Erb <se...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67967/#review206214
-----------------------------------------------------------


Ship it!




Ship It!

- Stephan Erb


On July 18, 2018, 10:27 p.m., Santhosh Kumar Shanmugham wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67967/
> -----------------------------------------------------------
> 
> (Updated July 18, 2018, 10:27 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Jordan Ly, Reza Motamedi, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> If the ThermoTaskRunner encounters an Exception when trying to
> fork the process, it bubbles this up to the Executor which does
> not handle execptions other than TaskError. This leads to the
> executor leaving the task in STARTING state and we end up with
> tasks that get stranded in this state.
> 
> Fix it so that any unknown expection that is thrown when starting
> a runner leads to task failure and get marked as FAILED.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 8a9958fffc2312686dccc7daf6d216631d4c956e 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py f6ae1be5d56bfd845bd09db67ef0000a92091136 
> 
> 
> Diff: https://reviews.apache.org/r/67967/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> ./pants test src/test/python/apache::
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>


Re: Review Request 67967: Unhandled exception should not strand runner in STARTING state.

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67967/#review206205
-----------------------------------------------------------


Ship it!




Ship It!

- David McLaughlin


On July 18, 2018, 8:27 p.m., Santhosh Kumar Shanmugham wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67967/
> -----------------------------------------------------------
> 
> (Updated July 18, 2018, 8:27 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Jordan Ly, Reza Motamedi, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> If the ThermoTaskRunner encounters an Exception when trying to
> fork the process, it bubbles this up to the Executor which does
> not handle execptions other than TaskError. This leads to the
> executor leaving the task in STARTING state and we end up with
> tasks that get stranded in this state.
> 
> Fix it so that any unknown expection that is thrown when starting
> a runner leads to task failure and get marked as FAILED.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 8a9958fffc2312686dccc7daf6d216631d4c956e 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py f6ae1be5d56bfd845bd09db67ef0000a92091136 
> 
> 
> Diff: https://reviews.apache.org/r/67967/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> ./pants test src/test/python/apache::
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>


Re: Review Request 67967: Unhandled exception should not strand runner in STARTING state.

Posted by Jordan Ly <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67967/#review206206
-----------------------------------------------------------


Ship it!




Ship It!

- Jordan Ly


On July 18, 2018, 8:27 p.m., Santhosh Kumar Shanmugham wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67967/
> -----------------------------------------------------------
> 
> (Updated July 18, 2018, 8:27 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Jordan Ly, Reza Motamedi, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> If the ThermoTaskRunner encounters an Exception when trying to
> fork the process, it bubbles this up to the Executor which does
> not handle execptions other than TaskError. This leads to the
> executor leaving the task in STARTING state and we end up with
> tasks that get stranded in this state.
> 
> Fix it so that any unknown expection that is thrown when starting
> a runner leads to task failure and get marked as FAILED.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 8a9958fffc2312686dccc7daf6d216631d4c956e 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py f6ae1be5d56bfd845bd09db67ef0000a92091136 
> 
> 
> Diff: https://reviews.apache.org/r/67967/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> ./pants test src/test/python/apache::
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>


Re: Review Request 67967: Unhandled exception should not strand runner in STARTING state.

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67967/#review206216
-----------------------------------------------------------


Ship it!




Master (efe8656) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On July 18, 2018, 8:27 p.m., Santhosh Kumar Shanmugham wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67967/
> -----------------------------------------------------------
> 
> (Updated July 18, 2018, 8:27 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Jordan Ly, Reza Motamedi, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> If the ThermoTaskRunner encounters an Exception when trying to
> fork the process, it bubbles this up to the Executor which does
> not handle execptions other than TaskError. This leads to the
> executor leaving the task in STARTING state and we end up with
> tasks that get stranded in this state.
> 
> Fix it so that any unknown expection that is thrown when starting
> a runner leads to task failure and get marked as FAILED.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 8a9958fffc2312686dccc7daf6d216631d4c956e 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py f6ae1be5d56bfd845bd09db67ef0000a92091136 
> 
> 
> Diff: https://reviews.apache.org/r/67967/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> ./pants test src/test/python/apache::
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>