You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by Stephan Erb <se...@apache.org> on 2017/06/21 21:35:09 UTC
Review Request 60306: Ensure Thermos is not overloaded by an unlimited
number of lost processes
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60306/
-----------------------------------------------------------
Review request for Aurora.
Repository: aurora
Description
-------
Included changes:
* Thermos may consider launched processes to be LOST. Instead of
restarting those immediately, the restarts will now be at least
`min_duration` seconds apart. Restarts will also be capped at the
TOTAL_RUN_LIMIT of 100 restarts. This ensures neither Thermos nor the
observer will be overloaded by checkpoints. The handling of the LOST
state is now consistent with the handling of both FAILED and FINISHED.
* Mark the success_transition and failure_transition as private. They
are only used within `TaskPlanner` itself.
* Fix documented default of `min_duration` (i.e 5s rather than 15s).
Diffs
-----
docs/reference/configuration.md 6a9a3ff988dd2102aa9d22e27f22487f18423894
src/main/python/apache/thermos/common/planner.py da5120f8f0c2489927a03e9d78ccb4f9b6eb275f
src/test/python/apache/thermos/common/test_task_planner.py 132c1ec4977143b79df8d13804370e76a553c3b9
Diff: https://reviews.apache.org/r/60306/diff/1/
Testing
-------
./pants test.pytest src/test/python::
Thanks,
Stephan Erb
Re: Review Request 60306: Ensure Thermos is not overloaded by an
unlimited number of lost processes
Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60306/#review178593
-----------------------------------------------------------
Ship it!
Master (aae39a8) is green with this patch.
./build-support/jenkins/build.sh
I will refresh this build result if you post a review containing "@ReviewBot retry"
- Aurora ReviewBot
On June 21, 2017, 10:48 p.m., Stephan Erb wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60306/
> -----------------------------------------------------------
>
> (Updated June 21, 2017, 10:48 p.m.)
>
>
> Review request for Aurora.
>
>
> Repository: aurora
>
>
> Description
> -------
>
> Included changes:
>
> * Thermos may consider launched processes to be LOST. Instead of
> restarting those immediately, the restarts will now be at least
> `min_duration` seconds apart. Restarts will also be capped at the
> TOTAL_RUN_LIMIT of 100 restarts. This ensures neither Thermos nor the
> observer will be overloaded by checkpoints. The handling of the LOST
> state is now consistent with the handling of both FAILED and FINISHED.
> * Mark the success_transition and failure_transition as private. They
> are only used within `TaskPlanner` itself.
> * Fix documented default of `min_duration` (i.e 5s rather than 15s).
>
>
> Diffs
> -----
>
> docs/reference/configuration.md 6a9a3ff988dd2102aa9d22e27f22487f18423894
> src/main/python/apache/thermos/common/planner.py da5120f8f0c2489927a03e9d78ccb4f9b6eb275f
> src/test/python/apache/thermos/common/test_task_planner.py 132c1ec4977143b79df8d13804370e76a553c3b9
>
>
> Diff: https://reviews.apache.org/r/60306/diff/2/
>
>
> Testing
> -------
>
> ./pants test.pytest src/test/python::
>
>
> File Attachments
> ----------------
>
> massive_cpu_spike.png
> https://reviews.apache.org/media/uploaded/files/2017/06/21/57cbc6e6-2cd5-4e92-995a-e0e05a57c359__massive_cpu_spike.png
> massive_restart_count.png
> https://reviews.apache.org/media/uploaded/files/2017/06/21/c4cbab7c-1a48-4cf0-92ab-5fa9444813c7__massive_restart_count.png
>
>
> Thanks,
>
> Stephan Erb
>
>
Re: Review Request 60306: Ensure Thermos is not overloaded by an
unlimited number of lost processes
Posted by Stephan Erb <se...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60306/
-----------------------------------------------------------
(Updated June 22, 2017, 12:48 a.m.)
Review request for Aurora.
Changes
-------
Review changes
Repository: aurora
Description
-------
Included changes:
* Thermos may consider launched processes to be LOST. Instead of
restarting those immediately, the restarts will now be at least
`min_duration` seconds apart. Restarts will also be capped at the
TOTAL_RUN_LIMIT of 100 restarts. This ensures neither Thermos nor the
observer will be overloaded by checkpoints. The handling of the LOST
state is now consistent with the handling of both FAILED and FINISHED.
* Mark the success_transition and failure_transition as private. They
are only used within `TaskPlanner` itself.
* Fix documented default of `min_duration` (i.e 5s rather than 15s).
Diffs (updated)
-----
docs/reference/configuration.md 6a9a3ff988dd2102aa9d22e27f22487f18423894
src/main/python/apache/thermos/common/planner.py da5120f8f0c2489927a03e9d78ccb4f9b6eb275f
src/test/python/apache/thermos/common/test_task_planner.py 132c1ec4977143b79df8d13804370e76a553c3b9
Diff: https://reviews.apache.org/r/60306/diff/2/
Changes: https://reviews.apache.org/r/60306/diff/1-2/
Testing
-------
./pants test.pytest src/test/python::
File Attachments
----------------
massive_cpu_spike.png
https://reviews.apache.org/media/uploaded/files/2017/06/21/57cbc6e6-2cd5-4e92-995a-e0e05a57c359__massive_cpu_spike.png
massive_restart_count.png
https://reviews.apache.org/media/uploaded/files/2017/06/21/c4cbab7c-1a48-4cf0-92ab-5fa9444813c7__massive_restart_count.png
Thanks,
Stephan Erb
Re: Review Request 60306: Ensure Thermos is not overloaded by an
unlimited number of lost processes
Posted by Zameer Manji <zm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60306/#review178571
-----------------------------------------------------------
Ship it!
I once noticed this in production but I neglected to investigate further.
Thanks for this fix.
src/main/python/apache/thermos/common/planner.py
Lines 307 (patched)
<https://reviews.apache.org/r/60306/#comment252636>
There is duplication here with the failure case.
It might be worthwhile to extract this logic out to a private method instead of duplicating it here.
- Zameer Manji
On June 21, 2017, 2:37 p.m., Stephan Erb wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60306/
> -----------------------------------------------------------
>
> (Updated June 21, 2017, 2:37 p.m.)
>
>
> Review request for Aurora.
>
>
> Repository: aurora
>
>
> Description
> -------
>
> Included changes:
>
> * Thermos may consider launched processes to be LOST. Instead of
> restarting those immediately, the restarts will now be at least
> `min_duration` seconds apart. Restarts will also be capped at the
> TOTAL_RUN_LIMIT of 100 restarts. This ensures neither Thermos nor the
> observer will be overloaded by checkpoints. The handling of the LOST
> state is now consistent with the handling of both FAILED and FINISHED.
> * Mark the success_transition and failure_transition as private. They
> are only used within `TaskPlanner` itself.
> * Fix documented default of `min_duration` (i.e 5s rather than 15s).
>
>
> Diffs
> -----
>
> docs/reference/configuration.md 6a9a3ff988dd2102aa9d22e27f22487f18423894
> src/main/python/apache/thermos/common/planner.py da5120f8f0c2489927a03e9d78ccb4f9b6eb275f
> src/test/python/apache/thermos/common/test_task_planner.py 132c1ec4977143b79df8d13804370e76a553c3b9
>
>
> Diff: https://reviews.apache.org/r/60306/diff/1/
>
>
> Testing
> -------
>
> ./pants test.pytest src/test/python::
>
>
> File Attachments
> ----------------
>
> massive_cpu_spike.png
> https://reviews.apache.org/media/uploaded/files/2017/06/21/57cbc6e6-2cd5-4e92-995a-e0e05a57c359__massive_cpu_spike.png
> massive_restart_count.png
> https://reviews.apache.org/media/uploaded/files/2017/06/21/c4cbab7c-1a48-4cf0-92ab-5fa9444813c7__massive_restart_count.png
>
>
> Thanks,
>
> Stephan Erb
>
>
Re: Review Request 60306: Ensure Thermos is not overloaded by an
unlimited number of lost processes
Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60306/#review178578
-----------------------------------------------------------
Ship it!
Master (aae39a8) is green with this patch.
./build-support/jenkins/build.sh
I will refresh this build result if you post a review containing "@ReviewBot retry"
- Aurora ReviewBot
On June 21, 2017, 9:37 p.m., Stephan Erb wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60306/
> -----------------------------------------------------------
>
> (Updated June 21, 2017, 9:37 p.m.)
>
>
> Review request for Aurora.
>
>
> Repository: aurora
>
>
> Description
> -------
>
> Included changes:
>
> * Thermos may consider launched processes to be LOST. Instead of
> restarting those immediately, the restarts will now be at least
> `min_duration` seconds apart. Restarts will also be capped at the
> TOTAL_RUN_LIMIT of 100 restarts. This ensures neither Thermos nor the
> observer will be overloaded by checkpoints. The handling of the LOST
> state is now consistent with the handling of both FAILED and FINISHED.
> * Mark the success_transition and failure_transition as private. They
> are only used within `TaskPlanner` itself.
> * Fix documented default of `min_duration` (i.e 5s rather than 15s).
>
>
> Diffs
> -----
>
> docs/reference/configuration.md 6a9a3ff988dd2102aa9d22e27f22487f18423894
> src/main/python/apache/thermos/common/planner.py da5120f8f0c2489927a03e9d78ccb4f9b6eb275f
> src/test/python/apache/thermos/common/test_task_planner.py 132c1ec4977143b79df8d13804370e76a553c3b9
>
>
> Diff: https://reviews.apache.org/r/60306/diff/1/
>
>
> Testing
> -------
>
> ./pants test.pytest src/test/python::
>
>
> File Attachments
> ----------------
>
> massive_cpu_spike.png
> https://reviews.apache.org/media/uploaded/files/2017/06/21/57cbc6e6-2cd5-4e92-995a-e0e05a57c359__massive_cpu_spike.png
> massive_restart_count.png
> https://reviews.apache.org/media/uploaded/files/2017/06/21/c4cbab7c-1a48-4cf0-92ab-5fa9444813c7__massive_restart_count.png
>
>
> Thanks,
>
> Stephan Erb
>
>
Re: Review Request 60306: Ensure Thermos is not overloaded by an
unlimited number of lost processes
Posted by Joshua Cohen <jc...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60306/#review178556
-----------------------------------------------------------
Ship it!
Ship It!
- Joshua Cohen
On June 21, 2017, 9:37 p.m., Stephan Erb wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60306/
> -----------------------------------------------------------
>
> (Updated June 21, 2017, 9:37 p.m.)
>
>
> Review request for Aurora.
>
>
> Repository: aurora
>
>
> Description
> -------
>
> Included changes:
>
> * Thermos may consider launched processes to be LOST. Instead of
> restarting those immediately, the restarts will now be at least
> `min_duration` seconds apart. Restarts will also be capped at the
> TOTAL_RUN_LIMIT of 100 restarts. This ensures neither Thermos nor the
> observer will be overloaded by checkpoints. The handling of the LOST
> state is now consistent with the handling of both FAILED and FINISHED.
> * Mark the success_transition and failure_transition as private. They
> are only used within `TaskPlanner` itself.
> * Fix documented default of `min_duration` (i.e 5s rather than 15s).
>
>
> Diffs
> -----
>
> docs/reference/configuration.md 6a9a3ff988dd2102aa9d22e27f22487f18423894
> src/main/python/apache/thermos/common/planner.py da5120f8f0c2489927a03e9d78ccb4f9b6eb275f
> src/test/python/apache/thermos/common/test_task_planner.py 132c1ec4977143b79df8d13804370e76a553c3b9
>
>
> Diff: https://reviews.apache.org/r/60306/diff/1/
>
>
> Testing
> -------
>
> ./pants test.pytest src/test/python::
>
>
> File Attachments
> ----------------
>
> massive_cpu_spike.png
> https://reviews.apache.org/media/uploaded/files/2017/06/21/57cbc6e6-2cd5-4e92-995a-e0e05a57c359__massive_cpu_spike.png
> massive_restart_count.png
> https://reviews.apache.org/media/uploaded/files/2017/06/21/c4cbab7c-1a48-4cf0-92ab-5fa9444813c7__massive_restart_count.png
>
>
> Thanks,
>
> Stephan Erb
>
>
Re: Review Request 60306: Ensure Thermos is not overloaded by an
unlimited number of lost processes
Posted by Stephan Erb <se...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60306/
-----------------------------------------------------------
(Updated June 21, 2017, 11:37 p.m.)
Review request for Aurora.
Changes
-------
Add screenshots to illustrate the observed issue.
Repository: aurora
Description
-------
Included changes:
* Thermos may consider launched processes to be LOST. Instead of
restarting those immediately, the restarts will now be at least
`min_duration` seconds apart. Restarts will also be capped at the
TOTAL_RUN_LIMIT of 100 restarts. This ensures neither Thermos nor the
observer will be overloaded by checkpoints. The handling of the LOST
state is now consistent with the handling of both FAILED and FINISHED.
* Mark the success_transition and failure_transition as private. They
are only used within `TaskPlanner` itself.
* Fix documented default of `min_duration` (i.e 5s rather than 15s).
Diffs
-----
docs/reference/configuration.md 6a9a3ff988dd2102aa9d22e27f22487f18423894
src/main/python/apache/thermos/common/planner.py da5120f8f0c2489927a03e9d78ccb4f9b6eb275f
src/test/python/apache/thermos/common/test_task_planner.py 132c1ec4977143b79df8d13804370e76a553c3b9
Diff: https://reviews.apache.org/r/60306/diff/1/
Testing
-------
./pants test.pytest src/test/python::
File Attachments (updated)
----------------
massive_cpu_spike.png
https://reviews.apache.org/media/uploaded/files/2017/06/21/57cbc6e6-2cd5-4e92-995a-e0e05a57c359__massive_cpu_spike.png
massive_restart_count.png
https://reviews.apache.org/media/uploaded/files/2017/06/21/c4cbab7c-1a48-4cf0-92ab-5fa9444813c7__massive_restart_count.png
Thanks,
Stephan Erb