You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by Zameer Manji <zm...@apache.org> on 2016/01/15 01:53:41 UTC

Review Request 42332: Trigger shutdown on task pruning failure.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/
-----------------------------------------------------------

Review request for Aurora, John Sirois and Maxim Khutornenko.


Bugs: AURORA-1582
    https://issues.apache.org/jira/browse/AURORA-1582


Repository: aurora


Description
-------

Task pruning is critical to the operation of a large cluster. Failure to prune tasks can lead to storage growing in unexpected ways leading to scheduling slowdown. This patch adds shutdown on task pruning failure to prevent the scheduler from entering an undefined state.


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
  src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 

Diff: https://reviews.apache.org/r/42332/diff/


Testing
-------

./gradlew build -Pq


Thanks,

Zameer Manji


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Zameer Manji <zm...@apache.org>.

> On Jan. 20, 2016, 3:26 p.m., John Sirois wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, line 154
> > <https://reviews.apache.org/r/42332/diff/3/?file=1203998#file1203998line154>
> >
> >     I think this same concept can be cleaned up usefully by:
> >     1. wrap the executor in an `ExecutorCompletionService` up in the constructor.
> >     2. submit void work to the `CompletionService` 2x in `registerInactiveTask`
> >     3. in this run loop `take` void futures from it and `get` their void result - the exceptions will propagate naturally, 1st come, 1st serve, no trampling.

`ExecutorCompletionService` doesn't allow for submission of delayed tasks and line 186 is submitting a task with a delay so I don't think this is possible.


- Zameer


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115517
-----------------------------------------------------------


On Jan. 20, 2016, 2:39 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2016, 2:39 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by John Sirois <js...@apache.org>.

> On Jan. 20, 2016, 4:26 p.m., John Sirois wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, line 154
> > <https://reviews.apache.org/r/42332/diff/3/?file=1203998#file1203998line154>
> >
> >     I think this same concept can be cleaned up usefully by:
> >     1. wrap the executor in an `ExecutorCompletionService` up in the constructor.
> >     2. submit void work to the `CompletionService` 2x in `registerInactiveTask`
> >     3. in this run loop `take` void futures from it and `get` their void result - the exceptions will propagate naturally, 1st come, 1st serve, no trampling.
> 
> Zameer Manji wrote:
>     `ExecutorCompletionService` doesn't allow for submission of delayed tasks and line 186 is submitting a task with a delay so I don't think this is possible.

Hrm, yup.  Could construct `new FutureTask<Void>(() -> {...}, null);` to use as the runnables and add them to your won queue for the same benefits, up to you.


- John


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115517
-----------------------------------------------------------


On Jan. 20, 2016, 3:39 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2016, 3:39 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by John Sirois <js...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115517
-----------------------------------------------------------


LGTM, but let me know what you think about the suggestion below.


src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (line 154)
<https://reviews.apache.org/r/42332/#comment176497>

    I think this same concept can be cleaned up usefully by:
    1. wrap the executor in an `ExecutorCompletionService` up in the constructor.
    2. submit void work to the `CompletionService` 2x in `registerInactiveTask`
    3. in this run loop `take` void futures from it and `get` their void result - the exceptions will propagate naturally, 1st come, 1st serve, no trampling.


- John Sirois


On Jan. 20, 2016, 3:39 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2016, 3:39 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by John Sirois <js...@apache.org>.

> On Jan. 20, 2016, 4:39 p.m., John Sirois wrote:
> > Ship It!

As Maxim points out below, the current `run` will consume a thread, so ship-it retracted!


- John


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115524
-----------------------------------------------------------


On Jan. 20, 2016, 3:39 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2016, 3:39 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by John Sirois <js...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115524
-----------------------------------------------------------

Ship it!


Ship It!

- John Sirois


On Jan. 20, 2016, 3:39 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2016, 3:39 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by John Sirois <js...@apache.org>.

> On Jan. 21, 2016, 11:51 a.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, lines 152-161
> > <https://reviews.apache.org/r/42332/diff/3/?file=1203998#file1203998line152>
> >
> >     Am I reading this as a busy loop consuming 100% thread CPU waiting for something that may never happen? I don't think this is an acceptable solution.
> >     
> >     Perhaps it's time to refactor task prunner into an AbstractScheduledService? I always felt task prunner approach of holding on to task IDs for 2 days just to act once on them isn't very efficient. What if instead of acting on every particular task ID we have a periodic (say every 30 seconds) run loop to prune job keys instead?
> >     
> >     Implementation-wise, it could be a Set of unique job keys that we populate on every TaskStateChange event. A runOneIteration() would poll that set and apply both latency and max_per_job thresholds for all related terminal tasks within the same iteration.
> >     
> >     The only downside for the above is a somewhat increased history count between the cleanup runs but given that our current thresholds are chosen mostly arbitrarily I think that should be acceptable.

I think my Future/Queue suggestion above solves the busy loop with no liveness penalty.  That might allow your batching change suggestion to happen in a seperate follow-up RB.


- John


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115663
-----------------------------------------------------------


On Jan. 20, 2016, 3:39 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2016, 3:39 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Maxim Khutornenko <ma...@apache.org>.

> On Jan. 21, 2016, 6:51 p.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, lines 152-161
> > <https://reviews.apache.org/r/42332/diff/3/?file=1203998#file1203998line152>
> >
> >     Am I reading this as a busy loop consuming 100% thread CPU waiting for something that may never happen? I don't think this is an acceptable solution.
> >     
> >     Perhaps it's time to refactor task prunner into an AbstractScheduledService? I always felt task prunner approach of holding on to task IDs for 2 days just to act once on them isn't very efficient. What if instead of acting on every particular task ID we have a periodic (say every 30 seconds) run loop to prune job keys instead?
> >     
> >     Implementation-wise, it could be a Set of unique job keys that we populate on every TaskStateChange event. A runOneIteration() would poll that set and apply both latency and max_per_job thresholds for all related terminal tasks within the same iteration.
> >     
> >     The only downside for the above is a somewhat increased history count between the cleanup runs but given that our current thresholds are chosen mostly arbitrarily I think that should be acceptable.
> 
> John Sirois wrote:
>     I think my Future/Queue suggestion above solves the busy loop with no liveness penalty.  That might allow your batching change suggestion to happen in a seperate follow-up RB.
> 
> Zameer Manji wrote:
>     +1 to John here. I think we are overdue for a less complex and heavy pruner but I would prefer to keep this RB focused on failure propagation. I am open to a follow up ticket and RB. Maxim, if you agree, I can create a ticket that tracks the work you just proposed.
>     
>     Right now, I think I will use the Future/Queue suggestion that John has to remove the busy loop.

I am fine with the follow up ticket if you feel it's too much to lift within this RB.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115663
-----------------------------------------------------------


On Jan. 20, 2016, 10:39 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2016, 10:39 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Zameer Manji <zm...@apache.org>.

> On Jan. 21, 2016, 10:51 a.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, lines 152-161
> > <https://reviews.apache.org/r/42332/diff/3/?file=1203998#file1203998line152>
> >
> >     Am I reading this as a busy loop consuming 100% thread CPU waiting for something that may never happen? I don't think this is an acceptable solution.
> >     
> >     Perhaps it's time to refactor task prunner into an AbstractScheduledService? I always felt task prunner approach of holding on to task IDs for 2 days just to act once on them isn't very efficient. What if instead of acting on every particular task ID we have a periodic (say every 30 seconds) run loop to prune job keys instead?
> >     
> >     Implementation-wise, it could be a Set of unique job keys that we populate on every TaskStateChange event. A runOneIteration() would poll that set and apply both latency and max_per_job thresholds for all related terminal tasks within the same iteration.
> >     
> >     The only downside for the above is a somewhat increased history count between the cleanup runs but given that our current thresholds are chosen mostly arbitrarily I think that should be acceptable.
> 
> John Sirois wrote:
>     I think my Future/Queue suggestion above solves the busy loop with no liveness penalty.  That might allow your batching change suggestion to happen in a seperate follow-up RB.

+1 to John here. I think we are overdue for a less complex and heavy pruner but I would prefer to keep this RB focused on failure propagation. I am open to a follow up ticket and RB. Maxim, if you agree, I can create a ticket that tracks the work you just proposed.

Right now, I think I will use the Future/Queue suggestion that John has to remove the busy loop.


- Zameer


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115663
-----------------------------------------------------------


On Jan. 20, 2016, 2:39 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2016, 2:39 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Maxim Khutornenko <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115663
-----------------------------------------------------------



src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (lines 152 - 161)
<https://reviews.apache.org/r/42332/#comment176658>

    Am I reading this as a busy loop consuming 100% thread CPU waiting for something that may never happen? I don't think this is an acceptable solution.
    
    Perhaps it's time to refactor task prunner into an AbstractScheduledService? I always felt task prunner approach of holding on to task IDs for 2 days just to act once on them isn't very efficient. What if instead of acting on every particular task ID we have a periodic (say every 30 seconds) run loop to prune job keys instead?
    
    Implementation-wise, it could be a Set of unique job keys that we populate on every TaskStateChange event. A runOneIteration() would poll that set and apply both latency and max_per_job thresholds for all related terminal tasks within the same iteration.
    
    The only downside for the above is a somewhat increased history count between the cleanup runs but given that our current thresholds are chosen mostly arbitrarily I think that should be acceptable.


- Maxim Khutornenko


On Jan. 20, 2016, 10:39 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2016, 10:39 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115513
-----------------------------------------------------------

Ship it!


Master (8d3fb24) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Jan. 20, 2016, 10:39 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2016, 10:39 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Zameer Manji <zm...@apache.org>.

> On Jan. 21, 2016, 3:20 p.m., John Sirois wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, line 164
> > <https://reviews.apache.org/r/42332/diff/3-5/?file=1203998#file1203998line164>
> >
> >     I was envisioning put and take, no schedules, no batching. The queue could even be a SynchronousQueue holding ~0 elements. Any reason not to simplify in that way?

The documentation for Guava's `AbstractExecutionThreadService` says: "You must override the run() method, and it must respond to stop requests.". If we just implemented `take`, the thread of the service would block for a long time if there was no TaskStateChanges, possibly causing this service to not respond to stop requests. I don't know what the implications are of not responding to stop requests, but I would like to avoid that case.

I decided that it would be easier to just use `AbstractScheduledService` instead and avoid this case.


- Zameer


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115730
-----------------------------------------------------------


On Jan. 21, 2016, 2:55 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 2:55 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by John Sirois <js...@apache.org>.

> On Jan. 21, 2016, 4:20 p.m., John Sirois wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, line 164
> > <https://reviews.apache.org/r/42332/diff/3-5/?file=1203998#file1203998line164>
> >
> >     I was envisioning put and take, no schedules, no batching. The queue could even be a SynchronousQueue holding ~0 elements. Any reason not to simplify in that way?
> 
> Zameer Manji wrote:
>     The documentation for Guava's `AbstractExecutionThreadService` says: "You must override the run() method, and it must respond to stop requests.". If we just implemented `take`, the thread of the service would block for a long time if there was no TaskStateChanges, possibly causing this service to not respond to stop requests. I don't know what the implications are of not responding to stop requests, but I would like to avoid that case.
>     
>     I decided that it would be easier to just use `AbstractScheduledService` instead and avoid this case.

OK - I may follow up with a proposal RB on my own time - I think this can be simpler and clearer, but I need to sketch some to see if I'm lying.


- John


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115730
-----------------------------------------------------------


On Jan. 21, 2016, 4:43 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 4:43 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by John Sirois <js...@apache.org>.

> On Jan. 21, 2016, 4:20 p.m., John Sirois wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, line 164
> > <https://reviews.apache.org/r/42332/diff/3-5/?file=1203998#file1203998line164>
> >
> >     I was envisioning put and take, no schedules, no batching. The queue could even be a SynchronousQueue holding ~0 elements. Any reason not to simplify in that way?
> 
> Zameer Manji wrote:
>     The documentation for Guava's `AbstractExecutionThreadService` says: "You must override the run() method, and it must respond to stop requests.". If we just implemented `take`, the thread of the service would block for a long time if there was no TaskStateChanges, possibly causing this service to not respond to stop requests. I don't know what the implications are of not responding to stop requests, but I would like to avoid that case.
>     
>     I decided that it would be easier to just use `AbstractScheduledService` instead and avoid this case.
> 
> John Sirois wrote:
>     OK - I may follow up with a proposal RB on my own time - I think this can be simpler and clearer, but I need to sketch some to see if I'm lying.

See what you think of this: https://reviews.apache.org/r/42639/


- John


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115730
-----------------------------------------------------------


On Jan. 21, 2016, 4:43 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 4:43 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by John Sirois <js...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115730
-----------------------------------------------------------



src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (line 145)
<https://reviews.apache.org/r/42332/#comment176774>

    I was envisioning put and take, no schedules, no batching. The queue could even be a SynchronousQueue holding ~0 elements. Any reason not to simplify in that way?


- John Sirois


On Jan. 21, 2016, 3:55 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 3:55 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115729
-----------------------------------------------------------

Ship it!


Master (b2cc604) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Jan. 21, 2016, 10:55 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 10:55 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by John Sirois <js...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115749
-----------------------------------------------------------

Ship it!


Ship It!

- John Sirois


On Jan. 21, 2016, 4:43 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 4:43 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Maxim Khutornenko <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115739
-----------------------------------------------------------

Ship it!


Ship It!

- Maxim Khutornenko


On Jan. 21, 2016, 11:43 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 11:43 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115738
-----------------------------------------------------------

Ship it!


Master (a2c7ccc) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Jan. 21, 2016, 11:43 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 11:43 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Zameer Manji <zm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/
-----------------------------------------------------------

(Updated Jan. 21, 2016, 3:43 p.m.)


Review request for Aurora, John Sirois and Maxim Khutornenko.


Changes
-------

Maxim's feedback.


Bugs: AURORA-1582
    https://issues.apache.org/jira/browse/AURORA-1582


Repository: aurora


Description
-------

Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.


Diffs (updated)
-----

  src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
  src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 

Diff: https://reviews.apache.org/r/42332/diff/


Testing
-------

./gradlew build -Pq
e2e tests


Thanks,

Zameer Manji


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Zameer Manji <zm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/
-----------------------------------------------------------

(Updated Jan. 21, 2016, 2:55 p.m.)


Review request for Aurora, John Sirois and Maxim Khutornenko.


Changes
-------

Maxim's feedback.


Bugs: AURORA-1582
    https://issues.apache.org/jira/browse/AURORA-1582


Repository: aurora


Description
-------

Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.


Diffs (updated)
-----

  src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
  src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 

Diff: https://reviews.apache.org/r/42332/diff/


Testing
-------

./gradlew build -Pq
e2e tests


Thanks,

Zameer Manji


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115721
-----------------------------------------------------------

Ship it!


Master (8d3fb24) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Jan. 21, 2016, 10:22 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 10:22 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Zameer Manji <zm...@apache.org>.

> On Jan. 21, 2016, 2:50 p.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, line 108
> > <https://reviews.apache.org/r/42332/diff/4/?file=1204701#file1204701line108>
> >
> >     Why BlockingQueue and not something like ConcurrentLinkedQueue? There is nothing to block here for.
> 
> Zameer Manji wrote:
>     `BlockingQueue` has the `drainTo` method which I think makes the iteration nicer.
> 
> Maxim Khutornenko wrote:
>     The price for that is excessive locking and lower perf. You can do the same with a faster ConcurrentLinkedQueue and avoid extra collection copying:
>     
>     ```
>           FutureTask<Void> task;
>           while((task = q.poll()) != null) {
>             task.get();
>           }
>     ```

Done.


- Zameer


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115720
-----------------------------------------------------------


On Jan. 21, 2016, 2:55 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 2:55 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Zameer Manji <zm...@apache.org>.

> On Jan. 21, 2016, 2:50 p.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, line 108
> > <https://reviews.apache.org/r/42332/diff/4/?file=1204701#file1204701line108>
> >
> >     Why BlockingQueue and not something like ConcurrentLinkedQueue? There is nothing to block here for.

`BlockingQueue` has the `drainTo` method which I think makes the iteration nicer.


> On Jan. 21, 2016, 2:50 p.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, line 140
> > <https://reviews.apache.org/r/42332/diff/4/?file=1204701#file1204701line140>
> >
> >     runOneIteration() implies periodicity already, consider simplifying this comment

Simplified.


> On Jan. 21, 2016, 2:50 p.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, line 153
> > <https://reviews.apache.org/r/42332/diff/4/?file=1204701#file1204701line153>
> >
> >     Does it have to run every second? This just checks for errors so a higher limit should suffice (e.g. 5 seconds).

No, it does not have to run every second. Changed it to every 5 seconds.


- Zameer


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115720
-----------------------------------------------------------


On Jan. 21, 2016, 2:22 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 2:22 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Maxim Khutornenko <ma...@apache.org>.

> On Jan. 21, 2016, 10:50 p.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, line 108
> > <https://reviews.apache.org/r/42332/diff/4/?file=1204701#file1204701line108>
> >
> >     Why BlockingQueue and not something like ConcurrentLinkedQueue? There is nothing to block here for.
> 
> Zameer Manji wrote:
>     `BlockingQueue` has the `drainTo` method which I think makes the iteration nicer.

The price for that is excessive locking and lower perf. You can do the same with a faster ConcurrentLinkedQueue and avoid extra collection copying:

```
      FutureTask<Void> task;
      while((task = q.poll()) != null) {
        task.get();
      }
```


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115720
-----------------------------------------------------------


On Jan. 21, 2016, 10:55 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 10:55 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by John Sirois <js...@apache.org>.

> On Jan. 21, 2016, 3:50 p.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, line 108
> > <https://reviews.apache.org/r/42332/diff/4/?file=1204701#file1204701line108>
> >
> >     Why BlockingQueue and not something like ConcurrentLinkedQueue? There is nothing to block here for.
> 
> Zameer Manji wrote:
>     `BlockingQueue` has the `drainTo` method which I think makes the iteration nicer.
> 
> Maxim Khutornenko wrote:
>     The price for that is excessive locking and lower perf. You can do the same with a faster ConcurrentLinkedQueue and avoid extra collection copying:
>     
>     ```
>           FutureTask<Void> task;
>           while((task = q.poll()) != null) {
>             task.get();
>           }
>     ```
> 
> Zameer Manji wrote:
>     Done.

I just want to temper perf concerns with the fact that there are exactly 2 threads involved here - 1 from EventBus (no AllowConcurrentEvents used here), and this 1 polling thread in the main service lifecycle - perf should not actually matter here in any reasonable scenario I can think of.

That said - this code reads ~just as clearly so I see no problems with the switch.


- John


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115720
-----------------------------------------------------------


On Jan. 21, 2016, 4:43 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 4:43 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Maxim Khutornenko <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review115720
-----------------------------------------------------------



src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (line 108)
<https://reviews.apache.org/r/42332/#comment176768>

    Why BlockingQueue and not something like ConcurrentLinkedQueue? There is nothing to block here for.



src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (line 140)
<https://reviews.apache.org/r/42332/#comment176761>

    runOneIteration() implies periodicity already, consider simplifying this comment



src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (line 153)
<https://reviews.apache.org/r/42332/#comment176766>

    Does it have to run every second? This just checks for errors so a higher limit should suffice (e.g. 5 seconds).


- Maxim Khutornenko


On Jan. 21, 2016, 10:22 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 10:22 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Zameer Manji <zm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/
-----------------------------------------------------------

(Updated Jan. 21, 2016, 2:22 p.m.)


Review request for Aurora, John Sirois and Maxim Khutornenko.


Changes
-------

Removed busy waiting for periodic checking.


Bugs: AURORA-1582
    https://issues.apache.org/jira/browse/AURORA-1582


Repository: aurora


Description
-------

Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.


Diffs (updated)
-----

  src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
  src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 

Diff: https://reviews.apache.org/r/42332/diff/


Testing
-------

./gradlew build -Pq
e2e tests


Thanks,

Zameer Manji


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Zameer Manji <zm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/
-----------------------------------------------------------

(Updated Jan. 20, 2016, 2:39 p.m.)


Review request for Aurora, John Sirois and Maxim Khutornenko.


Changes
-------

Remove TODO


Bugs: AURORA-1582
    https://issues.apache.org/jira/browse/AURORA-1582


Repository: aurora


Description
-------

Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.


Diffs (updated)
-----

  src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
  src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 

Diff: https://reviews.apache.org/r/42332/diff/


Testing
-------

./gradlew build -Pq
e2e tests


Thanks,

Zameer Manji


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Zameer Manji <zm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/
-----------------------------------------------------------

(Updated Jan. 20, 2016, 2:38 p.m.)


Review request for Aurora, John Sirois and Maxim Khutornenko.


Bugs: AURORA-1582
    https://issues.apache.org/jira/browse/AURORA-1582


Repository: aurora


Description (updated)
-------

Task pruning is key to operating a large cluster and failure to prune should trigger shutdown to prevent unbounded growth of storage. This patch turns `TaskHistoryPruner` into a service which propagates failure from failed pruning attempts towards the `ServiceManager`. Also completing a TODO which removes a test for behaviour that is very awkward to test for.


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
  src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 

Diff: https://reviews.apache.org/r/42332/diff/


Testing (updated)
-------

./gradlew build -Pq
e2e tests


Thanks,

Zameer Manji


Re: Review Request 42332: Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.

Posted by Zameer Manji <zm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/
-----------------------------------------------------------

(Updated Jan. 20, 2016, 2:36 p.m.)


Review request for Aurora, John Sirois and Maxim Khutornenko.


Summary (updated)
-----------------

Turn TaskHistoryPruner into a service and trigger shutdown on pruning failure.


Bugs: AURORA-1582
    https://issues.apache.org/jira/browse/AURORA-1582


Repository: aurora


Description
-------

Task pruning is critical to the operation of a large cluster. Failure to prune tasks can lead to storage growing in unexpected ways leading to scheduling slowdown. This patch adds shutdown on task pruning failure to prevent the scheduler from entering an undefined state.


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
  src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 

Diff: https://reviews.apache.org/r/42332/diff/


Testing
-------

./gradlew build -Pq


Thanks,

Zameer Manji


Re: Review Request 42332: Trigger shutdown on task pruning failure.

Posted by Zameer Manji <zm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/
-----------------------------------------------------------

(Updated Jan. 20, 2016, 2:36 p.m.)


Review request for Aurora, John Sirois and Maxim Khutornenko.


Changes
-------

Review Feedback.


Bugs: AURORA-1582
    https://issues.apache.org/jira/browse/AURORA-1582


Repository: aurora


Description
-------

Task pruning is critical to the operation of a large cluster. Failure to prune tasks can lead to storage growing in unexpected ways leading to scheduling slowdown. This patch adds shutdown on task pruning failure to prevent the scheduler from entering an undefined state.


Diffs (updated)
-----

  src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 735199ac1ccccab343c24471890aa330d6635c26 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
  src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 

Diff: https://reviews.apache.org/r/42332/diff/


Testing
-------

./gradlew build -Pq


Thanks,

Zameer Manji


Re: Review Request 42332: Trigger shutdown on task pruning failure.

Posted by Zameer Manji <zm...@apache.org>.

> On Jan. 15, 2016, 3:33 p.m., John Sirois wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, line 152
> > <https://reviews.apache.org/r/42332/diff/1/?file=1197802#file1197802line152>
> >
> >     You could centralize this handling in deleteTasks - it has all the ids to give as useful a message as you have here, and that localizes the otherwise overly broad looking RTE catch.  It'll clean up the temp you had to introduce below as well.  Further, it'll give you 1 nice spot to add a comment explaining why termination on 1 failed prune is sane; ie: you can point out the inevitable history growth issue and that rolling back is better than waiting for the failure and then rolling back.

The objective is to catch any unexpected exceptions in the Runnable given to the executor. For example the exception that triggered this investigation did not come from delete tasks but from sorting the set of tasks to delete:
````
FluentIterable
        .from(Tasks.LATEST_ACTIVITY.sortedCopy(inactiveTasks))
        .filter(safeToDelete)
        .limit(tasksToPrune)
        .transform(Tasks::id)
        .toSet();
````

I agree that this approach is not ideal. I'm not sure on how to change this without changing the executor to have some sort of ListenableFuture that does something `onFailure`.


- Zameer


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review114795
-----------------------------------------------------------


On Jan. 14, 2016, 4:53 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 14, 2016, 4:53 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is critical to the operation of a large cluster. Failure to prune tasks can lead to storage growing in unexpected ways leading to scheduling slowdown. This patch adds shutdown on task pruning failure to prevent the scheduler from entering an undefined state.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Trigger shutdown on task pruning failure.

Posted by John Sirois <js...@apache.org>.

> On Jan. 15, 2016, 4:33 p.m., John Sirois wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, line 152
> > <https://reviews.apache.org/r/42332/diff/1/?file=1197802#file1197802line152>
> >
> >     You could centralize this handling in deleteTasks - it has all the ids to give as useful a message as you have here, and that localizes the otherwise overly broad looking RTE catch.  It'll clean up the temp you had to introduce below as well.  Further, it'll give you 1 nice spot to add a comment explaining why termination on 1 failed prune is sane; ie: you can point out the inevitable history growth issue and that rolling back is better than waiting for the failure and then rolling back.
> 
> Zameer Manji wrote:
>     The objective is to catch any unexpected exceptions in the Runnable given to the executor. For example the exception that triggered this investigation did not come from delete tasks but from sorting the set of tasks to delete:
>     ````
>     FluentIterable
>             .from(Tasks.LATEST_ACTIVITY.sortedCopy(inactiveTasks))
>             .filter(safeToDelete)
>             .limit(tasksToPrune)
>             .transform(Tasks::id)
>             .toSet();
>     ````
>     
>     I agree that this approach is not ideal. I'm not sure on how to change this without changing the executor to have some sort of ListenableFuture that does something `onFailure`.

Right, and that sounds like a great idea to me, it would be nice to have a more centralized way to deal with this and a ListenableFutures provide it:
  http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/util/concurrent/ListeningExecutorService.html
  http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/util/concurrent/MoreExecutors.html#listeningDecorator(java.util.concurrent.ExecutorService)
  
Not sure if you want to pursue that idea and drop this or do this change, add the larger change then flip over to it later.


- John


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review114795
-----------------------------------------------------------


On Jan. 14, 2016, 5:53 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 14, 2016, 5:53 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is critical to the operation of a large cluster. Failure to prune tasks can lead to storage growing in unexpected ways leading to scheduling slowdown. This patch adds shutdown on task pruning failure to prevent the scheduler from entering an undefined state.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Trigger shutdown on task pruning failure.

Posted by John Sirois <js...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review114795
-----------------------------------------------------------



src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (line 152)
<https://reviews.apache.org/r/42332/#comment175644>

    You could centralize this handling in deleteTasks - it has all the ids to give as useful a message as you have here, and that localizes the otherwise overly broad looking RTE catch.  It'll clean up the temp you had to introduce below as well.  Further, it'll give you 1 nice spot to add a comment explaining why termination on 1 failed prune is sane; ie: you can point out the inevitable history growth issue and that rolling back is better than waiting for the failure and then rolling back.


- John Sirois


On Jan. 14, 2016, 5:53 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 14, 2016, 5:53 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is critical to the operation of a large cluster. Failure to prune tasks can lead to storage growing in unexpected ways leading to scheduling slowdown. This patch adds shutdown on task pruning failure to prevent the scheduler from entering an undefined state.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Trigger shutdown on task pruning failure.

Posted by Zameer Manji <zm...@apache.org>.

> On Jan. 15, 2016, 9:57 a.m., Maxim Khutornenko wrote:
> > History prunner is regsistered as a service. Shouldn't its failure triger a shutdown according to this: https://reviews.apache.org/r/39631?

`TaskHistoryPruner` is not a service, only `JobUpdateHistoryPruner` is. Even if this was a service this error would not shut it down because the exception occurs on a separate thread. In this case the exception occurs on the executor threads provided by `AsyncModule`. Previously, a failure there would just be logged (and I recently added support for a metric). I talked with John about this and we agreed that clients of this executor should be responsible for determining if an exception in the Runnable is fatal or not.


- Zameer


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review114732
-----------------------------------------------------------


On Jan. 14, 2016, 4:53 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 14, 2016, 4:53 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is critical to the operation of a large cluster. Failure to prune tasks can lead to storage growing in unexpected ways leading to scheduling slowdown. This patch adds shutdown on task pruning failure to prevent the scheduler from entering an undefined state.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Trigger shutdown on task pruning failure.

Posted by Maxim Khutornenko <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review114732
-----------------------------------------------------------


History prunner is regsistered as a service. Shouldn't its failure triger a shutdown according to this: https://reviews.apache.org/r/39631?

- Maxim Khutornenko


On Jan. 15, 2016, 12:53 a.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 15, 2016, 12:53 a.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is critical to the operation of a large cluster. Failure to prune tasks can lead to storage growing in unexpected ways leading to scheduling slowdown. This patch adds shutdown on task pruning failure to prevent the scheduler from entering an undefined state.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Trigger shutdown on task pruning failure.

Posted by Bill Farner <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review114830
-----------------------------------------------------------



src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (line 94)
<https://reviews.apache.org/r/42332/#comment175674>

    Ultra drive-by review.  But have you explored making this class a service and letting general service failure handle the ripple to application teardown?  IMHO that's a nice abstraction if it work out.


- Bill Farner


On Jan. 14, 2016, 4:53 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 14, 2016, 4:53 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is critical to the operation of a large cluster. Failure to prune tasks can lead to storage growing in unexpected ways leading to scheduling slowdown. This patch adds shutdown on task pruning failure to prevent the scheduler from entering an undefined state.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 42332: Trigger shutdown on task pruning failure.

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42332/#review114623
-----------------------------------------------------------

Ship it!


Master (4dff5da) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Jan. 15, 2016, 12:53 a.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42332/
> -----------------------------------------------------------
> 
> (Updated Jan. 15, 2016, 12:53 a.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1582
>     https://issues.apache.org/jira/browse/AURORA-1582
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task pruning is critical to the operation of a large cluster. Failure to prune tasks can lead to storage growing in unexpected ways leading to scheduling slowdown. This patch adds shutdown on task pruning failure to prevent the scheduler from entering an undefined state.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 2064089937f5178b1413d386a312f4173a0e35fb 
>   src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 295960f13693c6ba0d7075a8ef7f9680a91ae69d 
> 
> Diff: https://reviews.apache.org/r/42332/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>