You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by Bill Farner <wf...@apache.org> on 2014/09/02 18:43:14 UTC

Re: Review Request 24915: Adding initial GC task delay on scheduler restart.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24915/#review52037
-----------------------------------------------------------


I'm concerned that the problem we're solving is underspecified.  An immediate issue i have with this patch is that it introduces a memory leak (in practice, this is hidden with default settings due to failover failover hides this).

I'll reply on the ticket to try to reach consensus on what the problem is and what it means to fix it.

- Bill Farner


On Aug. 20, 2014, 11:35 p.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/24915/
> -----------------------------------------------------------
> 
> (Updated Aug. 20, 2014, 11:35 p.m.)
> 
> 
> Review request for Aurora and Bill Farner.
> 
> 
> Bugs: AURORA-608
>     https://issues.apache.org/jira/browse/AURORA-608
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Adding initial GC task delay on scheduler restart.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java 5a38a1f69ac5dbe68af3bfe175899ddee392880b 
>   src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java 65f404915bc60ffe11a7a57d9861ac5b37fa646a 
>   src/test/java/org/apache/aurora/scheduler/async/GcExecutorLauncherTest.java 04137072891b2a1f0ad663182629dd469b09324f 
> 
> Diff: https://reviews.apache.org/r/24915/diff/
> 
> 
> Testing
> -------
> 
> gradle -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>


Re: Review Request 24915: Adding initial GC task delay on scheduler restart.

Posted by Bill Farner <wf...@apache.org>.

> On Sept. 2, 2014, 4:43 p.m., Bill Farner wrote:
> > I'm concerned that the problem we're solving is underspecified.  An immediate issue i have with this patch is that it introduces a memory leak (in practice, this is hidden with default settings due to failover failover hides this).
> > 
> > I'll reply on the ticket to try to reach consensus on what the problem is and what it means to fix it.
> 
> Maxim Khutornenko wrote:
>     I don't think we are trying to fix a Mesos problem here. Regardless of the underlying Mesos resolution (MESOS-1646), I do think Aurora should be a good Mesos citizen here and avoid abnormal offer activity spikes driven by failovers. 
>     
>     In order for that memory leak to expose itself in any reasonable manner, there must be an intensive and constant churn of hosts in the cluster. I don't think it's a likely scenario but if you are concerned it should be easy to converge on a hybrid solution with expiring cache where the EXECUTOR_GC_RESTART_INTERVAL kicks in any time an item is not found in the cache.

> I do think Aurora should be a good Mesos citizen here and avoid abnormal offer activity spikes driven by failovers

Can you offer more detail here - why is this bad behavior?


- Bill


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24915/#review52037
-----------------------------------------------------------


On Aug. 20, 2014, 11:35 p.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/24915/
> -----------------------------------------------------------
> 
> (Updated Aug. 20, 2014, 11:35 p.m.)
> 
> 
> Review request for Aurora and Bill Farner.
> 
> 
> Bugs: AURORA-608
>     https://issues.apache.org/jira/browse/AURORA-608
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Adding initial GC task delay on scheduler restart.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java 5a38a1f69ac5dbe68af3bfe175899ddee392880b 
>   src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java 65f404915bc60ffe11a7a57d9861ac5b37fa646a 
>   src/test/java/org/apache/aurora/scheduler/async/GcExecutorLauncherTest.java 04137072891b2a1f0ad663182629dd469b09324f 
> 
> Diff: https://reviews.apache.org/r/24915/diff/
> 
> 
> Testing
> -------
> 
> gradle -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>


Re: Review Request 24915: Adding initial GC task delay on scheduler restart.

Posted by Maxim Khutornenko <ma...@apache.org>.

> On Sept. 2, 2014, 4:43 p.m., Bill Farner wrote:
> > I'm concerned that the problem we're solving is underspecified.  An immediate issue i have with this patch is that it introduces a memory leak (in practice, this is hidden with default settings due to failover failover hides this).
> > 
> > I'll reply on the ticket to try to reach consensus on what the problem is and what it means to fix it.
> 
> Maxim Khutornenko wrote:
>     I don't think we are trying to fix a Mesos problem here. Regardless of the underlying Mesos resolution (MESOS-1646), I do think Aurora should be a good Mesos citizen here and avoid abnormal offer activity spikes driven by failovers. 
>     
>     In order for that memory leak to expose itself in any reasonable manner, there must be an intensive and constant churn of hosts in the cluster. I don't think it's a likely scenario but if you are concerned it should be easy to converge on a hybrid solution with expiring cache where the EXECUTOR_GC_RESTART_INTERVAL kicks in any time an item is not found in the cache.
> 
> Bill Farner wrote:
>     > I do think Aurora should be a good Mesos citizen here and avoid abnormal offer activity spikes driven by failovers
>     
>     Can you offer more detail here - why is this bad behavior?

We spread GC activity randomly to avoid hourly spikes (https://reviews.apache.org/r/16247/). Failovers, however, break that pattern and manifest in "unusual" GC activity spikes. All Mesos matters aside, I don't think scheduler interruptions should result in what appears externally as a "catch-up" behavior.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24915/#review52037
-----------------------------------------------------------


On Aug. 20, 2014, 11:35 p.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/24915/
> -----------------------------------------------------------
> 
> (Updated Aug. 20, 2014, 11:35 p.m.)
> 
> 
> Review request for Aurora and Bill Farner.
> 
> 
> Bugs: AURORA-608
>     https://issues.apache.org/jira/browse/AURORA-608
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Adding initial GC task delay on scheduler restart.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java 5a38a1f69ac5dbe68af3bfe175899ddee392880b 
>   src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java 65f404915bc60ffe11a7a57d9861ac5b37fa646a 
>   src/test/java/org/apache/aurora/scheduler/async/GcExecutorLauncherTest.java 04137072891b2a1f0ad663182629dd469b09324f 
> 
> Diff: https://reviews.apache.org/r/24915/diff/
> 
> 
> Testing
> -------
> 
> gradle -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>


Re: Review Request 24915: Adding initial GC task delay on scheduler restart.

Posted by Maxim Khutornenko <ma...@apache.org>.

> On Sept. 2, 2014, 4:43 p.m., Bill Farner wrote:
> > I'm concerned that the problem we're solving is underspecified.  An immediate issue i have with this patch is that it introduces a memory leak (in practice, this is hidden with default settings due to failover failover hides this).
> > 
> > I'll reply on the ticket to try to reach consensus on what the problem is and what it means to fix it.

I don't think we are trying to fix a Mesos problem here. Regardless of the underlying Mesos resolution (MESOS-1646), I do think Aurora should be a good Mesos citizen here and avoid abnormal offer activity spikes driven by failovers. 

In order for that memory leak to expose itself in any reasonable manner, there must be an intensive and constant churn of hosts in the cluster. I don't think it's a likely scenario but if you are concerned it should be easy to converge on a hybrid solution with expiring cache where the EXECUTOR_GC_RESTART_INTERVAL kicks in any time an item is not found in the cache.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24915/#review52037
-----------------------------------------------------------


On Aug. 20, 2014, 11:35 p.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/24915/
> -----------------------------------------------------------
> 
> (Updated Aug. 20, 2014, 11:35 p.m.)
> 
> 
> Review request for Aurora and Bill Farner.
> 
> 
> Bugs: AURORA-608
>     https://issues.apache.org/jira/browse/AURORA-608
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Adding initial GC task delay on scheduler restart.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java 5a38a1f69ac5dbe68af3bfe175899ddee392880b 
>   src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java 65f404915bc60ffe11a7a57d9861ac5b37fa646a 
>   src/test/java/org/apache/aurora/scheduler/async/GcExecutorLauncherTest.java 04137072891b2a1f0ad663182629dd469b09324f 
> 
> Diff: https://reviews.apache.org/r/24915/diff/
> 
> 
> Testing
> -------
> 
> gradle -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>