You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Martin Weindel <ma...@gmail.com> on 2014/08/26 00:08:41 UTC

Review Request 25035: Fix for MESOS-1688

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/
-----------------------------------------------------------

Review request for mesos.


Bugs: MESOS-1688
    https://issues.apache.org/jira/browse/MESOS-1688


Repository: mesos-git


Description
-------

As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
This can easily lead to a dead lock (in the application, not in Mesos).

Simple example:
1. Scheduler allocates all memory of a slave for an executor
2. Scheduler launches a task for this executor (allocating 1 CPU)
3. Task finishes: 1 CPU , 0 MB memory allocatable.
4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.

To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory


Diffs
-----

  src/master/hierarchical_allocator_process.hpp 34f8cd658920b36b1062bd3b7f6bfbd1bcb6bb52 

Diff: https://reviews.apache.org/r/25035/diff/


Testing
-------

`make` and `make check` executed.
Allocation tests succeeded.
On my machine the test `MasterTest.MetricsInStatsEndpoint` failed both with and without the patch. So I'm not sure if all tests were executed.


Thanks,

Martin Weindel


Re: Review Request 25035: Fix for MESOS-1688

Posted by Vinod Kone <vi...@gmail.com>.

> On Sept. 2, 2014, 5:53 p.m., Vinod Kone wrote:
> > src/master/hierarchical_allocator_process.hpp, lines 825-840
> > <https://reviews.apache.org/r/25035/diff/2/?file=672690#file672690line825>
> >
> >     I suggest to delete this comment altogether because frameworks can utilize offers with either no memory or no cpus based on how they allocate resources between executors and tasks. Also, change the code to 
> >     
> >     ```
> >     return (cpus.isSome() && cpus.get() >= MIN_CPUS) || 
> >            (mem.isSome() && mem.get() >= MIN_MEM);
> >     ```
> >     
> >     The important thing to note here is that executors should be launched with both cpus *and* memory. Mind adding a TODO in ResourceUsageChecker in master.cpp to that effect and log a warning? The reason we are doing a TODO and warning instead of fixing ResourceUsageChecker is to give frameworks (e.g., Spark) time to update their code to adhere to these new semantics. We will enforce this in the next release. Sounds good?
> 
> Martin Weindel wrote:
>     Ok, I will take a look in allocator_tests and see how extend it.
>     
>     Your suggested code change was actually my first try. But there were test cases in allocator_tests which failed with this code.
>     I have not the time to investigate the allocation algorithm and its constraints to really understand the cause.
>     So either somebody with better understanding for the allocation algorithm takes a closer look at this or we keep my suggested variant.
>     It would be good if we agree on this, before I write the test.
>     
>     BTW, can you explain the background of the importance that "executors should be launched with both cpus and memory"?
>     What's the difference between these two allocations?
>     a) executor: 0 cpu, its 4 parallel tasks: each 1 cpu
>     b) executor: 0.1 cpu, its 4 parallel tasks: each 1 cpu
>     
>     Is it correct that case b) the framework can only run 3 parallel tasks if there are 4 cpu resources allocatable?
>     That seems to be a waste of resources only to make some conservative estimation for the cpu resources really consumed by the executor itself.
>     Why is it so important to reserve cpu resources for the little overhead the executor may cause by calculating the next tasks and communicating with Mesos and its tasks?

```
Your suggested code change was actually my first try. But there were test cases in allocator_tests which failed with this code.
```

I see. If you can paste the logs of the tests that fail I'll be happy to help diagnose/fix. Alternatively, add a note on why you are only doing this for cpu and not for memory.


```
executors should be launched with both cpus and memory
``` 

This is because executor is an actual unix process that is launched by the slave. If an executor doesn't specify cpus, what should do the cpu limits be for that executor *when there are no tasks running* on it? If no cpu limits are set then it might starve other executors/tasks on the slave violating isolation guarantees. Same goes with memory. Moreover, the current containerizer/isolator code will throw failures when using such an executor, e.g., when the last task on the executor finishes and Containerizer::update() is called with 0 cpus or 0 mem.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review52048
-----------------------------------------------------------


On Sept. 2, 2014, 5:52 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 2, 2014, 5:52 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   src/master/hierarchical_allocator_process.hpp 34f8cd658920b36b1062bd3b7f6bfbd1bcb6bb52 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Martin Weindel <ma...@gmail.com>.

> On Sept. 2, 2014, 5:53 nachm., Vinod Kone wrote:
> > src/master/hierarchical_allocator_process.hpp, lines 825-840
> > <https://reviews.apache.org/r/25035/diff/2/?file=672690#file672690line825>
> >
> >     I suggest to delete this comment altogether because frameworks can utilize offers with either no memory or no cpus based on how they allocate resources between executors and tasks. Also, change the code to 
> >     
> >     ```
> >     return (cpus.isSome() && cpus.get() >= MIN_CPUS) || 
> >            (mem.isSome() && mem.get() >= MIN_MEM);
> >     ```
> >     
> >     The important thing to note here is that executors should be launched with both cpus *and* memory. Mind adding a TODO in ResourceUsageChecker in master.cpp to that effect and log a warning? The reason we are doing a TODO and warning instead of fixing ResourceUsageChecker is to give frameworks (e.g., Spark) time to update their code to adhere to these new semantics. We will enforce this in the next release. Sounds good?

Ok, I will take a look in allocator_tests and see how extend it.

Your suggested code change was actually my first try. But there were test cases in allocator_tests which failed with this code.
I have not the time to investigate the allocation algorithm and its constraints to really understand the cause.
So either somebody with better understanding for the allocation algorithm takes a closer look at this or we keep my suggested variant.
It would be good if we agree on this, before I write the test.

BTW, can you explain the background of the importance that "executors should be launched with both cpus and memory"?
What's the difference between these two allocations?
a) executor: 0 cpu, its 4 parallel tasks: each 1 cpu
b) executor: 0.1 cpu, its 4 parallel tasks: each 1 cpu

Is it correct that case b) the framework can only run 3 parallel tasks if there are 4 cpu resources allocatable?
That seems to be a waste of resources only to make some conservative estimation for the cpu resources really consumed by the executor itself.
Why is it so important to reserve cpu resources for the little overhead the executor may cause by calculating the next tasks and communicating with Mesos and its tasks?


- Martin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review52048
-----------------------------------------------------------


On Sept. 2, 2014, 5:52 nachm., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 2, 2014, 5:52 nachm.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   src/master/hierarchical_allocator_process.hpp 34f8cd658920b36b1062bd3b7f6bfbd1bcb6bb52 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review52048
-----------------------------------------------------------


Mind writing a test for this in allocator_tests.cpp?


src/master/hierarchical_allocator_process.hpp
<https://reviews.apache.org/r/25035/#comment90782>

    I suggest to delete this comment altogether because frameworks can utilize offers with either no memory or no cpus based on how they allocate resources between executors and tasks. Also, change the code to 
    
    ```
    return (cpus.isSome() && cpus.get() >= MIN_CPUS) || 
           (mem.isSome() && mem.get() >= MIN_MEM);
    ```
    
    The important thing to note here is that executors should be launched with both cpus *and* memory. Mind adding a TODO in ResourceUsageChecker in master.cpp to that effect and log a warning? The reason we are doing a TODO and warning instead of fixing ResourceUsageChecker is to give frameworks (e.g., Spark) time to update their code to adhere to these new semantics. We will enforce this in the next release. Sounds good?


- Vinod Kone


On Sept. 2, 2014, 5:52 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 2, 2014, 5:52 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   src/master/hierarchical_allocator_process.hpp 34f8cd658920b36b1062bd3b7f6bfbd1bcb6bb52 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review52550
-----------------------------------------------------------


Patch looks great!

Reviews applied: [25035]

All tests passed.

- Mesos ReviewBot


On Sept. 6, 2014, 10:03 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 6, 2014, 10:03 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.hpp ce7995b 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Martin Weindel <ma...@gmail.com>.

> On Sept. 15, 2014, 3:23 nachm., Timothy St. Clair wrote:
> > src/master/hierarchical_allocator_process.hpp, line 837
> > <https://reviews.apache.org/r/25035/diff/7/?file=688721#file688721line837>
> >
> >     What happens in the case where all CPUs are taken but memory is available?  It looks like it will return (true), but this should not be possible. 
> >     
> >     I think you want to give an offer in the case where there are CPU resources available, but memory is consumed by the executor.
> 
> Vinod Kone wrote:
>     Giving memory only resources is ok as long as it is used for a task and not an executor. See my comments above.
> 
> Timothy St. Clair wrote:
>     Could you please add a detailed comment in the code above the mod, as on 1st inspection it leaves me still feeling unsettled.

I agree with Vinod. An executor may make use of additional offered memory, e.g for expanding a cache.
In this scenario, the already allocated CPU resources are sufficient.


- Martin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review53343
-----------------------------------------------------------


On Sept. 16, 2014, 9:05 nachm., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 16, 2014, 9:05 nachm.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   CHANGELOG a822cc4 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by "Timothy St. Clair" <ts...@redhat.com>.

> On Sept. 15, 2014, 3:23 p.m., Timothy St. Clair wrote:
> > src/master/hierarchical_allocator_process.hpp, line 837
> > <https://reviews.apache.org/r/25035/diff/7/?file=688721#file688721line837>
> >
> >     What happens in the case where all CPUs are taken but memory is available?  It looks like it will return (true), but this should not be possible. 
> >     
> >     I think you want to give an offer in the case where there are CPU resources available, but memory is consumed by the executor.
> 
> Vinod Kone wrote:
>     Giving memory only resources is ok as long as it is used for a task and not an executor. See my comments above.

Could you please add a detailed comment in the code above the mod, as on 1st inspection it leaves me still feeling unsettled.


- Timothy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review53343
-----------------------------------------------------------


On Sept. 16, 2014, 9:05 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 16, 2014, 9:05 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   CHANGELOG a822cc4 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Vinod Kone <vi...@gmail.com>.

> On Sept. 15, 2014, 3:23 p.m., Timothy St. Clair wrote:
> > src/master/hierarchical_allocator_process.hpp, line 837
> > <https://reviews.apache.org/r/25035/diff/7/?file=688721#file688721line837>
> >
> >     What happens in the case where all CPUs are taken but memory is available?  It looks like it will return (true), but this should not be possible. 
> >     
> >     I think you want to give an offer in the case where there are CPU resources available, but memory is consumed by the executor.

Giving memory only resources is ok as long as it is used for a task and not an executor. See my comments above.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review53343
-----------------------------------------------------------


On Sept. 13, 2014, 7:10 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 13, 2014, 7:10 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   CHANGELOG a822cc4 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Ben Mahler <be...@gmail.com>.

> On Sept. 15, 2014, 3:23 p.m., Timothy St. Clair wrote:
> > src/master/hierarchical_allocator_process.hpp, line 837
> > <https://reviews.apache.org/r/25035/diff/7/?file=688721#file688721line837>
> >
> >     What happens in the case where all CPUs are taken but memory is available?  It looks like it will return (true), but this should not be possible. 
> >     
> >     I think you want to give an offer in the case where there are CPU resources available, but memory is consumed by the executor.
> 
> Vinod Kone wrote:
>     Giving memory only resources is ok as long as it is used for a task and not an executor. See my comments above.
> 
> Timothy St. Clair wrote:
>     Could you please add a detailed comment in the code above the mod, as on 1st inspection it leaves me still feeling unsettled.
> 
> Martin Weindel wrote:
>     I agree with Vinod. An executor may make use of additional offered memory, e.g for expanding a cache.
>     In this scenario, the already allocated CPU resources are sufficient.

More generally, I think resources for executors should be required, and resources for tasks should be optional.

If a task doesn't need to specify CPU, then as a corollary, it doesn't need to specify memory.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review53343
-----------------------------------------------------------


On Sept. 16, 2014, 9:05 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 16, 2014, 9:05 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   CHANGELOG a822cc4 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by "Timothy St. Clair" <ts...@redhat.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review53343
-----------------------------------------------------------



src/master/hierarchical_allocator_process.hpp
<https://reviews.apache.org/r/25035/#comment92977>

    What happens in the case where all CPUs are taken but memory is available?  It looks like it will return (true), but this should not be possible. 
    
    I think you want to give an offer in the case where there are CPU resources available, but memory is consumed by the executor.


- Timothy St. Clair


On Sept. 13, 2014, 7:10 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 13, 2014, 7:10 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   CHANGELOG a822cc4 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Martin Weindel <ma...@gmail.com>.

> On Sept. 15, 2014, 9:02 nachm., Vinod Kone wrote:
> > CHANGELOG, lines 1-9
> > <https://reviews.apache.org/r/25035/diff/7/?file=688718#file688718line1>
> >
> >     Thinking a bit more about this and talking to others. Adding deprecations in a bug fix release is bit weird.
> >     
> >     2 options. 
> >     
> >     1) We can land this feature in 0.21.0 and not 0.20.1. That way we will do deprecation warning in 0.21.0 and disallow cpu/mem only executors in 0.22.0. This is the most straightforward.
> >     
> >     2) Land this in 0.20.1, but the deprecation warning, in changelog (and ResourceUsageChecker?), happens in 0.21.0. The disallowing hapens in 0.22.0. This is bit weird but not too bad if you absolutely need this in 0.20.1. 
> >     
> >     Considering 0.21.0 would happen in a month or so, I prefer #1. Does that work for you?

For me it only matters to fix the problem in the near future.
So I adjusted the patch for integration with 0.21.0.


- Martin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review53362
-----------------------------------------------------------


On Sept. 16, 2014, 9:05 nachm., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 16, 2014, 9:05 nachm.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   CHANGELOG a822cc4 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review53362
-----------------------------------------------------------


Minor nits and we will get this committed. Thanks for your patience Martin.


CHANGELOG
<https://reviews.apache.org/r/25035/#comment93059>

    Thinking a bit more about this and talking to others. Adding deprecations in a bug fix release is bit weird.
    
    2 options. 
    
    1) We can land this feature in 0.21.0 and not 0.20.1. That way we will do deprecation warning in 0.21.0 and disallow cpu/mem only executors in 0.22.0. This is the most straightforward.
    
    2) Land this in 0.20.1, but the deprecation warning, in changelog (and ResourceUsageChecker?), happens in 0.21.0. The disallowing hapens in 0.22.0. This is bit weird but not too bad if you absolutely need this in 0.20.1. 
    
    Considering 0.21.0 would happen in a month or so, I prefer #1. Does that work for you?



src/master/master.cpp
<https://reviews.apache.org/r/25035/#comment92992>

    also log the cpu resources used by the executor for easier debugging.
    
    e.g.,
    
    LOG(WARNING)
      << "Executor " << task.executor().executor_id()
      << " for task " << task.task_id()
      << " uses less CPUs (" << cpus.isSome() ? cpus.get() : "none" 
      << ") than the minimum required (" << MIN_CPUS
      << "). Please update........"



src/master/master.cpp
<https://reviews.apache.org/r/25035/#comment92993>

    ditto. log requested memory.



src/tests/allocator_tests.cpp
<https://reviews.apache.org/r/25035/#comment92991>

    2 blank lines.



src/tests/allocator_tests.cpp
<https://reviews.apache.org/r/25035/#comment92990>

    2 blank lines.


- Vinod Kone


On Sept. 13, 2014, 7:10 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 13, 2014, 7:10 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   CHANGELOG a822cc4 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review53280
-----------------------------------------------------------


Patch looks great!

Reviews applied: [25035]

All tests passed.

- Mesos ReviewBot


On Sept. 13, 2014, 7:10 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 13, 2014, 7:10 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   CHANGELOG a822cc4 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review53621
-----------------------------------------------------------


Patch looks great!

Reviews applied: [25035]

All tests passed.

- Mesos ReviewBot


On Sept. 16, 2014, 9:05 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 16, 2014, 9:05 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   CHANGELOG a822cc4 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Updated allocator to offer cpu only or memory only resources.

Posted by Martin Weindel <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/
-----------------------------------------------------------

(Updated Sept. 17, 2014, 6:36 p.m.)


Review request for mesos and Vinod Kone.


Changes
-------

Updated the summary.

Also edited the CHANGELOG to point to a new ticket regarding deprecation.

I'll commit this now.


Summary (updated)
-----------------

Updated allocator to offer cpu only or memory only resources.


Bugs: MESOS-1688
    https://issues.apache.org/jira/browse/MESOS-1688


Repository: mesos-git


Description
-------

As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
This can easily lead to a dead lock (in the application, not in Mesos).

Simple example:
1. Scheduler allocates all memory of a slave for an executor
2. Scheduler launches a task for this executor (allocating 1 CPU)
3. Task finishes: 1 CPU , 0 MB memory allocatable.
4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.

To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory


Diffs
-----

  CHANGELOG a822cc4 
  src/common/resources.cpp edf36b1 
  src/master/constants.cpp faa1503 
  src/master/hierarchical_allocator_process.hpp 34f8cd6 
  src/master/master.cpp 18464ba 
  src/tests/allocator_tests.cpp 774528a 

Diff: https://reviews.apache.org/r/25035/diff/


Testing
-------

Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.


Thanks,

Martin Weindel


Re: Review Request 25035: Fix for MESOS-1688

Posted by Martin Weindel <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/
-----------------------------------------------------------

(Updated Sept. 16, 2014, 9:05 nachm.)


Review request for mesos and Vinod Kone.


Changes
-------

Adjusted CHANGELOG and comments for integration with 0.21.0 instead 0.20.1.
Improved warning log.


Bugs: MESOS-1688
    https://issues.apache.org/jira/browse/MESOS-1688


Repository: mesos-git


Description
-------

As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
This can easily lead to a dead lock (in the application, not in Mesos).

Simple example:
1. Scheduler allocates all memory of a slave for an executor
2. Scheduler launches a task for this executor (allocating 1 CPU)
3. Task finishes: 1 CPU , 0 MB memory allocatable.
4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.

To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory


Diffs (updated)
-----

  CHANGELOG a822cc4 
  src/common/resources.cpp edf36b1 
  src/master/constants.cpp faa1503 
  src/master/hierarchical_allocator_process.hpp 34f8cd6 
  src/master/master.cpp 18464ba 
  src/tests/allocator_tests.cpp 774528a 

Diff: https://reviews.apache.org/r/25035/diff/


Testing
-------

Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.


Thanks,

Martin Weindel


Re: Review Request 25035: Fix for MESOS-1688

Posted by Martin Weindel <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/
-----------------------------------------------------------

(Updated Sept. 13, 2014, 7:10 nachm.)


Review request for mesos and Vinod Kone.


Changes
-------

improved understandability of patch in Resources::find()


Bugs: MESOS-1688
    https://issues.apache.org/jira/browse/MESOS-1688


Repository: mesos-git


Description
-------

As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
This can easily lead to a dead lock (in the application, not in Mesos).

Simple example:
1. Scheduler allocates all memory of a slave for an executor
2. Scheduler launches a task for this executor (allocating 1 CPU)
3. Task finishes: 1 CPU , 0 MB memory allocatable.
4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.

To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory


Diffs (updated)
-----

  CHANGELOG a822cc4 
  src/common/resources.cpp edf36b1 
  src/master/constants.cpp faa1503 
  src/master/hierarchical_allocator_process.hpp 34f8cd6 
  src/master/master.cpp 18464ba 
  src/tests/allocator_tests.cpp 774528a 

Diff: https://reviews.apache.org/r/25035/diff/


Testing
-------

Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.


Thanks,

Martin Weindel


Re: Review Request 25035: Fix for MESOS-1688

Posted by Martin Weindel <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/
-----------------------------------------------------------

(Updated Sept. 13, 2014, 6:56 nachm.)


Review request for mesos and Vinod Kone.


Bugs: MESOS-1688
    https://issues.apache.org/jira/browse/MESOS-1688


Repository: mesos-git


Description
-------

As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
This can easily lead to a dead lock (in the application, not in Mesos).

Simple example:
1. Scheduler allocates all memory of a slave for an executor
2. Scheduler launches a task for this executor (allocating 1 CPU)
3. Task finishes: 1 CPU , 0 MB memory allocatable.
4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.

To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory


Diffs (updated)
-----

  CHANGELOG a822cc4 
  src/common/resources.cpp edf36b1 
  src/master/constants.cpp faa1503 
  src/master/hierarchical_allocator_process.hpp 34f8cd6 
  src/master/master.cpp 18464ba 
  src/tests/allocator_tests.cpp 774528a 

Diff: https://reviews.apache.org/r/25035/diff/


Testing
-------

Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.


Thanks,

Martin Weindel


Re: Review Request 25035: Fix for MESOS-1688

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review53027
-----------------------------------------------------------


Patch looks great!

Reviews applied: [25035]

All tests passed.

- Mesos ReviewBot


On Sept. 10, 2014, 10 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 10, 2014, 10 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Vinod Kone <vi...@gmail.com>.

> On Sept. 11, 2014, 5:35 a.m., Vinod Kone wrote:
> >

Can you also update the summary of the review to something more meaningful? We typically use the summary to generate the commit message.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review53002
-----------------------------------------------------------


On Sept. 10, 2014, 10 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 10, 2014, 10 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review53002
-----------------------------------------------------------



src/common/resources.cpp
<https://reviews.apache.org/r/25035/#comment92333>

    I'm not sure what's happening here. Can you add a comment?



src/master/master.cpp
<https://reviews.apache.org/r/25035/#comment92334>

    Add a TODO:
    
    TODO(martin): Return Error instead of logging a warning in 0.21.0.



src/tests/allocator_tests.cpp
<https://reviews.apache.org/r/25035/#comment92336>

    s/with cpus only/using only cpus/



src/tests/allocator_tests.cpp
<https://reviews.apache.org/r/25035/#comment92335>

    s/tasks/task/



src/tests/allocator_tests.cpp
<https://reviews.apache.org/r/25035/#comment92337>

    s/with memory only/using only memory/



src/tests/allocator_tests.cpp
<https://reviews.apache.org/r/25035/#comment92338>

    s/mem/memory/



src/tests/allocator_tests.cpp
<https://reviews.apache.org/r/25035/#comment92339>

    s/tasks/task/


- Vinod Kone


On Sept. 10, 2014, 10 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 10, 2014, 10 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Martin Weindel <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/
-----------------------------------------------------------

(Updated Sept. 10, 2014, 10 nachm.)


Review request for mesos and Vinod Kone.


Changes
-------

fixed review issues


Bugs: MESOS-1688
    https://issues.apache.org/jira/browse/MESOS-1688


Repository: mesos-git


Description
-------

As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
This can easily lead to a dead lock (in the application, not in Mesos).

Simple example:
1. Scheduler allocates all memory of a slave for an executor
2. Scheduler launches a task for this executor (allocating 1 CPU)
3. Task finishes: 1 CPU , 0 MB memory allocatable.
4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.

To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory


Diffs (updated)
-----

  src/common/resources.cpp edf36b1 
  src/master/constants.cpp faa1503 
  src/master/hierarchical_allocator_process.hpp 34f8cd6 
  src/master/master.cpp 18464ba 
  src/tests/allocator_tests.cpp 774528a 

Diff: https://reviews.apache.org/r/25035/diff/


Testing
-------

Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.


Thanks,

Martin Weindel


Re: Review Request 25035: Fix for MESOS-1688

Posted by Vinod Kone <vi...@gmail.com>.

> On Sept. 9, 2014, 7:10 p.m., Vinod Kone wrote:
> > src/master/master.cpp, line 1901
> > <https://reviews.apache.org/r/25035/diff/4/?file=682182#file682182line1901>
> >
> >     I like these warnings.
> >     
> >     Are you planning to get this in to 0.20.1 or 0.21.0 ? If the former, can you add this to the list of deprecations in CHANGELOG.
> 
> Martin Weindel wrote:
>     Would be nice to see this in 0.20.1.
>     But it is not clear to me, how to update the CHANGELOG. There is no section for upcoming releases.

Just start one for 0.20.1 and just add the deprecation. See how we did it for 0.20.0 and 0.19.1 for inspiration. As we get close to releasing 0.20.1, the release manager will make sure to update the CHANGELOG with the tickets and other info.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review52763
-----------------------------------------------------------


On Sept. 10, 2014, 10 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 10, 2014, 10 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Martin Weindel <ma...@gmail.com>.

> On Sept. 9, 2014, 7:10 nachm., Vinod Kone wrote:
> > src/master/master.cpp, line 1901
> > <https://reviews.apache.org/r/25035/diff/4/?file=682182#file682182line1901>
> >
> >     I like these warnings.
> >     
> >     Are you planning to get this in to 0.20.1 or 0.21.0 ? If the former, can you add this to the list of deprecations in CHANGELOG.

Would be nice to see this in 0.20.1.
But it is not clear to me, how to update the CHANGELOG. There is no section for upcoming releases.


- Martin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review52763
-----------------------------------------------------------


On Sept. 10, 2014, 10 nachm., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 10, 2014, 10 nachm.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review52763
-----------------------------------------------------------



src/master/hierarchical_allocator_process.hpp
<https://reviews.apache.org/r/25035/#comment91816>

    kill trailing white space.
    
    we should really have the mesos-style.py catch this. filed: https://issues.apache.org/jira/browse/MESOS-1779



src/master/master.cpp
<https://reviews.apache.org/r/25035/#comment91822>

    We use proper sentences (capitalization, periods etc) for comments.
    
    s/check/Check/
    
    s/set/set./



src/master/master.cpp
<https://reviews.apache.org/r/25035/#comment91839>

    How about:
    
    ```
    Resources executorResources = task.executor.resources();
    
    foreach (const Resource& resource, executorResources) {
        .....
    }
    
    Option<double> cpus =  executorResources.cpus();
    if (cpus.isNone() || cpus.get() < MIN_CPUS) {
     LOG(WARNING) << ...
    
    }
    
    Option<Bytes> mem = executorResources.mem();
    if (mem.isNone() || mem.get() < MIN_MEM) {
      LOG(WARNINIG) << ...
    }
    ```



src/master/master.cpp
<https://reviews.apache.org/r/25035/#comment91823>

    Why do we need a minimum cpus for the executor?
    
    I'm assuming you didn't want to use MIN_CPUS because 0.1 cpus is too much overhead for Spark?
    
    0.01 cpus equates to ~10 shares (1024 * 0.01) which is the minium amount of shares enforced by the cpu isolator; so, let's just change MIN_CPUS to 0.01 and get rid of MIN_CPUS_EXECUTOR.



src/master/master.cpp
<https://reviews.apache.org/r/25035/#comment91830>

    s/executor/Executor/
    s/should/ should/



src/master/master.cpp
<https://reviews.apache.org/r/25035/#comment91817>

    kill extra white space.



src/master/master.cpp
<https://reviews.apache.org/r/25035/#comment91818>

    kill extra white space.



src/master/master.cpp
<https://reviews.apache.org/r/25035/#comment91833>

    s/executor/Executor/
    s/should/ should/



src/master/master.cpp
<https://reviews.apache.org/r/25035/#comment91840>

    I like these warnings.
    
    Are you planning to get this in to 0.20.1 or 0.21.0 ? If the former, can you add this to the list of deprecations in CHANGELOG.



src/tests/allocator_tests.cpp
<https://reviews.apache.org/r/25035/#comment91842>

    s/pure cpu/cpu only/ ?
    
    s/offered/offered./
    
    Also, this test is doing more than checking resources are offered. It is also testing that task(s) which use cpus only are launched. Can you please add that to the comment?
    
    Note that this test and the next one will break once we disallow such launches in the future, but that is good to have.



src/tests/allocator_tests.cpp
<https://reviews.apache.org/r/25035/#comment91843>

    Just say:
    
    // Start a slave with cpu only resources.



src/tests/allocator_tests.cpp
<https://reviews.apache.org/r/25035/#comment91847>

    Capitalize and period.
    
    Also, why launch two tasks instead of one? Makes the test a bit complicated and not sure you are gaining much.



src/tests/allocator_tests.cpp
<https://reviews.apache.org/r/25035/#comment91848>

    Capitalize and period.



src/tests/allocator_tests.cpp
<https://reviews.apache.org/r/25035/#comment91849>

    s/pure memory/memory only/ ?
    
    s/offered/offered./
    
    also, see comments on the previous test.



src/tests/allocator_tests.cpp
<https://reviews.apache.org/r/25035/#comment91850>

    // Start slave with memory only resources.



src/tests/allocator_tests.cpp
<https://reviews.apache.org/r/25035/#comment91851>

    Period at the end.
    
    ditto. see comments in the previous test.


- Vinod Kone


On Sept. 6, 2014, 10:03 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 6, 2014, 10:03 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.hpp ce7995b 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Martin Weindel <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/
-----------------------------------------------------------

(Updated Sept. 6, 2014, 10:03 nachm.)


Review request for mesos and Vinod Kone.


Bugs: MESOS-1688
    https://issues.apache.org/jira/browse/MESOS-1688


Repository: mesos-git


Description
-------

As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
This can easily lead to a dead lock (in the application, not in Mesos).

Simple example:
1. Scheduler allocates all memory of a slave for an executor
2. Scheduler launches a task for this executor (allocating 1 CPU)
3. Task finishes: 1 CPU , 0 MB memory allocatable.
4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.

To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory


Diffs (updated)
-----

  src/common/resources.cpp edf36b1 
  src/master/constants.hpp ce7995b 
  src/master/constants.cpp faa1503 
  src/master/hierarchical_allocator_process.hpp 34f8cd6 
  src/master/master.cpp 18464ba 
  src/tests/allocator_tests.cpp 774528a 

Diff: https://reviews.apache.org/r/25035/diff/


Testing
-------

Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.


Thanks,

Martin Weindel


Re: Review Request 25035: Fix for MESOS-1688

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review52547
-----------------------------------------------------------


Bad patch!

Reviews applied: [25035]

Failed command: ./support/mesos-style.py

Error:
 Checking 504 files using filter --filter=-,+build/class,+build/deprecated,+build/endif_comment,+readability/todo,+readability/namespace,+runtime/vlog,+whitespace/blank_line,+whitespace/comma,+whitespace/ending_newline,+whitespace/forcolon,+whitespace/indent,+whitespace/line_length,+whitespace/tab,+whitespace/todo
src/common/resources.cpp:220:  Tab found; better to use spaces  [whitespace/tab] [1]
Total errors found: 1

- Mesos ReviewBot


On Sept. 6, 2014, 6:37 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Sept. 6, 2014, 6:37 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   src/common/resources.cpp edf36b1 
>   src/master/constants.hpp ce7995b 
>   src/master/constants.cpp faa1503 
>   src/master/hierarchical_allocator_process.hpp 34f8cd6 
>   src/master/master.cpp 18464ba 
>   src/tests/allocator_tests.cpp 774528a 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Martin Weindel <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/
-----------------------------------------------------------

(Updated Sept. 6, 2014, 6:37 nachm.)


Review request for mesos and Vinod Kone.


Changes
-------

- allow pure cpus or mem offers
- added tests in allocate_tests
- added log warning in ResourceUsageChecker
- fixed Resources::operator <= and ::find to deal correctly with zero resources
- added constant MIN_CPUS_EXECUTOR


Bugs: MESOS-1688
    https://issues.apache.org/jira/browse/MESOS-1688


Repository: mesos-git


Description
-------

As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
This can easily lead to a dead lock (in the application, not in Mesos).

Simple example:
1. Scheduler allocates all memory of a slave for an executor
2. Scheduler launches a task for this executor (allocating 1 CPU)
3. Task finishes: 1 CPU , 0 MB memory allocatable.
4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.

To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory


Diffs (updated)
-----

  src/common/resources.cpp edf36b1 
  src/master/constants.hpp ce7995b 
  src/master/constants.cpp faa1503 
  src/master/hierarchical_allocator_process.hpp 34f8cd6 
  src/master/master.cpp 18464ba 
  src/tests/allocator_tests.cpp 774528a 

Diff: https://reviews.apache.org/r/25035/diff/


Testing
-------

Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.


Thanks,

Martin Weindel


Re: Review Request 25035: Fix for MESOS-1688

Posted by Martin Weindel <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/
-----------------------------------------------------------

(Updated Sept. 2, 2014, 5:52 p.m.)


Review request for mesos and Vinod Kone.


Changes
-------

I'll shepeherd this -- @vinodkone.


Bugs: MESOS-1688
    https://issues.apache.org/jira/browse/MESOS-1688


Repository: mesos-git


Description
-------

As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
This can easily lead to a dead lock (in the application, not in Mesos).

Simple example:
1. Scheduler allocates all memory of a slave for an executor
2. Scheduler launches a task for this executor (allocating 1 CPU)
3. Task finishes: 1 CPU , 0 MB memory allocatable.
4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.

To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory


Diffs
-----

  src/master/hierarchical_allocator_process.hpp 34f8cd658920b36b1062bd3b7f6bfbd1bcb6bb52 

Diff: https://reviews.apache.org/r/25035/diff/


Testing
-------

Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.


Thanks,

Martin Weindel


Re: Review Request 25035: Fix for MESOS-1688

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/#review52003
-----------------------------------------------------------


Patch looks great!

Reviews applied: [25035]

All tests passed.

- Mesos ReviewBot


On Aug. 30, 2014, 6:34 p.m., Martin Weindel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25035/
> -----------------------------------------------------------
> 
> (Updated Aug. 30, 2014, 6:34 p.m.)
> 
> 
> Review request for mesos.
> 
> 
> Bugs: MESOS-1688
>     https://issues.apache.org/jira/browse/MESOS-1688
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
> Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
> This can easily lead to a dead lock (in the application, not in Mesos).
> 
> Simple example:
> 1. Scheduler allocates all memory of a slave for an executor
> 2. Scheduler launches a task for this executor (allocating 1 CPU)
> 3. Task finishes: 1 CPU , 0 MB memory allocatable.
> 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.
> 
> To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory
> 
> 
> Diffs
> -----
> 
>   src/master/hierarchical_allocator_process.hpp 34f8cd658920b36b1062bd3b7f6bfbd1bcb6bb52 
> 
> Diff: https://reviews.apache.org/r/25035/diff/
> 
> 
> Testing
> -------
> 
> Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.
> 
> 
> Thanks,
> 
> Martin Weindel
> 
>


Re: Review Request 25035: Fix for MESOS-1688

Posted by Martin Weindel <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/
-----------------------------------------------------------

(Updated Aug. 30, 2014, 6:34 nachm.)


Review request for mesos.


Changes
-------

uploaded same diff once again


Bugs: MESOS-1688
    https://issues.apache.org/jira/browse/MESOS-1688


Repository: mesos-git


Description
-------

As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
This can easily lead to a dead lock (in the application, not in Mesos).

Simple example:
1. Scheduler allocates all memory of a slave for an executor
2. Scheduler launches a task for this executor (allocating 1 CPU)
3. Task finishes: 1 CPU , 0 MB memory allocatable.
4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.

To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory


Diffs (updated)
-----

  src/master/hierarchical_allocator_process.hpp 34f8cd658920b36b1062bd3b7f6bfbd1bcb6bb52 

Diff: https://reviews.apache.org/r/25035/diff/


Testing
-------

Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.


Thanks,

Martin Weindel


Re: Review Request 25035: Fix for MESOS-1688

Posted by Martin Weindel <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25035/
-----------------------------------------------------------

(Updated Aug. 26, 2014, 7:53 vorm.)


Review request for mesos.


Changes
-------

added manual testing


Bugs: MESOS-1688
    https://issues.apache.org/jira/browse/MESOS-1688


Repository: mesos-git


Description
-------

As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case.
Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory.
This can easily lead to a dead lock (in the application, not in Mesos).

Simple example:
1. Scheduler allocates all memory of a slave for an executor
2. Scheduler launches a task for this executor (allocating 1 CPU)
3. Task finishes: 1 CPU , 0 MB memory allocatable.
4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application.

To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory


Diffs
-----

  src/master/hierarchical_allocator_process.hpp 34f8cd658920b36b1062bd3b7f6bfbd1bcb6bb52 

Diff: https://reviews.apache.org/r/25035/diff/


Testing (updated)
-------

Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in "fine-grained" mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos.


Thanks,

Martin Weindel