You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Claudiu Barbura <Cl...@atigeo.com> on 2014/06/02 19:01:20 UTC

Re: Framework Starvation

Hi Vinod,

I attached the maser logs snapshots during starvation and after starvation.

There are 4 slave nodes and 1 master, all with of the same ec2 instance type (cc2.8xlarge, 32 cores, 60GB RAM).
I am running 4 shark-cli instances from the same master node, and running queries on all 4 of them ... then "starvation" kicks in (see attached log_during_starvation file).
After I terminate 2 of the shark-cli instances, the starved ones are receiving offers and are able to run queries again (see attached log_after_starvation file).

Let me know if you need the slave logs.

Thank you!
Claudiu

From: Vinod Kone <vi...@gmail.com>>
Reply-To: "user@mesos.apache.org<ma...@mesos.apache.org>" <us...@mesos.apache.org>>
Date: Friday, May 30, 2014 at 10:13 AM
To: "user@mesos.apache.org<ma...@mesos.apache.org>" <us...@mesos.apache.org>>
Subject: Re: Framework Starvation

Hey Claudiu,

Mind posting some master logs with the simple setup that you described (3 shark cli instances)? That would help us better diagnose the problem.


On Fri, May 30, 2014 at 1:59 AM, Claudiu Barbura <Cl...@atigeo.com>> wrote:
This is a critical issue for us as we have to shut down frameworks for various components in our platform to work and this has created more contention than before we deployed Mesos, when everyone had to wait in line for their MR/Hive jobs to run.

Any guidance, ideas would be extremely helpful at this point.

Thank you,
Claudiu

From: Claudiu Barbura <cl...@atigeo.com>>
Reply-To: "user@mesos.apache.org<ma...@mesos.apache.org>" <us...@mesos.apache.org>>
Date: Tuesday, May 27, 2014 at 11:57 PM
To: "user@mesos.apache.org<ma...@mesos.apache.org>" <us...@mesos.apache.org>>
Subject: Framework Starvation

Hi,

Following Ben's suggestion at the Seattle Spark Meetup in April, I built and deployed  the 0-18.1-rc1 branch hoping that this wold solve the framework starvation problem we have been seeing in the past 2 months now. The hope was that https://issues.apache.org/jira/browse/MESOS-1086 would also help us. Unfortunately it did not.
This bug is preventing us to run multiple spark and shark servers (http, thrift), in load balanced fashion, Hadoop and Aurora in the same mesos cluster.

For example, if we start at least 3 frameworks, one Hadoop, one SparkJobServer (one Spark context in fine-grained mode) and one Http SharkServer (one JavaSharkContext that inherits from Spark Contexts, again in fine-grained mode) and we run queries on all three of them, very soon we notice the following behavior:


  *   only the last two frameworks that we run queries against receive resource offers (master.cpp log entries in the log/mesos-master.INFO)
  *   the other frameworks are ignored and not allocated any resources until we kill one the two privileged ones above
  *   As soon as one of the privileged framework is terminated, one of the starved framework takes its place
  *   Any new Spark context created in coarse-grained mode (fixed number of cores) will generally receive offers immediately (rarely it gets starved)
  *   Hadoop behaves slightly differently when starved: task trackers are started but never released, which means, if the first job (Hive query) is small in terms of number of input splits, only one task tracker with a small number of allocated ores is created, and then all subsequent queries, regardless of size are only run in very limited mode with this one "small" task tracker. Most of the time only the map phase of a big query is completed while the reduce phase is hanging. Killing one of the registered Spark context above releases resources for Mesos to complete the query and gracefully shut down the task trackers (as noticed in the master log)

We are using the default settings in terms of isolation, weights etc ... the only stand out configuration would be the memory allocation for slave (export MESOS_resources=mem:35840 in mesos-slave-env.sh) but I am not sure if this is ever enforced, as each framework has its own executor process (JVM in our case) with its own memory allocation (we are not using cgroups yet)

A very easy to reproduce this bug is to start a minimum of 3 shark-cli instances in a mesos cluster and notice that only two of them are being offered resources and are running queries successfully.
I spent quite a bit of time in mesos, spark and hadoop-mesos code in an attempt to find a possible workaround  but no luck so far.

Any guidance would be very appreciated.

Thank you,
Claudiu

Re: Framework Starvation

Posted by Vinod Kone <vi...@gmail.com>.

In case you didn't receive my email from @twitter domain.


> On Thu, Jun 12, 2014 at 8:20 AM, Claudiu Barbura <
> Claudiu.Barbura@atigeo.com> wrote:
>
>> We had to change the drf_sorter.cpp/hpp and
>> hierarchical_allocator_process.cpp files.
>>
>
> Hey Claudiu. Can you share the patch?
>
>
> @vinodkone
>

Re: Framework Starvation

Posted by Vinod Kone <vi...@twitter.com>.

On Thu, Jun 12, 2014 at 8:20 AM, Claudiu Barbura <Claudiu.Barbura@atigeo.com
> wrote:

> We had to change the drf_sorter.cpp/hpp and
> hierarchical_allocator_process.cpp files.
>

Hey Claudiu. Can you share the patch?


@vinodkone

Re: Framework Starvation

Posted by Vinod Kone <vi...@gmail.com>.

On Thu, Jun 19, 2014 at 10:46 AM, Vinod Kone <vi...@twitter.com> wrote:

> Waiting to see your blog post :)
>
> That said, what baffles me is that in the very beginning when only two
> frameworks are present and no tasks have been launched, one framework is
> getting more allocations than other (see the log lines I posted in the
> earlier email), which is unexpected.
>
>
> @vinodkone
>
>
> On Tue, Jun 17, 2014 at 9:41 PM, Claudiu Barbura <
> Claudiu.Barbura@atigeo.com> wrote:
>
>>  Hi Vinod,
>>
>>  Yo are looking at logs I had posted before we implemented our fix
>> (files attached in my last email).
>> I will write a detailed blog post on the issue … after the Spark Summit
>> at the end of this month.
>>
>>  What wold happen before is that frameworks with the same share (0)
>> would also have the smallest allocation in the beginning, and after sorting
>> the list they would be at the top, always offered all the resources before
>> other frameworks that had been already offered, running tasks with a share
>> and allocation > 0.
>>
>>  Thanks,
>> Claudiu
>>
>>   From: Vinod Kone <vi...@twitter.com>
>> Reply-To: "user@mesos.apache.org" <us...@mesos.apache.org>
>> Date: Wednesday, June 18, 2014 at 4:54 AM
>>
>> To: "user@mesos.apache.org" <us...@mesos.apache.org>
>> Subject: Re: Framework Starvation
>>
>>   Hey Claudiu,
>>
>>  I spent some time trying to understand the logs you posted. Whats
>> strange to me is that in the very beginning when framework's 1 and 2 are
>> registered, only one framework gets offers for a period of 9s. It's not
>> clear why this happens. I even wrote a test (
>> https://reviews.apache.org/r/22714/) to repro but wasn't able to.
>>
>>  It would probably be helpful to add more logging to the drf sorting
>> comparator function to understand why frameworks are sorted in such a way
>> when their share is same (0). My expectation is that after each allocation,
>> the 'allocations' for a framework should increase causing the sort function
>> to behave correctly. But that doesn't seem to be happening in your case.
>>
>>
>>  I0604 22:12:43.715530 22270 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0000
>>
>> I0604 22:12:44.276062 22273 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:44.756918 22292 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0000
>>
>> I0604 22:12:45.794178 22276 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:46.841629 22291 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:47.884266 22262 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:48.926856 22268 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:49.966560 22280 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:51.007143 22267 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:52.047987 22280 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:53.089340 22291 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:54.130242 22263 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0000
>>
>>
>>  @vinodkone
>>
>>
>> On Fri, Jun 13, 2014 at 3:40 PM, Claudiu Barbura <
>> Claudiu.Barbura@atigeo.com> wrote:
>>
>>>  Hi Vinod,
>>>
>>>  Attached are the patch files.Hadoop has to be treated differently as
>>> it requires resources in order to shut down task trackers after a job is
>>> complete. Therefore we set the role name so that Mesos allocates resources
>>> for it first, ahead of the rest of the frameworks under the default role
>>> (*).
>>> This is not ideal, we are going to loo into the Hadoop Mesos framework
>>> code and fix if possible. Luckily, Hadoop is the only framework we use on
>>> top of Mesos that allows a configurable role name to be passed in when
>>> registering a framework (unlike, Spark, Aurora, Storm etc)
>>> For the non-Hadoop frameworks, we are making sure that once a framework
>>> is running its jobs, Mesos no longer offers resources to it. In the same
>>> time, once a framework completes its job, we make sure its “client
>>> allocations” value is updated so that when it completes the execution of
>>> its jobs, it is placed back in the sorting list with a real chance of being
>>> offered again immediately (not starved!).
>>> What is also key is that mem type resources are ignored during share
>>> computation as only cpus are a good indicator of which frameworks are
>>> actually running jobs in the cluster.
>>>
>>>  Thanks,
>>> Claudiu
>>>
>>>   From: Claudiu Barbura <cl...@atigeo.com>
>>> Reply-To: "user@mesos.apache.org" <us...@mesos.apache.org>
>>>  Date: Thursday, June 12, 2014 at 6:20 PM
>>>
>>> To: "user@mesos.apache.org" <us...@mesos.apache.org>
>>> Subject: Re: Framework Starvation
>>>
>>>   Hi Vinod,
>>>
>>>  We have a fix (more like a hack) that works for us, but it requires us
>>> to run each Hadoop framework with a different role as we need to treat
>>> Hadoop differently than the rest of the frameworks (Spark, Shark, Aurora)
>>> which are running with the default role (*).
>>> We had to change the drf_sorter.cpp/hpp and
>>> hierarchical_allocator_process.cpp files.
>>>
>>>  Let me know if you need more info on this.
>>>
>>>  Thanks,
>>> Claudiu
>>>
>>>   From: Claudiu Barbura <cl...@atigeo.com>
>>> Reply-To: "user@mesos.apache.org" <us...@mesos.apache.org>
>>> Date: Thursday, June 5, 2014 at 2:41 AM
>>> To: "user@mesos.apache.org" <us...@mesos.apache.org>
>>> Subject: Re: Framework Starvation
>>>
>>>   Hi Vinod,
>>>
>>>  I attached the master log after adding more logging to the sorter code.
>>> I believe the problem lies somewhere else however …
>>> in HierarchicalAllocatorProcess<RoleSorter, FrameworkSorter>::allocate()
>>>
>>>  I will continue to investigate in the meantime.
>>>
>>>  Thanks,
>>> Claudiu
>>>
>>>   From: Vinod Kone <vi...@gmail.com>
>>> Reply-To: "user@mesos.apache.org" <us...@mesos.apache.org>
>>> Date: Tuesday, June 3, 2014 at 5:16 PM
>>> To: "user@mesos.apache.org" <us...@mesos.apache.org>
>>> Subject: Re: Framework Starvation
>>>
>>>   Either should be fine. I don't think there are any changes in
>>> allocator since 0.18.0-rc1.
>>>
>>>
>>> On Tue, Jun 3, 2014 at 4:08 PM, Claudiu Barbura <
>>> Claudiu.Barbura@atigeo.com> wrote:
>>>
>>>>  Hi Vinod,
>>>>
>>>>  Should we use the same 0-18.1-rc1 branch or trunk code?
>>>>
>>>>  Thanks,
>>>> Claudiu
>>>>
>>>>   From: Vinod Kone <vi...@gmail.com>
>>>> Reply-To: "user@mesos.apache.org" <us...@mesos.apache.org>
>>>>  Date: Tuesday, June 3, 2014 at 3:55 PM
>>>> To: "user@mesos.apache.org" <us...@mesos.apache.org>
>>>> Subject: Re: Framework Starvation
>>>>
>>>>   Hey Claudiu,
>>>>
>>>>  Is it possible for you to run the same test but logging more
>>>> information about the framework shares? For example, it would be really
>>>> insightful if you can log each framework's share in DRFSorter::sort()
>>>> (see: master/drf_sorter.hpp). This will help us diagnose the problem. I
>>>> suspect one of our open tickets around allocation (MESOS-1119
>>>> <https://issues.apache.org/jira/browse/MESOS-1119>, MESOS-1130
>>>> <https://issues.apache.org/jira/browse/MESOS-1130> and MESOS-1187
>>>> <https://issues.apache.org/jira/browse/MESOS-1187>) is the issue. But
>>>> it would be good to have that logging data regardless to confirm.
>>>>
>>>>
>>>> On Mon, Jun 2, 2014 at 10:46 AM, Claudiu Barbura <
>>>> Claudiu.Barbura@atigeo.com> wrote:
>>>>
>>>>>  Hi Vinod,
>>>>>
>>>>>  I tried to attach the logs (2MB) and the email (see below) did not
>>>>> go through. I emailed your gmail account separately.
>>>>>
>>>>>  Thanks,
>>>>> Claudiu
>>>>>
>>>>>   From: Claudiu Barbura <cl...@atigeo.com>
>>>>> Date: Monday, June 2, 2014 at 10:00 AM
>>>>>
>>>>> To: "user@mesos.apache.org" <us...@mesos.apache.org>
>>>>> Subject: Re: Framework Starvation
>>>>>
>>>>>   Hi Vinod,
>>>>>
>>>>>  I attached the maser logs snapshots during starvation and after
>>>>> starvation.
>>>>>
>>>>>  There are 4 slave nodes and 1 master, all with of the same ec2
>>>>> instance type (cc2.8xlarge, 32 cores, 60GB RAM).
>>>>> I am running 4 shark-cli instances from the same master node, and
>>>>> running queries on all 4 of them … then “starvation” kicks in (see attached
>>>>> log_during_starvation file).
>>>>> After I terminate 2 of the shark-cli instances, the starved ones are
>>>>> receiving offers and are able to run queries again (see attached
>>>>> log_after_starvation file).
>>>>>
>>>>>  Let me know if you need the slave logs.
>>>>>
>>>>>  Thank you!
>>>>> Claudiu
>>>>>
>>>>>   From: Vinod Kone <vi...@gmail.com>
>>>>> Reply-To: "user@mesos.apache.org" <us...@mesos.apache.org>
>>>>> Date: Friday, May 30, 2014 at 10:13 AM
>>>>> To: "user@mesos.apache.org" <us...@mesos.apache.org>
>>>>> Subject: Re: Framework Starvation
>>>>>
>>>>>   Hey Claudiu,
>>>>>
>>>>>  Mind posting some master logs with the simple setup that you
>>>>> described (3 shark cli instances)? That would help us better diagnose the
>>>>> problem.
>>>>>
>>>>>
>>>>> On Fri, May 30, 2014 at 1:59 AM, Claudiu Barbura <
>>>>> Claudiu.Barbura@atigeo.com> wrote:
>>>>>
>>>>>>  This is a critical issue for us as we have to shut down frameworks
>>>>>> for various components in our platform to work and this has created more
>>>>>> contention than before we deployed Mesos, when everyone had to wait in line
>>>>>> for their MR/Hive jobs to run.
>>>>>>
>>>>>>  Any guidance, ideas would be extremely helpful at this point.
>>>>>>
>>>>>>  Thank you,
>>>>>> Claudiu
>>>>>>
>>>>>>   From: Claudiu Barbura <cl...@atigeo.com>
>>>>>> Reply-To: "user@mesos.apache.org" <us...@mesos.apache.org>
>>>>>> Date: Tuesday, May 27, 2014 at 11:57 PM
>>>>>> To: "user@mesos.apache.org" <us...@mesos.apache.org>
>>>>>> Subject: Framework Starvation
>>>>>>
>>>>>>   Hi,
>>>>>>
>>>>>>  Following Ben’s suggestion at the Seattle Spark Meetup in April, I
>>>>>> built and deployed  the 0-18.1-rc1 branch hoping that this wold solve the
>>>>>> framework starvation problem we have been seeing in the past 2 months now.
>>>>>> The hope was that https://issues.apache.org/jira/browse/MESOS-1086 would
>>>>>> also help us. Unfortunately it did not.
>>>>>> This bug is preventing us to run multiple spark and shark servers
>>>>>> (http, thrift), in load balanced fashion, Hadoop and Aurora in the same
>>>>>> mesos cluster.
>>>>>>
>>>>>>  For example, if we start at least 3 frameworks, one Hadoop, one
>>>>>> SparkJobServer (one Spark context in fine-grained mode) and one Http
>>>>>> SharkServer (one JavaSharkContext that inherits from Spark Contexts, again
>>>>>> in fine-grained mode) and we run queries on all three of them, very soon we
>>>>>> notice the following behavior:
>>>>>>
>>>>>>
>>>>>>    - only the last two frameworks that we run queries against
>>>>>>    receive resource offers (master.cpp log entries in the
>>>>>>    log/mesos-master.INFO)
>>>>>>    - the other frameworks are ignored and not allocated any
>>>>>>    resources until we kill one the two privileged ones above
>>>>>>    - As soon as one of the privileged framework is terminated, one
>>>>>>    of the starved framework takes its place
>>>>>>    - Any new Spark context created in coarse-grained mode (fixed
>>>>>>    number of cores) will generally receive offers immediately (rarely it gets
>>>>>>    starved)
>>>>>>    - Hadoop behaves slightly differently when starved: task trackers
>>>>>>    are started but never released, which means, if the first job (Hive query)
>>>>>>    is small in terms of number of input splits, only one task tracker with a
>>>>>>    small number of allocated ores is created, and then all subsequent queries,
>>>>>>    regardless of size are only run in very limited mode with this one “small”
>>>>>>    task tracker. Most of the time only the map phase of a big query is
>>>>>>    completed while the reduce phase is hanging. Killing one of the registered
>>>>>>    Spark context above releases resources for Mesos to complete the query and
>>>>>>    gracefully shut down the task trackers (as noticed in the master log)
>>>>>>
>>>>>> We are using the default settings in terms of isolation, weights etc
>>>>>> … the only stand out configuration would be the memory allocation for slave
>>>>>> (export MESOS_resources=mem:35840 in mesos-slave-env.sh) but I am not sure
>>>>>> if this is ever enforced, as each framework has its own executor process
>>>>>> (JVM in our case) with its own memory allocation (we are not using cgroups
>>>>>> yet)
>>>>>>
>>>>>>  A very easy to reproduce this bug is to start a minimum of 3
>>>>>> shark-cli instances in a mesos cluster and notice that only two of them are
>>>>>> being offered resources and are running queries successfully.
>>>>>> I spent quite a bit of time in mesos, spark and hadoop-mesos code in
>>>>>> an attempt to find a possible workaround  but no luck so far.
>>>>>>
>>>>>>  Any guidance would be very appreciated.
>>>>>>
>>>>>>  Thank you,
>>>>>> Claudiu
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>