You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Dario Rexin <dr...@apple.com> on 2016/07/06 21:08:03 UTC

MESOS-4694

Hi all,

I would like to revive https://issues.apache.org/jira/browse/MESOS-4694 <https://issues.apache.org/jira/browse/MESOS-4694>, especially https://reviews.apache.org/r/43666/ <https://reviews.apache.org/r/43666/>. We heavily depend on this patch and would love to see it merged. To show the value of this patch, I ran the benchmark from https://reviews.apache.org/r/49616/ <https://reviews.apache.org/r/49616/> first on HEAD and then with the aforementioned patch applied. I took some lines out to make it easier to see the changes over time in the patched version and to keep this email shorter ;). I would love to get some feedback and discuss any necessary changes to get this patch merged.

Here are the results:

Mesos HEAD:

Using 2000 agents and 200 frameworks
round 0 allocate took 3.064665secs to make 199 offers
round 1 allocate took 3.029418secs to make 198 offers
round 2 allocate took 3.091427secs to make 197 offers
round 3 allocate took 2.955457secs to make 196 offers
round 4 allocate took 3.133789secs to make 195 offers
[...]
round 50 allocate took 3.109859secs to make 149 offers
round 51 allocate took 3.062746secs to make 148 offers
round 52 allocate took 3.146043secs to make 147 offers
round 53 allocate took 3.042948secs to make 146 offers
round 54 allocate took 3.097835secs to make 145 offers
[...]
round 100 allocate took 3.027475secs to make 99 offers
round 101 allocate took 3.021641secs to make 98 offers
round 102 allocate took 2.9853secs to make 97 offers
round 103 allocate took 3.145925secs to make 96 offers
round 104 allocate took 2.99094secs to make 95 offers
[...]
round 150 allocate took 3.080406secs to make 49 offers
round 151 allocate took 3.109412secs to make 48 offers
round 152 allocate took 2.992129secs to make 47 offers
round 153 allocate took 3.405642secs to make 46 offers
round 154 allocate took 4.153354secs to make 45 offers
[...]
round 195 allocate took 3.10015secs to make 4 offers
round 196 allocate took 3.029347secs to make 3 offers
round 197 allocate took 2.982825secs to make 2 offers
round 198 allocate took 2.934595secs to make 1 offers
round 199 allocate took 313212us to make 0 offers

Mesos HEAD + allocator patch:

Using 2000 agents and 200 frameworks
round 0 allocate took 3.248205secs to make 199 offers
round 1 allocate took 3.170852secs to make 198 offers
round 2 allocate took 3.135146secs to make 197 offers
round 3 allocate took 3.143857secs to make 196 offers
round 4 allocate took 3.127641secs to make 195 offers
[...]
round 50 allocate took 2.492077secs to make 149 offers
round 51 allocate took 2.435054secs to make 148 offers
round 52 allocate took 2.472204secs to make 147 offers
round 53 allocate took 2.457228secs to make 146 offers
round 54 allocate took 2.413916secs to make 145 offers
[...]
round 100 allocate took 1.645015secs to make 99 offers
round 101 allocate took 1.647373secs to make 98 offers
round 102 allocate took 1.619147secs to make 97 offers
round 103 allocate took 1.625496secs to make 96 offers
round 104 allocate took 1.580513secs to make 95 offers
[...]
round 150 allocate took 1.064716secs to make 49 offers
round 151 allocate took 1.065604secs to make 48 offers
round 152 allocate took 1.053049secs to make 47 offers
round 153 allocate took 1.041333secs to make 46 offers
round 154 allocate took 1.0461secs to make 45 offers
[...]
round 195 allocate took 569640us to make 4 offers
round 196 allocate took 562107us to make 3 offers
round 197 allocate took 547632us to make 2 offers
round 198 allocate took 530765us to make 1 offers
round 199 allocate took 24426us to make 0 offers

--
 Dario

Re: MESOS-4694

Posted by Dario Rexin <dr...@apple.com>.

Hi Alex,

right, I didn't think about the other cases. Currently I would say we have 3 orders of magnitude more frameworks than roles.

> On Jul 18, 2016, at 2:31 AM, Alex Rukletsov <al...@mesosphere.com> wrote:
> 
> Dario,
> 
> but this is true only for framework sorters, right? The total kept in the
> role sorter is changed not on allocations, but when an agent joins or
> leaves the cluster. Maintaining a priority queue for roles can make sense,
> but may decrease the performance for framework sorters.
> 
> What is the ratio frameworks / roles in your clusters?
> 
>> On Fri, Jul 8, 2016 at 6:37 PM, Dario Rexin <dr...@apple.com> wrote:
>> 
>> Hi Alex,
>> 
>> thanks for your input. We originally thought about that, too, but the
>> problem is, that every time resources are allocated to a framework, this
>> method will be called:
>> 
>> void DRFSorter::add(const SlaveID& slaveId, const Resources& resources)
>> 
>> It will add the passed resources to the total resources of the sorter and
>> therefore invalidate the whole sorting (i.e. set dirty=true). So we would
>> still have to actually sort the frameworks almost every time. In fact,
>> frameworks are already kept sorted as long as possible, it’s just not
>> possible to keep them sorted for very long because of the call to said
>> function ;).
>> 
>> --
>>  Dario
>> 
>>> On Jul 8, 2016, at 6:50 AM, Alex Rukletsov <al...@mesosphere.com> wrote:
>>> 
>>> I was not involved into conversations around this issue, so maybe you
>> have
>>> discussed this already (in this case, is the outcome of the discussion is
>>> documented somewhere?).
>>> 
>>> Though the patch seems good to me, it assumes that frameworks SUPPRESS
>> when
>>> they don't need offers. This is not always the case. Since now we have a
>>> real world use case with ~6k frameworks, the "right thing to do" seems to
>>> maintain a heap of roles and frameworks in the role and avoid sorting.
>>> 
>>>> On Thu, Jul 7, 2016 at 7:20 PM, Dario Rexin <dr...@apple.com> wrote:
>>>> 
>>>> A bit more context:
>>>> 
>>>> We have a very high number of frameworks on our clusters. In some cases
>>>> ~6k. The biggest problem is the sort method, which has a complexity of
>> O(n
>>>> log n) and is called n*m times, where n = number of agents and m =
>> number
>>>> of roles. So in total we have a complexity of O(n^3 log n). I think
>>>> reducing n is the most promising optimization here. We have been running
>>>> this patch in production for quite a while now and have seen huge
>>>> improvements in general allocation time and also in failover times.
>>>> 
>>>> Also, if we were to add a parameterized version of SUPPRESS, what
>> problems
>>>> do you see with just differentiating between the two cases?
>>>> 
>>>> Thanks,
>>>> --
>>>>  Dario
>>>> 
>>>>> On Jul 7, 2016, at 8:40 AM, Dario Rexin <dr...@apple.com> wrote:
>>>>> 
>>>>> Hi Joris,
>>>>> 
>>>>> I still don't really understand why we would parameterize SUPPRESS, to
>>>> me that sounds like a case for filters. The idea of SUPPRESS was to
>>>> completely stop getting offers.
>>>>> 
>>>>> Could you please explain why you think the patch is a hack? To me it
>>>> just seems logical to not sort frameworks that don't need to be
>> considered
>>>> in the allocator.
>>>>> 
>>>>> Thanks,
>>>>> Dario
>>>>> 
>>>>>> On 07.07.2016, at 7:38 AM, Joris Van Remoortere <jo...@mesosphere.io>
>>>> wrote:
>>>>>> 
>>>>>> The reason that SUPPRESS doesn't just deactivate is because the intent
>>>> was
>>>>>> to be able to parameterize this call. At that point the change
>> wouldn't
>>>>>> work without turning this in to 2 cases.
>>>>>> 
>>>>>> I have asked to look at what a parameterized suppress would like and
>>>>>> understand the performance impact of that before we do this.
>>>>>> Have we reached consensus that there's no way to implement a generic
>>>>>> parameterized suppress that is performant?
>>>>>> 
>>>>>> There are some refactorings that we had discussed with James, Jacob,
>> and
>>>>>> Ian that seem like lower hanging fruit. After those are made it might
>> be
>>>>>> worth reconsidering whether we need to do this hack.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> —
>>>>>> *Joris Van Remoortere*
>>>>>> Mesosphere
>>>>>> 
>>>>>>> On Thu, Jul 7, 2016 at 10:15 AM, Guangya Liu <gy...@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>> Hi Ben and Dario,
>>>>>>> 
>>>>>>> The reason that we have "SUPPRESS" call is as following:
>>>>>>> 1) Act as the complement to the current REVIVE call.
>>>>>>> 2) The HTTP API do not have an API to "Deactivate" a framework, we
>>>> want to
>>>>>>> use "SUPPRESS", "DECLINE" and "DECLINE_INVERSE_OFFERS" to implement
>> the
>>>>>>> call for "DeactivateFrameworkMessage".
>>>>>>> 
>>>>>>> You can also refer to
>> https://issues.apache.org/jira/browse/MESOS-3037
>>>> for
>>>>>>> detail.
>>>>>>> 
>>>>>>> So I think that Dario's patch is good, we should remove the framework
>>>>>>> clients when "SUPPRESS" and add the framework client back when
>>>> "REVIVE". to
>>>>>>> ignore those frameworks from sorter.
>>>>>>> 
>>>>>>> @Viond, any comments for this?
>>>>>>> 
>>>>>>> @Ben, for your concern of the benchmark test result is not easy to
>>>>>>> understand, I have filed a JIRA ticket here
>>>>>>> https://issues.apache.org/jira/browse/MESOS-5800 to trace.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Guangya
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Thu, Jul 7, 2016 at 6:01 AM, Dario Rexin <dr...@apple.com>
>> wrote:
>>>>>>>> 
>>>>>>>> Hi Vinod,
>>>>>>>> 
>>>>>>>> thanks for your reply. The reason it’s so much faster is because the
>>>>>>>> sorting is a lot faster with fewer frameworks. Looping shouldn’t
>> make
>>>> a
>>>>>>>> huge difference, as it used to just skip over the deactivated
>>>> frameworks.
>>>>>>>> 
>>>>>>>> I don’t know what effects deactivating the framework in the master
>>>> would
>>>>>>>> have. The framework is still active and listening for events /
>> sending
>>>>>>>> calls. Could you please elaborate?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> --
>>>>>>>>  Dario
>>>>>>>> 
>>>>>>>> On Jul 6, 2016, at 2:56 PM, Benjamin Mahler <bm...@apache.org>
>>>> wrote:
>>>>>>>> 
>>>>>>>> +implementer and shepherd of SUPPRESS
>>>>>>>> 
>>>>>>>> Is there any reason we didn't already just "deactivate" frameworks
>>>> that
>>>>>>>> were suppressing offers? That seems to be the natural
>> implementation,
>>>>>>>> performance aside, because the meaning of "deactivated" is: not
>> being
>>>>>>> sent
>>>>>>>> any offers. The patch you posted seems to only take this half-way:
>>>>>>> suppress
>>>>>>>> = deactivation in the allocator, but not in the master.
>>>>>>>> 
>>>>>>>> Also, Dario it's a bit hard to interpret these numbers without
>> reading
>>>>>>> the
>>>>>>>> benchmark code. My interpretation of these numbers is that this
>> change
>>>>>>>> makes the allocation loop complete more quickly when there are many
>>>>>>>> frameworks that are in the suppressed state, because we have to loop
>>>> over
>>>>>>>> fewer clients. Is this an accurate interpretation?
>>>>>>>> 
>>>>>>>> On Wed, Jul 6, 2016 at 2:08 PM, Dario Rexin <dr...@apple.com>
>> wrote:
>>>>>>>> 
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> I would like to revive
>>>> https://issues.apache.org/jira/browse/MESOS-4694
>>>>>>> <
>>>>>>>> https://issues.apache.org/jira/browse/MESOS-4694>, especially
>>>>>>>> https://reviews.apache.org/r/43666/ <
>>>> https://reviews.apache.org/r/43666/
>>>>>>>> .
>>>>>>>> We heavily depend on this patch and would love to see it merged. To
>>>> show
>>>>>>>> the value of this patch, I ran the benchmark from
>>>>>>>> https://reviews.apache.org/r/49616/ <
>>>> https://reviews.apache.org/r/49616/
>>>>>>>> 
>>>>>>>> first on HEAD and then with the aforementioned patch applied. I took
>>>> some
>>>>>>>> lines out to make it easier to see the changes over time in the
>>>> patched
>>>>>>>> version and to keep this email shorter ;). I would love to get some
>>>>>>>> feedback and discuss any necessary changes to get this patch merged.
>>>>>>>> 
>>>>>>>> Here are the results:
>>>>>>>> 
>>>>>>>> Mesos HEAD:
>>>>>>>> 
>>>>>>>> Using 2000 agents and 200 frameworks
>>>>>>>> round 0 allocate took 3.064665secs to make 199 offers
>>>>>>>> round 1 allocate took 3.029418secs to make 198 offers
>>>>>>>> round 2 allocate took 3.091427secs to make 197 offers
>>>>>>>> round 3 allocate took 2.955457secs to make 196 offers
>>>>>>>> round 4 allocate took 3.133789secs to make 195 offers
>>>>>>>> [...]
>>>>>>>> round 50 allocate took 3.109859secs to make 149 offers
>>>>>>>> round 51 allocate took 3.062746secs to make 148 offers
>>>>>>>> round 52 allocate took 3.146043secs to make 147 offers
>>>>>>>> round 53 allocate took 3.042948secs to make 146 offers
>>>>>>>> round 54 allocate took 3.097835secs to make 145 offers
>>>>>>>> [...]
>>>>>>>> round 100 allocate took 3.027475secs to make 99 offers
>>>>>>>> round 101 allocate took 3.021641secs to make 98 offers
>>>>>>>> round 102 allocate took 2.9853secs to make 97 offers
>>>>>>>> round 103 allocate took 3.145925secs to make 96 offers
>>>>>>>> round 104 allocate took 2.99094secs to make 95 offers
>>>>>>>> [...]
>>>>>>>> round 150 allocate took 3.080406secs to make 49 offers
>>>>>>>> round 151 allocate took 3.109412secs to make 48 offers
>>>>>>>> round 152 allocate took 2.992129secs to make 47 offers
>>>>>>>> round 153 allocate took 3.405642secs to make 46 offers
>>>>>>>> round 154 allocate took 4.153354secs to make 45 offers
>>>>>>>> [...]
>>>>>>>> round 195 allocate took 3.10015secs to make 4 offers
>>>>>>>> round 196 allocate took 3.029347secs to make 3 offers
>>>>>>>> round 197 allocate took 2.982825secs to make 2 offers
>>>>>>>> round 198 allocate took 2.934595secs to make 1 offers
>>>>>>>> round 199 allocate took 313212us to make 0 offers
>>>>>>>> 
>>>>>>>> Mesos HEAD + allocator patch:
>>>>>>>> 
>>>>>>>> Using 2000 agents and 200 frameworks
>>>>>>>> round 0 allocate took 3.248205secs to make 199 offers
>>>>>>>> round 1 allocate took 3.170852secs to make 198 offers
>>>>>>>> round 2 allocate took 3.135146secs to make 197 offers
>>>>>>>> round 3 allocate took 3.143857secs to make 196 offers
>>>>>>>> round 4 allocate took 3.127641secs to make 195 offers
>>>>>>>> [...]
>>>>>>>> round 50 allocate took 2.492077secs to make 149 offers
>>>>>>>> round 51 allocate took 2.435054secs to make 148 offers
>>>>>>>> round 52 allocate took 2.472204secs to make 147 offers
>>>>>>>> round 53 allocate took 2.457228secs to make 146 offers
>>>>>>>> round 54 allocate took 2.413916secs to make 145 offers
>>>>>>>> [...]
>>>>>>>> round 100 allocate took 1.645015secs to make 99 offers
>>>>>>>> round 101 allocate took 1.647373secs to make 98 offers
>>>>>>>> round 102 allocate took 1.619147secs to make 97 offers
>>>>>>>> round 103 allocate took 1.625496secs to make 96 offers
>>>>>>>> round 104 allocate took 1.580513secs to make 95 offers
>>>>>>>> [...]
>>>>>>>> round 150 allocate took 1.064716secs to make 49 offers
>>>>>>>> round 151 allocate took 1.065604secs to make 48 offers
>>>>>>>> round 152 allocate took 1.053049secs to make 47 offers
>>>>>>>> round 153 allocate took 1.041333secs to make 46 offers
>>>>>>>> round 154 allocate took 1.0461secs to make 45 offers
>>>>>>>> [...]
>>>>>>>> round 195 allocate took 569640us to make 4 offers
>>>>>>>> round 196 allocate took 562107us to make 3 offers
>>>>>>>> round 197 allocate took 547632us to make 2 offers
>>>>>>>> round 198 allocate took 530765us to make 1 offers
>>>>>>>> round 199 allocate took 24426us to make 0 offers
>>>>>>>> 
>>>>>>>> --
>>>>>>>>  Dario
>>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: MESOS-4694

Posted by Alex Rukletsov <al...@mesosphere.com>.

Dario,

but this is true only for framework sorters, right? The total kept in the
role sorter is changed not on allocations, but when an agent joins or
leaves the cluster. Maintaining a priority queue for roles can make sense,
but may decrease the performance for framework sorters.

What is the ratio frameworks / roles in your clusters?

On Fri, Jul 8, 2016 at 6:37 PM, Dario Rexin <dr...@apple.com> wrote:

> Hi Alex,
>
> thanks for your input. We originally thought about that, too, but the
> problem is, that every time resources are allocated to a framework, this
> method will be called:
>
> void DRFSorter::add(const SlaveID& slaveId, const Resources& resources)
>
> It will add the passed resources to the total resources of the sorter and
> therefore invalidate the whole sorting (i.e. set dirty=true). So we would
> still have to actually sort the frameworks almost every time. In fact,
> frameworks are already kept sorted as long as possible, it’s just not
> possible to keep them sorted for very long because of the call to said
> function ;).
>
> --
>  Dario
>
> > On Jul 8, 2016, at 6:50 AM, Alex Rukletsov <al...@mesosphere.com> wrote:
> >
> > I was not involved into conversations around this issue, so maybe you
> have
> > discussed this already (in this case, is the outcome of the discussion is
> > documented somewhere?).
> >
> > Though the patch seems good to me, it assumes that frameworks SUPPRESS
> when
> > they don't need offers. This is not always the case. Since now we have a
> > real world use case with ~6k frameworks, the "right thing to do" seems to
> > maintain a heap of roles and frameworks in the role and avoid sorting.
> >
> > On Thu, Jul 7, 2016 at 7:20 PM, Dario Rexin <dr...@apple.com> wrote:
> >
> >> A bit more context:
> >>
> >> We have a very high number of frameworks on our clusters. In some cases
> >> ~6k. The biggest problem is the sort method, which has a complexity of
> O(n
> >> log n) and is called n*m times, where n = number of agents and m =
> number
> >> of roles. So in total we have a complexity of O(n^3 log n). I think
> >> reducing n is the most promising optimization here. We have been running
> >> this patch in production for quite a while now and have seen huge
> >> improvements in general allocation time and also in failover times.
> >>
> >> Also, if we were to add a parameterized version of SUPPRESS, what
> problems
> >> do you see with just differentiating between the two cases?
> >>
> >> Thanks,
> >> --
> >>  Dario
> >>
> >>> On Jul 7, 2016, at 8:40 AM, Dario Rexin <dr...@apple.com> wrote:
> >>>
> >>> Hi Joris,
> >>>
> >>> I still don't really understand why we would parameterize SUPPRESS, to
> >> me that sounds like a case for filters. The idea of SUPPRESS was to
> >> completely stop getting offers.
> >>>
> >>> Could you please explain why you think the patch is a hack? To me it
> >> just seems logical to not sort frameworks that don't need to be
> considered
> >> in the allocator.
> >>>
> >>> Thanks,
> >>> Dario
> >>>
> >>>> On 07.07.2016, at 7:38 AM, Joris Van Remoortere <jo...@mesosphere.io>
> >> wrote:
> >>>>
> >>>> The reason that SUPPRESS doesn't just deactivate is because the intent
> >> was
> >>>> to be able to parameterize this call. At that point the change
> wouldn't
> >>>> work without turning this in to 2 cases.
> >>>>
> >>>> I have asked to look at what a parameterized suppress would like and
> >>>> understand the performance impact of that before we do this.
> >>>> Have we reached consensus that there's no way to implement a generic
> >>>> parameterized suppress that is performant?
> >>>>
> >>>> There are some refactorings that we had discussed with James, Jacob,
> and
> >>>> Ian that seem like lower hanging fruit. After those are made it might
> be
> >>>> worth reconsidering whether we need to do this hack.
> >>>>
> >>>>
> >>>>
> >>>> —
> >>>> *Joris Van Remoortere*
> >>>> Mesosphere
> >>>>
> >>>>> On Thu, Jul 7, 2016 at 10:15 AM, Guangya Liu <gy...@gmail.com>
> >> wrote:
> >>>>>
> >>>>> Hi Ben and Dario,
> >>>>>
> >>>>> The reason that we have "SUPPRESS" call is as following:
> >>>>> 1) Act as the complement to the current REVIVE call.
> >>>>> 2) The HTTP API do not have an API to "Deactivate" a framework, we
> >> want to
> >>>>> use "SUPPRESS", "DECLINE" and "DECLINE_INVERSE_OFFERS" to implement
> the
> >>>>> call for "DeactivateFrameworkMessage".
> >>>>>
> >>>>> You can also refer to
> https://issues.apache.org/jira/browse/MESOS-3037
> >> for
> >>>>> detail.
> >>>>>
> >>>>> So I think that Dario's patch is good, we should remove the framework
> >>>>> clients when "SUPPRESS" and add the framework client back when
> >> "REVIVE". to
> >>>>> ignore those frameworks from sorter.
> >>>>>
> >>>>> @Viond, any comments for this?
> >>>>>
> >>>>> @Ben, for your concern of the benchmark test result is not easy to
> >>>>> understand, I have filed a JIRA ticket here
> >>>>> https://issues.apache.org/jira/browse/MESOS-5800 to trace.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Guangya
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Thu, Jul 7, 2016 at 6:01 AM, Dario Rexin <dr...@apple.com>
> wrote:
> >>>>>>
> >>>>>> Hi Vinod,
> >>>>>>
> >>>>>> thanks for your reply. The reason it’s so much faster is because the
> >>>>>> sorting is a lot faster with fewer frameworks. Looping shouldn’t
> make
> >> a
> >>>>>> huge difference, as it used to just skip over the deactivated
> >> frameworks.
> >>>>>>
> >>>>>> I don’t know what effects deactivating the framework in the master
> >> would
> >>>>>> have. The framework is still active and listening for events /
> sending
> >>>>>> calls. Could you please elaborate?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> --
> >>>>>>  Dario
> >>>>>>
> >>>>>> On Jul 6, 2016, at 2:56 PM, Benjamin Mahler <bm...@apache.org>
> >> wrote:
> >>>>>>
> >>>>>> +implementer and shepherd of SUPPRESS
> >>>>>>
> >>>>>> Is there any reason we didn't already just "deactivate" frameworks
> >> that
> >>>>>> were suppressing offers? That seems to be the natural
> implementation,
> >>>>>> performance aside, because the meaning of "deactivated" is: not
> being
> >>>>> sent
> >>>>>> any offers. The patch you posted seems to only take this half-way:
> >>>>> suppress
> >>>>>> = deactivation in the allocator, but not in the master.
> >>>>>>
> >>>>>> Also, Dario it's a bit hard to interpret these numbers without
> reading
> >>>>> the
> >>>>>> benchmark code. My interpretation of these numbers is that this
> change
> >>>>>> makes the allocation loop complete more quickly when there are many
> >>>>>> frameworks that are in the suppressed state, because we have to loop
> >> over
> >>>>>> fewer clients. Is this an accurate interpretation?
> >>>>>>
> >>>>>> On Wed, Jul 6, 2016 at 2:08 PM, Dario Rexin <dr...@apple.com>
> wrote:
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I would like to revive
> >> https://issues.apache.org/jira/browse/MESOS-4694
> >>>>> <
> >>>>>> https://issues.apache.org/jira/browse/MESOS-4694>, especially
> >>>>>> https://reviews.apache.org/r/43666/ <
> >> https://reviews.apache.org/r/43666/
> >>>>>> .
> >>>>>> We heavily depend on this patch and would love to see it merged. To
> >> show
> >>>>>> the value of this patch, I ran the benchmark from
> >>>>>> https://reviews.apache.org/r/49616/ <
> >> https://reviews.apache.org/r/49616/
> >>>>>>
> >>>>>> first on HEAD and then with the aforementioned patch applied. I took
> >> some
> >>>>>> lines out to make it easier to see the changes over time in the
> >> patched
> >>>>>> version and to keep this email shorter ;). I would love to get some
> >>>>>> feedback and discuss any necessary changes to get this patch merged.
> >>>>>>
> >>>>>> Here are the results:
> >>>>>>
> >>>>>> Mesos HEAD:
> >>>>>>
> >>>>>> Using 2000 agents and 200 frameworks
> >>>>>> round 0 allocate took 3.064665secs to make 199 offers
> >>>>>> round 1 allocate took 3.029418secs to make 198 offers
> >>>>>> round 2 allocate took 3.091427secs to make 197 offers
> >>>>>> round 3 allocate took 2.955457secs to make 196 offers
> >>>>>> round 4 allocate took 3.133789secs to make 195 offers
> >>>>>> [...]
> >>>>>> round 50 allocate took 3.109859secs to make 149 offers
> >>>>>> round 51 allocate took 3.062746secs to make 148 offers
> >>>>>> round 52 allocate took 3.146043secs to make 147 offers
> >>>>>> round 53 allocate took 3.042948secs to make 146 offers
> >>>>>> round 54 allocate took 3.097835secs to make 145 offers
> >>>>>> [...]
> >>>>>> round 100 allocate took 3.027475secs to make 99 offers
> >>>>>> round 101 allocate took 3.021641secs to make 98 offers
> >>>>>> round 102 allocate took 2.9853secs to make 97 offers
> >>>>>> round 103 allocate took 3.145925secs to make 96 offers
> >>>>>> round 104 allocate took 2.99094secs to make 95 offers
> >>>>>> [...]
> >>>>>> round 150 allocate took 3.080406secs to make 49 offers
> >>>>>> round 151 allocate took 3.109412secs to make 48 offers
> >>>>>> round 152 allocate took 2.992129secs to make 47 offers
> >>>>>> round 153 allocate took 3.405642secs to make 46 offers
> >>>>>> round 154 allocate took 4.153354secs to make 45 offers
> >>>>>> [...]
> >>>>>> round 195 allocate took 3.10015secs to make 4 offers
> >>>>>> round 196 allocate took 3.029347secs to make 3 offers
> >>>>>> round 197 allocate took 2.982825secs to make 2 offers
> >>>>>> round 198 allocate took 2.934595secs to make 1 offers
> >>>>>> round 199 allocate took 313212us to make 0 offers
> >>>>>>
> >>>>>> Mesos HEAD + allocator patch:
> >>>>>>
> >>>>>> Using 2000 agents and 200 frameworks
> >>>>>> round 0 allocate took 3.248205secs to make 199 offers
> >>>>>> round 1 allocate took 3.170852secs to make 198 offers
> >>>>>> round 2 allocate took 3.135146secs to make 197 offers
> >>>>>> round 3 allocate took 3.143857secs to make 196 offers
> >>>>>> round 4 allocate took 3.127641secs to make 195 offers
> >>>>>> [...]
> >>>>>> round 50 allocate took 2.492077secs to make 149 offers
> >>>>>> round 51 allocate took 2.435054secs to make 148 offers
> >>>>>> round 52 allocate took 2.472204secs to make 147 offers
> >>>>>> round 53 allocate took 2.457228secs to make 146 offers
> >>>>>> round 54 allocate took 2.413916secs to make 145 offers
> >>>>>> [...]
> >>>>>> round 100 allocate took 1.645015secs to make 99 offers
> >>>>>> round 101 allocate took 1.647373secs to make 98 offers
> >>>>>> round 102 allocate took 1.619147secs to make 97 offers
> >>>>>> round 103 allocate took 1.625496secs to make 96 offers
> >>>>>> round 104 allocate took 1.580513secs to make 95 offers
> >>>>>> [...]
> >>>>>> round 150 allocate took 1.064716secs to make 49 offers
> >>>>>> round 151 allocate took 1.065604secs to make 48 offers
> >>>>>> round 152 allocate took 1.053049secs to make 47 offers
> >>>>>> round 153 allocate took 1.041333secs to make 46 offers
> >>>>>> round 154 allocate took 1.0461secs to make 45 offers
> >>>>>> [...]
> >>>>>> round 195 allocate took 569640us to make 4 offers
> >>>>>> round 196 allocate took 562107us to make 3 offers
> >>>>>> round 197 allocate took 547632us to make 2 offers
> >>>>>> round 198 allocate took 530765us to make 1 offers
> >>>>>> round 199 allocate took 24426us to make 0 offers
> >>>>>>
> >>>>>> --
> >>>>>>  Dario
> >>>>>
> >>
> >>
>
>

Re: MESOS-4694

Posted by Dario Rexin <dr...@apple.com>.

Hi Alex,

thanks for your input. We originally thought about that, too, but the problem is, that every time resources are allocated to a framework, this method will be called:

void DRFSorter::add(const SlaveID& slaveId, const Resources& resources)

It will add the passed resources to the total resources of the sorter and therefore invalidate the whole sorting (i.e. set dirty=true). So we would still have to actually sort the frameworks almost every time. In fact, frameworks are already kept sorted as long as possible, it’s just not possible to keep them sorted for very long because of the call to said function ;).

--
 Dario

> On Jul 8, 2016, at 6:50 AM, Alex Rukletsov <al...@mesosphere.com> wrote:
> 
> I was not involved into conversations around this issue, so maybe you have
> discussed this already (in this case, is the outcome of the discussion is
> documented somewhere?).
> 
> Though the patch seems good to me, it assumes that frameworks SUPPRESS when
> they don't need offers. This is not always the case. Since now we have a
> real world use case with ~6k frameworks, the "right thing to do" seems to
> maintain a heap of roles and frameworks in the role and avoid sorting.
> 
> On Thu, Jul 7, 2016 at 7:20 PM, Dario Rexin <dr...@apple.com> wrote:
> 
>> A bit more context:
>> 
>> We have a very high number of frameworks on our clusters. In some cases
>> ~6k. The biggest problem is the sort method, which has a complexity of O(n
>> log n) and is called n*m times, where n = number of agents and m = number
>> of roles. So in total we have a complexity of O(n^3 log n). I think
>> reducing n is the most promising optimization here. We have been running
>> this patch in production for quite a while now and have seen huge
>> improvements in general allocation time and also in failover times.
>> 
>> Also, if we were to add a parameterized version of SUPPRESS, what problems
>> do you see with just differentiating between the two cases?
>> 
>> Thanks,
>> --
>>  Dario
>> 
>>> On Jul 7, 2016, at 8:40 AM, Dario Rexin <dr...@apple.com> wrote:
>>> 
>>> Hi Joris,
>>> 
>>> I still don't really understand why we would parameterize SUPPRESS, to
>> me that sounds like a case for filters. The idea of SUPPRESS was to
>> completely stop getting offers.
>>> 
>>> Could you please explain why you think the patch is a hack? To me it
>> just seems logical to not sort frameworks that don't need to be considered
>> in the allocator.
>>> 
>>> Thanks,
>>> Dario
>>> 
>>>> On 07.07.2016, at 7:38 AM, Joris Van Remoortere <jo...@mesosphere.io>
>> wrote:
>>>> 
>>>> The reason that SUPPRESS doesn't just deactivate is because the intent
>> was
>>>> to be able to parameterize this call. At that point the change wouldn't
>>>> work without turning this in to 2 cases.
>>>> 
>>>> I have asked to look at what a parameterized suppress would like and
>>>> understand the performance impact of that before we do this.
>>>> Have we reached consensus that there's no way to implement a generic
>>>> parameterized suppress that is performant?
>>>> 
>>>> There are some refactorings that we had discussed with James, Jacob, and
>>>> Ian that seem like lower hanging fruit. After those are made it might be
>>>> worth reconsidering whether we need to do this hack.
>>>> 
>>>> 
>>>> 
>>>> —
>>>> *Joris Van Remoortere*
>>>> Mesosphere
>>>> 
>>>>> On Thu, Jul 7, 2016 at 10:15 AM, Guangya Liu <gy...@gmail.com>
>> wrote:
>>>>> 
>>>>> Hi Ben and Dario,
>>>>> 
>>>>> The reason that we have "SUPPRESS" call is as following:
>>>>> 1) Act as the complement to the current REVIVE call.
>>>>> 2) The HTTP API do not have an API to "Deactivate" a framework, we
>> want to
>>>>> use "SUPPRESS", "DECLINE" and "DECLINE_INVERSE_OFFERS" to implement the
>>>>> call for "DeactivateFrameworkMessage".
>>>>> 
>>>>> You can also refer to https://issues.apache.org/jira/browse/MESOS-3037
>> for
>>>>> detail.
>>>>> 
>>>>> So I think that Dario's patch is good, we should remove the framework
>>>>> clients when "SUPPRESS" and add the framework client back when
>> "REVIVE". to
>>>>> ignore those frameworks from sorter.
>>>>> 
>>>>> @Viond, any comments for this?
>>>>> 
>>>>> @Ben, for your concern of the benchmark test result is not easy to
>>>>> understand, I have filed a JIRA ticket here
>>>>> https://issues.apache.org/jira/browse/MESOS-5800 to trace.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Guangya
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Thu, Jul 7, 2016 at 6:01 AM, Dario Rexin <dr...@apple.com> wrote:
>>>>>> 
>>>>>> Hi Vinod,
>>>>>> 
>>>>>> thanks for your reply. The reason it’s so much faster is because the
>>>>>> sorting is a lot faster with fewer frameworks. Looping shouldn’t make
>> a
>>>>>> huge difference, as it used to just skip over the deactivated
>> frameworks.
>>>>>> 
>>>>>> I don’t know what effects deactivating the framework in the master
>> would
>>>>>> have. The framework is still active and listening for events / sending
>>>>>> calls. Could you please elaborate?
>>>>>> 
>>>>>> Thanks,
>>>>>> --
>>>>>>  Dario
>>>>>> 
>>>>>> On Jul 6, 2016, at 2:56 PM, Benjamin Mahler <bm...@apache.org>
>> wrote:
>>>>>> 
>>>>>> +implementer and shepherd of SUPPRESS
>>>>>> 
>>>>>> Is there any reason we didn't already just "deactivate" frameworks
>> that
>>>>>> were suppressing offers? That seems to be the natural implementation,
>>>>>> performance aside, because the meaning of "deactivated" is: not being
>>>>> sent
>>>>>> any offers. The patch you posted seems to only take this half-way:
>>>>> suppress
>>>>>> = deactivation in the allocator, but not in the master.
>>>>>> 
>>>>>> Also, Dario it's a bit hard to interpret these numbers without reading
>>>>> the
>>>>>> benchmark code. My interpretation of these numbers is that this change
>>>>>> makes the allocation loop complete more quickly when there are many
>>>>>> frameworks that are in the suppressed state, because we have to loop
>> over
>>>>>> fewer clients. Is this an accurate interpretation?
>>>>>> 
>>>>>> On Wed, Jul 6, 2016 at 2:08 PM, Dario Rexin <dr...@apple.com> wrote:
>>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> I would like to revive
>> https://issues.apache.org/jira/browse/MESOS-4694
>>>>> <
>>>>>> https://issues.apache.org/jira/browse/MESOS-4694>, especially
>>>>>> https://reviews.apache.org/r/43666/ <
>> https://reviews.apache.org/r/43666/
>>>>>> .
>>>>>> We heavily depend on this patch and would love to see it merged. To
>> show
>>>>>> the value of this patch, I ran the benchmark from
>>>>>> https://reviews.apache.org/r/49616/ <
>> https://reviews.apache.org/r/49616/
>>>>>> 
>>>>>> first on HEAD and then with the aforementioned patch applied. I took
>> some
>>>>>> lines out to make it easier to see the changes over time in the
>> patched
>>>>>> version and to keep this email shorter ;). I would love to get some
>>>>>> feedback and discuss any necessary changes to get this patch merged.
>>>>>> 
>>>>>> Here are the results:
>>>>>> 
>>>>>> Mesos HEAD:
>>>>>> 
>>>>>> Using 2000 agents and 200 frameworks
>>>>>> round 0 allocate took 3.064665secs to make 199 offers
>>>>>> round 1 allocate took 3.029418secs to make 198 offers
>>>>>> round 2 allocate took 3.091427secs to make 197 offers
>>>>>> round 3 allocate took 2.955457secs to make 196 offers
>>>>>> round 4 allocate took 3.133789secs to make 195 offers
>>>>>> [...]
>>>>>> round 50 allocate took 3.109859secs to make 149 offers
>>>>>> round 51 allocate took 3.062746secs to make 148 offers
>>>>>> round 52 allocate took 3.146043secs to make 147 offers
>>>>>> round 53 allocate took 3.042948secs to make 146 offers
>>>>>> round 54 allocate took 3.097835secs to make 145 offers
>>>>>> [...]
>>>>>> round 100 allocate took 3.027475secs to make 99 offers
>>>>>> round 101 allocate took 3.021641secs to make 98 offers
>>>>>> round 102 allocate took 2.9853secs to make 97 offers
>>>>>> round 103 allocate took 3.145925secs to make 96 offers
>>>>>> round 104 allocate took 2.99094secs to make 95 offers
>>>>>> [...]
>>>>>> round 150 allocate took 3.080406secs to make 49 offers
>>>>>> round 151 allocate took 3.109412secs to make 48 offers
>>>>>> round 152 allocate took 2.992129secs to make 47 offers
>>>>>> round 153 allocate took 3.405642secs to make 46 offers
>>>>>> round 154 allocate took 4.153354secs to make 45 offers
>>>>>> [...]
>>>>>> round 195 allocate took 3.10015secs to make 4 offers
>>>>>> round 196 allocate took 3.029347secs to make 3 offers
>>>>>> round 197 allocate took 2.982825secs to make 2 offers
>>>>>> round 198 allocate took 2.934595secs to make 1 offers
>>>>>> round 199 allocate took 313212us to make 0 offers
>>>>>> 
>>>>>> Mesos HEAD + allocator patch:
>>>>>> 
>>>>>> Using 2000 agents and 200 frameworks
>>>>>> round 0 allocate took 3.248205secs to make 199 offers
>>>>>> round 1 allocate took 3.170852secs to make 198 offers
>>>>>> round 2 allocate took 3.135146secs to make 197 offers
>>>>>> round 3 allocate took 3.143857secs to make 196 offers
>>>>>> round 4 allocate took 3.127641secs to make 195 offers
>>>>>> [...]
>>>>>> round 50 allocate took 2.492077secs to make 149 offers
>>>>>> round 51 allocate took 2.435054secs to make 148 offers
>>>>>> round 52 allocate took 2.472204secs to make 147 offers
>>>>>> round 53 allocate took 2.457228secs to make 146 offers
>>>>>> round 54 allocate took 2.413916secs to make 145 offers
>>>>>> [...]
>>>>>> round 100 allocate took 1.645015secs to make 99 offers
>>>>>> round 101 allocate took 1.647373secs to make 98 offers
>>>>>> round 102 allocate took 1.619147secs to make 97 offers
>>>>>> round 103 allocate took 1.625496secs to make 96 offers
>>>>>> round 104 allocate took 1.580513secs to make 95 offers
>>>>>> [...]
>>>>>> round 150 allocate took 1.064716secs to make 49 offers
>>>>>> round 151 allocate took 1.065604secs to make 48 offers
>>>>>> round 152 allocate took 1.053049secs to make 47 offers
>>>>>> round 153 allocate took 1.041333secs to make 46 offers
>>>>>> round 154 allocate took 1.0461secs to make 45 offers
>>>>>> [...]
>>>>>> round 195 allocate took 569640us to make 4 offers
>>>>>> round 196 allocate took 562107us to make 3 offers
>>>>>> round 197 allocate took 547632us to make 2 offers
>>>>>> round 198 allocate took 530765us to make 1 offers
>>>>>> round 199 allocate took 24426us to make 0 offers
>>>>>> 
>>>>>> --
>>>>>>  Dario
>>>>> 
>> 
>>

Re: MESOS-4694

Posted by Alex Rukletsov <al...@mesosphere.com>.

I was not involved into conversations around this issue, so maybe you have
discussed this already (in this case, is the outcome of the discussion is
documented somewhere?).

Though the patch seems good to me, it assumes that frameworks SUPPRESS when
they don't need offers. This is not always the case. Since now we have a
real world use case with ~6k frameworks, the "right thing to do" seems to
maintain a heap of roles and frameworks in the role and avoid sorting.

On Thu, Jul 7, 2016 at 7:20 PM, Dario Rexin <dr...@apple.com> wrote:

> A bit more context:
>
> We have a very high number of frameworks on our clusters. In some cases
> ~6k. The biggest problem is the sort method, which has a complexity of O(n
> log n) and is called n*m times, where n = number of agents and m = number
> of roles. So in total we have a complexity of O(n^3 log n). I think
> reducing n is the most promising optimization here. We have been running
> this patch in production for quite a while now and have seen huge
> improvements in general allocation time and also in failover times.
>
> Also, if we were to add a parameterized version of SUPPRESS, what problems
> do you see with just differentiating between the two cases?
>
> Thanks,
> --
>  Dario
>
> > On Jul 7, 2016, at 8:40 AM, Dario Rexin <dr...@apple.com> wrote:
> >
> > Hi Joris,
> >
> > I still don't really understand why we would parameterize SUPPRESS, to
> me that sounds like a case for filters. The idea of SUPPRESS was to
> completely stop getting offers.
> >
> > Could you please explain why you think the patch is a hack? To me it
> just seems logical to not sort frameworks that don't need to be considered
> in the allocator.
> >
> > Thanks,
> > Dario
> >
> >> On 07.07.2016, at 7:38 AM, Joris Van Remoortere <jo...@mesosphere.io>
> wrote:
> >>
> >> The reason that SUPPRESS doesn't just deactivate is because the intent
> was
> >> to be able to parameterize this call. At that point the change wouldn't
> >> work without turning this in to 2 cases.
> >>
> >> I have asked to look at what a parameterized suppress would like and
> >> understand the performance impact of that before we do this.
> >> Have we reached consensus that there's no way to implement a generic
> >> parameterized suppress that is performant?
> >>
> >> There are some refactorings that we had discussed with James, Jacob, and
> >> Ian that seem like lower hanging fruit. After those are made it might be
> >> worth reconsidering whether we need to do this hack.
> >>
> >>
> >>
> >> —
> >> *Joris Van Remoortere*
> >> Mesosphere
> >>
> >>> On Thu, Jul 7, 2016 at 10:15 AM, Guangya Liu <gy...@gmail.com>
> wrote:
> >>>
> >>> Hi Ben and Dario,
> >>>
> >>> The reason that we have "SUPPRESS" call is as following:
> >>> 1) Act as the complement to the current REVIVE call.
> >>> 2) The HTTP API do not have an API to "Deactivate" a framework, we
> want to
> >>> use "SUPPRESS", "DECLINE" and "DECLINE_INVERSE_OFFERS" to implement the
> >>> call for "DeactivateFrameworkMessage".
> >>>
> >>> You can also refer to https://issues.apache.org/jira/browse/MESOS-3037
> for
> >>> detail.
> >>>
> >>> So I think that Dario's patch is good, we should remove the framework
> >>> clients when "SUPPRESS" and add the framework client back when
> "REVIVE". to
> >>> ignore those frameworks from sorter.
> >>>
> >>> @Viond, any comments for this?
> >>>
> >>> @Ben, for your concern of the benchmark test result is not easy to
> >>> understand, I have filed a JIRA ticket here
> >>> https://issues.apache.org/jira/browse/MESOS-5800 to trace.
> >>>
> >>> Thanks,
> >>>
> >>> Guangya
> >>>
> >>>
> >>>
> >>>> On Thu, Jul 7, 2016 at 6:01 AM, Dario Rexin <dr...@apple.com> wrote:
> >>>>
> >>>> Hi Vinod,
> >>>>
> >>>> thanks for your reply. The reason it’s so much faster is because the
> >>>> sorting is a lot faster with fewer frameworks. Looping shouldn’t make
> a
> >>>> huge difference, as it used to just skip over the deactivated
> frameworks.
> >>>>
> >>>> I don’t know what effects deactivating the framework in the master
> would
> >>>> have. The framework is still active and listening for events / sending
> >>>> calls. Could you please elaborate?
> >>>>
> >>>> Thanks,
> >>>> --
> >>>>  Dario
> >>>>
> >>>> On Jul 6, 2016, at 2:56 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
> >>>>
> >>>> +implementer and shepherd of SUPPRESS
> >>>>
> >>>> Is there any reason we didn't already just "deactivate" frameworks
> that
> >>>> were suppressing offers? That seems to be the natural implementation,
> >>>> performance aside, because the meaning of "deactivated" is: not being
> >>> sent
> >>>> any offers. The patch you posted seems to only take this half-way:
> >>> suppress
> >>>> = deactivation in the allocator, but not in the master.
> >>>>
> >>>> Also, Dario it's a bit hard to interpret these numbers without reading
> >>> the
> >>>> benchmark code. My interpretation of these numbers is that this change
> >>>> makes the allocation loop complete more quickly when there are many
> >>>> frameworks that are in the suppressed state, because we have to loop
> over
> >>>> fewer clients. Is this an accurate interpretation?
> >>>>
> >>>> On Wed, Jul 6, 2016 at 2:08 PM, Dario Rexin <dr...@apple.com> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I would like to revive
> https://issues.apache.org/jira/browse/MESOS-4694
> >>> <
> >>>> https://issues.apache.org/jira/browse/MESOS-4694>, especially
> >>>> https://reviews.apache.org/r/43666/ <
> https://reviews.apache.org/r/43666/
> >>>> .
> >>>> We heavily depend on this patch and would love to see it merged. To
> show
> >>>> the value of this patch, I ran the benchmark from
> >>>> https://reviews.apache.org/r/49616/ <
> https://reviews.apache.org/r/49616/
> >>>>
> >>>> first on HEAD and then with the aforementioned patch applied. I took
> some
> >>>> lines out to make it easier to see the changes over time in the
> patched
> >>>> version and to keep this email shorter ;). I would love to get some
> >>>> feedback and discuss any necessary changes to get this patch merged.
> >>>>
> >>>> Here are the results:
> >>>>
> >>>> Mesos HEAD:
> >>>>
> >>>> Using 2000 agents and 200 frameworks
> >>>> round 0 allocate took 3.064665secs to make 199 offers
> >>>> round 1 allocate took 3.029418secs to make 198 offers
> >>>> round 2 allocate took 3.091427secs to make 197 offers
> >>>> round 3 allocate took 2.955457secs to make 196 offers
> >>>> round 4 allocate took 3.133789secs to make 195 offers
> >>>> [...]
> >>>> round 50 allocate took 3.109859secs to make 149 offers
> >>>> round 51 allocate took 3.062746secs to make 148 offers
> >>>> round 52 allocate took 3.146043secs to make 147 offers
> >>>> round 53 allocate took 3.042948secs to make 146 offers
> >>>> round 54 allocate took 3.097835secs to make 145 offers
> >>>> [...]
> >>>> round 100 allocate took 3.027475secs to make 99 offers
> >>>> round 101 allocate took 3.021641secs to make 98 offers
> >>>> round 102 allocate took 2.9853secs to make 97 offers
> >>>> round 103 allocate took 3.145925secs to make 96 offers
> >>>> round 104 allocate took 2.99094secs to make 95 offers
> >>>> [...]
> >>>> round 150 allocate took 3.080406secs to make 49 offers
> >>>> round 151 allocate took 3.109412secs to make 48 offers
> >>>> round 152 allocate took 2.992129secs to make 47 offers
> >>>> round 153 allocate took 3.405642secs to make 46 offers
> >>>> round 154 allocate took 4.153354secs to make 45 offers
> >>>> [...]
> >>>> round 195 allocate took 3.10015secs to make 4 offers
> >>>> round 196 allocate took 3.029347secs to make 3 offers
> >>>> round 197 allocate took 2.982825secs to make 2 offers
> >>>> round 198 allocate took 2.934595secs to make 1 offers
> >>>> round 199 allocate took 313212us to make 0 offers
> >>>>
> >>>> Mesos HEAD + allocator patch:
> >>>>
> >>>> Using 2000 agents and 200 frameworks
> >>>> round 0 allocate took 3.248205secs to make 199 offers
> >>>> round 1 allocate took 3.170852secs to make 198 offers
> >>>> round 2 allocate took 3.135146secs to make 197 offers
> >>>> round 3 allocate took 3.143857secs to make 196 offers
> >>>> round 4 allocate took 3.127641secs to make 195 offers
> >>>> [...]
> >>>> round 50 allocate took 2.492077secs to make 149 offers
> >>>> round 51 allocate took 2.435054secs to make 148 offers
> >>>> round 52 allocate took 2.472204secs to make 147 offers
> >>>> round 53 allocate took 2.457228secs to make 146 offers
> >>>> round 54 allocate took 2.413916secs to make 145 offers
> >>>> [...]
> >>>> round 100 allocate took 1.645015secs to make 99 offers
> >>>> round 101 allocate took 1.647373secs to make 98 offers
> >>>> round 102 allocate took 1.619147secs to make 97 offers
> >>>> round 103 allocate took 1.625496secs to make 96 offers
> >>>> round 104 allocate took 1.580513secs to make 95 offers
> >>>> [...]
> >>>> round 150 allocate took 1.064716secs to make 49 offers
> >>>> round 151 allocate took 1.065604secs to make 48 offers
> >>>> round 152 allocate took 1.053049secs to make 47 offers
> >>>> round 153 allocate took 1.041333secs to make 46 offers
> >>>> round 154 allocate took 1.0461secs to make 45 offers
> >>>> [...]
> >>>> round 195 allocate took 569640us to make 4 offers
> >>>> round 196 allocate took 562107us to make 3 offers
> >>>> round 197 allocate took 547632us to make 2 offers
> >>>> round 198 allocate took 530765us to make 1 offers
> >>>> round 199 allocate took 24426us to make 0 offers
> >>>>
> >>>> --
> >>>>  Dario
> >>>
>
>

Re: MESOS-4694

Posted by Dario Rexin <dr...@apple.com>.

Hi Joris,

that’s great news, thanks! I will add a comment and ping you later.

--
 Dario

> On Jul 7, 2016, at 10:57 AM, Joris Van Remoortere <jo...@mesosphere.io> wrote:
> 
> After syncing with Vinod, we're ok adding this change in the interim. We do want a clear comment in the implementation of suppress explaining that this is a special case and that we will need separate handling if this call becomes parameterized in the future.
> 
> Let me know (ping in mesos slack?) when you feel a sufficient explanation is updated in the patch and I'll schedule time to review them.
> 
> Joris
> 
> — 
> Joris Van Remoortere
> Mesosphere
> 
> On Thu, Jul 7, 2016 at 7:20 PM, Dario Rexin <drexin@apple.com <ma...@apple.com>> wrote:
> A bit more context:
> 
> We have a very high number of frameworks on our clusters. In some cases ~6k. The biggest problem is the sort method, which has a complexity of O(n log n) and is called n*m times, where n = number of agents and m = number of roles. So in total we have a complexity of O(n^3 log n). I think reducing n is the most promising optimization here. We have been running this patch in production for quite a while now and have seen huge improvements in general allocation time and also in failover times.
> 
> Also, if we were to add a parameterized version of SUPPRESS, what problems do you see with just differentiating between the two cases?
> 
> Thanks,
> --
>  Dario
> 
> > On Jul 7, 2016, at 8:40 AM, Dario Rexin <drexin@apple.com <ma...@apple.com>> wrote:
> >
> > Hi Joris,
> >
> > I still don't really understand why we would parameterize SUPPRESS, to me that sounds like a case for filters. The idea of SUPPRESS was to completely stop getting offers.
> >
> > Could you please explain why you think the patch is a hack? To me it just seems logical to not sort frameworks that don't need to be considered in the allocator.
> >
> > Thanks,
> > Dario
> >
> >> On 07.07.2016, at 7:38 AM, Joris Van Remoortere <joris@mesosphere.io <ma...@mesosphere.io>> wrote:
> >>
> >> The reason that SUPPRESS doesn't just deactivate is because the intent was
> >> to be able to parameterize this call. At that point the change wouldn't
> >> work without turning this in to 2 cases.
> >>
> >> I have asked to look at what a parameterized suppress would like and
> >> understand the performance impact of that before we do this.
> >> Have we reached consensus that there's no way to implement a generic
> >> parameterized suppress that is performant?
> >>
> >> There are some refactorings that we had discussed with James, Jacob, and
> >> Ian that seem like lower hanging fruit. After those are made it might be
> >> worth reconsidering whether we need to do this hack.
> >>
> >>
> >>
> >> —
> >> *Joris Van Remoortere*
> >> Mesosphere
> >>
> >>> On Thu, Jul 7, 2016 at 10:15 AM, Guangya Liu <gyliu513@gmail.com <ma...@gmail.com>> wrote:
> >>>
> >>> Hi Ben and Dario,
> >>>
> >>> The reason that we have "SUPPRESS" call is as following:
> >>> 1) Act as the complement to the current REVIVE call.
> >>> 2) The HTTP API do not have an API to "Deactivate" a framework, we want to
> >>> use "SUPPRESS", "DECLINE" and "DECLINE_INVERSE_OFFERS" to implement the
> >>> call for "DeactivateFrameworkMessage".
> >>>
> >>> You can also refer to https://issues.apache.org/jira/browse/MESOS-3037 <https://issues.apache.org/jira/browse/MESOS-3037> for
> >>> detail.
> >>>
> >>> So I think that Dario's patch is good, we should remove the framework
> >>> clients when "SUPPRESS" and add the framework client back when "REVIVE". to
> >>> ignore those frameworks from sorter.
> >>>
> >>> @Viond, any comments for this?
> >>>
> >>> @Ben, for your concern of the benchmark test result is not easy to
> >>> understand, I have filed a JIRA ticket here
> >>> https://issues.apache.org/jira/browse/MESOS-5800 <https://issues.apache.org/jira/browse/MESOS-5800> to trace.
> >>>
> >>> Thanks,
> >>>
> >>> Guangya
> >>>
> >>>
> >>>
> >>>> On Thu, Jul 7, 2016 at 6:01 AM, Dario Rexin <drexin@apple.com <ma...@apple.com>> wrote:
> >>>>
> >>>> Hi Vinod,
> >>>>
> >>>> thanks for your reply. The reason it’s so much faster is because the
> >>>> sorting is a lot faster with fewer frameworks. Looping shouldn’t make a
> >>>> huge difference, as it used to just skip over the deactivated frameworks.
> >>>>
> >>>> I don’t know what effects deactivating the framework in the master would
> >>>> have. The framework is still active and listening for events / sending
> >>>> calls. Could you please elaborate?
> >>>>
> >>>> Thanks,
> >>>> --
> >>>>  Dario
> >>>>
> >>>> On Jul 6, 2016, at 2:56 PM, Benjamin Mahler <bmahler@apache.org <ma...@apache.org>> wrote:
> >>>>
> >>>> +implementer and shepherd of SUPPRESS
> >>>>
> >>>> Is there any reason we didn't already just "deactivate" frameworks that
> >>>> were suppressing offers? That seems to be the natural implementation,
> >>>> performance aside, because the meaning of "deactivated" is: not being
> >>> sent
> >>>> any offers. The patch you posted seems to only take this half-way:
> >>> suppress
> >>>> = deactivation in the allocator, but not in the master.
> >>>>
> >>>> Also, Dario it's a bit hard to interpret these numbers without reading
> >>> the
> >>>> benchmark code. My interpretation of these numbers is that this change
> >>>> makes the allocation loop complete more quickly when there are many
> >>>> frameworks that are in the suppressed state, because we have to loop over
> >>>> fewer clients. Is this an accurate interpretation?
> >>>>
> >>>> On Wed, Jul 6, 2016 at 2:08 PM, Dario Rexin <drexin@apple.com <ma...@apple.com>> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I would like to revive https://issues.apache.org/jira/browse/MESOS-4694 <https://issues.apache.org/jira/browse/MESOS-4694>
> >>> <
> >>>> https://issues.apache.org/jira/browse/MESOS-4694 <https://issues.apache.org/jira/browse/MESOS-4694>>, especially
> >>>> https://reviews.apache.org/r/43666/ <https://reviews.apache.org/r/43666/> <https://reviews.apache.org/r/43666/ <https://reviews.apache.org/r/43666/>
> >>>> .
> >>>> We heavily depend on this patch and would love to see it merged. To show
> >>>> the value of this patch, I ran the benchmark from
> >>>> https://reviews.apache.org/r/49616/ <https://reviews.apache.org/r/49616/> <https://reviews.apache.org/r/49616/ <https://reviews.apache.org/r/49616/>
> >>>>
> >>>> first on HEAD and then with the aforementioned patch applied. I took some
> >>>> lines out to make it easier to see the changes over time in the patched
> >>>> version and to keep this email shorter ;). I would love to get some
> >>>> feedback and discuss any necessary changes to get this patch merged.
> >>>>
> >>>> Here are the results:
> >>>>
> >>>> Mesos HEAD:
> >>>>
> >>>> Using 2000 agents and 200 frameworks
> >>>> round 0 allocate took 3.064665secs to make 199 offers
> >>>> round 1 allocate took 3.029418secs to make 198 offers
> >>>> round 2 allocate took 3.091427secs to make 197 offers
> >>>> round 3 allocate took 2.955457secs to make 196 offers
> >>>> round 4 allocate took 3.133789secs to make 195 offers
> >>>> [...]
> >>>> round 50 allocate took 3.109859secs to make 149 offers
> >>>> round 51 allocate took 3.062746secs to make 148 offers
> >>>> round 52 allocate took 3.146043secs to make 147 offers
> >>>> round 53 allocate took 3.042948secs to make 146 offers
> >>>> round 54 allocate took 3.097835secs to make 145 offers
> >>>> [...]
> >>>> round 100 allocate took 3.027475secs to make 99 offers
> >>>> round 101 allocate took 3.021641secs to make 98 offers
> >>>> round 102 allocate took 2.9853secs to make 97 offers
> >>>> round 103 allocate took 3.145925secs to make 96 offers
> >>>> round 104 allocate took 2.99094secs to make 95 offers
> >>>> [...]
> >>>> round 150 allocate took 3.080406secs to make 49 offers
> >>>> round 151 allocate took 3.109412secs to make 48 offers
> >>>> round 152 allocate took 2.992129secs to make 47 offers
> >>>> round 153 allocate took 3.405642secs to make 46 offers
> >>>> round 154 allocate took 4.153354secs to make 45 offers
> >>>> [...]
> >>>> round 195 allocate took 3.10015secs to make 4 offers
> >>>> round 196 allocate took 3.029347secs to make 3 offers
> >>>> round 197 allocate took 2.982825secs to make 2 offers
> >>>> round 198 allocate took 2.934595secs to make 1 offers
> >>>> round 199 allocate took 313212us to make 0 offers
> >>>>
> >>>> Mesos HEAD + allocator patch:
> >>>>
> >>>> Using 2000 agents and 200 frameworks
> >>>> round 0 allocate took 3.248205secs to make 199 offers
> >>>> round 1 allocate took 3.170852secs to make 198 offers
> >>>> round 2 allocate took 3.135146secs to make 197 offers
> >>>> round 3 allocate took 3.143857secs to make 196 offers
> >>>> round 4 allocate took 3.127641secs to make 195 offers
> >>>> [...]
> >>>> round 50 allocate took 2.492077secs to make 149 offers
> >>>> round 51 allocate took 2.435054secs to make 148 offers
> >>>> round 52 allocate took 2.472204secs to make 147 offers
> >>>> round 53 allocate took 2.457228secs to make 146 offers
> >>>> round 54 allocate took 2.413916secs to make 145 offers
> >>>> [...]
> >>>> round 100 allocate took 1.645015secs to make 99 offers
> >>>> round 101 allocate took 1.647373secs to make 98 offers
> >>>> round 102 allocate took 1.619147secs to make 97 offers
> >>>> round 103 allocate took 1.625496secs to make 96 offers
> >>>> round 104 allocate took 1.580513secs to make 95 offers
> >>>> [...]
> >>>> round 150 allocate took 1.064716secs to make 49 offers
> >>>> round 151 allocate took 1.065604secs to make 48 offers
> >>>> round 152 allocate took 1.053049secs to make 47 offers
> >>>> round 153 allocate took 1.041333secs to make 46 offers
> >>>> round 154 allocate took 1.0461secs to make 45 offers
> >>>> [...]
> >>>> round 195 allocate took 569640us to make 4 offers
> >>>> round 196 allocate took 562107us to make 3 offers
> >>>> round 197 allocate took 547632us to make 2 offers
> >>>> round 198 allocate took 530765us to make 1 offers
> >>>> round 199 allocate took 24426us to make 0 offers
> >>>>
> >>>> --
> >>>>  Dario
> >>>
> 
>

Re: MESOS-4694

Posted by Joris Van Remoortere <jo...@mesosphere.io>.

After syncing with Vinod, we're ok adding this change in the interim. We do
want a clear comment in the implementation of suppress explaining that this
is a special case and that we will need separate handling if this call
becomes parameterized in the future.

Let me know (ping in mesos slack?) when you feel a sufficient explanation
is updated in the patch and I'll schedule time to review them.

Joris

—
*Joris Van Remoortere*
Mesosphere

On Thu, Jul 7, 2016 at 7:20 PM, Dario Rexin <dr...@apple.com> wrote:

> A bit more context:
>
> We have a very high number of frameworks on our clusters. In some cases
> ~6k. The biggest problem is the sort method, which has a complexity of O(n
> log n) and is called n*m times, where n = number of agents and m = number
> of roles. So in total we have a complexity of O(n^3 log n). I think
> reducing n is the most promising optimization here. We have been running
> this patch in production for quite a while now and have seen huge
> improvements in general allocation time and also in failover times.
>
> Also, if we were to add a parameterized version of SUPPRESS, what problems
> do you see with just differentiating between the two cases?
>
> Thanks,
> --
>  Dario
>
> > On Jul 7, 2016, at 8:40 AM, Dario Rexin <dr...@apple.com> wrote:
> >
> > Hi Joris,
> >
> > I still don't really understand why we would parameterize SUPPRESS, to
> me that sounds like a case for filters. The idea of SUPPRESS was to
> completely stop getting offers.
> >
> > Could you please explain why you think the patch is a hack? To me it
> just seems logical to not sort frameworks that don't need to be considered
> in the allocator.
> >
> > Thanks,
> > Dario
> >
> >> On 07.07.2016, at 7:38 AM, Joris Van Remoortere <jo...@mesosphere.io>
> wrote:
> >>
> >> The reason that SUPPRESS doesn't just deactivate is because the intent
> was
> >> to be able to parameterize this call. At that point the change wouldn't
> >> work without turning this in to 2 cases.
> >>
> >> I have asked to look at what a parameterized suppress would like and
> >> understand the performance impact of that before we do this.
> >> Have we reached consensus that there's no way to implement a generic
> >> parameterized suppress that is performant?
> >>
> >> There are some refactorings that we had discussed with James, Jacob, and
> >> Ian that seem like lower hanging fruit. After those are made it might be
> >> worth reconsidering whether we need to do this hack.
> >>
> >>
> >>
> >> —
> >> *Joris Van Remoortere*
> >> Mesosphere
> >>
> >>> On Thu, Jul 7, 2016 at 10:15 AM, Guangya Liu <gy...@gmail.com>
> wrote:
> >>>
> >>> Hi Ben and Dario,
> >>>
> >>> The reason that we have "SUPPRESS" call is as following:
> >>> 1) Act as the complement to the current REVIVE call.
> >>> 2) The HTTP API do not have an API to "Deactivate" a framework, we
> want to
> >>> use "SUPPRESS", "DECLINE" and "DECLINE_INVERSE_OFFERS" to implement the
> >>> call for "DeactivateFrameworkMessage".
> >>>
> >>> You can also refer to https://issues.apache.org/jira/browse/MESOS-3037
> for
> >>> detail.
> >>>
> >>> So I think that Dario's patch is good, we should remove the framework
> >>> clients when "SUPPRESS" and add the framework client back when
> "REVIVE". to
> >>> ignore those frameworks from sorter.
> >>>
> >>> @Viond, any comments for this?
> >>>
> >>> @Ben, for your concern of the benchmark test result is not easy to
> >>> understand, I have filed a JIRA ticket here
> >>> https://issues.apache.org/jira/browse/MESOS-5800 to trace.
> >>>
> >>> Thanks,
> >>>
> >>> Guangya
> >>>
> >>>
> >>>
> >>>> On Thu, Jul 7, 2016 at 6:01 AM, Dario Rexin <dr...@apple.com> wrote:
> >>>>
> >>>> Hi Vinod,
> >>>>
> >>>> thanks for your reply. The reason it’s so much faster is because the
> >>>> sorting is a lot faster with fewer frameworks. Looping shouldn’t make
> a
> >>>> huge difference, as it used to just skip over the deactivated
> frameworks.
> >>>>
> >>>> I don’t know what effects deactivating the framework in the master
> would
> >>>> have. The framework is still active and listening for events / sending
> >>>> calls. Could you please elaborate?
> >>>>
> >>>> Thanks,
> >>>> --
> >>>>  Dario
> >>>>
> >>>> On Jul 6, 2016, at 2:56 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
> >>>>
> >>>> +implementer and shepherd of SUPPRESS
> >>>>
> >>>> Is there any reason we didn't already just "deactivate" frameworks
> that
> >>>> were suppressing offers? That seems to be the natural implementation,
> >>>> performance aside, because the meaning of "deactivated" is: not being
> >>> sent
> >>>> any offers. The patch you posted seems to only take this half-way:
> >>> suppress
> >>>> = deactivation in the allocator, but not in the master.
> >>>>
> >>>> Also, Dario it's a bit hard to interpret these numbers without reading
> >>> the
> >>>> benchmark code. My interpretation of these numbers is that this change
> >>>> makes the allocation loop complete more quickly when there are many
> >>>> frameworks that are in the suppressed state, because we have to loop
> over
> >>>> fewer clients. Is this an accurate interpretation?
> >>>>
> >>>> On Wed, Jul 6, 2016 at 2:08 PM, Dario Rexin <dr...@apple.com> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I would like to revive
> https://issues.apache.org/jira/browse/MESOS-4694
> >>> <
> >>>> https://issues.apache.org/jira/browse/MESOS-4694>, especially
> >>>> https://reviews.apache.org/r/43666/ <
> https://reviews.apache.org/r/43666/
> >>>> .
> >>>> We heavily depend on this patch and would love to see it merged. To
> show
> >>>> the value of this patch, I ran the benchmark from
> >>>> https://reviews.apache.org/r/49616/ <
> https://reviews.apache.org/r/49616/
> >>>>
> >>>> first on HEAD and then with the aforementioned patch applied. I took
> some
> >>>> lines out to make it easier to see the changes over time in the
> patched
> >>>> version and to keep this email shorter ;). I would love to get some
> >>>> feedback and discuss any necessary changes to get this patch merged.
> >>>>
> >>>> Here are the results:
> >>>>
> >>>> Mesos HEAD:
> >>>>
> >>>> Using 2000 agents and 200 frameworks
> >>>> round 0 allocate took 3.064665secs to make 199 offers
> >>>> round 1 allocate took 3.029418secs to make 198 offers
> >>>> round 2 allocate took 3.091427secs to make 197 offers
> >>>> round 3 allocate took 2.955457secs to make 196 offers
> >>>> round 4 allocate took 3.133789secs to make 195 offers
> >>>> [...]
> >>>> round 50 allocate took 3.109859secs to make 149 offers
> >>>> round 51 allocate took 3.062746secs to make 148 offers
> >>>> round 52 allocate took 3.146043secs to make 147 offers
> >>>> round 53 allocate took 3.042948secs to make 146 offers
> >>>> round 54 allocate took 3.097835secs to make 145 offers
> >>>> [...]
> >>>> round 100 allocate took 3.027475secs to make 99 offers
> >>>> round 101 allocate took 3.021641secs to make 98 offers
> >>>> round 102 allocate took 2.9853secs to make 97 offers
> >>>> round 103 allocate took 3.145925secs to make 96 offers
> >>>> round 104 allocate took 2.99094secs to make 95 offers
> >>>> [...]
> >>>> round 150 allocate took 3.080406secs to make 49 offers
> >>>> round 151 allocate took 3.109412secs to make 48 offers
> >>>> round 152 allocate took 2.992129secs to make 47 offers
> >>>> round 153 allocate took 3.405642secs to make 46 offers
> >>>> round 154 allocate took 4.153354secs to make 45 offers
> >>>> [...]
> >>>> round 195 allocate took 3.10015secs to make 4 offers
> >>>> round 196 allocate took 3.029347secs to make 3 offers
> >>>> round 197 allocate took 2.982825secs to make 2 offers
> >>>> round 198 allocate took 2.934595secs to make 1 offers
> >>>> round 199 allocate took 313212us to make 0 offers
> >>>>
> >>>> Mesos HEAD + allocator patch:
> >>>>
> >>>> Using 2000 agents and 200 frameworks
> >>>> round 0 allocate took 3.248205secs to make 199 offers
> >>>> round 1 allocate took 3.170852secs to make 198 offers
> >>>> round 2 allocate took 3.135146secs to make 197 offers
> >>>> round 3 allocate took 3.143857secs to make 196 offers
> >>>> round 4 allocate took 3.127641secs to make 195 offers
> >>>> [...]
> >>>> round 50 allocate took 2.492077secs to make 149 offers
> >>>> round 51 allocate took 2.435054secs to make 148 offers
> >>>> round 52 allocate took 2.472204secs to make 147 offers
> >>>> round 53 allocate took 2.457228secs to make 146 offers
> >>>> round 54 allocate took 2.413916secs to make 145 offers
> >>>> [...]
> >>>> round 100 allocate took 1.645015secs to make 99 offers
> >>>> round 101 allocate took 1.647373secs to make 98 offers
> >>>> round 102 allocate took 1.619147secs to make 97 offers
> >>>> round 103 allocate took 1.625496secs to make 96 offers
> >>>> round 104 allocate took 1.580513secs to make 95 offers
> >>>> [...]
> >>>> round 150 allocate took 1.064716secs to make 49 offers
> >>>> round 151 allocate took 1.065604secs to make 48 offers
> >>>> round 152 allocate took 1.053049secs to make 47 offers
> >>>> round 153 allocate took 1.041333secs to make 46 offers
> >>>> round 154 allocate took 1.0461secs to make 45 offers
> >>>> [...]
> >>>> round 195 allocate took 569640us to make 4 offers
> >>>> round 196 allocate took 562107us to make 3 offers
> >>>> round 197 allocate took 547632us to make 2 offers
> >>>> round 198 allocate took 530765us to make 1 offers
> >>>> round 199 allocate took 24426us to make 0 offers
> >>>>
> >>>> --
> >>>>  Dario
> >>>
>
>

Re: MESOS-4694

Posted by Dario Rexin <dr...@apple.com>.

A bit more context:

We have a very high number of frameworks on our clusters. In some cases ~6k. The biggest problem is the sort method, which has a complexity of O(n log n) and is called n*m times, where n = number of agents and m = number of roles. So in total we have a complexity of O(n^3 log n). I think reducing n is the most promising optimization here. We have been running this patch in production for quite a while now and have seen huge improvements in general allocation time and also in failover times.

Also, if we were to add a parameterized version of SUPPRESS, what problems do you see with just differentiating between the two cases?

Thanks,
--
 Dario

> On Jul 7, 2016, at 8:40 AM, Dario Rexin <dr...@apple.com> wrote:
> 
> Hi Joris,
> 
> I still don't really understand why we would parameterize SUPPRESS, to me that sounds like a case for filters. The idea of SUPPRESS was to completely stop getting offers. 
> 
> Could you please explain why you think the patch is a hack? To me it just seems logical to not sort frameworks that don't need to be considered in the allocator.
> 
> Thanks,
> Dario
> 
>> On 07.07.2016, at 7:38 AM, Joris Van Remoortere <jo...@mesosphere.io> wrote:
>> 
>> The reason that SUPPRESS doesn't just deactivate is because the intent was
>> to be able to parameterize this call. At that point the change wouldn't
>> work without turning this in to 2 cases.
>> 
>> I have asked to look at what a parameterized suppress would like and
>> understand the performance impact of that before we do this.
>> Have we reached consensus that there's no way to implement a generic
>> parameterized suppress that is performant?
>> 
>> There are some refactorings that we had discussed with James, Jacob, and
>> Ian that seem like lower hanging fruit. After those are made it might be
>> worth reconsidering whether we need to do this hack.
>> 
>> 
>> 
>> —
>> *Joris Van Remoortere*
>> Mesosphere
>> 
>>> On Thu, Jul 7, 2016 at 10:15 AM, Guangya Liu <gy...@gmail.com> wrote:
>>> 
>>> Hi Ben and Dario,
>>> 
>>> The reason that we have "SUPPRESS" call is as following:
>>> 1) Act as the complement to the current REVIVE call.
>>> 2) The HTTP API do not have an API to "Deactivate" a framework, we want to
>>> use "SUPPRESS", "DECLINE" and "DECLINE_INVERSE_OFFERS" to implement the
>>> call for "DeactivateFrameworkMessage".
>>> 
>>> You can also refer to https://issues.apache.org/jira/browse/MESOS-3037 for
>>> detail.
>>> 
>>> So I think that Dario's patch is good, we should remove the framework
>>> clients when "SUPPRESS" and add the framework client back when "REVIVE". to
>>> ignore those frameworks from sorter.
>>> 
>>> @Viond, any comments for this?
>>> 
>>> @Ben, for your concern of the benchmark test result is not easy to
>>> understand, I have filed a JIRA ticket here
>>> https://issues.apache.org/jira/browse/MESOS-5800 to trace.
>>> 
>>> Thanks,
>>> 
>>> Guangya
>>> 
>>> 
>>> 
>>>> On Thu, Jul 7, 2016 at 6:01 AM, Dario Rexin <dr...@apple.com> wrote:
>>>> 
>>>> Hi Vinod,
>>>> 
>>>> thanks for your reply. The reason it’s so much faster is because the
>>>> sorting is a lot faster with fewer frameworks. Looping shouldn’t make a
>>>> huge difference, as it used to just skip over the deactivated frameworks.
>>>> 
>>>> I don’t know what effects deactivating the framework in the master would
>>>> have. The framework is still active and listening for events / sending
>>>> calls. Could you please elaborate?
>>>> 
>>>> Thanks,
>>>> --
>>>>  Dario
>>>> 
>>>> On Jul 6, 2016, at 2:56 PM, Benjamin Mahler <bm...@apache.org> wrote:
>>>> 
>>>> +implementer and shepherd of SUPPRESS
>>>> 
>>>> Is there any reason we didn't already just "deactivate" frameworks that
>>>> were suppressing offers? That seems to be the natural implementation,
>>>> performance aside, because the meaning of "deactivated" is: not being
>>> sent
>>>> any offers. The patch you posted seems to only take this half-way:
>>> suppress
>>>> = deactivation in the allocator, but not in the master.
>>>> 
>>>> Also, Dario it's a bit hard to interpret these numbers without reading
>>> the
>>>> benchmark code. My interpretation of these numbers is that this change
>>>> makes the allocation loop complete more quickly when there are many
>>>> frameworks that are in the suppressed state, because we have to loop over
>>>> fewer clients. Is this an accurate interpretation?
>>>> 
>>>> On Wed, Jul 6, 2016 at 2:08 PM, Dario Rexin <dr...@apple.com> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> I would like to revive https://issues.apache.org/jira/browse/MESOS-4694
>>> <
>>>> https://issues.apache.org/jira/browse/MESOS-4694>, especially
>>>> https://reviews.apache.org/r/43666/ <https://reviews.apache.org/r/43666/
>>>> .
>>>> We heavily depend on this patch and would love to see it merged. To show
>>>> the value of this patch, I ran the benchmark from
>>>> https://reviews.apache.org/r/49616/ <https://reviews.apache.org/r/49616/
>>>> 
>>>> first on HEAD and then with the aforementioned patch applied. I took some
>>>> lines out to make it easier to see the changes over time in the patched
>>>> version and to keep this email shorter ;). I would love to get some
>>>> feedback and discuss any necessary changes to get this patch merged.
>>>> 
>>>> Here are the results:
>>>> 
>>>> Mesos HEAD:
>>>> 
>>>> Using 2000 agents and 200 frameworks
>>>> round 0 allocate took 3.064665secs to make 199 offers
>>>> round 1 allocate took 3.029418secs to make 198 offers
>>>> round 2 allocate took 3.091427secs to make 197 offers
>>>> round 3 allocate took 2.955457secs to make 196 offers
>>>> round 4 allocate took 3.133789secs to make 195 offers
>>>> [...]
>>>> round 50 allocate took 3.109859secs to make 149 offers
>>>> round 51 allocate took 3.062746secs to make 148 offers
>>>> round 52 allocate took 3.146043secs to make 147 offers
>>>> round 53 allocate took 3.042948secs to make 146 offers
>>>> round 54 allocate took 3.097835secs to make 145 offers
>>>> [...]
>>>> round 100 allocate took 3.027475secs to make 99 offers
>>>> round 101 allocate took 3.021641secs to make 98 offers
>>>> round 102 allocate took 2.9853secs to make 97 offers
>>>> round 103 allocate took 3.145925secs to make 96 offers
>>>> round 104 allocate took 2.99094secs to make 95 offers
>>>> [...]
>>>> round 150 allocate took 3.080406secs to make 49 offers
>>>> round 151 allocate took 3.109412secs to make 48 offers
>>>> round 152 allocate took 2.992129secs to make 47 offers
>>>> round 153 allocate took 3.405642secs to make 46 offers
>>>> round 154 allocate took 4.153354secs to make 45 offers
>>>> [...]
>>>> round 195 allocate took 3.10015secs to make 4 offers
>>>> round 196 allocate took 3.029347secs to make 3 offers
>>>> round 197 allocate took 2.982825secs to make 2 offers
>>>> round 198 allocate took 2.934595secs to make 1 offers
>>>> round 199 allocate took 313212us to make 0 offers
>>>> 
>>>> Mesos HEAD + allocator patch:
>>>> 
>>>> Using 2000 agents and 200 frameworks
>>>> round 0 allocate took 3.248205secs to make 199 offers
>>>> round 1 allocate took 3.170852secs to make 198 offers
>>>> round 2 allocate took 3.135146secs to make 197 offers
>>>> round 3 allocate took 3.143857secs to make 196 offers
>>>> round 4 allocate took 3.127641secs to make 195 offers
>>>> [...]
>>>> round 50 allocate took 2.492077secs to make 149 offers
>>>> round 51 allocate took 2.435054secs to make 148 offers
>>>> round 52 allocate took 2.472204secs to make 147 offers
>>>> round 53 allocate took 2.457228secs to make 146 offers
>>>> round 54 allocate took 2.413916secs to make 145 offers
>>>> [...]
>>>> round 100 allocate took 1.645015secs to make 99 offers
>>>> round 101 allocate took 1.647373secs to make 98 offers
>>>> round 102 allocate took 1.619147secs to make 97 offers
>>>> round 103 allocate took 1.625496secs to make 96 offers
>>>> round 104 allocate took 1.580513secs to make 95 offers
>>>> [...]
>>>> round 150 allocate took 1.064716secs to make 49 offers
>>>> round 151 allocate took 1.065604secs to make 48 offers
>>>> round 152 allocate took 1.053049secs to make 47 offers
>>>> round 153 allocate took 1.041333secs to make 46 offers
>>>> round 154 allocate took 1.0461secs to make 45 offers
>>>> [...]
>>>> round 195 allocate took 569640us to make 4 offers
>>>> round 196 allocate took 562107us to make 3 offers
>>>> round 197 allocate took 547632us to make 2 offers
>>>> round 198 allocate took 530765us to make 1 offers
>>>> round 199 allocate took 24426us to make 0 offers
>>>> 
>>>> --
>>>>  Dario
>>>

Re: MESOS-4694

Posted by Dario Rexin <dr...@apple.com>.

Hi Joris,

I still don't really understand why we would parameterize SUPPRESS, to me that sounds like a case for filters. The idea of SUPPRESS was to completely stop getting offers. 

Could you please explain why you think the patch is a hack? To me it just seems logical to not sort frameworks that don't need to be considered in the allocator.

Thanks,
Dario

> On 07.07.2016, at 7:38 AM, Joris Van Remoortere <jo...@mesosphere.io> wrote:
> 
> The reason that SUPPRESS doesn't just deactivate is because the intent was
> to be able to parameterize this call. At that point the change wouldn't
> work without turning this in to 2 cases.
> 
> I have asked to look at what a parameterized suppress would like and
> understand the performance impact of that before we do this.
> Have we reached consensus that there's no way to implement a generic
> parameterized suppress that is performant?
> 
> There are some refactorings that we had discussed with James, Jacob, and
> Ian that seem like lower hanging fruit. After those are made it might be
> worth reconsidering whether we need to do this hack.
> 
> 
> 
> —
> *Joris Van Remoortere*
> Mesosphere
> 
>> On Thu, Jul 7, 2016 at 10:15 AM, Guangya Liu <gy...@gmail.com> wrote:
>> 
>> Hi Ben and Dario,
>> 
>> The reason that we have "SUPPRESS" call is as following:
>> 1) Act as the complement to the current REVIVE call.
>> 2) The HTTP API do not have an API to "Deactivate" a framework, we want to
>> use "SUPPRESS", "DECLINE" and "DECLINE_INVERSE_OFFERS" to implement the
>> call for "DeactivateFrameworkMessage".
>> 
>> You can also refer to https://issues.apache.org/jira/browse/MESOS-3037 for
>> detail.
>> 
>> So I think that Dario's patch is good, we should remove the framework
>> clients when "SUPPRESS" and add the framework client back when "REVIVE". to
>> ignore those frameworks from sorter.
>> 
>> @Viond, any comments for this?
>> 
>> @Ben, for your concern of the benchmark test result is not easy to
>> understand, I have filed a JIRA ticket here
>> https://issues.apache.org/jira/browse/MESOS-5800 to trace.
>> 
>> Thanks,
>> 
>> Guangya
>> 
>> 
>> 
>>> On Thu, Jul 7, 2016 at 6:01 AM, Dario Rexin <dr...@apple.com> wrote:
>>> 
>>> Hi Vinod,
>>> 
>>> thanks for your reply. The reason it’s so much faster is because the
>>> sorting is a lot faster with fewer frameworks. Looping shouldn’t make a
>>> huge difference, as it used to just skip over the deactivated frameworks.
>>> 
>>> I don’t know what effects deactivating the framework in the master would
>>> have. The framework is still active and listening for events / sending
>>> calls. Could you please elaborate?
>>> 
>>> Thanks,
>>> --
>>>  Dario
>>> 
>>> On Jul 6, 2016, at 2:56 PM, Benjamin Mahler <bm...@apache.org> wrote:
>>> 
>>> +implementer and shepherd of SUPPRESS
>>> 
>>> Is there any reason we didn't already just "deactivate" frameworks that
>>> were suppressing offers? That seems to be the natural implementation,
>>> performance aside, because the meaning of "deactivated" is: not being
>> sent
>>> any offers. The patch you posted seems to only take this half-way:
>> suppress
>>> = deactivation in the allocator, but not in the master.
>>> 
>>> Also, Dario it's a bit hard to interpret these numbers without reading
>> the
>>> benchmark code. My interpretation of these numbers is that this change
>>> makes the allocation loop complete more quickly when there are many
>>> frameworks that are in the suppressed state, because we have to loop over
>>> fewer clients. Is this an accurate interpretation?
>>> 
>>> On Wed, Jul 6, 2016 at 2:08 PM, Dario Rexin <dr...@apple.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> I would like to revive https://issues.apache.org/jira/browse/MESOS-4694
>> <
>>> https://issues.apache.org/jira/browse/MESOS-4694>, especially
>>> https://reviews.apache.org/r/43666/ <https://reviews.apache.org/r/43666/
>>> .
>>> We heavily depend on this patch and would love to see it merged. To show
>>> the value of this patch, I ran the benchmark from
>>> https://reviews.apache.org/r/49616/ <https://reviews.apache.org/r/49616/
>>> 
>>> first on HEAD and then with the aforementioned patch applied. I took some
>>> lines out to make it easier to see the changes over time in the patched
>>> version and to keep this email shorter ;). I would love to get some
>>> feedback and discuss any necessary changes to get this patch merged.
>>> 
>>> Here are the results:
>>> 
>>> Mesos HEAD:
>>> 
>>> Using 2000 agents and 200 frameworks
>>> round 0 allocate took 3.064665secs to make 199 offers
>>> round 1 allocate took 3.029418secs to make 198 offers
>>> round 2 allocate took 3.091427secs to make 197 offers
>>> round 3 allocate took 2.955457secs to make 196 offers
>>> round 4 allocate took 3.133789secs to make 195 offers
>>> [...]
>>> round 50 allocate took 3.109859secs to make 149 offers
>>> round 51 allocate took 3.062746secs to make 148 offers
>>> round 52 allocate took 3.146043secs to make 147 offers
>>> round 53 allocate took 3.042948secs to make 146 offers
>>> round 54 allocate took 3.097835secs to make 145 offers
>>> [...]
>>> round 100 allocate took 3.027475secs to make 99 offers
>>> round 101 allocate took 3.021641secs to make 98 offers
>>> round 102 allocate took 2.9853secs to make 97 offers
>>> round 103 allocate took 3.145925secs to make 96 offers
>>> round 104 allocate took 2.99094secs to make 95 offers
>>> [...]
>>> round 150 allocate took 3.080406secs to make 49 offers
>>> round 151 allocate took 3.109412secs to make 48 offers
>>> round 152 allocate took 2.992129secs to make 47 offers
>>> round 153 allocate took 3.405642secs to make 46 offers
>>> round 154 allocate took 4.153354secs to make 45 offers
>>> [...]
>>> round 195 allocate took 3.10015secs to make 4 offers
>>> round 196 allocate took 3.029347secs to make 3 offers
>>> round 197 allocate took 2.982825secs to make 2 offers
>>> round 198 allocate took 2.934595secs to make 1 offers
>>> round 199 allocate took 313212us to make 0 offers
>>> 
>>> Mesos HEAD + allocator patch:
>>> 
>>> Using 2000 agents and 200 frameworks
>>> round 0 allocate took 3.248205secs to make 199 offers
>>> round 1 allocate took 3.170852secs to make 198 offers
>>> round 2 allocate took 3.135146secs to make 197 offers
>>> round 3 allocate took 3.143857secs to make 196 offers
>>> round 4 allocate took 3.127641secs to make 195 offers
>>> [...]
>>> round 50 allocate took 2.492077secs to make 149 offers
>>> round 51 allocate took 2.435054secs to make 148 offers
>>> round 52 allocate took 2.472204secs to make 147 offers
>>> round 53 allocate took 2.457228secs to make 146 offers
>>> round 54 allocate took 2.413916secs to make 145 offers
>>> [...]
>>> round 100 allocate took 1.645015secs to make 99 offers
>>> round 101 allocate took 1.647373secs to make 98 offers
>>> round 102 allocate took 1.619147secs to make 97 offers
>>> round 103 allocate took 1.625496secs to make 96 offers
>>> round 104 allocate took 1.580513secs to make 95 offers
>>> [...]
>>> round 150 allocate took 1.064716secs to make 49 offers
>>> round 151 allocate took 1.065604secs to make 48 offers
>>> round 152 allocate took 1.053049secs to make 47 offers
>>> round 153 allocate took 1.041333secs to make 46 offers
>>> round 154 allocate took 1.0461secs to make 45 offers
>>> [...]
>>> round 195 allocate took 569640us to make 4 offers
>>> round 196 allocate took 562107us to make 3 offers
>>> round 197 allocate took 547632us to make 2 offers
>>> round 198 allocate took 530765us to make 1 offers
>>> round 199 allocate took 24426us to make 0 offers
>>> 
>>> --
>>>  Dario
>>

Re: MESOS-4694

Posted by Joris Van Remoortere <jo...@mesosphere.io>.

The reason that SUPPRESS doesn't just deactivate is because the intent was
to be able to parameterize this call. At that point the change wouldn't
work without turning this in to 2 cases.

I have asked to look at what a parameterized suppress would like and
understand the performance impact of that before we do this.
Have we reached consensus that there's no way to implement a generic
parameterized suppress that is performant?

There are some refactorings that we had discussed with James, Jacob, and
Ian that seem like lower hanging fruit. After those are made it might be
worth reconsidering whether we need to do this hack.



—
*Joris Van Remoortere*
Mesosphere

On Thu, Jul 7, 2016 at 10:15 AM, Guangya Liu <gy...@gmail.com> wrote:

> Hi Ben and Dario,
>
> The reason that we have "SUPPRESS" call is as following:
> 1) Act as the complement to the current REVIVE call.
> 2) The HTTP API do not have an API to "Deactivate" a framework, we want to
> use "SUPPRESS", "DECLINE" and "DECLINE_INVERSE_OFFERS" to implement the
> call for "DeactivateFrameworkMessage".
>
> You can also refer to https://issues.apache.org/jira/browse/MESOS-3037 for
> detail.
>
> So I think that Dario's patch is good, we should remove the framework
> clients when "SUPPRESS" and add the framework client back when "REVIVE". to
> ignore those frameworks from sorter.
>
> @Viond, any comments for this?
>
> @Ben, for your concern of the benchmark test result is not easy to
> understand, I have filed a JIRA ticket here
> https://issues.apache.org/jira/browse/MESOS-5800 to trace.
>
> Thanks,
>
> Guangya
>
>
>
> On Thu, Jul 7, 2016 at 6:01 AM, Dario Rexin <dr...@apple.com> wrote:
>
> > Hi Vinod,
> >
> > thanks for your reply. The reason it’s so much faster is because the
> > sorting is a lot faster with fewer frameworks. Looping shouldn’t make a
> > huge difference, as it used to just skip over the deactivated frameworks.
> >
> > I don’t know what effects deactivating the framework in the master would
> > have. The framework is still active and listening for events / sending
> > calls. Could you please elaborate?
> >
> > Thanks,
> > --
> >  Dario
> >
> > On Jul 6, 2016, at 2:56 PM, Benjamin Mahler <bm...@apache.org> wrote:
> >
> > +implementer and shepherd of SUPPRESS
> >
> > Is there any reason we didn't already just "deactivate" frameworks that
> > were suppressing offers? That seems to be the natural implementation,
> > performance aside, because the meaning of "deactivated" is: not being
> sent
> > any offers. The patch you posted seems to only take this half-way:
> suppress
> > = deactivation in the allocator, but not in the master.
> >
> > Also, Dario it's a bit hard to interpret these numbers without reading
> the
> > benchmark code. My interpretation of these numbers is that this change
> > makes the allocation loop complete more quickly when there are many
> > frameworks that are in the suppressed state, because we have to loop over
> > fewer clients. Is this an accurate interpretation?
> >
> > On Wed, Jul 6, 2016 at 2:08 PM, Dario Rexin <dr...@apple.com> wrote:
> >
> > Hi all,
> >
> > I would like to revive https://issues.apache.org/jira/browse/MESOS-4694
> <
> > https://issues.apache.org/jira/browse/MESOS-4694>, especially
> > https://reviews.apache.org/r/43666/ <https://reviews.apache.org/r/43666/
> >.
> > We heavily depend on this patch and would love to see it merged. To show
> > the value of this patch, I ran the benchmark from
> > https://reviews.apache.org/r/49616/ <https://reviews.apache.org/r/49616/
> >
> > first on HEAD and then with the aforementioned patch applied. I took some
> > lines out to make it easier to see the changes over time in the patched
> > version and to keep this email shorter ;). I would love to get some
> > feedback and discuss any necessary changes to get this patch merged.
> >
> > Here are the results:
> >
> > Mesos HEAD:
> >
> > Using 2000 agents and 200 frameworks
> > round 0 allocate took 3.064665secs to make 199 offers
> > round 1 allocate took 3.029418secs to make 198 offers
> > round 2 allocate took 3.091427secs to make 197 offers
> > round 3 allocate took 2.955457secs to make 196 offers
> > round 4 allocate took 3.133789secs to make 195 offers
> > [...]
> > round 50 allocate took 3.109859secs to make 149 offers
> > round 51 allocate took 3.062746secs to make 148 offers
> > round 52 allocate took 3.146043secs to make 147 offers
> > round 53 allocate took 3.042948secs to make 146 offers
> > round 54 allocate took 3.097835secs to make 145 offers
> > [...]
> > round 100 allocate took 3.027475secs to make 99 offers
> > round 101 allocate took 3.021641secs to make 98 offers
> > round 102 allocate took 2.9853secs to make 97 offers
> > round 103 allocate took 3.145925secs to make 96 offers
> > round 104 allocate took 2.99094secs to make 95 offers
> > [...]
> > round 150 allocate took 3.080406secs to make 49 offers
> > round 151 allocate took 3.109412secs to make 48 offers
> > round 152 allocate took 2.992129secs to make 47 offers
> > round 153 allocate took 3.405642secs to make 46 offers
> > round 154 allocate took 4.153354secs to make 45 offers
> > [...]
> > round 195 allocate took 3.10015secs to make 4 offers
> > round 196 allocate took 3.029347secs to make 3 offers
> > round 197 allocate took 2.982825secs to make 2 offers
> > round 198 allocate took 2.934595secs to make 1 offers
> > round 199 allocate took 313212us to make 0 offers
> >
> > Mesos HEAD + allocator patch:
> >
> > Using 2000 agents and 200 frameworks
> > round 0 allocate took 3.248205secs to make 199 offers
> > round 1 allocate took 3.170852secs to make 198 offers
> > round 2 allocate took 3.135146secs to make 197 offers
> > round 3 allocate took 3.143857secs to make 196 offers
> > round 4 allocate took 3.127641secs to make 195 offers
> > [...]
> > round 50 allocate took 2.492077secs to make 149 offers
> > round 51 allocate took 2.435054secs to make 148 offers
> > round 52 allocate took 2.472204secs to make 147 offers
> > round 53 allocate took 2.457228secs to make 146 offers
> > round 54 allocate took 2.413916secs to make 145 offers
> > [...]
> > round 100 allocate took 1.645015secs to make 99 offers
> > round 101 allocate took 1.647373secs to make 98 offers
> > round 102 allocate took 1.619147secs to make 97 offers
> > round 103 allocate took 1.625496secs to make 96 offers
> > round 104 allocate took 1.580513secs to make 95 offers
> > [...]
> > round 150 allocate took 1.064716secs to make 49 offers
> > round 151 allocate took 1.065604secs to make 48 offers
> > round 152 allocate took 1.053049secs to make 47 offers
> > round 153 allocate took 1.041333secs to make 46 offers
> > round 154 allocate took 1.0461secs to make 45 offers
> > [...]
> > round 195 allocate took 569640us to make 4 offers
> > round 196 allocate took 562107us to make 3 offers
> > round 197 allocate took 547632us to make 2 offers
> > round 198 allocate took 530765us to make 1 offers
> > round 199 allocate took 24426us to make 0 offers
> >
> > --
> >  Dario
> >
> >
> >
> >
>

Re: MESOS-4694

Posted by Guangya Liu <gy...@gmail.com>.

Hi Ben and Dario,

The reason that we have "SUPPRESS" call is as following:
1) Act as the complement to the current REVIVE call.
2) The HTTP API do not have an API to "Deactivate" a framework, we want to
use "SUPPRESS", "DECLINE" and "DECLINE_INVERSE_OFFERS" to implement the
call for "DeactivateFrameworkMessage".

You can also refer to https://issues.apache.org/jira/browse/MESOS-3037 for
detail.

So I think that Dario's patch is good, we should remove the framework
clients when "SUPPRESS" and add the framework client back when "REVIVE". to
ignore those frameworks from sorter.

@Viond, any comments for this?

@Ben, for your concern of the benchmark test result is not easy to
understand, I have filed a JIRA ticket here
https://issues.apache.org/jira/browse/MESOS-5800 to trace.

Thanks,

Guangya



On Thu, Jul 7, 2016 at 6:01 AM, Dario Rexin <dr...@apple.com> wrote:

> Hi Vinod,
>
> thanks for your reply. The reason it’s so much faster is because the
> sorting is a lot faster with fewer frameworks. Looping shouldn’t make a
> huge difference, as it used to just skip over the deactivated frameworks.
>
> I don’t know what effects deactivating the framework in the master would
> have. The framework is still active and listening for events / sending
> calls. Could you please elaborate?
>
> Thanks,
> --
>  Dario
>
> On Jul 6, 2016, at 2:56 PM, Benjamin Mahler <bm...@apache.org> wrote:
>
> +implementer and shepherd of SUPPRESS
>
> Is there any reason we didn't already just "deactivate" frameworks that
> were suppressing offers? That seems to be the natural implementation,
> performance aside, because the meaning of "deactivated" is: not being sent
> any offers. The patch you posted seems to only take this half-way: suppress
> = deactivation in the allocator, but not in the master.
>
> Also, Dario it's a bit hard to interpret these numbers without reading the
> benchmark code. My interpretation of these numbers is that this change
> makes the allocation loop complete more quickly when there are many
> frameworks that are in the suppressed state, because we have to loop over
> fewer clients. Is this an accurate interpretation?
>
> On Wed, Jul 6, 2016 at 2:08 PM, Dario Rexin <dr...@apple.com> wrote:
>
> Hi all,
>
> I would like to revive https://issues.apache.org/jira/browse/MESOS-4694 <
> https://issues.apache.org/jira/browse/MESOS-4694>, especially
> https://reviews.apache.org/r/43666/ <https://reviews.apache.org/r/43666/>.
> We heavily depend on this patch and would love to see it merged. To show
> the value of this patch, I ran the benchmark from
> https://reviews.apache.org/r/49616/ <https://reviews.apache.org/r/49616/>
> first on HEAD and then with the aforementioned patch applied. I took some
> lines out to make it easier to see the changes over time in the patched
> version and to keep this email shorter ;). I would love to get some
> feedback and discuss any necessary changes to get this patch merged.
>
> Here are the results:
>
> Mesos HEAD:
>
> Using 2000 agents and 200 frameworks
> round 0 allocate took 3.064665secs to make 199 offers
> round 1 allocate took 3.029418secs to make 198 offers
> round 2 allocate took 3.091427secs to make 197 offers
> round 3 allocate took 2.955457secs to make 196 offers
> round 4 allocate took 3.133789secs to make 195 offers
> [...]
> round 50 allocate took 3.109859secs to make 149 offers
> round 51 allocate took 3.062746secs to make 148 offers
> round 52 allocate took 3.146043secs to make 147 offers
> round 53 allocate took 3.042948secs to make 146 offers
> round 54 allocate took 3.097835secs to make 145 offers
> [...]
> round 100 allocate took 3.027475secs to make 99 offers
> round 101 allocate took 3.021641secs to make 98 offers
> round 102 allocate took 2.9853secs to make 97 offers
> round 103 allocate took 3.145925secs to make 96 offers
> round 104 allocate took 2.99094secs to make 95 offers
> [...]
> round 150 allocate took 3.080406secs to make 49 offers
> round 151 allocate took 3.109412secs to make 48 offers
> round 152 allocate took 2.992129secs to make 47 offers
> round 153 allocate took 3.405642secs to make 46 offers
> round 154 allocate took 4.153354secs to make 45 offers
> [...]
> round 195 allocate took 3.10015secs to make 4 offers
> round 196 allocate took 3.029347secs to make 3 offers
> round 197 allocate took 2.982825secs to make 2 offers
> round 198 allocate took 2.934595secs to make 1 offers
> round 199 allocate took 313212us to make 0 offers
>
> Mesos HEAD + allocator patch:
>
> Using 2000 agents and 200 frameworks
> round 0 allocate took 3.248205secs to make 199 offers
> round 1 allocate took 3.170852secs to make 198 offers
> round 2 allocate took 3.135146secs to make 197 offers
> round 3 allocate took 3.143857secs to make 196 offers
> round 4 allocate took 3.127641secs to make 195 offers
> [...]
> round 50 allocate took 2.492077secs to make 149 offers
> round 51 allocate took 2.435054secs to make 148 offers
> round 52 allocate took 2.472204secs to make 147 offers
> round 53 allocate took 2.457228secs to make 146 offers
> round 54 allocate took 2.413916secs to make 145 offers
> [...]
> round 100 allocate took 1.645015secs to make 99 offers
> round 101 allocate took 1.647373secs to make 98 offers
> round 102 allocate took 1.619147secs to make 97 offers
> round 103 allocate took 1.625496secs to make 96 offers
> round 104 allocate took 1.580513secs to make 95 offers
> [...]
> round 150 allocate took 1.064716secs to make 49 offers
> round 151 allocate took 1.065604secs to make 48 offers
> round 152 allocate took 1.053049secs to make 47 offers
> round 153 allocate took 1.041333secs to make 46 offers
> round 154 allocate took 1.0461secs to make 45 offers
> [...]
> round 195 allocate took 569640us to make 4 offers
> round 196 allocate took 562107us to make 3 offers
> round 197 allocate took 547632us to make 2 offers
> round 198 allocate took 530765us to make 1 offers
> round 199 allocate took 24426us to make 0 offers
>
> --
>  Dario
>
>
>
>

Re: MESOS-4694

Posted by Dario Rexin <dr...@apple.com>.

Hi Vinod,

thanks for your reply. The reason it’s so much faster is because the sorting is a lot faster with fewer frameworks. Looping shouldn’t make a huge difference, as it used to just skip over the deactivated frameworks.

I don’t know what effects deactivating the framework in the master would have. The framework is still active and listening for events / sending calls. Could you please elaborate?

Thanks,
--
 Dario

> On Jul 6, 2016, at 2:56 PM, Benjamin Mahler <bm...@apache.org> wrote:
> 
> +implementer and shepherd of SUPPRESS
> 
> Is there any reason we didn't already just "deactivate" frameworks that
> were suppressing offers? That seems to be the natural implementation,
> performance aside, because the meaning of "deactivated" is: not being sent
> any offers. The patch you posted seems to only take this half-way: suppress
> = deactivation in the allocator, but not in the master.
> 
> Also, Dario it's a bit hard to interpret these numbers without reading the
> benchmark code. My interpretation of these numbers is that this change
> makes the allocation loop complete more quickly when there are many
> frameworks that are in the suppressed state, because we have to loop over
> fewer clients. Is this an accurate interpretation?
> 
> On Wed, Jul 6, 2016 at 2:08 PM, Dario Rexin <dr...@apple.com> wrote:
> 
>> Hi all,
>> 
>> I would like to revive https://issues.apache.org/jira/browse/MESOS-4694 <
>> https://issues.apache.org/jira/browse/MESOS-4694>, especially
>> https://reviews.apache.org/r/43666/ <https://reviews.apache.org/r/43666/>.
>> We heavily depend on this patch and would love to see it merged. To show
>> the value of this patch, I ran the benchmark from
>> https://reviews.apache.org/r/49616/ <https://reviews.apache.org/r/49616/>
>> first on HEAD and then with the aforementioned patch applied. I took some
>> lines out to make it easier to see the changes over time in the patched
>> version and to keep this email shorter ;). I would love to get some
>> feedback and discuss any necessary changes to get this patch merged.
>> 
>> Here are the results:
>> 
>> Mesos HEAD:
>> 
>> Using 2000 agents and 200 frameworks
>> round 0 allocate took 3.064665secs to make 199 offers
>> round 1 allocate took 3.029418secs to make 198 offers
>> round 2 allocate took 3.091427secs to make 197 offers
>> round 3 allocate took 2.955457secs to make 196 offers
>> round 4 allocate took 3.133789secs to make 195 offers
>> [...]
>> round 50 allocate took 3.109859secs to make 149 offers
>> round 51 allocate took 3.062746secs to make 148 offers
>> round 52 allocate took 3.146043secs to make 147 offers
>> round 53 allocate took 3.042948secs to make 146 offers
>> round 54 allocate took 3.097835secs to make 145 offers
>> [...]
>> round 100 allocate took 3.027475secs to make 99 offers
>> round 101 allocate took 3.021641secs to make 98 offers
>> round 102 allocate took 2.9853secs to make 97 offers
>> round 103 allocate took 3.145925secs to make 96 offers
>> round 104 allocate took 2.99094secs to make 95 offers
>> [...]
>> round 150 allocate took 3.080406secs to make 49 offers
>> round 151 allocate took 3.109412secs to make 48 offers
>> round 152 allocate took 2.992129secs to make 47 offers
>> round 153 allocate took 3.405642secs to make 46 offers
>> round 154 allocate took 4.153354secs to make 45 offers
>> [...]
>> round 195 allocate took 3.10015secs to make 4 offers
>> round 196 allocate took 3.029347secs to make 3 offers
>> round 197 allocate took 2.982825secs to make 2 offers
>> round 198 allocate took 2.934595secs to make 1 offers
>> round 199 allocate took 313212us to make 0 offers
>> 
>> Mesos HEAD + allocator patch:
>> 
>> Using 2000 agents and 200 frameworks
>> round 0 allocate took 3.248205secs to make 199 offers
>> round 1 allocate took 3.170852secs to make 198 offers
>> round 2 allocate took 3.135146secs to make 197 offers
>> round 3 allocate took 3.143857secs to make 196 offers
>> round 4 allocate took 3.127641secs to make 195 offers
>> [...]
>> round 50 allocate took 2.492077secs to make 149 offers
>> round 51 allocate took 2.435054secs to make 148 offers
>> round 52 allocate took 2.472204secs to make 147 offers
>> round 53 allocate took 2.457228secs to make 146 offers
>> round 54 allocate took 2.413916secs to make 145 offers
>> [...]
>> round 100 allocate took 1.645015secs to make 99 offers
>> round 101 allocate took 1.647373secs to make 98 offers
>> round 102 allocate took 1.619147secs to make 97 offers
>> round 103 allocate took 1.625496secs to make 96 offers
>> round 104 allocate took 1.580513secs to make 95 offers
>> [...]
>> round 150 allocate took 1.064716secs to make 49 offers
>> round 151 allocate took 1.065604secs to make 48 offers
>> round 152 allocate took 1.053049secs to make 47 offers
>> round 153 allocate took 1.041333secs to make 46 offers
>> round 154 allocate took 1.0461secs to make 45 offers
>> [...]
>> round 195 allocate took 569640us to make 4 offers
>> round 196 allocate took 562107us to make 3 offers
>> round 197 allocate took 547632us to make 2 offers
>> round 198 allocate took 530765us to make 1 offers
>> round 199 allocate took 24426us to make 0 offers
>> 
>> --
>>  Dario
>> 
>>

Re: MESOS-4694

Posted by Benjamin Mahler <bm...@apache.org>.

+implementer and shepherd of SUPPRESS

Is there any reason we didn't already just "deactivate" frameworks that
were suppressing offers? That seems to be the natural implementation,
performance aside, because the meaning of "deactivated" is: not being sent
any offers. The patch you posted seems to only take this half-way: suppress
= deactivation in the allocator, but not in the master.

Also, Dario it's a bit hard to interpret these numbers without reading the
benchmark code. My interpretation of these numbers is that this change
makes the allocation loop complete more quickly when there are many
frameworks that are in the suppressed state, because we have to loop over
fewer clients. Is this an accurate interpretation?

On Wed, Jul 6, 2016 at 2:08 PM, Dario Rexin <dr...@apple.com> wrote:

> Hi all,
>
> I would like to revive https://issues.apache.org/jira/browse/MESOS-4694 <
> https://issues.apache.org/jira/browse/MESOS-4694>, especially
> https://reviews.apache.org/r/43666/ <https://reviews.apache.org/r/43666/>.
> We heavily depend on this patch and would love to see it merged. To show
> the value of this patch, I ran the benchmark from
> https://reviews.apache.org/r/49616/ <https://reviews.apache.org/r/49616/>
> first on HEAD and then with the aforementioned patch applied. I took some
> lines out to make it easier to see the changes over time in the patched
> version and to keep this email shorter ;). I would love to get some
> feedback and discuss any necessary changes to get this patch merged.
>
> Here are the results:
>
> Mesos HEAD:
>
> Using 2000 agents and 200 frameworks
> round 0 allocate took 3.064665secs to make 199 offers
> round 1 allocate took 3.029418secs to make 198 offers
> round 2 allocate took 3.091427secs to make 197 offers
> round 3 allocate took 2.955457secs to make 196 offers
> round 4 allocate took 3.133789secs to make 195 offers
> [...]
> round 50 allocate took 3.109859secs to make 149 offers
> round 51 allocate took 3.062746secs to make 148 offers
> round 52 allocate took 3.146043secs to make 147 offers
> round 53 allocate took 3.042948secs to make 146 offers
> round 54 allocate took 3.097835secs to make 145 offers
> [...]
> round 100 allocate took 3.027475secs to make 99 offers
> round 101 allocate took 3.021641secs to make 98 offers
> round 102 allocate took 2.9853secs to make 97 offers
> round 103 allocate took 3.145925secs to make 96 offers
> round 104 allocate took 2.99094secs to make 95 offers
> [...]
> round 150 allocate took 3.080406secs to make 49 offers
> round 151 allocate took 3.109412secs to make 48 offers
> round 152 allocate took 2.992129secs to make 47 offers
> round 153 allocate took 3.405642secs to make 46 offers
> round 154 allocate took 4.153354secs to make 45 offers
> [...]
> round 195 allocate took 3.10015secs to make 4 offers
> round 196 allocate took 3.029347secs to make 3 offers
> round 197 allocate took 2.982825secs to make 2 offers
> round 198 allocate took 2.934595secs to make 1 offers
> round 199 allocate took 313212us to make 0 offers
>
> Mesos HEAD + allocator patch:
>
> Using 2000 agents and 200 frameworks
> round 0 allocate took 3.248205secs to make 199 offers
> round 1 allocate took 3.170852secs to make 198 offers
> round 2 allocate took 3.135146secs to make 197 offers
> round 3 allocate took 3.143857secs to make 196 offers
> round 4 allocate took 3.127641secs to make 195 offers
> [...]
> round 50 allocate took 2.492077secs to make 149 offers
> round 51 allocate took 2.435054secs to make 148 offers
> round 52 allocate took 2.472204secs to make 147 offers
> round 53 allocate took 2.457228secs to make 146 offers
> round 54 allocate took 2.413916secs to make 145 offers
> [...]
> round 100 allocate took 1.645015secs to make 99 offers
> round 101 allocate took 1.647373secs to make 98 offers
> round 102 allocate took 1.619147secs to make 97 offers
> round 103 allocate took 1.625496secs to make 96 offers
> round 104 allocate took 1.580513secs to make 95 offers
> [...]
> round 150 allocate took 1.064716secs to make 49 offers
> round 151 allocate took 1.065604secs to make 48 offers
> round 152 allocate took 1.053049secs to make 47 offers
> round 153 allocate took 1.041333secs to make 46 offers
> round 154 allocate took 1.0461secs to make 45 offers
> [...]
> round 195 allocate took 569640us to make 4 offers
> round 196 allocate took 562107us to make 3 offers
> round 197 allocate took 547632us to make 2 offers
> round 198 allocate took 530765us to make 1 offers
> round 199 allocate took 24426us to make 0 offers
>
> --
>  Dario
>
>