You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Meng Zhu <mz...@mesosphere.com> on 2018/03/01 22:56:11 UTC

Tasks may be explicitly dropped by agent in Mesos 1.5

Hi all:

TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent
if all three conditions are met:
(1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls
 use the same executor.
(2) The executor currently does not exist on the agent.
(3) Due to some race conditions, these tasks are trying to launch
on the agent in a different order from their original launch order.

In this case, tasks that are trying to launch on the agent
before the first task in the original order will be explicitly dropped by
the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).

This bug will be fixed in 1.5.1. It is tracked in
https://issues.apache.org/jira/browse/MESOS-8624

----

In https://issues.apache.org/jira/browse/MESOS-1720, we introduced an
ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
calls to a new executor. The master would specify that the first call is the
one to launch a new executor through the `launch_executor` field in
`RunTaskMessage`/`RunTaskGroupMessage`, and the second one should
use the existing executor launched by the first one.

On the agent side, running a task/task group goes through a series of
continuations, one is `collect()` on the future that unschedule frameworks
from
being GC'ed:
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158
another is `collect()` on task authorization:
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333
Since these `collect()` calls run on individual actors, the futures of the
`collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return
out-of-order, even if the futures these two `collect()` wait for are
satisfied in
order (which is true in these two cases).

As a result, under some race conditions (probably under some heavy load
conditions), tasks rely on the previous task to launch executor may
get processed before the task that is supposed to launch the executor
first, resulting in the tasks being explicitly dropped by the agent.

-Meng

Re: Tasks may be explicitly dropped by agent in Mesos 1.5

Posted by Meng Zhu <mz...@mesosphere.com>.

CORRECTION:

This is a new behavior that only appears in the current 1.5.x branch. In
1.5.0, Mesos
agent still has the old behavior, namely, any reordered tasks (to the same
executor)
are launched regardless.

On Fri, Mar 2, 2018 at 9:41 AM, Chun-Hung Hsiao <ch...@mesosphere.io>
wrote:

> Gilbert I think you're right. The code path doesn't exist in 1.5.0.
>
> On Mar 2, 2018 9:36 AM, "Chun-Hung Hsiao" <ch...@mesosphere.io> wrote:
>
> > This is a new behavior we have after solving MESOS-1720, and thus a new
> > problem only in 1.5.x. Prior to 1.5, reordered tasks (to the same
> executor)
> > will be launched because whoever comes first will launch the executor.
> > Since 1.5, one might be dropped.
> >
> > On Mar 1, 2018 4:36 PM, "Gilbert Song" <gi...@mesosphere.io> wrote:
> >
> >> Meng,
> >>
> >> Could you double check if this is really an issue in Mesos 1.5.0
> release?
> >>
> >> MESOS-1720 <https://issues.apache.org/jira/browse/MESOS-1720> was
> >> resolved
> >> after the 1.5 release (rc-2) and it seems like
> >> it is only at the master branch and 1.5.x branch (not 1.5.0).
> >>
> >> Did I miss anything?
> >>
> >> - Gilbert
> >>
> >> On Thu, Mar 1, 2018 at 4:22 PM, Benjamin Mahler <bm...@apache.org>
> >> wrote:
> >>
> >> > Put another way, we currently don't guarantee in-order task delivery
> to
> >> > the executor. Due to the changes for MESOS-1720, one special case of
> >> task
> >> > re-ordering now leads to the re-ordered task being dropped (rather
> than
> >> > delivered out-of-order as before). Technically, this is strictly
> better.
> >> >
> >> > However, we'd like to start guaranteeing in-order task delivery.
> >> >
> >> > On Thu, Mar 1, 2018 at 2:56 PM, Meng Zhu <mz...@mesosphere.com> wrote:
> >> >
> >> >> Hi all:
> >> >>
> >> >> TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent
> >> >> if all three conditions are met:
> >> >> (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls
> >> >>  use the same executor.
> >> >> (2) The executor currently does not exist on the agent.
> >> >> (3) Due to some race conditions, these tasks are trying to launch
> >> >> on the agent in a different order from their original launch order.
> >> >>
> >> >> In this case, tasks that are trying to launch on the agent
> >> >> before the first task in the original order will be explicitly
> dropped
> >> by
> >> >> the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).
> >> >>
> >> >> This bug will be fixed in 1.5.1. It is tracked in
> >> >> https://issues.apache.org/jira/browse/MESOS-8624
> >> >>
> >> >> ----
> >> >>
> >> >> In https://issues.apache.org/jira/browse/MESOS-1720, we introduced
> an
> >> >> ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
> >> >> calls to a new executor. The master would specify that the first call
> >> is
> >> >> the
> >> >> one to launch a new executor through the `launch_executor` field in
> >> >> `RunTaskMessage`/`RunTaskGroupMessage`, and the second one should
> >> >> use the existing executor launched by the first one.
> >> >>
> >> >> On the agent side, running a task/task group goes through a series of
> >> >> continuations, one is `collect()` on the future that unschedule
> >> >> frameworks from
> >> >> being GC'ed:
> >> >> https://github.com/apache/mesos/blob/master/src/slave/slave.
> cpp#L2158
> >> >> another is `collect()` on task authorization:
> >> >> https://github.com/apache/mesos/blob/master/src/slave/slave.
> cpp#L2333
> >> >> Since these `collect()` calls run on individual actors, the futures
> of
> >> the
> >> >> `collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return
> >> >> out-of-order, even if the futures these two `collect()` wait for are
> >> >> satisfied in
> >> >> order (which is true in these two cases).
> >> >>
> >> >> As a result, under some race conditions (probably under some heavy
> load
> >> >> conditions), tasks rely on the previous task to launch executor may
> >> >> get processed before the task that is supposed to launch the executor
> >> >> first, resulting in the tasks being explicitly dropped by the agent.
> >> >>
> >> >> -Meng
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >
>

Re: Tasks may be explicitly dropped by agent in Mesos 1.5

Posted by Meng Zhu <mz...@mesosphere.com>.

CORRECTION:

This is a new behavior that only appears in the current 1.5.x branch. In
1.5.0, Mesos
agent still has the old behavior, namely, any reordered tasks (to the same
executor)
are launched regardless.

On Fri, Mar 2, 2018 at 9:41 AM, Chun-Hung Hsiao <ch...@mesosphere.io>
wrote:

> Gilbert I think you're right. The code path doesn't exist in 1.5.0.
>
> On Mar 2, 2018 9:36 AM, "Chun-Hung Hsiao" <ch...@mesosphere.io> wrote:
>
> > This is a new behavior we have after solving MESOS-1720, and thus a new
> > problem only in 1.5.x. Prior to 1.5, reordered tasks (to the same
> executor)
> > will be launched because whoever comes first will launch the executor.
> > Since 1.5, one might be dropped.
> >
> > On Mar 1, 2018 4:36 PM, "Gilbert Song" <gi...@mesosphere.io> wrote:
> >
> >> Meng,
> >>
> >> Could you double check if this is really an issue in Mesos 1.5.0
> release?
> >>
> >> MESOS-1720 <https://issues.apache.org/jira/browse/MESOS-1720> was
> >> resolved
> >> after the 1.5 release (rc-2) and it seems like
> >> it is only at the master branch and 1.5.x branch (not 1.5.0).
> >>
> >> Did I miss anything?
> >>
> >> - Gilbert
> >>
> >> On Thu, Mar 1, 2018 at 4:22 PM, Benjamin Mahler <bm...@apache.org>
> >> wrote:
> >>
> >> > Put another way, we currently don't guarantee in-order task delivery
> to
> >> > the executor. Due to the changes for MESOS-1720, one special case of
> >> task
> >> > re-ordering now leads to the re-ordered task being dropped (rather
> than
> >> > delivered out-of-order as before). Technically, this is strictly
> better.
> >> >
> >> > However, we'd like to start guaranteeing in-order task delivery.
> >> >
> >> > On Thu, Mar 1, 2018 at 2:56 PM, Meng Zhu <mz...@mesosphere.com> wrote:
> >> >
> >> >> Hi all:
> >> >>
> >> >> TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent
> >> >> if all three conditions are met:
> >> >> (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls
> >> >>  use the same executor.
> >> >> (2) The executor currently does not exist on the agent.
> >> >> (3) Due to some race conditions, these tasks are trying to launch
> >> >> on the agent in a different order from their original launch order.
> >> >>
> >> >> In this case, tasks that are trying to launch on the agent
> >> >> before the first task in the original order will be explicitly
> dropped
> >> by
> >> >> the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).
> >> >>
> >> >> This bug will be fixed in 1.5.1. It is tracked in
> >> >> https://issues.apache.org/jira/browse/MESOS-8624
> >> >>
> >> >> ----
> >> >>
> >> >> In https://issues.apache.org/jira/browse/MESOS-1720, we introduced
> an
> >> >> ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
> >> >> calls to a new executor. The master would specify that the first call
> >> is
> >> >> the
> >> >> one to launch a new executor through the `launch_executor` field in
> >> >> `RunTaskMessage`/`RunTaskGroupMessage`, and the second one should
> >> >> use the existing executor launched by the first one.
> >> >>
> >> >> On the agent side, running a task/task group goes through a series of
> >> >> continuations, one is `collect()` on the future that unschedule
> >> >> frameworks from
> >> >> being GC'ed:
> >> >> https://github.com/apache/mesos/blob/master/src/slave/slave.
> cpp#L2158
> >> >> another is `collect()` on task authorization:
> >> >> https://github.com/apache/mesos/blob/master/src/slave/slave.
> cpp#L2333
> >> >> Since these `collect()` calls run on individual actors, the futures
> of
> >> the
> >> >> `collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return
> >> >> out-of-order, even if the futures these two `collect()` wait for are
> >> >> satisfied in
> >> >> order (which is true in these two cases).
> >> >>
> >> >> As a result, under some race conditions (probably under some heavy
> load
> >> >> conditions), tasks rely on the previous task to launch executor may
> >> >> get processed before the task that is supposed to launch the executor
> >> >> first, resulting in the tasks being explicitly dropped by the agent.
> >> >>
> >> >> -Meng
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >
>

Re: Tasks may be explicitly dropped by agent in Mesos 1.5

Posted by Chun-Hung Hsiao <ch...@mesosphere.io>.

Gilbert I think you're right. The code path doesn't exist in 1.5.0.

On Mar 2, 2018 9:36 AM, "Chun-Hung Hsiao" <ch...@mesosphere.io> wrote:

> This is a new behavior we have after solving MESOS-1720, and thus a new
> problem only in 1.5.x. Prior to 1.5, reordered tasks (to the same executor)
> will be launched because whoever comes first will launch the executor.
> Since 1.5, one might be dropped.
>
> On Mar 1, 2018 4:36 PM, "Gilbert Song" <gi...@mesosphere.io> wrote:
>
>> Meng,
>>
>> Could you double check if this is really an issue in Mesos 1.5.0 release?
>>
>> MESOS-1720 <https://issues.apache.org/jira/browse/MESOS-1720> was
>> resolved
>> after the 1.5 release (rc-2) and it seems like
>> it is only at the master branch and 1.5.x branch (not 1.5.0).
>>
>> Did I miss anything?
>>
>> - Gilbert
>>
>> On Thu, Mar 1, 2018 at 4:22 PM, Benjamin Mahler <bm...@apache.org>
>> wrote:
>>
>> > Put another way, we currently don't guarantee in-order task delivery to
>> > the executor. Due to the changes for MESOS-1720, one special case of
>> task
>> > re-ordering now leads to the re-ordered task being dropped (rather than
>> > delivered out-of-order as before). Technically, this is strictly better.
>> >
>> > However, we'd like to start guaranteeing in-order task delivery.
>> >
>> > On Thu, Mar 1, 2018 at 2:56 PM, Meng Zhu <mz...@mesosphere.com> wrote:
>> >
>> >> Hi all:
>> >>
>> >> TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent
>> >> if all three conditions are met:
>> >> (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls
>> >>  use the same executor.
>> >> (2) The executor currently does not exist on the agent.
>> >> (3) Due to some race conditions, these tasks are trying to launch
>> >> on the agent in a different order from their original launch order.
>> >>
>> >> In this case, tasks that are trying to launch on the agent
>> >> before the first task in the original order will be explicitly dropped
>> by
>> >> the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).
>> >>
>> >> This bug will be fixed in 1.5.1. It is tracked in
>> >> https://issues.apache.org/jira/browse/MESOS-8624
>> >>
>> >> ----
>> >>
>> >> In https://issues.apache.org/jira/browse/MESOS-1720, we introduced an
>> >> ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
>> >> calls to a new executor. The master would specify that the first call
>> is
>> >> the
>> >> one to launch a new executor through the `launch_executor` field in
>> >> `RunTaskMessage`/`RunTaskGroupMessage`, and the second one should
>> >> use the existing executor launched by the first one.
>> >>
>> >> On the agent side, running a task/task group goes through a series of
>> >> continuations, one is `collect()` on the future that unschedule
>> >> frameworks from
>> >> being GC'ed:
>> >> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158
>> >> another is `collect()` on task authorization:
>> >> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333
>> >> Since these `collect()` calls run on individual actors, the futures of
>> the
>> >> `collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return
>> >> out-of-order, even if the futures these two `collect()` wait for are
>> >> satisfied in
>> >> order (which is true in these two cases).
>> >>
>> >> As a result, under some race conditions (probably under some heavy load
>> >> conditions), tasks rely on the previous task to launch executor may
>> >> get processed before the task that is supposed to launch the executor
>> >> first, resulting in the tasks being explicitly dropped by the agent.
>> >>
>> >> -Meng
>> >>
>> >>
>> >>
>> >
>>
>

Re: Tasks may be explicitly dropped by agent in Mesos 1.5

Posted by Chun-Hung Hsiao <ch...@mesosphere.io>.

This is a new behavior we have after solving MESOS-1720, and thus a new
problem only in 1.5.x. Prior to 1.5, reordered tasks (to the same executor)
will be launched because whoever comes first will launch the executor.
Since 1.5, one might be dropped.

On Mar 1, 2018 4:36 PM, "Gilbert Song" <gi...@mesosphere.io> wrote:

> Meng,
>
> Could you double check if this is really an issue in Mesos 1.5.0 release?
>
> MESOS-1720 <https://issues.apache.org/jira/browse/MESOS-1720> was resolved
> after the 1.5 release (rc-2) and it seems like
> it is only at the master branch and 1.5.x branch (not 1.5.0).
>
> Did I miss anything?
>
> - Gilbert
>
> On Thu, Mar 1, 2018 at 4:22 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
>
> > Put another way, we currently don't guarantee in-order task delivery to
> > the executor. Due to the changes for MESOS-1720, one special case of task
> > re-ordering now leads to the re-ordered task being dropped (rather than
> > delivered out-of-order as before). Technically, this is strictly better.
> >
> > However, we'd like to start guaranteeing in-order task delivery.
> >
> > On Thu, Mar 1, 2018 at 2:56 PM, Meng Zhu <mz...@mesosphere.com> wrote:
> >
> >> Hi all:
> >>
> >> TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent
> >> if all three conditions are met:
> >> (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls
> >>  use the same executor.
> >> (2) The executor currently does not exist on the agent.
> >> (3) Due to some race conditions, these tasks are trying to launch
> >> on the agent in a different order from their original launch order.
> >>
> >> In this case, tasks that are trying to launch on the agent
> >> before the first task in the original order will be explicitly dropped
> by
> >> the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).
> >>
> >> This bug will be fixed in 1.5.1. It is tracked in
> >> https://issues.apache.org/jira/browse/MESOS-8624
> >>
> >> ----
> >>
> >> In https://issues.apache.org/jira/browse/MESOS-1720, we introduced an
> >> ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
> >> calls to a new executor. The master would specify that the first call is
> >> the
> >> one to launch a new executor through the `launch_executor` field in
> >> `RunTaskMessage`/`RunTaskGroupMessage`, and the second one should
> >> use the existing executor launched by the first one.
> >>
> >> On the agent side, running a task/task group goes through a series of
> >> continuations, one is `collect()` on the future that unschedule
> >> frameworks from
> >> being GC'ed:
> >> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158
> >> another is `collect()` on task authorization:
> >> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333
> >> Since these `collect()` calls run on individual actors, the futures of
> the
> >> `collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return
> >> out-of-order, even if the futures these two `collect()` wait for are
> >> satisfied in
> >> order (which is true in these two cases).
> >>
> >> As a result, under some race conditions (probably under some heavy load
> >> conditions), tasks rely on the previous task to launch executor may
> >> get processed before the task that is supposed to launch the executor
> >> first, resulting in the tasks being explicitly dropped by the agent.
> >>
> >> -Meng
> >>
> >>
> >>
> >
>

Re: Tasks may be explicitly dropped by agent in Mesos 1.5

Posted by Gilbert Song <gi...@mesosphere.io>.

Meng,

Could you double check if this is really an issue in Mesos 1.5.0 release?

MESOS-1720 <https://issues.apache.org/jira/browse/MESOS-1720> was resolved
after the 1.5 release (rc-2) and it seems like
it is only at the master branch and 1.5.x branch (not 1.5.0).

Did I miss anything?

- Gilbert

On Thu, Mar 1, 2018 at 4:22 PM, Benjamin Mahler <bm...@apache.org> wrote:

> Put another way, we currently don't guarantee in-order task delivery to
> the executor. Due to the changes for MESOS-1720, one special case of task
> re-ordering now leads to the re-ordered task being dropped (rather than
> delivered out-of-order as before). Technically, this is strictly better.
>
> However, we'd like to start guaranteeing in-order task delivery.
>
> On Thu, Mar 1, 2018 at 2:56 PM, Meng Zhu <mz...@mesosphere.com> wrote:
>
>> Hi all:
>>
>> TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent
>> if all three conditions are met:
>> (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls
>>  use the same executor.
>> (2) The executor currently does not exist on the agent.
>> (3) Due to some race conditions, these tasks are trying to launch
>> on the agent in a different order from their original launch order.
>>
>> In this case, tasks that are trying to launch on the agent
>> before the first task in the original order will be explicitly dropped by
>> the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).
>>
>> This bug will be fixed in 1.5.1. It is tracked in
>> https://issues.apache.org/jira/browse/MESOS-8624
>>
>> ----
>>
>> In https://issues.apache.org/jira/browse/MESOS-1720, we introduced an
>> ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
>> calls to a new executor. The master would specify that the first call is
>> the
>> one to launch a new executor through the `launch_executor` field in
>> `RunTaskMessage`/`RunTaskGroupMessage`, and the second one should
>> use the existing executor launched by the first one.
>>
>> On the agent side, running a task/task group goes through a series of
>> continuations, one is `collect()` on the future that unschedule
>> frameworks from
>> being GC'ed:
>> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158
>> another is `collect()` on task authorization:
>> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333
>> Since these `collect()` calls run on individual actors, the futures of the
>> `collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return
>> out-of-order, even if the futures these two `collect()` wait for are
>> satisfied in
>> order (which is true in these two cases).
>>
>> As a result, under some race conditions (probably under some heavy load
>> conditions), tasks rely on the previous task to launch executor may
>> get processed before the task that is supposed to launch the executor
>> first, resulting in the tasks being explicitly dropped by the agent.
>>
>> -Meng
>>
>>
>>
>

Re: Tasks may be explicitly dropped by agent in Mesos 1.5

Posted by Gilbert Song <gi...@mesosphere.io>.

Meng,

Could you double check if this is really an issue in Mesos 1.5.0 release?

MESOS-1720 <https://issues.apache.org/jira/browse/MESOS-1720> was resolved
after the 1.5 release (rc-2) and it seems like
it is only at the master branch and 1.5.x branch (not 1.5.0).

Did I miss anything?

- Gilbert

On Thu, Mar 1, 2018 at 4:22 PM, Benjamin Mahler <bm...@apache.org> wrote:

> Put another way, we currently don't guarantee in-order task delivery to
> the executor. Due to the changes for MESOS-1720, one special case of task
> re-ordering now leads to the re-ordered task being dropped (rather than
> delivered out-of-order as before). Technically, this is strictly better.
>
> However, we'd like to start guaranteeing in-order task delivery.
>
> On Thu, Mar 1, 2018 at 2:56 PM, Meng Zhu <mz...@mesosphere.com> wrote:
>
>> Hi all:
>>
>> TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent
>> if all three conditions are met:
>> (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls
>>  use the same executor.
>> (2) The executor currently does not exist on the agent.
>> (3) Due to some race conditions, these tasks are trying to launch
>> on the agent in a different order from their original launch order.
>>
>> In this case, tasks that are trying to launch on the agent
>> before the first task in the original order will be explicitly dropped by
>> the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).
>>
>> This bug will be fixed in 1.5.1. It is tracked in
>> https://issues.apache.org/jira/browse/MESOS-8624
>>
>> ----
>>
>> In https://issues.apache.org/jira/browse/MESOS-1720, we introduced an
>> ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
>> calls to a new executor. The master would specify that the first call is
>> the
>> one to launch a new executor through the `launch_executor` field in
>> `RunTaskMessage`/`RunTaskGroupMessage`, and the second one should
>> use the existing executor launched by the first one.
>>
>> On the agent side, running a task/task group goes through a series of
>> continuations, one is `collect()` on the future that unschedule
>> frameworks from
>> being GC'ed:
>> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158
>> another is `collect()` on task authorization:
>> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333
>> Since these `collect()` calls run on individual actors, the futures of the
>> `collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return
>> out-of-order, even if the futures these two `collect()` wait for are
>> satisfied in
>> order (which is true in these two cases).
>>
>> As a result, under some race conditions (probably under some heavy load
>> conditions), tasks rely on the previous task to launch executor may
>> get processed before the task that is supposed to launch the executor
>> first, resulting in the tasks being explicitly dropped by the agent.
>>
>> -Meng
>>
>>
>>
>

Re: Tasks may be explicitly dropped by agent in Mesos 1.5

Posted by Benjamin Mahler <bm...@apache.org>.

Put another way, we currently don't guarantee in-order task delivery to the
executor. Due to the changes for MESOS-1720, one special case of task
re-ordering now leads to the re-ordered task being dropped (rather than
delivered out-of-order as before). Technically, this is strictly better.

However, we'd like to start guaranteeing in-order task delivery.

On Thu, Mar 1, 2018 at 2:56 PM, Meng Zhu <mz...@mesosphere.com> wrote:

> Hi all:
>
> TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent
> if all three conditions are met:
> (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls
>  use the same executor.
> (2) The executor currently does not exist on the agent.
> (3) Due to some race conditions, these tasks are trying to launch
> on the agent in a different order from their original launch order.
>
> In this case, tasks that are trying to launch on the agent
> before the first task in the original order will be explicitly dropped by
> the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).
>
> This bug will be fixed in 1.5.1. It is tracked in
> https://issues.apache.org/jira/browse/MESOS-8624
>
> ----
>
> In https://issues.apache.org/jira/browse/MESOS-1720, we introduced an
> ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
> calls to a new executor. The master would specify that the first call is
> the
> one to launch a new executor through the `launch_executor` field in
> `RunTaskMessage`/`RunTaskGroupMessage`, and the second one should
> use the existing executor launched by the first one.
>
> On the agent side, running a task/task group goes through a series of
> continuations, one is `collect()` on the future that unschedule
> frameworks from
> being GC'ed:
> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158
> another is `collect()` on task authorization:
> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333
> Since these `collect()` calls run on individual actors, the futures of the
> `collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return
> out-of-order, even if the futures these two `collect()` wait for are
> satisfied in
> order (which is true in these two cases).
>
> As a result, under some race conditions (probably under some heavy load
> conditions), tasks rely on the previous task to launch executor may
> get processed before the task that is supposed to launch the executor
> first, resulting in the tasks being explicitly dropped by the agent.
>
> -Meng
>
>
>

Re: Tasks may be explicitly dropped by agent in Mesos 1.5

Posted by Benjamin Mahler <bm...@apache.org>.

Put another way, we currently don't guarantee in-order task delivery to the
executor. Due to the changes for MESOS-1720, one special case of task
re-ordering now leads to the re-ordered task being dropped (rather than
delivered out-of-order as before). Technically, this is strictly better.

However, we'd like to start guaranteeing in-order task delivery.

On Thu, Mar 1, 2018 at 2:56 PM, Meng Zhu <mz...@mesosphere.com> wrote:

> Hi all:
>
> TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent
> if all three conditions are met:
> (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls
>  use the same executor.
> (2) The executor currently does not exist on the agent.
> (3) Due to some race conditions, these tasks are trying to launch
> on the agent in a different order from their original launch order.
>
> In this case, tasks that are trying to launch on the agent
> before the first task in the original order will be explicitly dropped by
> the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).
>
> This bug will be fixed in 1.5.1. It is tracked in
> https://issues.apache.org/jira/browse/MESOS-8624
>
> ----
>
> In https://issues.apache.org/jira/browse/MESOS-1720, we introduced an
> ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
> calls to a new executor. The master would specify that the first call is
> the
> one to launch a new executor through the `launch_executor` field in
> `RunTaskMessage`/`RunTaskGroupMessage`, and the second one should
> use the existing executor launched by the first one.
>
> On the agent side, running a task/task group goes through a series of
> continuations, one is `collect()` on the future that unschedule
> frameworks from
> being GC'ed:
> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158
> another is `collect()` on task authorization:
> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333
> Since these `collect()` calls run on individual actors, the futures of the
> `collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return
> out-of-order, even if the futures these two `collect()` wait for are
> satisfied in
> order (which is true in these two cases).
>
> As a result, under some race conditions (probably under some heavy load
> conditions), tasks rely on the previous task to launch executor may
> get processed before the task that is supposed to launch the executor
> first, resulting in the tasks being explicitly dropped by the agent.
>
> -Meng
>
>
>