You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Neil Conway <ne...@gmail.com> on 2016/10/14 22:11:59 UTC

Non-checkpointing frameworks

Hi folks,

I'd like input from individuals who currently use frameworks but do
not enable checkpointing.

Background: "checkpointing" is a parameter that can be enabled in
FrameworkInfo; if enabled, the agent will write the framework pid,
executor PIDs, and status updates to disk for any tasks started by
that framework. This checkpointed information means that these tasks
can survive an agent crash: if the agent exits (whether due to
crashing or as part of an upgrade procedure), a restarted agent can
use this information to reconnect to executors started by the previous
instance of the agent. The downside is that checkpointing requires
some additional disk I/O at the agent.

Checkpointing is not currently the default, but in my experience it is
often enabled for production frameworks. As part of the work on
supporting partition-aware Mesos frameworks (see MESOS-4049), we are
considering:

(a) requiring that partition-aware frameworks must also enable
checkpointing, and/or
(b) enabling checkpointing by default

If you have intentionally decided to disable checkpointing for your
Mesos framework, I'd be curious to hear more about your use-case and
why you haven't enabled it.

Thanks!

Neil

Re: Non-checkpointing frameworks

Posted by Neil Conway <ne...@gmail.com>.

Hi folks,

Thanks for the feedback!

On Mon, Oct 17, 2016 at 12:44 PM, Zhitao Li <zh...@gmail.com> wrote:
> +1 to both A to B.
>
> Do we plan to eventually drop non-checkpionted framework support (possibly
> in v2) and declare that all frameworks has to operate in this assumption?

I think that's worth considering for v2, *if* we don't find anyone
that has good reasons for disabling checkpointing. But I'd expect it
is more likely we keep it around as an option that is disabled by
default.

Neil

Re: Non-checkpointing frameworks

Posted by Neil Conway <ne...@gmail.com>.

Hi folks,

Thanks for the feedback!

On Mon, Oct 17, 2016 at 12:44 PM, Zhitao Li <zh...@gmail.com> wrote:
> +1 to both A to B.
>
> Do we plan to eventually drop non-checkpionted framework support (possibly
> in v2) and declare that all frameworks has to operate in this assumption?

I think that's worth considering for v2, *if* we don't find anyone
that has good reasons for disabling checkpointing. But I'd expect it
is more likely we keep it around as an option that is disabled by
default.

Neil

Re: Non-checkpointing frameworks

Posted by Zhitao Li <zh...@gmail.com>.

+1 to both A to B.

Do we plan to eventually drop non-checkpionted framework support (possibly
in v2) and declare that all frameworks has to operate in this assumption?

On Mon, Oct 17, 2016 at 1:36 AM, Aaron Carey <ac...@ilm.com> wrote:

> +1 to A and B
>
> Aaron Carey
> Production Engineer - Cloud Pipeline
> Industrial Light & Magic
> London
> 020 3751 9150
>
>
> On 17 October 2016 at 00:38, Qian Zhang <zh...@gmail.com> wrote:
>
>> and requires operators to enable checkpointing on the slaves.
>>
>>
>> Just curious why operator needs to enable checkpointing on the slaves (I
>> do not see an agent flag for that), I think checkpointing should be enabled
>> in framework level rather than slave.
>>
>>
>> Thanks,
>> Qian Zhang
>>
>> On Sun, Oct 16, 2016 at 10:18 AM, Zameer Manji <zm...@apache.org> wrote:
>>
>>> +1 to A and B
>>>
>>> Aurora has enabled checkpointing for years and requires operators to
>>> enable
>>> checkpointing on the slaves.
>>>
>>> On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <
>>> joris@mesosphere.io>
>>> wrote:
>>>
>>> > I'm in favor of A & B. I find it provides a better "first experience"
>>> to
>>> > users.
>>> > From my experience you usually have to have an explicit reason to not
>>> want
>>> > to checkpoint. Most people assume the semantics provided by the
>>> checkpoint
>>> > behavior is default and it can be a frustrating experience for them to
>>> find
>>> > out that is not the case.
>>> >
>>> > —
>>> > *Joris Van Remoortere*
>>>
>>> > Mesosphere
>>> >
>>> > On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <ne...@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi folks,
>>> >>
>>> >> I'd like input from individuals who currently use frameworks but do
>>> >> not enable checkpointing.
>>> >>
>>> >> Background: "checkpointing" is a parameter that can be enabled in
>>> >> FrameworkInfo; if enabled, the agent will write the framework pid,
>>> >> executor PIDs, and status updates to disk for any tasks started by
>>> >> that framework. This checkpointed information means that these tasks
>>> >> can survive an agent crash: if the agent exits (whether due to
>>> >> crashing or as part of an upgrade procedure), a restarted agent can
>>> >> use this information to reconnect to executors started by the previous
>>> >> instance of the agent. The downside is that checkpointing requires
>>> >> some additional disk I/O at the agent.
>>> >>
>>> >> Checkpointing is not currently the default, but in my experience it is
>>> >> often enabled for production frameworks. As part of the work on
>>> >> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
>>> >> considering:
>>> >>
>>> >> (a) requiring that partition-aware frameworks must also enable
>>> >> checkpointing, and/or
>>> >> (b) enabling checkpointing by default
>>> >>
>>> >> If you have intentionally decided to disable checkpointing for your
>>> >> Mesos framework, I'd be curious to hear more about your use-case and
>>> >> why you haven't enabled it.
>>> >>
>>> >> Thanks!
>>> >>
>>> >> Neil
>>> >>
>>> >> --
>>> >> Zameer Manji
>>> >>
>>> >
>>>
>>
>>
>


-- 
Cheers,

Zhitao Li

Re: Non-checkpointing frameworks

Posted by Zhitao Li <zh...@gmail.com>.

+1 to both A to B.

Do we plan to eventually drop non-checkpionted framework support (possibly
in v2) and declare that all frameworks has to operate in this assumption?

On Mon, Oct 17, 2016 at 1:36 AM, Aaron Carey <ac...@ilm.com> wrote:

> +1 to A and B
>
> Aaron Carey
> Production Engineer - Cloud Pipeline
> Industrial Light & Magic
> London
> 020 3751 9150
>
>
> On 17 October 2016 at 00:38, Qian Zhang <zh...@gmail.com> wrote:
>
>> and requires operators to enable checkpointing on the slaves.
>>
>>
>> Just curious why operator needs to enable checkpointing on the slaves (I
>> do not see an agent flag for that), I think checkpointing should be enabled
>> in framework level rather than slave.
>>
>>
>> Thanks,
>> Qian Zhang
>>
>> On Sun, Oct 16, 2016 at 10:18 AM, Zameer Manji <zm...@apache.org> wrote:
>>
>>> +1 to A and B
>>>
>>> Aurora has enabled checkpointing for years and requires operators to
>>> enable
>>> checkpointing on the slaves.
>>>
>>> On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <
>>> joris@mesosphere.io>
>>> wrote:
>>>
>>> > I'm in favor of A & B. I find it provides a better "first experience"
>>> to
>>> > users.
>>> > From my experience you usually have to have an explicit reason to not
>>> want
>>> > to checkpoint. Most people assume the semantics provided by the
>>> checkpoint
>>> > behavior is default and it can be a frustrating experience for them to
>>> find
>>> > out that is not the case.
>>> >
>>> > —
>>> > *Joris Van Remoortere*
>>>
>>> > Mesosphere
>>> >
>>> > On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <ne...@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi folks,
>>> >>
>>> >> I'd like input from individuals who currently use frameworks but do
>>> >> not enable checkpointing.
>>> >>
>>> >> Background: "checkpointing" is a parameter that can be enabled in
>>> >> FrameworkInfo; if enabled, the agent will write the framework pid,
>>> >> executor PIDs, and status updates to disk for any tasks started by
>>> >> that framework. This checkpointed information means that these tasks
>>> >> can survive an agent crash: if the agent exits (whether due to
>>> >> crashing or as part of an upgrade procedure), a restarted agent can
>>> >> use this information to reconnect to executors started by the previous
>>> >> instance of the agent. The downside is that checkpointing requires
>>> >> some additional disk I/O at the agent.
>>> >>
>>> >> Checkpointing is not currently the default, but in my experience it is
>>> >> often enabled for production frameworks. As part of the work on
>>> >> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
>>> >> considering:
>>> >>
>>> >> (a) requiring that partition-aware frameworks must also enable
>>> >> checkpointing, and/or
>>> >> (b) enabling checkpointing by default
>>> >>
>>> >> If you have intentionally decided to disable checkpointing for your
>>> >> Mesos framework, I'd be curious to hear more about your use-case and
>>> >> why you haven't enabled it.
>>> >>
>>> >> Thanks!
>>> >>
>>> >> Neil
>>> >>
>>> >> --
>>> >> Zameer Manji
>>> >>
>>> >
>>>
>>
>>
>


-- 
Cheers,

Zhitao Li

Re: Non-checkpointing frameworks

Posted by Aaron Carey <ac...@ilm.com>.

+1 to A and B

Aaron Carey
Production Engineer - Cloud Pipeline
Industrial Light & Magic
London
020 3751 9150


On 17 October 2016 at 00:38, Qian Zhang <zh...@gmail.com> wrote:

> and requires operators to enable checkpointing on the slaves.
>
>
> Just curious why operator needs to enable checkpointing on the slaves (I
> do not see an agent flag for that), I think checkpointing should be enabled
> in framework level rather than slave.
>
>
> Thanks,
> Qian Zhang
>
> On Sun, Oct 16, 2016 at 10:18 AM, Zameer Manji <zm...@apache.org> wrote:
>
>> +1 to A and B
>>
>> Aurora has enabled checkpointing for years and requires operators to
>> enable
>> checkpointing on the slaves.
>>
>> On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <
>> joris@mesosphere.io>
>> wrote:
>>
>> > I'm in favor of A & B. I find it provides a better "first experience" to
>> > users.
>> > From my experience you usually have to have an explicit reason to not
>> want
>> > to checkpoint. Most people assume the semantics provided by the
>> checkpoint
>> > behavior is default and it can be a frustrating experience for them to
>> find
>> > out that is not the case.
>> >
>> > —
>> > *Joris Van Remoortere*
>>
>> > Mesosphere
>> >
>> > On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <ne...@gmail.com>
>> > wrote:
>> >
>> >> Hi folks,
>> >>
>> >> I'd like input from individuals who currently use frameworks but do
>> >> not enable checkpointing.
>> >>
>> >> Background: "checkpointing" is a parameter that can be enabled in
>> >> FrameworkInfo; if enabled, the agent will write the framework pid,
>> >> executor PIDs, and status updates to disk for any tasks started by
>> >> that framework. This checkpointed information means that these tasks
>> >> can survive an agent crash: if the agent exits (whether due to
>> >> crashing or as part of an upgrade procedure), a restarted agent can
>> >> use this information to reconnect to executors started by the previous
>> >> instance of the agent. The downside is that checkpointing requires
>> >> some additional disk I/O at the agent.
>> >>
>> >> Checkpointing is not currently the default, but in my experience it is
>> >> often enabled for production frameworks. As part of the work on
>> >> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
>> >> considering:
>> >>
>> >> (a) requiring that partition-aware frameworks must also enable
>> >> checkpointing, and/or
>> >> (b) enabling checkpointing by default
>> >>
>> >> If you have intentionally decided to disable checkpointing for your
>> >> Mesos framework, I'd be curious to hear more about your use-case and
>> >> why you haven't enabled it.
>> >>
>> >> Thanks!
>> >>
>> >> Neil
>> >>
>> >> --
>> >> Zameer Manji
>> >>
>> >
>>
>
>

Re: Non-checkpointing frameworks

Posted by Qian Zhang <zh...@gmail.com>.

Got it, thanks Zameer!


Thanks,
Qian Zhang

On Tue, Oct 18, 2016 at 2:25 AM, Zameer Manji <zm...@apache.org> wrote:

> Qian,
>
> Turns out the --checkpoint flag was made default and removed in Mesos 0.22.
>
> On Sun, Oct 16, 2016 at 4:38 PM, Qian Zhang <zh...@gmail.com> wrote:
>
>> and requires operators to enable checkpointing on the slaves.
>>
>>
>> Just curious why operator needs to enable checkpointing on the slaves (I
>> do not see an agent flag for that), I think checkpointing should be enabled
>> in framework level rather than slave.
>>
>>
>> Thanks,
>> Qian Zhang
>>
>> On Sun, Oct 16, 2016 at 10:18 AM, Zameer Manji <zm...@apache.org> wrote:
>>
>>> +1 to A and B
>>>
>>> Aurora has enabled checkpointing for years and requires operators to
>>> enable
>>> checkpointing on the slaves.
>>>
>>> On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <
>>> joris@mesosphere.io>
>>> wrote:
>>>
>>> > I'm in favor of A & B. I find it provides a better "first experience"
>>> to
>>> > users.
>>> > From my experience you usually have to have an explicit reason to not
>>> want
>>> > to checkpoint. Most people assume the semantics provided by the
>>> checkpoint
>>> > behavior is default and it can be a frustrating experience for them to
>>> find
>>> > out that is not the case.
>>> >
>>> > —
>>> > *Joris Van Remoortere*
>>>
>>> > Mesosphere
>>> >
>>> > On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <ne...@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi folks,
>>> >>
>>> >> I'd like input from individuals who currently use frameworks but do
>>> >> not enable checkpointing.
>>> >>
>>> >> Background: "checkpointing" is a parameter that can be enabled in
>>> >> FrameworkInfo; if enabled, the agent will write the framework pid,
>>> >> executor PIDs, and status updates to disk for any tasks started by
>>> >> that framework. This checkpointed information means that these tasks
>>> >> can survive an agent crash: if the agent exits (whether due to
>>> >> crashing or as part of an upgrade procedure), a restarted agent can
>>> >> use this information to reconnect to executors started by the previous
>>> >> instance of the agent. The downside is that checkpointing requires
>>> >> some additional disk I/O at the agent.
>>> >>
>>> >> Checkpointing is not currently the default, but in my experience it is
>>> >> often enabled for production frameworks. As part of the work on
>>> >> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
>>> >> considering:
>>> >>
>>> >> (a) requiring that partition-aware frameworks must also enable
>>> >> checkpointing, and/or
>>> >> (b) enabling checkpointing by default
>>> >>
>>> >> If you have intentionally decided to disable checkpointing for your
>>> >> Mesos framework, I'd be curious to hear more about your use-case and
>>> >> why you haven't enabled it.
>>> >>
>>> >> Thanks!
>>> >>
>>> >> Neil
>>> >>
>>> >> --
>>> >> Zameer Manji
>>> >>
>>> >
>>>
>>> --
>>> Zameer Manji
>>>
>>

Re: Non-checkpointing frameworks

Posted by Qian Zhang <zh...@gmail.com>.

Got it, thanks Zameer!


Thanks,
Qian Zhang

On Tue, Oct 18, 2016 at 2:25 AM, Zameer Manji <zm...@apache.org> wrote:

> Qian,
>
> Turns out the --checkpoint flag was made default and removed in Mesos 0.22.
>
> On Sun, Oct 16, 2016 at 4:38 PM, Qian Zhang <zh...@gmail.com> wrote:
>
>> and requires operators to enable checkpointing on the slaves.
>>
>>
>> Just curious why operator needs to enable checkpointing on the slaves (I
>> do not see an agent flag for that), I think checkpointing should be enabled
>> in framework level rather than slave.
>>
>>
>> Thanks,
>> Qian Zhang
>>
>> On Sun, Oct 16, 2016 at 10:18 AM, Zameer Manji <zm...@apache.org> wrote:
>>
>>> +1 to A and B
>>>
>>> Aurora has enabled checkpointing for years and requires operators to
>>> enable
>>> checkpointing on the slaves.
>>>
>>> On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <
>>> joris@mesosphere.io>
>>> wrote:
>>>
>>> > I'm in favor of A & B. I find it provides a better "first experience"
>>> to
>>> > users.
>>> > From my experience you usually have to have an explicit reason to not
>>> want
>>> > to checkpoint. Most people assume the semantics provided by the
>>> checkpoint
>>> > behavior is default and it can be a frustrating experience for them to
>>> find
>>> > out that is not the case.
>>> >
>>> > —
>>> > *Joris Van Remoortere*
>>>
>>> > Mesosphere
>>> >
>>> > On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <ne...@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi folks,
>>> >>
>>> >> I'd like input from individuals who currently use frameworks but do
>>> >> not enable checkpointing.
>>> >>
>>> >> Background: "checkpointing" is a parameter that can be enabled in
>>> >> FrameworkInfo; if enabled, the agent will write the framework pid,
>>> >> executor PIDs, and status updates to disk for any tasks started by
>>> >> that framework. This checkpointed information means that these tasks
>>> >> can survive an agent crash: if the agent exits (whether due to
>>> >> crashing or as part of an upgrade procedure), a restarted agent can
>>> >> use this information to reconnect to executors started by the previous
>>> >> instance of the agent. The downside is that checkpointing requires
>>> >> some additional disk I/O at the agent.
>>> >>
>>> >> Checkpointing is not currently the default, but in my experience it is
>>> >> often enabled for production frameworks. As part of the work on
>>> >> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
>>> >> considering:
>>> >>
>>> >> (a) requiring that partition-aware frameworks must also enable
>>> >> checkpointing, and/or
>>> >> (b) enabling checkpointing by default
>>> >>
>>> >> If you have intentionally decided to disable checkpointing for your
>>> >> Mesos framework, I'd be curious to hear more about your use-case and
>>> >> why you haven't enabled it.
>>> >>
>>> >> Thanks!
>>> >>
>>> >> Neil
>>> >>
>>> >> --
>>> >> Zameer Manji
>>> >>
>>> >
>>>
>>> --
>>> Zameer Manji
>>>
>>

Re: Non-checkpointing frameworks

Posted by Zameer Manji <zm...@apache.org>.

Qian,

Turns out the --checkpoint flag was made default and removed in Mesos 0.22.

On Sun, Oct 16, 2016 at 4:38 PM, Qian Zhang <zh...@gmail.com> wrote:

> and requires operators to enable checkpointing on the slaves.
>
>
> Just curious why operator needs to enable checkpointing on the slaves (I
> do not see an agent flag for that), I think checkpointing should be enabled
> in framework level rather than slave.
>
>
> Thanks,
> Qian Zhang
>
> On Sun, Oct 16, 2016 at 10:18 AM, Zameer Manji <zm...@apache.org> wrote:
>
>> +1 to A and B
>>
>> Aurora has enabled checkpointing for years and requires operators to
>> enable
>> checkpointing on the slaves.
>>
>> On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <
>> joris@mesosphere.io>
>> wrote:
>>
>> > I'm in favor of A & B. I find it provides a better "first experience" to
>> > users.
>> > From my experience you usually have to have an explicit reason to not
>> want
>> > to checkpoint. Most people assume the semantics provided by the
>> checkpoint
>> > behavior is default and it can be a frustrating experience for them to
>> find
>> > out that is not the case.
>> >
>> > —
>> > *Joris Van Remoortere*
>>
>> > Mesosphere
>> >
>> > On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <ne...@gmail.com>
>> > wrote:
>> >
>> >> Hi folks,
>> >>
>> >> I'd like input from individuals who currently use frameworks but do
>> >> not enable checkpointing.
>> >>
>> >> Background: "checkpointing" is a parameter that can be enabled in
>> >> FrameworkInfo; if enabled, the agent will write the framework pid,
>> >> executor PIDs, and status updates to disk for any tasks started by
>> >> that framework. This checkpointed information means that these tasks
>> >> can survive an agent crash: if the agent exits (whether due to
>> >> crashing or as part of an upgrade procedure), a restarted agent can
>> >> use this information to reconnect to executors started by the previous
>> >> instance of the agent. The downside is that checkpointing requires
>> >> some additional disk I/O at the agent.
>> >>
>> >> Checkpointing is not currently the default, but in my experience it is
>> >> often enabled for production frameworks. As part of the work on
>> >> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
>> >> considering:
>> >>
>> >> (a) requiring that partition-aware frameworks must also enable
>> >> checkpointing, and/or
>> >> (b) enabling checkpointing by default
>> >>
>> >> If you have intentionally decided to disable checkpointing for your
>> >> Mesos framework, I'd be curious to hear more about your use-case and
>> >> why you haven't enabled it.
>> >>
>> >> Thanks!
>> >>
>> >> Neil
>> >>
>> >> --
>> >> Zameer Manji
>> >>
>> >
>>
>> --
>> Zameer Manji
>>
>

Re: Non-checkpointing frameworks

Posted by Aaron Carey <ac...@ilm.com>.

+1 to A and B

Aaron Carey
Production Engineer - Cloud Pipeline
Industrial Light & Magic
London
020 3751 9150


On 17 October 2016 at 00:38, Qian Zhang <zh...@gmail.com> wrote:

> and requires operators to enable checkpointing on the slaves.
>
>
> Just curious why operator needs to enable checkpointing on the slaves (I
> do not see an agent flag for that), I think checkpointing should be enabled
> in framework level rather than slave.
>
>
> Thanks,
> Qian Zhang
>
> On Sun, Oct 16, 2016 at 10:18 AM, Zameer Manji <zm...@apache.org> wrote:
>
>> +1 to A and B
>>
>> Aurora has enabled checkpointing for years and requires operators to
>> enable
>> checkpointing on the slaves.
>>
>> On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <
>> joris@mesosphere.io>
>> wrote:
>>
>> > I'm in favor of A & B. I find it provides a better "first experience" to
>> > users.
>> > From my experience you usually have to have an explicit reason to not
>> want
>> > to checkpoint. Most people assume the semantics provided by the
>> checkpoint
>> > behavior is default and it can be a frustrating experience for them to
>> find
>> > out that is not the case.
>> >
>> > —
>> > *Joris Van Remoortere*
>>
>> > Mesosphere
>> >
>> > On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <ne...@gmail.com>
>> > wrote:
>> >
>> >> Hi folks,
>> >>
>> >> I'd like input from individuals who currently use frameworks but do
>> >> not enable checkpointing.
>> >>
>> >> Background: "checkpointing" is a parameter that can be enabled in
>> >> FrameworkInfo; if enabled, the agent will write the framework pid,
>> >> executor PIDs, and status updates to disk for any tasks started by
>> >> that framework. This checkpointed information means that these tasks
>> >> can survive an agent crash: if the agent exits (whether due to
>> >> crashing or as part of an upgrade procedure), a restarted agent can
>> >> use this information to reconnect to executors started by the previous
>> >> instance of the agent. The downside is that checkpointing requires
>> >> some additional disk I/O at the agent.
>> >>
>> >> Checkpointing is not currently the default, but in my experience it is
>> >> often enabled for production frameworks. As part of the work on
>> >> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
>> >> considering:
>> >>
>> >> (a) requiring that partition-aware frameworks must also enable
>> >> checkpointing, and/or
>> >> (b) enabling checkpointing by default
>> >>
>> >> If you have intentionally decided to disable checkpointing for your
>> >> Mesos framework, I'd be curious to hear more about your use-case and
>> >> why you haven't enabled it.
>> >>
>> >> Thanks!
>> >>
>> >> Neil
>> >>
>> >> --
>> >> Zameer Manji
>> >>
>> >
>>
>
>

Re: Non-checkpointing frameworks

Posted by Zameer Manji <zm...@apache.org>.

Qian,

Turns out the --checkpoint flag was made default and removed in Mesos 0.22.

On Sun, Oct 16, 2016 at 4:38 PM, Qian Zhang <zh...@gmail.com> wrote:

> and requires operators to enable checkpointing on the slaves.
>
>
> Just curious why operator needs to enable checkpointing on the slaves (I
> do not see an agent flag for that), I think checkpointing should be enabled
> in framework level rather than slave.
>
>
> Thanks,
> Qian Zhang
>
> On Sun, Oct 16, 2016 at 10:18 AM, Zameer Manji <zm...@apache.org> wrote:
>
>> +1 to A and B
>>
>> Aurora has enabled checkpointing for years and requires operators to
>> enable
>> checkpointing on the slaves.
>>
>> On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <
>> joris@mesosphere.io>
>> wrote:
>>
>> > I'm in favor of A & B. I find it provides a better "first experience" to
>> > users.
>> > From my experience you usually have to have an explicit reason to not
>> want
>> > to checkpoint. Most people assume the semantics provided by the
>> checkpoint
>> > behavior is default and it can be a frustrating experience for them to
>> find
>> > out that is not the case.
>> >
>> > —
>> > *Joris Van Remoortere*
>>
>> > Mesosphere
>> >
>> > On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <ne...@gmail.com>
>> > wrote:
>> >
>> >> Hi folks,
>> >>
>> >> I'd like input from individuals who currently use frameworks but do
>> >> not enable checkpointing.
>> >>
>> >> Background: "checkpointing" is a parameter that can be enabled in
>> >> FrameworkInfo; if enabled, the agent will write the framework pid,
>> >> executor PIDs, and status updates to disk for any tasks started by
>> >> that framework. This checkpointed information means that these tasks
>> >> can survive an agent crash: if the agent exits (whether due to
>> >> crashing or as part of an upgrade procedure), a restarted agent can
>> >> use this information to reconnect to executors started by the previous
>> >> instance of the agent. The downside is that checkpointing requires
>> >> some additional disk I/O at the agent.
>> >>
>> >> Checkpointing is not currently the default, but in my experience it is
>> >> often enabled for production frameworks. As part of the work on
>> >> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
>> >> considering:
>> >>
>> >> (a) requiring that partition-aware frameworks must also enable
>> >> checkpointing, and/or
>> >> (b) enabling checkpointing by default
>> >>
>> >> If you have intentionally decided to disable checkpointing for your
>> >> Mesos framework, I'd be curious to hear more about your use-case and
>> >> why you haven't enabled it.
>> >>
>> >> Thanks!
>> >>
>> >> Neil
>> >>
>> >> --
>> >> Zameer Manji
>> >>
>> >
>>
>> --
>> Zameer Manji
>>
>

Re: Non-checkpointing frameworks

Posted by Qian Zhang <zh...@gmail.com>.

>
> and requires operators to enable checkpointing on the slaves.


Just curious why operator needs to enable checkpointing on the slaves (I do
not see an agent flag for that), I think checkpointing should be enabled in
framework level rather than slave.


Thanks,
Qian Zhang

On Sun, Oct 16, 2016 at 10:18 AM, Zameer Manji <zm...@apache.org> wrote:

> +1 to A and B
>
> Aurora has enabled checkpointing for years and requires operators to enable
> checkpointing on the slaves.
>
> On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <
> joris@mesosphere.io>
> wrote:
>
> > I'm in favor of A & B. I find it provides a better "first experience" to
> > users.
> > From my experience you usually have to have an explicit reason to not
> want
> > to checkpoint. Most people assume the semantics provided by the
> checkpoint
> > behavior is default and it can be a frustrating experience for them to
> find
> > out that is not the case.
> >
> > —
> > *Joris Van Remoortere*
> > Mesosphere
> >
> > On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <ne...@gmail.com>
> > wrote:
> >
> >> Hi folks,
> >>
> >> I'd like input from individuals who currently use frameworks but do
> >> not enable checkpointing.
> >>
> >> Background: "checkpointing" is a parameter that can be enabled in
> >> FrameworkInfo; if enabled, the agent will write the framework pid,
> >> executor PIDs, and status updates to disk for any tasks started by
> >> that framework. This checkpointed information means that these tasks
> >> can survive an agent crash: if the agent exits (whether due to
> >> crashing or as part of an upgrade procedure), a restarted agent can
> >> use this information to reconnect to executors started by the previous
> >> instance of the agent. The downside is that checkpointing requires
> >> some additional disk I/O at the agent.
> >>
> >> Checkpointing is not currently the default, but in my experience it is
> >> often enabled for production frameworks. As part of the work on
> >> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
> >> considering:
> >>
> >> (a) requiring that partition-aware frameworks must also enable
> >> checkpointing, and/or
> >> (b) enabling checkpointing by default
> >>
> >> If you have intentionally decided to disable checkpointing for your
> >> Mesos framework, I'd be curious to hear more about your use-case and
> >> why you haven't enabled it.
> >>
> >> Thanks!
> >>
> >> Neil
> >>
> >> --
> >> Zameer Manji
> >>
> >
>

Re: Non-checkpointing frameworks

Posted by Qian Zhang <zh...@gmail.com>.

>
> and requires operators to enable checkpointing on the slaves.


Just curious why operator needs to enable checkpointing on the slaves (I do
not see an agent flag for that), I think checkpointing should be enabled in
framework level rather than slave.


Thanks,
Qian Zhang

On Sun, Oct 16, 2016 at 10:18 AM, Zameer Manji <zm...@apache.org> wrote:

> +1 to A and B
>
> Aurora has enabled checkpointing for years and requires operators to enable
> checkpointing on the slaves.
>
> On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <
> joris@mesosphere.io>
> wrote:
>
> > I'm in favor of A & B. I find it provides a better "first experience" to
> > users.
> > From my experience you usually have to have an explicit reason to not
> want
> > to checkpoint. Most people assume the semantics provided by the
> checkpoint
> > behavior is default and it can be a frustrating experience for them to
> find
> > out that is not the case.
> >
> > —
> > *Joris Van Remoortere*
> > Mesosphere
> >
> > On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <ne...@gmail.com>
> > wrote:
> >
> >> Hi folks,
> >>
> >> I'd like input from individuals who currently use frameworks but do
> >> not enable checkpointing.
> >>
> >> Background: "checkpointing" is a parameter that can be enabled in
> >> FrameworkInfo; if enabled, the agent will write the framework pid,
> >> executor PIDs, and status updates to disk for any tasks started by
> >> that framework. This checkpointed information means that these tasks
> >> can survive an agent crash: if the agent exits (whether due to
> >> crashing or as part of an upgrade procedure), a restarted agent can
> >> use this information to reconnect to executors started by the previous
> >> instance of the agent. The downside is that checkpointing requires
> >> some additional disk I/O at the agent.
> >>
> >> Checkpointing is not currently the default, but in my experience it is
> >> often enabled for production frameworks. As part of the work on
> >> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
> >> considering:
> >>
> >> (a) requiring that partition-aware frameworks must also enable
> >> checkpointing, and/or
> >> (b) enabling checkpointing by default
> >>
> >> If you have intentionally decided to disable checkpointing for your
> >> Mesos framework, I'd be curious to hear more about your use-case and
> >> why you haven't enabled it.
> >>
> >> Thanks!
> >>
> >> Neil
> >>
> >> --
> >> Zameer Manji
> >>
> >
>

Re: Non-checkpointing frameworks

Posted by Zameer Manji <zm...@apache.org>.

+1 to A and B

Aurora has enabled checkpointing for years and requires operators to enable
checkpointing on the slaves.

On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <jo...@mesosphere.io>
wrote:

> I'm in favor of A & B. I find it provides a better "first experience" to
> users.
> From my experience you usually have to have an explicit reason to not want
> to checkpoint. Most people assume the semantics provided by the checkpoint
> behavior is default and it can be a frustrating experience for them to find
> out that is not the case.
>
> —
> *Joris Van Remoortere*
> Mesosphere
>
> On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <ne...@gmail.com>
> wrote:
>
>> Hi folks,
>>
>> I'd like input from individuals who currently use frameworks but do
>> not enable checkpointing.
>>
>> Background: "checkpointing" is a parameter that can be enabled in
>> FrameworkInfo; if enabled, the agent will write the framework pid,
>> executor PIDs, and status updates to disk for any tasks started by
>> that framework. This checkpointed information means that these tasks
>> can survive an agent crash: if the agent exits (whether due to
>> crashing or as part of an upgrade procedure), a restarted agent can
>> use this information to reconnect to executors started by the previous
>> instance of the agent. The downside is that checkpointing requires
>> some additional disk I/O at the agent.
>>
>> Checkpointing is not currently the default, but in my experience it is
>> often enabled for production frameworks. As part of the work on
>> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
>> considering:
>>
>> (a) requiring that partition-aware frameworks must also enable
>> checkpointing, and/or
>> (b) enabling checkpointing by default
>>
>> If you have intentionally decided to disable checkpointing for your
>> Mesos framework, I'd be curious to hear more about your use-case and
>> why you haven't enabled it.
>>
>> Thanks!
>>
>> Neil
>>
>> --
>> Zameer Manji
>>
>

Re: Non-checkpointing frameworks

Posted by Zameer Manji <zm...@apache.org>.

+1 to A and B

Aurora has enabled checkpointing for years and requires operators to enable
checkpointing on the slaves.

On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <jo...@mesosphere.io>
wrote:

> I'm in favor of A & B. I find it provides a better "first experience" to
> users.
> From my experience you usually have to have an explicit reason to not want
> to checkpoint. Most people assume the semantics provided by the checkpoint
> behavior is default and it can be a frustrating experience for them to find
> out that is not the case.
>
> —
> *Joris Van Remoortere*
> Mesosphere
>
> On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <ne...@gmail.com>
> wrote:
>
>> Hi folks,
>>
>> I'd like input from individuals who currently use frameworks but do
>> not enable checkpointing.
>>
>> Background: "checkpointing" is a parameter that can be enabled in
>> FrameworkInfo; if enabled, the agent will write the framework pid,
>> executor PIDs, and status updates to disk for any tasks started by
>> that framework. This checkpointed information means that these tasks
>> can survive an agent crash: if the agent exits (whether due to
>> crashing or as part of an upgrade procedure), a restarted agent can
>> use this information to reconnect to executors started by the previous
>> instance of the agent. The downside is that checkpointing requires
>> some additional disk I/O at the agent.
>>
>> Checkpointing is not currently the default, but in my experience it is
>> often enabled for production frameworks. As part of the work on
>> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
>> considering:
>>
>> (a) requiring that partition-aware frameworks must also enable
>> checkpointing, and/or
>> (b) enabling checkpointing by default
>>
>> If you have intentionally decided to disable checkpointing for your
>> Mesos framework, I'd be curious to hear more about your use-case and
>> why you haven't enabled it.
>>
>> Thanks!
>>
>> Neil
>>
>> --
>> Zameer Manji
>>
>

Re: Non-checkpointing frameworks

Posted by Joris Van Remoortere <jo...@mesosphere.io>.

I'm in favor of A & B. I find it provides a better "first experience" to
users.
From my experience you usually have to have an explicit reason to not want
to checkpoint. Most people assume the semantics provided by the checkpoint
behavior is default and it can be a frustrating experience for them to find
out that is not the case.

—
*Joris Van Remoortere*
Mesosphere

On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <ne...@gmail.com> wrote:

> Hi folks,
>
> I'd like input from individuals who currently use frameworks but do
> not enable checkpointing.
>
> Background: "checkpointing" is a parameter that can be enabled in
> FrameworkInfo; if enabled, the agent will write the framework pid,
> executor PIDs, and status updates to disk for any tasks started by
> that framework. This checkpointed information means that these tasks
> can survive an agent crash: if the agent exits (whether due to
> crashing or as part of an upgrade procedure), a restarted agent can
> use this information to reconnect to executors started by the previous
> instance of the agent. The downside is that checkpointing requires
> some additional disk I/O at the agent.
>
> Checkpointing is not currently the default, but in my experience it is
> often enabled for production frameworks. As part of the work on
> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
> considering:
>
> (a) requiring that partition-aware frameworks must also enable
> checkpointing, and/or
> (b) enabling checkpointing by default
>
> If you have intentionally decided to disable checkpointing for your
> Mesos framework, I'd be curious to hear more about your use-case and
> why you haven't enabled it.
>
> Thanks!
>
> Neil
>

Re: Non-checkpointing frameworks

Posted by Joris Van Remoortere <jo...@mesosphere.io>.

I'm in favor of A & B. I find it provides a better "first experience" to
users.
From my experience you usually have to have an explicit reason to not want
to checkpoint. Most people assume the semantics provided by the checkpoint
behavior is default and it can be a frustrating experience for them to find
out that is not the case.

—
*Joris Van Remoortere*
Mesosphere

On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <ne...@gmail.com> wrote:

> Hi folks,
>
> I'd like input from individuals who currently use frameworks but do
> not enable checkpointing.
>
> Background: "checkpointing" is a parameter that can be enabled in
> FrameworkInfo; if enabled, the agent will write the framework pid,
> executor PIDs, and status updates to disk for any tasks started by
> that framework. This checkpointed information means that these tasks
> can survive an agent crash: if the agent exits (whether due to
> crashing or as part of an upgrade procedure), a restarted agent can
> use this information to reconnect to executors started by the previous
> instance of the agent. The downside is that checkpointing requires
> some additional disk I/O at the agent.
>
> Checkpointing is not currently the default, but in my experience it is
> often enabled for production frameworks. As part of the work on
> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
> considering:
>
> (a) requiring that partition-aware frameworks must also enable
> checkpointing, and/or
> (b) enabling checkpointing by default
>
> If you have intentionally decided to disable checkpointing for your
> Mesos framework, I'd be curious to hear more about your use-case and
> why you haven't enabled it.
>
> Thanks!
>
> Neil
>