You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Benno Evers <be...@mesosphere.com> on 2017/11/05 01:26:28 UTC

Design doc: Relaxing Agent State Recovery on Restart

Hi all,

most people are probably familiar with the behaviour of the mesos-agent
that refuses to start when it detects incompatible slave information from a
previous run in its work directory. There is currently no way of
suppressing or pre-emptively avoiding this scenario without manual user
intervention.

There has been some prior work towards changing this, i.e.
https://issues.apache.org/jira/browse/MESOS-1739 and in particular
https://docs.google.com/document/d/1PWv7YIdV3nN2l1oUW7Nybm4KdnxZ7
Px2JGc5UM-PKoQ/edit#heading=h.ygnc486t6w2z but it looks like it is not
under active development anymore after parts of it have been implemented.

Since I would like to work on this as well, I wrote up another, new
proposal to relax this strict behaviour:


https://docs.google.com/document/d/1iOENs0JoXPc7sf1NDBCR2tPJ_KxwU4lLtr53SrE5U3Q/edit?usp=sharing

As always, any comments and suggestions are welcome and highly valued.

Best regards,
-- 
Benno Evers
Software Engineer, Mesosphere

Re: Design doc: Relaxing Agent State Recovery on Restart

Posted by Benno Evers <be...@mesosphere.com>.
Whoops, sorry, done.

On Thu, Nov 9, 2017 at 6:41 AM, Zhitao Li <zh...@gmail.com> wrote:

> Can you allow viewers to comment on the doc? Thanks
>
> On Wed, Nov 8, 2017 at 5:06 PM, Benno Evers <be...@mesosphere.com> wrote:
>
> > Thanks to everyone for the great comments.
> >
> > After thinking about the problem and the feedback for a few more days, I
> > went back to the drawing board and created a second revision of the
> design
> > doc.
> >
> > As before, all comments are highly welcomed:
> >
> >
> > https://docs.google.com/document/d/1hu0Ufi6gdskNEd7kY1sDx8ST67PAh
> > 9rJELRTXJzJGX0/edit?usp=sharing
> >
> > Best regards,
> > Benno
> >
> > On Sun, Nov 5, 2017 at 9:39 PM, Zhitao Li <zh...@gmail.com> wrote:
> >
> > > Thanks! Really looking forward for the proposed change as this is one
> of
> > > the most painful part for agent operation we have observed.
> > >
> > > On Sat, Nov 4, 2017 at 6:26 PM, Benno Evers <be...@mesosphere.com>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > most people are probably familiar with the behaviour of the
> mesos-agent
> > > > that refuses to start when it detects incompatible slave information
> > > from a
> > > > previous run in its work directory. There is currently no way of
> > > > suppressing or pre-emptively avoiding this scenario without manual
> user
> > > > intervention.
> > > >
> > > > There has been some prior work towards changing this, i.e.
> > > > https://issues.apache.org/jira/browse/MESOS-1739 and in particular
> > > > https://docs.google.com/document/d/1PWv7YIdV3nN2l1oUW7Nybm4KdnxZ7
> > > > Px2JGc5UM-PKoQ/edit#heading=h.ygnc486t6w2z but it looks like it is
> not
> > > > under active development anymore after parts of it have been
> > implemented.
> > > >
> > > > Since I would like to work on this as well, I wrote up another, new
> > > > proposal to relax this strict behaviour:
> > > >
> > > >
> > > > https://docs.google.com/document/d/1iOENs0JoXPc7sf1NDBCR2tPJ_
> > > > KxwU4lLtr53SrE5U3Q/edit?usp=sharing
> > > >
> > > > As always, any comments and suggestions are welcome and highly
> valued.
> > > >
> > > > Best regards,
> > > > --
> > > > Benno Evers
> > > > Software Engineer, Mesosphere
> > > >
> > >
> > >
> > >
> > > --
> > > Cheers,
> > >
> > > Zhitao Li
> > >
> >
> >
> >
> > --
> > Benno Evers
> > Software Engineer, Mesosphere
> >
>
>
>
> --
> Cheers,
>
> Zhitao Li
>



-- 
Benno Evers
Software Engineer, Mesosphere

Re: Design doc: Relaxing Agent State Recovery on Restart

Posted by Zhitao Li <zh...@gmail.com>.
Can you allow viewers to comment on the doc? Thanks

On Wed, Nov 8, 2017 at 5:06 PM, Benno Evers <be...@mesosphere.com> wrote:

> Thanks to everyone for the great comments.
>
> After thinking about the problem and the feedback for a few more days, I
> went back to the drawing board and created a second revision of the design
> doc.
>
> As before, all comments are highly welcomed:
>
>
> https://docs.google.com/document/d/1hu0Ufi6gdskNEd7kY1sDx8ST67PAh
> 9rJELRTXJzJGX0/edit?usp=sharing
>
> Best regards,
> Benno
>
> On Sun, Nov 5, 2017 at 9:39 PM, Zhitao Li <zh...@gmail.com> wrote:
>
> > Thanks! Really looking forward for the proposed change as this is one of
> > the most painful part for agent operation we have observed.
> >
> > On Sat, Nov 4, 2017 at 6:26 PM, Benno Evers <be...@mesosphere.com>
> wrote:
> >
> > > Hi all,
> > >
> > > most people are probably familiar with the behaviour of the mesos-agent
> > > that refuses to start when it detects incompatible slave information
> > from a
> > > previous run in its work directory. There is currently no way of
> > > suppressing or pre-emptively avoiding this scenario without manual user
> > > intervention.
> > >
> > > There has been some prior work towards changing this, i.e.
> > > https://issues.apache.org/jira/browse/MESOS-1739 and in particular
> > > https://docs.google.com/document/d/1PWv7YIdV3nN2l1oUW7Nybm4KdnxZ7
> > > Px2JGc5UM-PKoQ/edit#heading=h.ygnc486t6w2z but it looks like it is not
> > > under active development anymore after parts of it have been
> implemented.
> > >
> > > Since I would like to work on this as well, I wrote up another, new
> > > proposal to relax this strict behaviour:
> > >
> > >
> > > https://docs.google.com/document/d/1iOENs0JoXPc7sf1NDBCR2tPJ_
> > > KxwU4lLtr53SrE5U3Q/edit?usp=sharing
> > >
> > > As always, any comments and suggestions are welcome and highly valued.
> > >
> > > Best regards,
> > > --
> > > Benno Evers
> > > Software Engineer, Mesosphere
> > >
> >
> >
> >
> > --
> > Cheers,
> >
> > Zhitao Li
> >
>
>
>
> --
> Benno Evers
> Software Engineer, Mesosphere
>



-- 
Cheers,

Zhitao Li

Re: Design doc: Relaxing Agent State Recovery on Restart

Posted by Benno Evers <be...@mesosphere.com>.
Thanks to everyone for the great comments.

After thinking about the problem and the feedback for a few more days, I
went back to the drawing board and created a second revision of the design
doc.

As before, all comments are highly welcomed:


https://docs.google.com/document/d/1hu0Ufi6gdskNEd7kY1sDx8ST67PAh9rJELRTXJzJGX0/edit?usp=sharing

Best regards,
Benno

On Sun, Nov 5, 2017 at 9:39 PM, Zhitao Li <zh...@gmail.com> wrote:

> Thanks! Really looking forward for the proposed change as this is one of
> the most painful part for agent operation we have observed.
>
> On Sat, Nov 4, 2017 at 6:26 PM, Benno Evers <be...@mesosphere.com> wrote:
>
> > Hi all,
> >
> > most people are probably familiar with the behaviour of the mesos-agent
> > that refuses to start when it detects incompatible slave information
> from a
> > previous run in its work directory. There is currently no way of
> > suppressing or pre-emptively avoiding this scenario without manual user
> > intervention.
> >
> > There has been some prior work towards changing this, i.e.
> > https://issues.apache.org/jira/browse/MESOS-1739 and in particular
> > https://docs.google.com/document/d/1PWv7YIdV3nN2l1oUW7Nybm4KdnxZ7
> > Px2JGc5UM-PKoQ/edit#heading=h.ygnc486t6w2z but it looks like it is not
> > under active development anymore after parts of it have been implemented.
> >
> > Since I would like to work on this as well, I wrote up another, new
> > proposal to relax this strict behaviour:
> >
> >
> > https://docs.google.com/document/d/1iOENs0JoXPc7sf1NDBCR2tPJ_
> > KxwU4lLtr53SrE5U3Q/edit?usp=sharing
> >
> > As always, any comments and suggestions are welcome and highly valued.
> >
> > Best regards,
> > --
> > Benno Evers
> > Software Engineer, Mesosphere
> >
>
>
>
> --
> Cheers,
>
> Zhitao Li
>



-- 
Benno Evers
Software Engineer, Mesosphere

Re: Design doc: Relaxing Agent State Recovery on Restart

Posted by Zhitao Li <zh...@gmail.com>.
Thanks! Really looking forward for the proposed change as this is one of
the most painful part for agent operation we have observed.

On Sat, Nov 4, 2017 at 6:26 PM, Benno Evers <be...@mesosphere.com> wrote:

> Hi all,
>
> most people are probably familiar with the behaviour of the mesos-agent
> that refuses to start when it detects incompatible slave information from a
> previous run in its work directory. There is currently no way of
> suppressing or pre-emptively avoiding this scenario without manual user
> intervention.
>
> There has been some prior work towards changing this, i.e.
> https://issues.apache.org/jira/browse/MESOS-1739 and in particular
> https://docs.google.com/document/d/1PWv7YIdV3nN2l1oUW7Nybm4KdnxZ7
> Px2JGc5UM-PKoQ/edit#heading=h.ygnc486t6w2z but it looks like it is not
> under active development anymore after parts of it have been implemented.
>
> Since I would like to work on this as well, I wrote up another, new
> proposal to relax this strict behaviour:
>
>
> https://docs.google.com/document/d/1iOENs0JoXPc7sf1NDBCR2tPJ_
> KxwU4lLtr53SrE5U3Q/edit?usp=sharing
>
> As always, any comments and suggestions are welcome and highly valued.
>
> Best regards,
> --
> Benno Evers
> Software Engineer, Mesosphere
>



-- 
Cheers,

Zhitao Li