You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Diogo Gomes <di...@gmail.com> on 2016/01/06 21:52:04 UTC

[MESOS-1865] Redirect to the leader master when current master is not a leader.

Hi, Adam and Haosdent


Resurrecting this issue, https://issues.apache.org/jira/browse/MESOS-1865, I would like to make a +1 for this change, which apparently became cold but I think is very relevant and we had enough time to be prepared for a change like this, right?


If necessary, can I help with something?


Diogo Gomes

Re: [MESOS-1865] Redirect to the leader master when current master is not a leader.

Posted by Shuai Lin <li...@gmail.com>.

Regarding this issue, I see non-active marathon instance would proxy http
requests to the active marathon instance. This way no matter which marathon
instance the client is visiting, it would always get the correct result.

Could we do the same with mesos masters? The implementation would be more
complicated than the current patch, but it would be more convenient for
most of the users.

On Thu, Jan 7, 2016 at 7:53 AM, Adam Bordelon <ad...@mesosphere.io> wrote:

> Alright, let's revive it. I think we previously had problems trying to
> write a multi-master unit test. Might have to do some test infrastructure
> work to make that possible.
>
> On Wed, Jan 6, 2016 at 1:14 PM, Neil Conway <ne...@gmail.com> wrote:
>
> > +1 -- I think we should make this change. The current behavior is
> > quite dangerous.
> >
> > Neil
> >
> > On Wed, Jan 6, 2016 at 12:52 PM, Diogo Gomes <di...@gmail.com> wrote:
> > > Hi, Adam and Haosdent
> > >
> > >
> > > Resurrecting this issue,
> > https://issues.apache.org/jira/browse/MESOS-1865, I would like to make a
> > +1 for this change, which apparently became cold but I think is very
> > relevant and we had enough time to be prepared for a change like this,
> > right?
> > >
> > >
> > > If necessary, can I help with something?
> > >
> > >
> > > Diogo Gomes
> > >
> > >
> > >
> > >
> >
>

Re: [MESOS-1865] Redirect to the leader master when current master is not a leader.

Posted by Adam Bordelon <ad...@mesosphere.io>.

Alright, let's revive it. I think we previously had problems trying to
write a multi-master unit test. Might have to do some test infrastructure
work to make that possible.

On Wed, Jan 6, 2016 at 1:14 PM, Neil Conway <ne...@gmail.com> wrote:

> +1 -- I think we should make this change. The current behavior is
> quite dangerous.
>
> Neil
>
> On Wed, Jan 6, 2016 at 12:52 PM, Diogo Gomes <di...@gmail.com> wrote:
> > Hi, Adam and Haosdent
> >
> >
> > Resurrecting this issue,
> https://issues.apache.org/jira/browse/MESOS-1865, I would like to make a
> +1 for this change, which apparently became cold but I think is very
> relevant and we had enough time to be prepared for a change like this,
> right?
> >
> >
> > If necessary, can I help with something?
> >
> >
> > Diogo Gomes
> >
> >
> >
> >
>

Re: [MESOS-1865] Redirect to the leader master when current master is not a leader.

Posted by Neil Conway <ne...@gmail.com>.

+1 -- I think we should make this change. The current behavior is
quite dangerous.

Neil

On Wed, Jan 6, 2016 at 12:52 PM, Diogo Gomes <di...@gmail.com> wrote:
> Hi, Adam and Haosdent
>
>
> Resurrecting this issue, https://issues.apache.org/jira/browse/MESOS-1865, I would like to make a +1 for this change, which apparently became cold but I think is very relevant and we had enough time to be prepared for a change like this, right?
>
>
> If necessary, can I help with something?
>
>
> Diogo Gomes
>
>
>
>

Re: [MESOS-1865] Redirect to the leader master when current master is not a leader.

Posted by Marco Massenzio <m....@gmail.com>.

+1
(my two cent is that the “correct” approach from an operations viewpoint is to first query for the leader, then ask the leader; shortcoming identified by Ben obvious, but possibly the lesser of the two evils - and probably unavoidable in a distributed systems without atomic transactions - which I don’t think anyone on this list would advocate for?)

Thanks to the Benjamin(s) for (finally) giving a name to something I have encountered often :)
(I used to informally call it “the A-B problems” - your naming is definitely more compelling!)

> On Jan 8, 2016, at 12:29 PM, Benjamin Mahler <bm...@apache.org> wrote:
> 
> Some feedback on this ticket: it focuses on the solution rather than the
> problem. We generally want to avoid this, I guess it's been coined 'The XY
> Problem' (thanks Benjamin Bannier). In this case it turns out that there
> are actually 2 distinct problems that the user is facing:
> 
> (1) Passive masters return information in some endpoints that can be
> interpreted as incorrect. A passive master does not know the list of tasks,
> for example, and so returning an empty list is less accurate than
> expressing that no response is possible.
> 
> (2) It is difficult to reliably obtain cluster state through the existing
> endpoints. This one is less clear to me than the first problem. Here we
> have to think through how we want users to be hitting state endpoints. Do
> they hit all the masters and take the first valid response? Do they first
> ask for the leader, then query the leader? Both of these have races (the
> first case has an issue that the requests are not atomic, you may receive
> two valid responses ; the second case the leader information may become
> stale before the second request). Do we add redirects? Even redirects have
> issues, there may be multiple redirects, there may be a redirect to a
> master that is unable to redirect further (and so we haven't really solved
> the race difficulties with redirects).
> 
> The point is, it looks like we can easily solve (1), but (2) warrants more
> thought and will be easier to assess with the problem well understood.
> 
> On Wed, Jan 6, 2016 at 12:52 PM, Diogo Gomes <di...@gmail.com> wrote:
> 
>> Hi, Adam and Haosdent
>> 
>> 
>> Resurrecting this issue, https://issues.apache.org/jira/browse/MESOS-1865,
>> I would like to make a +1 for this change, which apparently became cold but
>> I think is very relevant and we had enough time to be prepared for a change
>> like this, right?
>> 
>> 
>> If necessary, can I help with something?
>> 
>> 
>> Diogo Gomes
>> 
>> 
>> 
>> 
>>

Re: [MESOS-1865] Redirect to the leader master when current master is not a leader.

Posted by Benjamin Mahler <bm...@apache.org>.

We should add the "who-is-the-current" leader informational endpoint
regardless of whether we do redirection, no?

Will it be clear which endpoints should redirect? Seems the redirection
approach, if we were to do it, needs to be specified explicitly by the
user. Otherwise it may be confusing for users that some endpoints redirect
and some do not.

On Fri, Jan 8, 2016 at 12:47 PM, Neil Conway <ne...@gmail.com> wrote:

> On Fri, Jan 8, 2016 at 12:29 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
> > (2) It is difficult to reliably obtain cluster state through the existing
> > endpoints. This one is less clear to me than the first problem. Here we
> > have to think through how we want users to be hitting state endpoints. Do
> > they hit all the masters and take the first valid response? Do they first
> > ask for the leader, then query the leader? Both of these have races (the
> > first case has an issue that the requests are not atomic, you may receive
> > two valid responses ; the second case the leader information may become
> > stale before the second request). Do we add redirects? Even redirects
> have
> > issues, there may be multiple redirects, there may be a redirect to a
> > master that is unable to redirect further (and so we haven't really
> solved
> > the race difficulties with redirects).
>
> I believe the proposed behavior is:
>
> * Clients can query any master
> * Endpoint queries against a non-leading master result in redirects to
> the current leader
>
> If the client follows a redirect to a different master, it may get
> redirected one or more times; it might also be unable to reach the
> current leader, or the queried master might be unable to determine the
> current leader. That seems like quite reasonable behavior to me,
> though (and technically I would argue that these situations aren't
> really "races" -- the client just needs to recognize that as in any
> distributed system, the information it observes might be stale).
>
> We could alternatively introduce a "who-is-the-current-leader"
> endpoint (which is something people have asked for [1]). As long as
> non-leading masters notify clients that they aren't talking to a
> leader (e.g., by returning a 403/503 error), that should also avoid
> races.
>
> Neil
>
> [1] https://issues.apache.org/jira/browse/MESOS-3841
>

Re: [MESOS-1865] Redirect to the leader master when current master is not a leader.

Posted by Neil Conway <ne...@gmail.com>.

On Fri, Jan 8, 2016 at 12:29 PM, Benjamin Mahler <bm...@apache.org> wrote:
> (2) It is difficult to reliably obtain cluster state through the existing
> endpoints. This one is less clear to me than the first problem. Here we
> have to think through how we want users to be hitting state endpoints. Do
> they hit all the masters and take the first valid response? Do they first
> ask for the leader, then query the leader? Both of these have races (the
> first case has an issue that the requests are not atomic, you may receive
> two valid responses ; the second case the leader information may become
> stale before the second request). Do we add redirects? Even redirects have
> issues, there may be multiple redirects, there may be a redirect to a
> master that is unable to redirect further (and so we haven't really solved
> the race difficulties with redirects).

I believe the proposed behavior is:

* Clients can query any master
* Endpoint queries against a non-leading master result in redirects to
the current leader

If the client follows a redirect to a different master, it may get
redirected one or more times; it might also be unable to reach the
current leader, or the queried master might be unable to determine the
current leader. That seems like quite reasonable behavior to me,
though (and technically I would argue that these situations aren't
really "races" -- the client just needs to recognize that as in any
distributed system, the information it observes might be stale).

We could alternatively introduce a "who-is-the-current-leader"
endpoint (which is something people have asked for [1]). As long as
non-leading masters notify clients that they aren't talking to a
leader (e.g., by returning a 403/503 error), that should also avoid
races.

Neil

[1] https://issues.apache.org/jira/browse/MESOS-3841

Re: [MESOS-1865] Redirect to the leader master when current master is not a leader.

Posted by Benjamin Mahler <bm...@apache.org>.

Some feedback on this ticket: it focuses on the solution rather than the
problem. We generally want to avoid this, I guess it's been coined 'The XY
Problem' (thanks Benjamin Bannier). In this case it turns out that there
are actually 2 distinct problems that the user is facing:

(1) Passive masters return information in some endpoints that can be
interpreted as incorrect. A passive master does not know the list of tasks,
for example, and so returning an empty list is less accurate than
expressing that no response is possible.

(2) It is difficult to reliably obtain cluster state through the existing
endpoints. This one is less clear to me than the first problem. Here we
have to think through how we want users to be hitting state endpoints. Do
they hit all the masters and take the first valid response? Do they first
ask for the leader, then query the leader? Both of these have races (the
first case has an issue that the requests are not atomic, you may receive
two valid responses ; the second case the leader information may become
stale before the second request). Do we add redirects? Even redirects have
issues, there may be multiple redirects, there may be a redirect to a
master that is unable to redirect further (and so we haven't really solved
the race difficulties with redirects).

The point is, it looks like we can easily solve (1), but (2) warrants more
thought and will be easier to assess with the problem well understood.

On Wed, Jan 6, 2016 at 12:52 PM, Diogo Gomes <di...@gmail.com> wrote:

> Hi, Adam and Haosdent
>
>
> Resurrecting this issue, https://issues.apache.org/jira/browse/MESOS-1865,
> I would like to make a +1 for this change, which apparently became cold but
> I think is very relevant and we had enough time to be prepared for a change
> like this, right?
>
>
> If necessary, can I help with something?
>
>
> Diogo Gomes
>
>
>
>
>