You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Philippe Laflamme <ph...@hopper.com> on 2015/07/02 23:10:33 UTC

Mesos Slave Port Change Fails Recovery

Hi,

I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
configured with checkpointing and with "reconnect" recovery.

I was investigating why the slaves would successfully re-register with the
master and recover, but would subsequently be asked to shutdown ("health
check timeout").

It turns out that our slaves had been unintentionally configured to use
port 5050 in the previous configuration. We decided to fix that during the
upgrade and have them use the default 5051 port.

This change seems to make the health checks fail and eventually kills the
slave due to inactivity.

I've confirmed that leaving the port to what it was in the previous
configuration makes the slave successfully re-register and is not asked to
shutdown later on.

Is this a known issue? I haven't been able to find a JIRA ticket for this.
Maybe it's the expected behaviour? Should I create a ticket?

Thanks,
Philippe

Re: Mesos Slave Port Change Fails Recovery

Posted by Philippe Laflamme <ph...@hopper.com>.

Awesome!

We've reverted to the previous port and all our slaves have recovered
nicely.

Thanks for looking into this,
Philippe

On Fri, Jul 3, 2015 at 3:27 PM, Vinod Kone <vi...@gmail.com> wrote:

> Looks like this is due to a bug in versions < 23.0, where slave recovery
> didn't check for changes in 'port' when considering compatibility
> <https://github.com/apache/mesos/blob/0.21.0/src/common/type_utils.cpp#L137>.
> It has since been fixed in the upcoming 0.23.0 release.
>
> On Thu, Jul 2, 2015 at 8:45 PM, Philippe Laflamme <ph...@hopper.com>
> wrote:
>
>> Checkpointing has been enabled since 0.18 on these slaves. The only other
>> setting that changed during the upgrade was that we added --gc_delay=1days.
>> Otherwise, it's an in-place upgrade without any changes to the work
>> directory...
>>
>> Philippe
>>
>> On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone <vi...@gmail.com> wrote:
>>
>>> It is surprising that the slave didn't bail out during the initial phase
>>> of recovery when the port changed. I'm assuming you enabled checkpointing
>>> in 0.20.0 and that you didn't wipe the meta data directory or anything when
>>> upgrading to 21.0?
>>>
>>> On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme <ph...@hopper.com>
>>> wrote:
>>>
>>>> Here you are:
>>>>
>>>> https://gist.github.com/plaflamme/9cd056dc959e0597fb1c
>>>>
>>>> You can see in the mesos-master.INFO log that it re-registers the slave
>>>> using port :5050 (line 9) and fails the health checks on port :5051 (line
>>>> 10). So it might be the slave that re-uses the old configuration?
>>>>
>>>> Thanks,
>>>> Philippe
>>>>
>>>> On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone <vi...@gmail.com> wrote:
>>>>
>>>>> Can you paste some logs?
>>>>>
>>>>> On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <philippe@hopper.com
>>>>> > wrote:
>>>>>
>>>>>> Ok, that's reasonable, but I'm not sure why it would successfully
>>>>>> re-register with the master if it's not supposed to in the first place. I
>>>>>> think changing the resources (for example) will dump the old configuration
>>>>>> in the logs and tell you why recovery is bailing out. It's not doing that
>>>>>> in this case.
>>>>>>
>>>>>> I looks as though this doesn't work only because the master can't
>>>>>> ping the slave on the old port, because the whole recovery process was
>>>>>> successful otherwise.
>>>>>>
>>>>>> I'm not sure if the slave could have picked up its configuration
>>>>>> change and failed the recovery early, but that would definitely be a better
>>>>>> experience.
>>>>>>
>>>>>> Philippe
>>>>>>
>>>>>> On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <vi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> For slave recovery to work, it is expected to not change its config.
>>>>>>>
>>>>>>> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <
>>>>>>> philippe@hopper.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
>>>>>>>> configured with checkpointing and with "reconnect" recovery.
>>>>>>>>
>>>>>>>> I was investigating why the slaves would successfully re-register
>>>>>>>> with the master and recover, but would subsequently be asked to shutdown
>>>>>>>> ("health check timeout").
>>>>>>>>
>>>>>>>> It turns out that our slaves had been unintentionally configured to
>>>>>>>> use port 5050 in the previous configuration. We decided to fix that during
>>>>>>>> the upgrade and have them use the default 5051 port.
>>>>>>>>
>>>>>>>> This change seems to make the health checks fail and eventually
>>>>>>>> kills the slave due to inactivity.
>>>>>>>>
>>>>>>>> I've confirmed that leaving the port to what it was in the previous
>>>>>>>> configuration makes the slave successfully re-register and is not asked to
>>>>>>>> shutdown later on.
>>>>>>>>
>>>>>>>> Is this a known issue? I haven't been able to find a JIRA ticket
>>>>>>>> for this. Maybe it's the expected behaviour? Should I create a ticket?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Philippe
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mesos Slave Port Change Fails Recovery

Posted by Vinod Kone <vi...@gmail.com>.

Looks like this is due to a bug in versions < 23.0, where slave recovery
didn't check for changes in 'port' when considering compatibility
<https://github.com/apache/mesos/blob/0.21.0/src/common/type_utils.cpp#L137>.
It has since been fixed in the upcoming 0.23.0 release.

On Thu, Jul 2, 2015 at 8:45 PM, Philippe Laflamme <ph...@hopper.com>
wrote:

> Checkpointing has been enabled since 0.18 on these slaves. The only other
> setting that changed during the upgrade was that we added --gc_delay=1days.
> Otherwise, it's an in-place upgrade without any changes to the work
> directory...
>
> Philippe
>
> On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone <vi...@gmail.com> wrote:
>
>> It is surprising that the slave didn't bail out during the initial phase
>> of recovery when the port changed. I'm assuming you enabled checkpointing
>> in 0.20.0 and that you didn't wipe the meta data directory or anything when
>> upgrading to 21.0?
>>
>> On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme <ph...@hopper.com>
>> wrote:
>>
>>> Here you are:
>>>
>>> https://gist.github.com/plaflamme/9cd056dc959e0597fb1c
>>>
>>> You can see in the mesos-master.INFO log that it re-registers the slave
>>> using port :5050 (line 9) and fails the health checks on port :5051 (line
>>> 10). So it might be the slave that re-uses the old configuration?
>>>
>>> Thanks,
>>> Philippe
>>>
>>> On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone <vi...@gmail.com> wrote:
>>>
>>>> Can you paste some logs?
>>>>
>>>> On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <ph...@hopper.com>
>>>> wrote:
>>>>
>>>>> Ok, that's reasonable, but I'm not sure why it would successfully
>>>>> re-register with the master if it's not supposed to in the first place. I
>>>>> think changing the resources (for example) will dump the old configuration
>>>>> in the logs and tell you why recovery is bailing out. It's not doing that
>>>>> in this case.
>>>>>
>>>>> I looks as though this doesn't work only because the master can't ping
>>>>> the slave on the old port, because the whole recovery process was
>>>>> successful otherwise.
>>>>>
>>>>> I'm not sure if the slave could have picked up its configuration
>>>>> change and failed the recovery early, but that would definitely be a better
>>>>> experience.
>>>>>
>>>>> Philippe
>>>>>
>>>>> On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <vi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> For slave recovery to work, it is expected to not change its config.
>>>>>>
>>>>>> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <
>>>>>> philippe@hopper.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
>>>>>>> configured with checkpointing and with "reconnect" recovery.
>>>>>>>
>>>>>>> I was investigating why the slaves would successfully re-register
>>>>>>> with the master and recover, but would subsequently be asked to shutdown
>>>>>>> ("health check timeout").
>>>>>>>
>>>>>>> It turns out that our slaves had been unintentionally configured to
>>>>>>> use port 5050 in the previous configuration. We decided to fix that during
>>>>>>> the upgrade and have them use the default 5051 port.
>>>>>>>
>>>>>>> This change seems to make the health checks fail and eventually
>>>>>>> kills the slave due to inactivity.
>>>>>>>
>>>>>>> I've confirmed that leaving the port to what it was in the previous
>>>>>>> configuration makes the slave successfully re-register and is not asked to
>>>>>>> shutdown later on.
>>>>>>>
>>>>>>> Is this a known issue? I haven't been able to find a JIRA ticket for
>>>>>>> this. Maybe it's the expected behaviour? Should I create a ticket?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Philippe
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mesos Slave Port Change Fails Recovery

Posted by Philippe Laflamme <ph...@hopper.com>.

Checkpointing has been enabled since 0.18 on these slaves. The only other
setting that changed during the upgrade was that we added --gc_delay=1days.
Otherwise, it's an in-place upgrade without any changes to the work
directory...

Philippe

On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone <vi...@gmail.com> wrote:

> It is surprising that the slave didn't bail out during the initial phase
> of recovery when the port changed. I'm assuming you enabled checkpointing
> in 0.20.0 and that you didn't wipe the meta data directory or anything when
> upgrading to 21.0?
>
> On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme <ph...@hopper.com>
> wrote:
>
>> Here you are:
>>
>> https://gist.github.com/plaflamme/9cd056dc959e0597fb1c
>>
>> You can see in the mesos-master.INFO log that it re-registers the slave
>> using port :5050 (line 9) and fails the health checks on port :5051 (line
>> 10). So it might be the slave that re-uses the old configuration?
>>
>> Thanks,
>> Philippe
>>
>> On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone <vi...@gmail.com> wrote:
>>
>>> Can you paste some logs?
>>>
>>> On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <ph...@hopper.com>
>>> wrote:
>>>
>>>> Ok, that's reasonable, but I'm not sure why it would successfully
>>>> re-register with the master if it's not supposed to in the first place. I
>>>> think changing the resources (for example) will dump the old configuration
>>>> in the logs and tell you why recovery is bailing out. It's not doing that
>>>> in this case.
>>>>
>>>> I looks as though this doesn't work only because the master can't ping
>>>> the slave on the old port, because the whole recovery process was
>>>> successful otherwise.
>>>>
>>>> I'm not sure if the slave could have picked up its configuration change
>>>> and failed the recovery early, but that would definitely be a better
>>>> experience.
>>>>
>>>> Philippe
>>>>
>>>> On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <vi...@gmail.com> wrote:
>>>>
>>>>> For slave recovery to work, it is expected to not change its config.
>>>>>
>>>>> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <philippe@hopper.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
>>>>>> configured with checkpointing and with "reconnect" recovery.
>>>>>>
>>>>>> I was investigating why the slaves would successfully re-register
>>>>>> with the master and recover, but would subsequently be asked to shutdown
>>>>>> ("health check timeout").
>>>>>>
>>>>>> It turns out that our slaves had been unintentionally configured to
>>>>>> use port 5050 in the previous configuration. We decided to fix that during
>>>>>> the upgrade and have them use the default 5051 port.
>>>>>>
>>>>>> This change seems to make the health checks fail and eventually kills
>>>>>> the slave due to inactivity.
>>>>>>
>>>>>> I've confirmed that leaving the port to what it was in the previous
>>>>>> configuration makes the slave successfully re-register and is not asked to
>>>>>> shutdown later on.
>>>>>>
>>>>>> Is this a known issue? I haven't been able to find a JIRA ticket for
>>>>>> this. Maybe it's the expected behaviour? Should I create a ticket?
>>>>>>
>>>>>> Thanks,
>>>>>> Philippe
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mesos Slave Port Change Fails Recovery

Posted by Vinod Kone <vi...@gmail.com>.

It is surprising that the slave didn't bail out during the initial phase of
recovery when the port changed. I'm assuming you enabled checkpointing in
0.20.0 and that you didn't wipe the meta data directory or anything when
upgrading to 21.0?

On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme <ph...@hopper.com>
wrote:

> Here you are:
>
> https://gist.github.com/plaflamme/9cd056dc959e0597fb1c
>
> You can see in the mesos-master.INFO log that it re-registers the slave
> using port :5050 (line 9) and fails the health checks on port :5051 (line
> 10). So it might be the slave that re-uses the old configuration?
>
> Thanks,
> Philippe
>
> On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone <vi...@gmail.com> wrote:
>
>> Can you paste some logs?
>>
>> On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <ph...@hopper.com>
>> wrote:
>>
>>> Ok, that's reasonable, but I'm not sure why it would successfully
>>> re-register with the master if it's not supposed to in the first place. I
>>> think changing the resources (for example) will dump the old configuration
>>> in the logs and tell you why recovery is bailing out. It's not doing that
>>> in this case.
>>>
>>> I looks as though this doesn't work only because the master can't ping
>>> the slave on the old port, because the whole recovery process was
>>> successful otherwise.
>>>
>>> I'm not sure if the slave could have picked up its configuration change
>>> and failed the recovery early, but that would definitely be a better
>>> experience.
>>>
>>> Philippe
>>>
>>> On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <vi...@gmail.com> wrote:
>>>
>>>> For slave recovery to work, it is expected to not change its config.
>>>>
>>>> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <ph...@hopper.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
>>>>> configured with checkpointing and with "reconnect" recovery.
>>>>>
>>>>> I was investigating why the slaves would successfully re-register with
>>>>> the master and recover, but would subsequently be asked to shutdown
>>>>> ("health check timeout").
>>>>>
>>>>> It turns out that our slaves had been unintentionally configured to
>>>>> use port 5050 in the previous configuration. We decided to fix that during
>>>>> the upgrade and have them use the default 5051 port.
>>>>>
>>>>> This change seems to make the health checks fail and eventually kills
>>>>> the slave due to inactivity.
>>>>>
>>>>> I've confirmed that leaving the port to what it was in the previous
>>>>> configuration makes the slave successfully re-register and is not asked to
>>>>> shutdown later on.
>>>>>
>>>>> Is this a known issue? I haven't been able to find a JIRA ticket for
>>>>> this. Maybe it's the expected behaviour? Should I create a ticket?
>>>>>
>>>>> Thanks,
>>>>> Philippe
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Mesos Slave Port Change Fails Recovery

Posted by Philippe Laflamme <ph...@hopper.com>.

Here you are:

https://gist.github.com/plaflamme/9cd056dc959e0597fb1c

You can see in the mesos-master.INFO log that it re-registers the slave
using port :5050 (line 9) and fails the health checks on port :5051 (line
10). So it might be the slave that re-uses the old configuration?

Thanks,
Philippe

On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone <vi...@gmail.com> wrote:

> Can you paste some logs?
>
> On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <ph...@hopper.com>
> wrote:
>
>> Ok, that's reasonable, but I'm not sure why it would successfully
>> re-register with the master if it's not supposed to in the first place. I
>> think changing the resources (for example) will dump the old configuration
>> in the logs and tell you why recovery is bailing out. It's not doing that
>> in this case.
>>
>> I looks as though this doesn't work only because the master can't ping
>> the slave on the old port, because the whole recovery process was
>> successful otherwise.
>>
>> I'm not sure if the slave could have picked up its configuration change
>> and failed the recovery early, but that would definitely be a better
>> experience.
>>
>> Philippe
>>
>> On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <vi...@gmail.com> wrote:
>>
>>> For slave recovery to work, it is expected to not change its config.
>>>
>>> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <ph...@hopper.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
>>>> configured with checkpointing and with "reconnect" recovery.
>>>>
>>>> I was investigating why the slaves would successfully re-register with
>>>> the master and recover, but would subsequently be asked to shutdown
>>>> ("health check timeout").
>>>>
>>>> It turns out that our slaves had been unintentionally configured to use
>>>> port 5050 in the previous configuration. We decided to fix that during the
>>>> upgrade and have them use the default 5051 port.
>>>>
>>>> This change seems to make the health checks fail and eventually kills
>>>> the slave due to inactivity.
>>>>
>>>> I've confirmed that leaving the port to what it was in the previous
>>>> configuration makes the slave successfully re-register and is not asked to
>>>> shutdown later on.
>>>>
>>>> Is this a known issue? I haven't been able to find a JIRA ticket for
>>>> this. Maybe it's the expected behaviour? Should I create a ticket?
>>>>
>>>> Thanks,
>>>> Philippe
>>>>
>>>
>>>
>>
>

Re: Mesos Slave Port Change Fails Recovery

Posted by Vinod Kone <vi...@gmail.com>.

Can you paste some logs?

On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <ph...@hopper.com>
wrote:

> Ok, that's reasonable, but I'm not sure why it would successfully
> re-register with the master if it's not supposed to in the first place. I
> think changing the resources (for example) will dump the old configuration
> in the logs and tell you why recovery is bailing out. It's not doing that
> in this case.
>
> I looks as though this doesn't work only because the master can't ping the
> slave on the old port, because the whole recovery process was successful
> otherwise.
>
> I'm not sure if the slave could have picked up its configuration change
> and failed the recovery early, but that would definitely be a better
> experience.
>
> Philippe
>
> On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <vi...@gmail.com> wrote:
>
>> For slave recovery to work, it is expected to not change its config.
>>
>> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <ph...@hopper.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
>>> configured with checkpointing and with "reconnect" recovery.
>>>
>>> I was investigating why the slaves would successfully re-register with
>>> the master and recover, but would subsequently be asked to shutdown
>>> ("health check timeout").
>>>
>>> It turns out that our slaves had been unintentionally configured to use
>>> port 5050 in the previous configuration. We decided to fix that during the
>>> upgrade and have them use the default 5051 port.
>>>
>>> This change seems to make the health checks fail and eventually kills
>>> the slave due to inactivity.
>>>
>>> I've confirmed that leaving the port to what it was in the previous
>>> configuration makes the slave successfully re-register and is not asked to
>>> shutdown later on.
>>>
>>> Is this a known issue? I haven't been able to find a JIRA ticket for
>>> this. Maybe it's the expected behaviour? Should I create a ticket?
>>>
>>> Thanks,
>>> Philippe
>>>
>>
>>
>

Re: Mesos Slave Port Change Fails Recovery

Posted by Philippe Laflamme <ph...@hopper.com>.

Ok, that's reasonable, but I'm not sure why it would successfully
re-register with the master if it's not supposed to in the first place. I
think changing the resources (for example) will dump the old configuration
in the logs and tell you why recovery is bailing out. It's not doing that
in this case.

I looks as though this doesn't work only because the master can't ping the
slave on the old port, because the whole recovery process was successful
otherwise.

I'm not sure if the slave could have picked up its configuration change and
failed the recovery early, but that would definitely be a better experience.

Philippe

On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <vi...@gmail.com> wrote:

> For slave recovery to work, it is expected to not change its config.
>
> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <ph...@hopper.com>
> wrote:
>
>> Hi,
>>
>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
>> configured with checkpointing and with "reconnect" recovery.
>>
>> I was investigating why the slaves would successfully re-register with
>> the master and recover, but would subsequently be asked to shutdown
>> ("health check timeout").
>>
>> It turns out that our slaves had been unintentionally configured to use
>> port 5050 in the previous configuration. We decided to fix that during the
>> upgrade and have them use the default 5051 port.
>>
>> This change seems to make the health checks fail and eventually kills the
>> slave due to inactivity.
>>
>> I've confirmed that leaving the port to what it was in the previous
>> configuration makes the slave successfully re-register and is not asked to
>> shutdown later on.
>>
>> Is this a known issue? I haven't been able to find a JIRA ticket for
>> this. Maybe it's the expected behaviour? Should I create a ticket?
>>
>> Thanks,
>> Philippe
>>
>
>

Re: Mesos Slave Port Change Fails Recovery

Posted by Vinod Kone <vi...@gmail.com>.

For slave recovery to work, it is expected to not change its config.

On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <ph...@hopper.com>
wrote:

> Hi,
>
> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
> configured with checkpointing and with "reconnect" recovery.
>
> I was investigating why the slaves would successfully re-register with the
> master and recover, but would subsequently be asked to shutdown ("health
> check timeout").
>
> It turns out that our slaves had been unintentionally configured to use
> port 5050 in the previous configuration. We decided to fix that during the
> upgrade and have them use the default 5051 port.
>
> This change seems to make the health checks fail and eventually kills the
> slave due to inactivity.
>
> I've confirmed that leaving the port to what it was in the previous
> configuration makes the slave successfully re-register and is not asked to
> shutdown later on.
>
> Is this a known issue? I haven't been able to find a JIRA ticket for this.
> Maybe it's the expected behaviour? Should I create a ticket?
>
> Thanks,
> Philippe
>