You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Chengwei Yang <ch...@gmail.com> on 2015/11/24 02:47:16 UTC

Is it safe to replace mesos-master in fly

Hi all,

We're using mesos in product on CentOS 6 and plan to upgrade CentOS to 7.1, to
avoid affect any tasks running on mesos. We're about to replace all
mesos-masters in fly.

The procedure listed below:

0. 3 mesos-masters running on CentOS 6
1. shutdown 1 mesos-master(CentOS 6) and bring up 1 mesos-master(CentOS 7)
   wait the new master synced for some time(is there any simple way to know when?)
2. repeat step 1

NOTE: we plan to shutdown non-leader first, and shutdown the leader(CentOS 6)
last.

Can we do this in such way? Or any other better suggestions?

-- 
Thanks,
Chengwei

Re: Is it safe to replace mesos-master in fly

Posted by Neil Conway <ne...@gmail.com>.
On Tue, Nov 24, 2015 at 3:38 PM, Marco Massenzio <ma...@mesosphere.io> wrote:
> The closest I could find is [0], but granted, much more detail could be
> desirable :)

Agreed! See also https://issues.apache.org/jira/browse/MESOS-3995

Neil

Re: Is it safe to replace mesos-master in fly

Posted by Chengwei Yang <ch...@gmail.com>.
On Tue, Nov 24, 2015 at 03:38:56PM -0800, Marco Massenzio wrote:
> The closest I could find is [0], but granted, much more detail could be
> desirable :)
> FYI - you may also want to check out the Maintenance Primitives [1] and
> upgrades [2] (which is actually not directly applicable to your stated use
> case, but may be of interest for future reference).
> 
> In any event, you're doing it right.
> As for the "reasonable time to wait" - I'm afraid I don't really have a good
> feel for it: probably keeping a eye on the logs may help, but I'm sure other
> folks on this list will have a much more satisfying answer.
> 
> Let us know how you go along, and if you want to contribute back to documenting
> how you did it, contributions always welcome!
> 
> [0] http://mesos.apache.org/documentation/latest/operational-guide/
> [1] http://mesos.apache.org/documentation/latest/maintenance/
> [2] http://mesos.apache.org/documentation/latest/upgrades/

Thanks Marco, these documents are quite helpful!

-- 
Thanks,
Chengwei

Re: Is it safe to replace mesos-master in fly

Posted by Marco Massenzio <ma...@mesosphere.io>.
The closest I could find is [0], but granted, much more detail could be
desirable :)
FYI - you may also want to check out the Maintenance Primitives [1] and
upgrades [2] (which is actually not directly applicable to your stated use
case, but may be of interest for future reference).

In any event, you're doing it right.
As for the "reasonable time to wait" - I'm afraid I don't really have a
good feel for it: probably keeping a eye on the logs may help, but I'm sure
other folks on this list will have a much more satisfying answer.

Let us know how you go along, and if you want to contribute back to
documenting how you did it, contributions always welcome!

[0] http://mesos.apache.org/documentation/latest/operational-guide/
[1] http://mesos.apache.org/documentation/latest/maintenance/
[2] http://mesos.apache.org/documentation/latest/upgrades/

--
*Marco Massenzio*
Distributed Systems Engineer
http://codetrips.com

On Tue, Nov 24, 2015 at 8:41 AM, Chengwei Yang <ch...@gmail.com>
wrote:

> Thanks @Tommy,
>
> Since I didn't found any official document about migrate mesos-mater or
> resize
> mesos-master quorum size, so before anything missing that will supprise me,
> I came here to confirm. :-)
>
> --
> Thanks,
> Chengwei
>
> On Wed, Nov 25, 2015 at 12:07:43AM +0800, tommy xiao wrote:
> > This is correct way on upgrade your mesos cluster, more details see mesos
> > documents release note.
> >
> > 2015-11-24 9:47 GMT+08:00 Chengwei Yang <ch...@gmail.com>:
> >
> >     Hi all,
> >
> >     We're using mesos in product on CentOS 6 and plan to upgrade CentOS
> to 7.1,
> >     to
> >     avoid affect any tasks running on mesos. We're about to replace all
> >     mesos-masters in fly.
> >
> >     The procedure listed below:
> >
> >     0. 3 mesos-masters running on CentOS 6
> >     1. shutdown 1 mesos-master(CentOS 6) and bring up 1
> mesos-master(CentOS 7)
> >        wait the new master synced for some time(is there any simple way
> to know
> >     when?)
> >     2. repeat step 1
> >
> >     NOTE: we plan to shutdown non-leader first, and shutdown the
> leader(CentOS
> >     6)
> >     last.
> >
> >     Can we do this in such way? Or any other better suggestions?
> >
> >     --
> >     Thanks,
> >     Chengwei
> >
> >
> >
> >
> > --
> > Deshi Xiao
> > Twitter: xds2000
> > E-mail: xiaods(AT)gmail.com
> > SECURITY NOTE: file ~/.netrc must not be accessible by others
>

Re: Is it safe to replace mesos-master in fly

Posted by Chengwei Yang <ch...@gmail.com>.
Thanks @Tommy,

Since I didn't found any official document about migrate mesos-mater or resize
mesos-master quorum size, so before anything missing that will supprise me,
I came here to confirm. :-)

-- 
Thanks,
Chengwei

On Wed, Nov 25, 2015 at 12:07:43AM +0800, tommy xiao wrote:
> This is correct way on upgrade your mesos cluster, more details see mesos
> documents release note.
> 
> 2015-11-24 9:47 GMT+08:00 Chengwei Yang <ch...@gmail.com>:
> 
>     Hi all,
> 
>     We're using mesos in product on CentOS 6 and plan to upgrade CentOS to 7.1,
>     to
>     avoid affect any tasks running on mesos. We're about to replace all
>     mesos-masters in fly.
> 
>     The procedure listed below:
> 
>     0. 3 mesos-masters running on CentOS 6
>     1. shutdown 1 mesos-master(CentOS 6) and bring up 1 mesos-master(CentOS 7)
>        wait the new master synced for some time(is there any simple way to know
>     when?)
>     2. repeat step 1
> 
>     NOTE: we plan to shutdown non-leader first, and shutdown the leader(CentOS
>     6)
>     last.
> 
>     Can we do this in such way? Or any other better suggestions?
>    
>     --
>     Thanks,
>     Chengwei
> 
> 
> 
> 
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
> SECURITY NOTE: file ~/.netrc must not be accessible by others

Re: Is it safe to replace mesos-master in fly

Posted by tommy xiao <xi...@gmail.com>.
This is correct way on upgrade your mesos cluster, more details see mesos
documents release note.

2015-11-24 9:47 GMT+08:00 Chengwei Yang <ch...@gmail.com>:

> Hi all,
>
> We're using mesos in product on CentOS 6 and plan to upgrade CentOS to
> 7.1, to
> avoid affect any tasks running on mesos. We're about to replace all
> mesos-masters in fly.
>
> The procedure listed below:
>
> 0. 3 mesos-masters running on CentOS 6
> 1. shutdown 1 mesos-master(CentOS 6) and bring up 1 mesos-master(CentOS 7)
>    wait the new master synced for some time(is there any simple way to
> know when?)
> 2. repeat step 1
>
> NOTE: we plan to shutdown non-leader first, and shutdown the leader(CentOS
> 6)
> last.
>
> Can we do this in such way? Or any other better suggestions?
>
> --
> Thanks,
> Chengwei
>



-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com

Re: Is it safe to replace mesos-master in fly

Posted by "Du, Fan" <fa...@intel.com>.

On 2015/11/25 13:57, Joe Smith wrote:
> In retrospect I should've (might still be able to one of these days)
> open sourced the tool we used to migrate mesos masters. That said,
> overall the process suggested so far is correct.
>
> To validate the new host joining, you can tail the master log file for
> "Successfully joined the Paxos group
> <https://github.com/apache/mesos/blob/3539b7a0e15b594148308319bf052d28b1429b98/src/log/recover.cpp#L578>"
> to confirm the replicated log recovery has completed for that machine.
> Once that happens, feel free to move onto adding the next host.

Why does the upgraded master after reboot have to recover?
By my understanding the master after reboot will first *detect* who is 
the elected leader,
if there is a valid one, it will *contender* by grabbing a new id to 
"Joining the ZK group".

https://github.com/apache/mesos/blob/3539b7a0e15b594148308319bf052d28b1429b98/src/zookeeper/contender.cpp#L147


> On Tue, Nov 24, 2015 at 9:40 PM, Du, Fan <fan.du@intel.com
> <ma...@intel.com>> wrote:
>
>
>
>     On 2015/11/24 9:47, Chengwei Yang wrote:
>
>         Hi all,
>
>         We're using mesos in product on CentOS 6 and plan to upgrade
>         CentOS to 7.1, to
>         avoid affect any tasks running on mesos. We're about to replace all
>         mesos-masters in fly.
>
>         The procedure listed below:
>
>         0. 3 mesos-masters running on CentOS 6
>         1. shutdown 1 mesos-master(CentOS 6) and bring up 1
>         mesos-master(CentOS 7)
>              wait the new master synced for some time(is there any
>         simple way to know when?)
>
>
>     Login the upgraded master mesos web page, it will redirect you to
>     the Leader master
>     if upgraded master join the cluster successfully.
>
>     Or a script friendly way, you can query zookeeper server the role of
>     the upgraded
>     master server by:
>
>     echo stat | nc UPGRADED_MASTER_ZOOKEEPER_IP 2181
>
>     it will report something like this:
>
>     Latency min/avg/max: 0/0/6
>     Received: 105
>     Sent: 105
>     Connections: 2
>     Outstanding: 0
>     Zxid: 0x30000001c
>     Mode: *follower*
>     Node count: 13
>
>
>
>         2. repeat step 1
>
>         NOTE: we plan to shutdown non-leader first, and shutdown the
>         leader(CentOS 6)
>         last.
>
>         Can we do this in such way? Or any other better suggestions?
>
>

Re: Is it safe to replace mesos-master in fly

Posted by "Du, Fan" <fa...@intel.com>.

On 2015/11/25 13:57, Joe Smith wrote:
> In retrospect I should've (might still be able to one of these days)
> open sourced the tool we used to migrate mesos masters.
btw, it's great!
I will have a try on this.

> overall the process suggested so far is correct.
>
> To validate the new host joining, you can tail the master log file for
> "Successfully joined the Paxos group
> <https://github.com/apache/mesos/blob/3539b7a0e15b594148308319bf052d28b1429b98/src/log/recover.cpp#L578>"
> to confirm the replicated log recovery has completed for that machine.
> Once that happens, feel free to move onto adding the next host.
>
> On Tue, Nov 24, 2015 at 9:40 PM, Du, Fan <fan.du@intel.com
> <ma...@intel.com>> wrote:
>
>
>
>     On 2015/11/24 9:47, Chengwei Yang wrote:
>
>         Hi all,
>
>         We're using mesos in product on CentOS 6 and plan to upgrade
>         CentOS to 7.1, to
>         avoid affect any tasks running on mesos. We're about to replace all
>         mesos-masters in fly.
>
>         The procedure listed below:
>
>         0. 3 mesos-masters running on CentOS 6
>         1. shutdown 1 mesos-master(CentOS 6) and bring up 1
>         mesos-master(CentOS 7)
>              wait the new master synced for some time(is there any
>         simple way to know when?)
>
>
>     Login the upgraded master mesos web page, it will redirect you to
>     the Leader master
>     if upgraded master join the cluster successfully.
>
>     Or a script friendly way, you can query zookeeper server the role of
>     the upgraded
>     master server by:
>
>     echo stat | nc UPGRADED_MASTER_ZOOKEEPER_IP 2181
>
>     it will report something like this:
>
>     Latency min/avg/max: 0/0/6
>     Received: 105
>     Sent: 105
>     Connections: 2
>     Outstanding: 0
>     Zxid: 0x30000001c
>     Mode: *follower*
>     Node count: 13
>
>
>
>         2. repeat step 1
>
>         NOTE: we plan to shutdown non-leader first, and shutdown the
>         leader(CentOS 6)
>         last.
>
>         Can we do this in such way? Or any other better suggestions?
>
>

Re: Is it safe to replace mesos-master in fly

Posted by Joe Smith <ya...@gmail.com>.
In retrospect I should've (might still be able to one of these days) open
sourced the tool we used to migrate mesos masters. That said, overall the
process suggested so far is correct.

To validate the new host joining, you can tail the master log file for
"Successfully
joined the Paxos group
<https://github.com/apache/mesos/blob/3539b7a0e15b594148308319bf052d28b1429b98/src/log/recover.cpp#L578>"
to confirm the replicated log recovery has completed for that machine. Once
that happens, feel free to move onto adding the next host.

On Tue, Nov 24, 2015 at 9:40 PM, Du, Fan <fa...@intel.com> wrote:

>
>
> On 2015/11/24 9:47, Chengwei Yang wrote:
>
>> Hi all,
>>
>> We're using mesos in product on CentOS 6 and plan to upgrade CentOS to
>> 7.1, to
>> avoid affect any tasks running on mesos. We're about to replace all
>> mesos-masters in fly.
>>
>> The procedure listed below:
>>
>> 0. 3 mesos-masters running on CentOS 6
>> 1. shutdown 1 mesos-master(CentOS 6) and bring up 1 mesos-master(CentOS 7)
>>     wait the new master synced for some time(is there any simple way to
>> know when?)
>>
>
> Login the upgraded master mesos web page, it will redirect you to the
> Leader master
> if upgraded master join the cluster successfully.
>
> Or a script friendly way, you can query zookeeper server the role of the
> upgraded
> master server by:
>
> echo stat | nc UPGRADED_MASTER_ZOOKEEPER_IP 2181
>
> it will report something like this:
>
> Latency min/avg/max: 0/0/6
> Received: 105
> Sent: 105
> Connections: 2
> Outstanding: 0
> Zxid: 0x30000001c
> Mode: *follower*
> Node count: 13
>
>
>
> 2. repeat step 1
>>
>> NOTE: we plan to shutdown non-leader first, and shutdown the
>> leader(CentOS 6)
>> last.
>>
>> Can we do this in such way? Or any other better suggestions?
>>
>>

Re: Is it safe to replace mesos-master in fly

Posted by "Du, Fan" <fa...@intel.com>.

On 2015/11/24 9:47, Chengwei Yang wrote:
> Hi all,
>
> We're using mesos in product on CentOS 6 and plan to upgrade CentOS to 7.1, to
> avoid affect any tasks running on mesos. We're about to replace all
> mesos-masters in fly.
>
> The procedure listed below:
>
> 0. 3 mesos-masters running on CentOS 6
> 1. shutdown 1 mesos-master(CentOS 6) and bring up 1 mesos-master(CentOS 7)
>     wait the new master synced for some time(is there any simple way to know when?)

Login the upgraded master mesos web page, it will redirect you to the 
Leader master
if upgraded master join the cluster successfully.

Or a script friendly way, you can query zookeeper server the role of the 
upgraded
master server by:

echo stat | nc UPGRADED_MASTER_ZOOKEEPER_IP 2181

it will report something like this:

Latency min/avg/max: 0/0/6
Received: 105
Sent: 105
Connections: 2
Outstanding: 0
Zxid: 0x30000001c
Mode: *follower*
Node count: 13


> 2. repeat step 1
>
> NOTE: we plan to shutdown non-leader first, and shutdown the leader(CentOS 6)
> last.
>
> Can we do this in such way? Or any other better suggestions?
>

Re: Is it safe to replace mesos-master in fly

Posted by Chengwei Yang <ch...@gmail.com>.
JFYI.

I finished the mesos-master migrate and all works fine as expectd.

-- 
Thanks,
Chengwei

On Wed, Nov 25, 2015 at 06:29:54PM +0800, Chengwei Yang wrote:
> OOPS,
> 
> We forgot to disable firewalld on the new centos7 VM, once firewalld disabled,
> replicate finished in seconds.
> 
> as below.
> 
> ```
> I1125 18:27:33.737843  2490 replica.cpp:369] Replica ignoring promise request as it is in RECOVERING status
> I1125 18:27:33.740927  2489 replica.cpp:655] Replica received learned notice for position 984
> I1125 18:27:33.741539  2489 leveldb.cpp:343] Persisting action (20 bytes) to leveldb took 572913ns
> I1125 18:27:33.741628  2489 replica.cpp:676] Persisted action at 984
> I1125 18:27:33.741673  2489 replica.cpp:661] Replica learned TRUNCATE action at position 984
> I1125 18:27:33.742463  2491 recover.cpp:554] Updating replica status to VOTING
> I1125 18:27:33.743335  2490 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 739871ns
> I1125 18:27:33.743413  2490 replica.cpp:320] Persisted replica status to VOTING
> I1125 18:27:33.743522  2490 recover.cpp:568] Successfully joined the Paxos group
> I1125 18:27:33.743727  2490 recover.cpp:452] Recover process terminated
> ```
> 
> -- 
> Thanks,
> Chengwei
> 
> On Wed, Nov 25, 2015 at 06:25:51PM +0800, Chengwei Yang wrote:
> > while the other 2 mesos-master (one leader and one follower) both repeat below
> > log.
> > 
> > I1125 18:06:33.315208 28401 replica.cpp:638] Replica in VOTING status received a broadcasted recover request
> > I1125 18:06:43.316341 28404 replica.cpp:638] Replica in VOTING status received a broadcasted recover request
> > I1125 18:06:53.318739 28399 replica.cpp:638] Replica in VOTING status received a broadcasted recover request
> > I1125 18:07:03.321287 28403 replica.cpp:638] Replica in VOTING status received a broadcasted recover request
> > 
> > Seems the new mesos-master can not catch up and continuously retry, is this a
> > bug?
> > 
> > I'm using mesos-0.21.0 on centos7, the vanilla rpm released by mesosphere.
> > 
> > -- 
> > Thanks,
> > Chengwei
> > 
> > 
> > On Wed, Nov 25, 2015 at 05:45:56PM +0800, Chengwei Yang wrote:
> > > Hi All,
> > > 
> > > I did step 1 below and check logs from the new started mesos-master, and it
> > > continuously complaint like below.
> > > 
> > > ```
> > > I1125 17:42:59.066706  2330 recover.cpp:188] Received a recover response from a replica in EMPTY status
> > > I1125 17:43:09.065188  2331 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
> > > I1125 17:43:09.066992  2330 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
> > > I1125 17:43:09.067425  2324 recover.cpp:188] Received a recover response from a replica in EMPTY status
> > > I1125 17:43:19.067332  2331 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
> > > I1125 17:43:19.069587  2323 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
> > > I1125 17:43:19.069807  2323 recover.cpp:188] Received a recover response from a replica in EMPTY status
> > > ```
> > > 
> > > Seems it can not catch up the other replicas?
> > > 
> > > -- 
> > > Thanks,
> > > Chengwei
> > > 
> > > On Tue, Nov 24, 2015 at 09:47:16AM +0800, Chengwei Yang wrote:
> > > > Hi all,
> > > > 
> > > > We're using mesos in product on CentOS 6 and plan to upgrade CentOS to 7.1, to
> > > > avoid affect any tasks running on mesos. We're about to replace all
> > > > mesos-masters in fly.
> > > > 
> > > > The procedure listed below:
> > > > 
> > > > 0. 3 mesos-masters running on CentOS 6
> > > > 1. shutdown 1 mesos-master(CentOS 6) and bring up 1 mesos-master(CentOS 7)
> > > >    wait the new master synced for some time(is there any simple way to know when?)
> > > > 2. repeat step 1
> > > > 
> > > > NOTE: we plan to shutdown non-leader first, and shutdown the leader(CentOS 6)
> > > > last.
> > > > 
> > > > Can we do this in such way? Or any other better suggestions?
> > > > 
> > > > -- 
> > > > Thanks,
> > > > Chengwei
> > > 
> > 
> > 
> 
> 


Re: Is it safe to replace mesos-master in fly

Posted by Chengwei Yang <ch...@gmail.com>.
OOPS,

We forgot to disable firewalld on the new centos7 VM, once firewalld disabled,
replicate finished in seconds.

as below.

```
I1125 18:27:33.737843  2490 replica.cpp:369] Replica ignoring promise request as it is in RECOVERING status
I1125 18:27:33.740927  2489 replica.cpp:655] Replica received learned notice for position 984
I1125 18:27:33.741539  2489 leveldb.cpp:343] Persisting action (20 bytes) to leveldb took 572913ns
I1125 18:27:33.741628  2489 replica.cpp:676] Persisted action at 984
I1125 18:27:33.741673  2489 replica.cpp:661] Replica learned TRUNCATE action at position 984
I1125 18:27:33.742463  2491 recover.cpp:554] Updating replica status to VOTING
I1125 18:27:33.743335  2490 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 739871ns
I1125 18:27:33.743413  2490 replica.cpp:320] Persisted replica status to VOTING
I1125 18:27:33.743522  2490 recover.cpp:568] Successfully joined the Paxos group
I1125 18:27:33.743727  2490 recover.cpp:452] Recover process terminated
```

-- 
Thanks,
Chengwei

On Wed, Nov 25, 2015 at 06:25:51PM +0800, Chengwei Yang wrote:
> while the other 2 mesos-master (one leader and one follower) both repeat below
> log.
> 
> I1125 18:06:33.315208 28401 replica.cpp:638] Replica in VOTING status received a broadcasted recover request
> I1125 18:06:43.316341 28404 replica.cpp:638] Replica in VOTING status received a broadcasted recover request
> I1125 18:06:53.318739 28399 replica.cpp:638] Replica in VOTING status received a broadcasted recover request
> I1125 18:07:03.321287 28403 replica.cpp:638] Replica in VOTING status received a broadcasted recover request
> 
> Seems the new mesos-master can not catch up and continuously retry, is this a
> bug?
> 
> I'm using mesos-0.21.0 on centos7, the vanilla rpm released by mesosphere.
> 
> -- 
> Thanks,
> Chengwei
> 
> 
> On Wed, Nov 25, 2015 at 05:45:56PM +0800, Chengwei Yang wrote:
> > Hi All,
> > 
> > I did step 1 below and check logs from the new started mesos-master, and it
> > continuously complaint like below.
> > 
> > ```
> > I1125 17:42:59.066706  2330 recover.cpp:188] Received a recover response from a replica in EMPTY status
> > I1125 17:43:09.065188  2331 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
> > I1125 17:43:09.066992  2330 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
> > I1125 17:43:09.067425  2324 recover.cpp:188] Received a recover response from a replica in EMPTY status
> > I1125 17:43:19.067332  2331 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
> > I1125 17:43:19.069587  2323 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
> > I1125 17:43:19.069807  2323 recover.cpp:188] Received a recover response from a replica in EMPTY status
> > ```
> > 
> > Seems it can not catch up the other replicas?
> > 
> > -- 
> > Thanks,
> > Chengwei
> > 
> > On Tue, Nov 24, 2015 at 09:47:16AM +0800, Chengwei Yang wrote:
> > > Hi all,
> > > 
> > > We're using mesos in product on CentOS 6 and plan to upgrade CentOS to 7.1, to
> > > avoid affect any tasks running on mesos. We're about to replace all
> > > mesos-masters in fly.
> > > 
> > > The procedure listed below:
> > > 
> > > 0. 3 mesos-masters running on CentOS 6
> > > 1. shutdown 1 mesos-master(CentOS 6) and bring up 1 mesos-master(CentOS 7)
> > >    wait the new master synced for some time(is there any simple way to know when?)
> > > 2. repeat step 1
> > > 
> > > NOTE: we plan to shutdown non-leader first, and shutdown the leader(CentOS 6)
> > > last.
> > > 
> > > Can we do this in such way? Or any other better suggestions?
> > > 
> > > -- 
> > > Thanks,
> > > Chengwei
> > 
> 
> 



Re: Is it safe to replace mesos-master in fly

Posted by Chengwei Yang <ch...@gmail.com>.
while the other 2 mesos-master (one leader and one follower) both repeat below
log.

I1125 18:06:33.315208 28401 replica.cpp:638] Replica in VOTING status received a broadcasted recover request
I1125 18:06:43.316341 28404 replica.cpp:638] Replica in VOTING status received a broadcasted recover request
I1125 18:06:53.318739 28399 replica.cpp:638] Replica in VOTING status received a broadcasted recover request
I1125 18:07:03.321287 28403 replica.cpp:638] Replica in VOTING status received a broadcasted recover request

Seems the new mesos-master can not catch up and continuously retry, is this a
bug?

I'm using mesos-0.21.0 on centos7, the vanilla rpm released by mesosphere.

-- 
Thanks,
Chengwei


On Wed, Nov 25, 2015 at 05:45:56PM +0800, Chengwei Yang wrote:
> Hi All,
> 
> I did step 1 below and check logs from the new started mesos-master, and it
> continuously complaint like below.
> 
> ```
> I1125 17:42:59.066706  2330 recover.cpp:188] Received a recover response from a replica in EMPTY status
> I1125 17:43:09.065188  2331 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
> I1125 17:43:09.066992  2330 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
> I1125 17:43:09.067425  2324 recover.cpp:188] Received a recover response from a replica in EMPTY status
> I1125 17:43:19.067332  2331 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
> I1125 17:43:19.069587  2323 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
> I1125 17:43:19.069807  2323 recover.cpp:188] Received a recover response from a replica in EMPTY status
> ```
> 
> Seems it can not catch up the other replicas?
> 
> -- 
> Thanks,
> Chengwei
> 
> On Tue, Nov 24, 2015 at 09:47:16AM +0800, Chengwei Yang wrote:
> > Hi all,
> > 
> > We're using mesos in product on CentOS 6 and plan to upgrade CentOS to 7.1, to
> > avoid affect any tasks running on mesos. We're about to replace all
> > mesos-masters in fly.
> > 
> > The procedure listed below:
> > 
> > 0. 3 mesos-masters running on CentOS 6
> > 1. shutdown 1 mesos-master(CentOS 6) and bring up 1 mesos-master(CentOS 7)
> >    wait the new master synced for some time(is there any simple way to know when?)
> > 2. repeat step 1
> > 
> > NOTE: we plan to shutdown non-leader first, and shutdown the leader(CentOS 6)
> > last.
> > 
> > Can we do this in such way? Or any other better suggestions?
> > 
> > -- 
> > Thanks,
> > Chengwei
> 



Re: Is it safe to replace mesos-master in fly

Posted by Chengwei Yang <ch...@gmail.com>.
Hi All,

I did step 1 below and check logs from the new started mesos-master, and it
continuously complaint like below.

```
I1125 17:42:59.066706  2330 recover.cpp:188] Received a recover response from a replica in EMPTY status
I1125 17:43:09.065188  2331 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
I1125 17:43:09.066992  2330 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
I1125 17:43:09.067425  2324 recover.cpp:188] Received a recover response from a replica in EMPTY status
I1125 17:43:19.067332  2331 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
I1125 17:43:19.069587  2323 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
I1125 17:43:19.069807  2323 recover.cpp:188] Received a recover response from a replica in EMPTY status
```

Seems it can not catch up the other replicas?

-- 
Thanks,
Chengwei

On Tue, Nov 24, 2015 at 09:47:16AM +0800, Chengwei Yang wrote:
> Hi all,
> 
> We're using mesos in product on CentOS 6 and plan to upgrade CentOS to 7.1, to
> avoid affect any tasks running on mesos. We're about to replace all
> mesos-masters in fly.
> 
> The procedure listed below:
> 
> 0. 3 mesos-masters running on CentOS 6
> 1. shutdown 1 mesos-master(CentOS 6) and bring up 1 mesos-master(CentOS 7)
>    wait the new master synced for some time(is there any simple way to know when?)
> 2. repeat step 1
> 
> NOTE: we plan to shutdown non-leader first, and shutdown the leader(CentOS 6)
> last.
> 
> Can we do this in such way? Or any other better suggestions?
> 
> -- 
> Thanks,
> Chengwei