You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Stefano (JIRA)" <ji...@apache.org> on 2016/04/14 17:05:25 UTC
[jira] [Commented] (MESOS-5207) Mesos Masters Leader Keeps Fluctuating

    [ https://issues.apache.org/jira/browse/MESOS-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241325#comment-15241325 ] 

Stefano commented on MESOS-5207:
--------------------------------

Hi all.
today i have tried to set 2 mesos clusters with one master on a network which attach to the group of other 2 master on another network, let me explain:
i'm working on OpenStack and i have build come virtual machines and 2 different networks with it.
I have set two mesos clusters:

NetworkA: 
2 mesos master
2 mesos slaves

NetworkB:
1 mesos master
1 mesos slave

I should try to make and interconnection between these two clusters.

I have set zookeeper configurations such that all 3 masters are competing for he leadership. I show you the main configurations:

NetworkA on both 2 masters:
/etc/zookeeper/conf/zoo.cfg : at the end of the file

server.1=192.168.100.54:2888:3888 (master1 on network A)

server.2=192.168.100.55:2888:3888 (master2 on network A)

server.3=131.154.xxx.xxx:2888:3888 (Master3 on network B, i have set floating IP)

etc/mesos/zk

zk://192.168.100.54:2181,192.168.100.55:2181,131.154.xxx.xxx:2181/mesos

NetorkB:


/etc/zookeeper/conf/zoo.cfg: at the end of the file:

server.1=131.154.96.27:2888:3888 (master1 on network A, i have set floating IP)

server.2=131.154.96.32:2888:3888 (master2 on network A, i have set floating IP)

server.3=192.168.10.11:2888:3888 (Master3 on network B)



etc/mesos/zk:

zk://131.154.zzz.zzz:2181,131.154.yyy.yyy:2181,192.168.10.11:2181/mesos


I notice this problem:
first of all these 3 masters are working like they are in the same cluster. So if i put down one of them, with quorum 2, there is the re election.
The problem is that after a while, more or less 1 minute, the current leader disconnect and then another master take the leadership.
I show you the log related to the master on network B:

Log file created at: 2016/04/14 15:02:18
Running on machine: master3.novalocal
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0414 15:02:18.447484 20410 logging.cpp:188] INFO level logging started!
I0414 15:02:18.447836 20410 main.cpp:230] Build: 2016-03-10 20:32:58 by root
I0414 15:02:18.447854 20410 main.cpp:232] Version: 0.27.2
I0414 15:02:18.447865 20410 main.cpp:235] Git tag: 0.27.2
I0414 15:02:18.447876 20410 main.cpp:239] Git SHA: 3c9ec4a0f34420b7803848af597de00fedefe0e2
I0414 15:02:18.447931 20410 main.cpp:253] Using 'HierarchicalDRF' allocator
I0414 15:02:18.483774 20410 leveldb.cpp:174] Opened db in 35.734219ms
I0414 15:02:18.505858 20410 leveldb.cpp:181] Compacted db in 22.032139ms
I0414 15:02:18.505903 20410 leveldb.cpp:196] Created db iterator in 7982ns
I0414 15:02:18.505930 20410 leveldb.cpp:202] Seeked to beginning of db in 668ns
I0414 15:02:18.505939 20410 leveldb.cpp:271] Iterated through 0 keys in the db in 470ns
I0414 15:02:18.505988 20410 replica.cpp:779] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I0414 15:02:18.506793 20410 main.cpp:464] Starting Mesos master
I0414 15:02:18.507874 20410 master.cpp:374] Master de75d47e-1791-4ab7-ac13-7c927873b035 (131.154.96.156) started on 192.168.10.11:5050
I0414 15:02:18.507890 20410 master.cpp:376] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_http="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="131.154.96.156" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://131.154.96.27:2181,131.154.96.32:2181,192.168.10.11:2181/mesos" --zk_session_timeout="10secs"
I0414 15:02:18.508060 20410 master.cpp:423] Master allowing unauthenticated frameworks to register
I0414 15:02:18.508070 20410 master.cpp:428] Master allowing unauthenticated slaves to register
I0414 15:02:18.508097 20410 master.cpp:466] Using default 'crammd5' authenticator
W0414 15:02:18.508111 20410 authenticator.cpp:511] No credentials provided, authentication requests will be refused
I0414 15:02:18.508291 20410 authenticator.cpp:518] Initializing server SASL
I0414 15:02:18.509346 20426 log.cpp:236] Attempting to join replica to ZooKeeper group
I0414 15:02:18.510659 20430 recover.cpp:447] Starting replica recovery
I0414 15:02:18.517371 20431 recover.cpp:473] Replica is in EMPTY status
I0414 15:02:18.518949 20429 master.cpp:1649] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I0414 15:02:18.518971 20429 contender.cpp:147] Joining the ZK group
I0414 15:02:18.541162 20429 group.cpp:349] Group process (group(3)@192.168.10.11:5050) connected to ZooKeeper
I0414 15:02:18.541213 20429 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0)
I0414 15:02:18.541229 20429 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
I0414 15:02:18.543774 20425 group.cpp:349] Group process (group(1)@192.168.10.11:5050) connected to ZooKeeper
I0414 15:02:18.543800 20425 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0414 15:02:18.543810 20425 group.cpp:427] Trying to create path '/mesos/log_replicas' in ZooKeeper
I0414 15:02:18.545526 20426 group.cpp:349] Group process (group(4)@192.168.10.11:5050) connected to ZooKeeper
I0414 15:02:18.545588 20426 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0414 15:02:18.545627 20426 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
I0414 15:02:18.551719 20424 group.cpp:349] Group process (group(2)@192.168.10.11:5050) connected to ZooKeeper
I0414 15:02:18.551811 20424 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0)
I0414 15:02:18.551832 20424 group.cpp:427] Trying to create path '/mesos/log_replicas' in ZooKeeper
I0414 15:02:18.553040 20426 detector.cpp:154] Detected a new leader: (id='69')
I0414 15:02:18.553306 20426 group.cpp:700] Trying to get '/mesos/json.info_0000000069' in ZooKeeper
I0414 15:02:18.553695 20425 network.hpp:413] ZooKeeper group memberships changed
I0414 15:02:18.553833 20425 group.cpp:700] Trying to get '/mesos/log_replicas/0000000066' in ZooKeeper
I0414 15:02:18.556457 20426 detector.cpp:479] A new leading master (UPID=master@192.168.100.54:5050) is detected
I0414 15:02:18.556591 20426 master.cpp:1710] The newly elected leader is master@192.168.100.54:5050 with id 32fd076d-e6cc-4fe0-acda-d5565bd98445
I0414 15:02:18.562369 20430 contender.cpp:263] New candidate (id='70') has entered the contest for leadership
I0414 15:02:18.563021 20425 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@192.168.100.54:5050 }
I0414 15:02:18.563916 20425 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (5)@192.168.10.11:5050
I0414 15:02:18.566625 20430 recover.cpp:193] Received a recover response from a replica in EMPTY status
I0414 15:02:18.576733 20429 network.hpp:413] ZooKeeper group memberships changed
I0414 15:02:18.576817 20429 group.cpp:700] Trying to get '/mesos/log_replicas/0000000066' in ZooKeeper
I0414 15:02:18.578048 20429 group.cpp:700] Trying to get '/mesos/log_replicas/0000000067' in ZooKeeper
I0414 15:02:18.579957 20429 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@192.168.10.11:5050, log-replica(1)@192.168.100.54:5050 }
I0414 15:02:28.518209 20432 recover.cpp:109] Unable to finish the recover protocol in 10secs, retrying
I0414 15:02:28.518898 20429 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (10)@192.168.10.11:5050
I0414 15:02:28.518987 20429 recover.cpp:193] Received a recover response from a replica in EMPTY status
I0414 15:02:38.519379 20432 recover.cpp:109] Unable to finish the recover protocol in 10secs, retrying
I0414 15:02:38.520006 20429 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (12)@192.168.10.11:5050
I0414 15:02:38.520128 20429 recover.cpp:193] Received a recover response from a replica in EMPTY status
I0414 15:02:48.520406 20432 recover.cpp:109] Unable to finish the recover protocol in 10secs, retrying
I0414 15:02:48.521069 20429 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (14)@192.168.10.11:5050
I0414 15:02:48.521224 20429 recover.cpp:193] Received a recover response from a replica in EMPTY status
I0414 15:02:50.335360 20429 http.cpp:501] HTTP GET for /master/state.json from 131.154.5.22:59543 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36 OPR/36.0.2130.46'
I0414 15:02:58.521517 20432 recover.cpp:109] Unable to finish the recover protocol in 10secs, retrying
I0414 15:02:58.522234 20429 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (20)@192.168.10.11:5050
I0414 15:02:58.522333 20429 recover.cpp:193] Received a recover response from a replica in EMPTY status
I0414 15:03:08.522389 20432 recover.cpp:109] Unable to finish the recover protocol in 10secs, retrying
I0414 15:03:08.523116 20424 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (23)@192.168.10.11:5050
I0414 15:03:08.523236 20424 recover.cpp:193] Received a recover response from a replica in EMPTY status
I0414 15:03:16.019850 20428 network.hpp:413] ZooKeeper group memberships changed
I0414 15:03:16.020007 20428 group.cpp:700] Trying to get '/mesos/log_replicas/0000000067' in ZooKeeper
I0414 15:03:16.024132 20427 detector.cpp:154] Detected a new leader: (id='70')
I0414 15:03:16.024277 20427 group.cpp:700] Trying to get '/mesos/json.info_0000000070' in ZooKeeper
I0414 15:03:16.024700 20428 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@192.168.10.11:5050 }
I0414 15:03:16.029292 20427 detector.cpp:479] A new leading master (UPID=master@192.168.10.11:5050) is detected
I0414 15:03:16.029399 20427 master.cpp:1710] The newly elected leader is master@192.168.10.11:5050 with id de75d47e-1791-4ab7-ac13-7c927873b035
I0414 15:03:16.029422 20427 master.cpp:1723] Elected as the leading master!
I0414 15:03:16.029444 20427 master.cpp:1468] Recovering from registrar
I0414 15:03:16.029558 20427 registrar.cpp:307] Recovering registrar
I0414 15:03:18.523638 20432 recover.cpp:109] Unable to finish the recover protocol in 10secs, retrying
I0414 15:03:26.609001 20428 network.hpp:413] ZooKeeper group memberships changed
I0414 15:03:26.609223 20428 group.cpp:700] Trying to get '/mesos/log_replicas/0000000067' in ZooKeeper
I0414 15:03:26.611070 20428 group.cpp:700] Trying to get '/mesos/log_replicas/0000000068' in ZooKeeper
I0414 15:03:26.612923 20428 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@192.168.10.11:5050, log-replica(1)@192.168.100.54:5050 }
I0414 15:03:26.613404 20428 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (28)@192.168.10.11:5050
I0414 15:03:26.613497 20428 recover.cpp:193] Received a recover response from a replica in EMPTY status
I0414 15:03:28.524957 20432 recover.cpp:109] Unable to finish the recover protocol in 10secs, retrying
I0414 15:03:28.525674 20428 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (30)@192.168.10.11:5050
I0414 15:03:28.525764 20428 recover.cpp:193] Received a recover response from a replica in EMPTY status
I0414 15:03:38.525599 20432 recover.cpp:109] Unable to finish the recover protocol in 10secs, retrying
I0414 15:03:38.526219 20428 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (32)@192.168.10.11:5050
I0414 15:03:38.526304 20428 recover.cpp:193] Received a recover response from a replica in EMPTY status

I know that it is an unusual use of mesos clusters, but my thesis aim is exaclty this one.

Thanks to all and Best regards.

Stefano

> Mesos Masters Leader Keeps Fluctuating
> --------------------------------------
>
>                 Key: MESOS-5207
>                 URL: https://issues.apache.org/jira/browse/MESOS-5207
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: haosdent
>            Assignee: haosdent
>
> Report from user mailing list. [Mesos  mail # user  Re: Mesos Masters Leader Keeps Fluctuating|http://search-hadoop.com/m/0Vlr69BZgz1NlAPP1] 
> From suruchi:
> {quote}
> Hi,
>  
> I have set the quorum value as 2 as I have configured 3 master machines in my environment.
>  
> But I don’t know why my leader master keeps fluctuating.
> {quote}
> From Stefano Bianchi:
> {quote}
> i joint to this discussion.
> i'm currently re setting op a cluster, but since i don't have much resources i need to set 2 masters.
> in this case the quorum valute set to 2 is correct?
> The problem i notice is that when i connect my 2 mesos masters the leader after few seconds id disconnecter: Failed to connect to...
> then the other master becomes the leader, but after a while again Failed to connect to...message.
> i notice that i always used mesos 0.27 and this problem happen with mesos 0.28.
> ...
> However in the previous configuration the switch between two masters was ok, just when the master was leading after, more or less 30 seconds, there was that Failed to connect message.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)