You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Qian Zhang <zh...@gmail.com> on 2016/06/04 15:42:58 UTC

Mesos HA does not work (Failed to recover registrar)

Hi Folks,

I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
Zookeeper running, so they form a Zookeeper cluster. And then when I
started the first Mesos master in one node with:
    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
--work_dir=/var/lib/mesos/master

I found it will hang here for 60 seconds:
  I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master (UPID=
master@192.168.122.132:5050) is detected
  I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader is
master@192.168.122.132:5050 with id 40d387a6-4d61-49d6-af44-51dd41457390
  I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
master!
  I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
  I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
  I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the writer

And after 60s, master will fail:
F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to
recover registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
    @     0x7f4b81372f4e  google::LogMessage::Fail()
    @     0x7f4b81372e9a  google::LogMessage::SendToLog()
    @     0x7f4b8137289c  google::LogMessage::Flush()
    @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f4b8040eea0  mesos::internal::master::fail()
    @     0x7f4b804dbeb3
 _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
    @     0x7f4b804ba453
 _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
    @     0x7f4b804898d7
 _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
    @     0x7f4b804dbf80
 _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
    @           0x49d257  std::function<>::operator()()
    @           0x49837f
 _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
    @           0x493024  process::Future<>::fail()
    @     0x7f4b8015ad20  process::Promise<>::fail()
    @     0x7f4b804d9295  process::internal::thenf<>()
    @     0x7f4b8051788f
 _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
    @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
    @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
    @     0x7f4b8050fc69  std::function<>::operator()()
    @     0x7f4b804f9609
 _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
    @     0x7f4b80517936
 _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
    @     0x7f4b8050fc69  std::function<>::operator()()
    @     0x7f4b8056b1b4  process::internal::run<>()
    @     0x7f4b80561672  process::Future<>::fail()
    @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
    @     0x7f4b8059757f
 _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
    @     0x7f4b8058fad1
 _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
    @     0x7f4b80585a41
 _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
    @     0x7f4b80597605
 _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
    @           0x49d257  std::function<>::operator()()
    @           0x49837f
 _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
    @     0x7f4b8056164a  process::Future<>::fail()
    @     0x7f4b8055a378  process::Promise<>::fail()

I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but no
luck for both. Any ideas about what happened? Thanks.



Thanks,
Qian Zhang

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Vinod Kone <vi...@apache.org>.

You need to start all 3 masters simultaneously so that they can reach a
quorum. Also, looks like each master is talking to its local zk server, are
you sure the 3 ZK servers are forming a quorum?

On Sat, Jun 4, 2016 at 9:42 AM, Qian Zhang <zh...@gmail.com> wrote:

> Hi Folks,
>
> I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
> Zookeeper running, so they form a Zookeeper cluster. And then when I
> started the first Mesos master in one node with:
>     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> --work_dir=/var/lib/mesos/master
>
> I found it will hang here for 60 seconds:
>   I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.132:5050) is detected
>   I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader is
> master@192.168.122.132:5050 with id 40d387a6-4d61-49d6-af44-51dd41457390
>   I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
> master!
>   I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
>   I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>   I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the writer
>
> And after 60s, master will fail:
> F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to
> recover registrar: Failed to perform fetch within 1mins
> *** Check failure stack trace: ***
>     @     0x7f4b81372f4e  google::LogMessage::Fail()
>     @     0x7f4b81372e9a  google::LogMessage::SendToLog()
>     @     0x7f4b8137289c  google::LogMessage::Flush()
>     @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f4b8040eea0  mesos::internal::master::fail()
>     @     0x7f4b804dbeb3
>  _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>     @     0x7f4b804ba453
>  _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>     @     0x7f4b804898d7
>  _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>     @     0x7f4b804dbf80
>  _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>     @           0x49d257  std::function<>::operator()()
>     @           0x49837f
>  _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>     @           0x493024  process::Future<>::fail()
>     @     0x7f4b8015ad20  process::Promise<>::fail()
>     @     0x7f4b804d9295  process::internal::thenf<>()
>     @     0x7f4b8051788f
>  _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>     @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
>     @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
>     @     0x7f4b8050fc69  std::function<>::operator()()
>     @     0x7f4b804f9609
>  _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>     @     0x7f4b80517936
>  _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
>     @     0x7f4b8050fc69  std::function<>::operator()()
>     @     0x7f4b8056b1b4  process::internal::run<>()
>     @     0x7f4b80561672  process::Future<>::fail()
>     @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
>     @     0x7f4b8059757f
>  _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>     @     0x7f4b8058fad1
>  _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>     @     0x7f4b80585a41
>  _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
>     @     0x7f4b80597605
>  _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>     @           0x49d257  std::function<>::operator()()
>     @           0x49837f
>  _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>     @     0x7f4b8056164a  process::Future<>::fail()
>     @     0x7f4b8055a378  process::Promise<>::fail()
>
> I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but no
> luck for both. Any ideas about what happened? Thanks.
>
>
>
> Thanks,
> Qian Zhang
>

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Vinod Kone <vi...@apache.org>.

You need to start all 3 masters simultaneously so that they can reach a
quorum. Also, looks like each master is talking to its local zk server, are
you sure the 3 ZK servers are forming a quorum?

On Sat, Jun 4, 2016 at 9:42 AM, Qian Zhang <zh...@gmail.com> wrote:

> Hi Folks,
>
> I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
> Zookeeper running, so they form a Zookeeper cluster. And then when I
> started the first Mesos master in one node with:
>     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> --work_dir=/var/lib/mesos/master
>
> I found it will hang here for 60 seconds:
>   I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.132:5050) is detected
>   I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader is
> master@192.168.122.132:5050 with id 40d387a6-4d61-49d6-af44-51dd41457390
>   I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
> master!
>   I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
>   I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>   I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the writer
>
> And after 60s, master will fail:
> F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to
> recover registrar: Failed to perform fetch within 1mins
> *** Check failure stack trace: ***
>     @     0x7f4b81372f4e  google::LogMessage::Fail()
>     @     0x7f4b81372e9a  google::LogMessage::SendToLog()
>     @     0x7f4b8137289c  google::LogMessage::Flush()
>     @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f4b8040eea0  mesos::internal::master::fail()
>     @     0x7f4b804dbeb3
>  _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>     @     0x7f4b804ba453
>  _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>     @     0x7f4b804898d7
>  _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>     @     0x7f4b804dbf80
>  _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>     @           0x49d257  std::function<>::operator()()
>     @           0x49837f
>  _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>     @           0x493024  process::Future<>::fail()
>     @     0x7f4b8015ad20  process::Promise<>::fail()
>     @     0x7f4b804d9295  process::internal::thenf<>()
>     @     0x7f4b8051788f
>  _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>     @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
>     @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
>     @     0x7f4b8050fc69  std::function<>::operator()()
>     @     0x7f4b804f9609
>  _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>     @     0x7f4b80517936
>  _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
>     @     0x7f4b8050fc69  std::function<>::operator()()
>     @     0x7f4b8056b1b4  process::internal::run<>()
>     @     0x7f4b80561672  process::Future<>::fail()
>     @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
>     @     0x7f4b8059757f
>  _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>     @     0x7f4b8058fad1
>  _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>     @     0x7f4b80585a41
>  _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
>     @     0x7f4b80597605
>  _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>     @           0x49d257  std::function<>::operator()()
>     @           0x49837f
>  _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>     @     0x7f4b8056164a  process::Future<>::fail()
>     @     0x7f4b8055a378  process::Promise<>::fail()
>
> I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but no
> luck for both. Any ideas about what happened? Thanks.
>
>
>
> Thanks,
> Qian Zhang
>

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Dick Davies <di...@hellooperator.net>.

The extra zookeepers listed in the second argument will let you mesos
master process keep working if
its local zookeeper goes down for maintenance.

On 5 June 2016 at 13:55, Qian Zhang <zh...@gmail.com> wrote:
>> You need the 2nd command line (i.e. you have to specify all the zk
>> nodes on each master, it's
>> not like e.g. Cassandra where you can discover other nodes from the
>> first one you talk to).
>
>
> I have an Open DC/OS environment which is enabled master HA (there are 3
> master nodes) and works very well, and I see each Mesos master is started to
> only connect with local zk:
> $ cat /opt/mesosphere/etc/mesos-master | grep ZK
> MESOS_ZK=zk://127.0.0.1:2181/mesos
>
> So I think I do not have to specify all the zk on each master.
>
>
>
>
>
>
>
> Thanks,
> Qian Zhang
>
> On Sun, Jun 5, 2016 at 4:25 PM, Dick Davies <di...@hellooperator.net> wrote:
>>
>> OK, good - that part looks as expected, you've had a successful
>> election for a leader
>> (and yes that sounds like your zookeeper layer is ok).
>>
>> You need the 2nd command line (i.e. you have to specify all the zk
>> nodes on each master, it's
>> not like e.g. Cassandra where you can discover other nodes from the
>> first one you talk to).
>>
>> The error you were getting was about the internal registry /
>> replicated log, which is a mesos master level thing.
>> You could try when Sivaram suggested - stopping the mesos master
>> processes, wiping their
>> work_dirs and starting them back up.
>> Perhaps some wonky state got in there while you were trying various
>> options?
>>
>>
>> On 5 June 2016 at 00:34, Qian Zhang <zh...@gmail.com> wrote:
>> > Thanks Vinod and Dick.
>> >
>> > I think my 3 ZK servers have formed a quorum, each of them has the
>> > following
>> > config:
>> >     $ cat conf/zoo.cfg
>> >     server.1=192.168.122.132:2888:3888
>> >     server.2=192.168.122.225:2888:3888
>> >     server.3=192.168.122.171:2888:3888
>> >     autopurge.purgeInterval=6
>> >     autopurge.snapRetainCount=5
>> >     initLimit=10
>> >     syncLimit=5
>> >     maxClientCnxns=0
>> >     clientPort=2181
>> >     tickTime=2000
>> >     quorumListenOnAllIPs=true
>> >     dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
>> >     dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
>> >
>> > And when I run "bin/zkServer.sh status" on each of them, I can see
>> > "Mode:
>> > leader" for one, and "Mode: follower" for the other two.
>> >
>> > I have already tried to manually start 3 masters simultaneously, and
>> > here is
>> > what I see in their log:
>> > In 192.168.122.171(this is the first master I started):
>> >     I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> >     I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
>> > '/mesos/log_replicas/0000000024' in ZooKeeper
>> >     I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
>> > '/mesos/json.info_0000000025' in ZooKeeper
>> >     I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.171:5050) is detected
>> >     I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
>> > log-replica(1)@192.168.122.171:5050 }
>> >     I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader
>> > is
>> > master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>> >     I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
>> > master!
>> >
>> > In 192.168.122.225 (second master I started):
>> >     I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> >     I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
>> > '/mesos/json.info_0000000025' in ZooKeeper
>> >     I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
>> > log-replica(1)@192.168.122.171:5050 }
>> >     I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
>> > received a broadcasted recover request from (6)@192.168.122.225:5050
>> >     I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.171:5050) is detected
>> >     I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader
>> > is
>> > master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>> >
>> > In 192.168.122.132 (last master I started):
>> > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
>> > '/mesos/json.info_0000000025' in ZooKeeper
>> > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.171:5050) is detected
>> >
>> > So right after I started these 3 masters, the first one
>> > (192.168.122.171)
>> > was successfully elected as leader, but after 60s, 192.168.122.171
>> > failed
>> > with the error mentioned in my first mail, and then 192.168.122.225 was
>> > elected as leader, but it failed with the same error too after another
>> > 60s,
>> > and the same thing happened to the last one (192.168.122.132). So after
>> > about 180s, all my 3 master were down.
>> >
>> > I tried both:
>> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
>> > --work_dir=/var/lib/mesos/master
>> > and
>> >     sudo ./bin/mesos-master.sh
>> >
>> > --zk=zk://192.168.122.132:2181,192.168.122.171:2181,192.168.122.225:2181/mesos
>> > --quorum=2 --work_dir=/var/lib/mesos/master
>> > And I see the same error for both.
>> >
>> > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
>> > running on a KVM hypervisor host.
>> >
>> >
>> >
>> >
>> > Thanks,
>> > Qian Zhang
>> >
>> > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net>
>> > wrote:
>> >>
>> >> You told the master it needed a quorum of 2 and it's the only one
>> >> online, so it's bombing out.
>> >> That's the expected behaviour.
>> >>
>> >> You need to start at least 2 zookeepers before it will be a functional
>> >> group, same for the masters.
>> >>
>> >> You haven't mentioned how you setup your zookeeper cluster, so i'm
>> >> assuming that's working
>> >> correctly (3 nodes, all aware of the other 2 in their config). If not,
>> >> you need to sort that out first.
>> >>
>> >>
>> >> Also I think your zk URL is wrong - you want to list all 3 zookeeper
>> >> nodes like this:
>> >>
>> >> sudo ./bin/mesos-master.sh
>> >> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
>> >> --work_dir=/var/lib/mesos/master
>> >>
>> >> when you've run that command on 2 hosts things should start working,
>> >> you'll want all 3 up for
>> >> redundancy.
>> >>
>> >> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
>> >> > Hi Folks,
>> >> >
>> >> > I am trying to set up a Mesos HA env with 3 nodes, each of nodes has
>> >> > a
>> >> > Zookeeper running, so they form a Zookeeper cluster. And then when I
>> >> > started
>> >> > the first Mesos master in one node with:
>> >> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos
>> >> > --quorum=2
>> >> > --work_dir=/var/lib/mesos/master
>> >> >
>> >> > I found it will hang here for 60 seconds:
>> >> >   I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
>> >> > (UPID=master@192.168.122.132:5050) is detected
>> >> >   I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected
>> >> > leader
>> >> > is
>> >> > master@192.168.122.132:5050 with id
>> >> > 40d387a6-4d61-49d6-af44-51dd41457390
>> >> >   I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
>> >> > master!
>> >> >   I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from
>> >> > registrar
>> >> >   I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>> >> >   I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
>> >> > writer
>> >> >
>> >> > And after 60s, master will fail:
>> >> > F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed
>> >> > to
>> >> > recover registrar: Failed to perform fetch within 1mins
>> >> > *** Check failure stack trace: ***
>> >> >     @     0x7f4b81372f4e  google::LogMessage::Fail()
>> >> >     @     0x7f4b81372e9a  google::LogMessage::SendToLog()
>> >> >     @     0x7f4b8137289c  google::LogMessage::Flush()
>> >> >     @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
>> >> >     @     0x7f4b8040eea0  mesos::internal::master::fail()
>> >> >     @     0x7f4b804dbeb3
>> >> >
>> >> >
>> >> > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>> >> >     @     0x7f4b804ba453
>> >> >
>> >> > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>> >> >     @     0x7f4b804898d7
>> >> >
>> >> >
>> >> > _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>> >> >     @     0x7f4b804dbf80
>> >> >
>> >> >
>> >> > _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >> >     @           0x49d257  std::function<>::operator()()
>> >> >     @           0x49837f
>> >> >
>> >> >
>> >> > _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >> >     @           0x493024  process::Future<>::fail()
>> >> >     @     0x7f4b8015ad20  process::Promise<>::fail()
>> >> >     @     0x7f4b804d9295  process::internal::thenf<>()
>> >> >     @     0x7f4b8051788f
>> >> >
>> >> >
>> >> > _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >> >     @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
>> >> >     @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
>> >> >     @     0x7f4b8050fc69  std::function<>::operator()()
>> >> >     @     0x7f4b804f9609
>> >> >
>> >> >
>> >> > _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>> >> >     @     0x7f4b80517936
>> >> >
>> >> >
>> >> > _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
>> >> >     @     0x7f4b8050fc69  std::function<>::operator()()
>> >> >     @     0x7f4b8056b1b4  process::internal::run<>()
>> >> >     @     0x7f4b80561672  process::Future<>::fail()
>> >> >     @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
>> >> >     @     0x7f4b8059757f
>> >> >
>> >> >
>> >> > _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >> >     @     0x7f4b8058fad1
>> >> >
>> >> >
>> >> > _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>> >> >     @     0x7f4b80585a41
>> >> >
>> >> >
>> >> > _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
>> >> >     @     0x7f4b80597605
>> >> >
>> >> >
>> >> > _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >> >     @           0x49d257  std::function<>::operator()()
>> >> >     @           0x49837f
>> >> >
>> >> >
>> >> > _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >> >     @     0x7f4b8056164a  process::Future<>::fail()
>> >> >     @     0x7f4b8055a378  process::Promise<>::fail()
>> >> >
>> >> > I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but
>> >> > no
>> >> > luck for both. Any ideas about what happened? Thanks.
>> >> >
>> >> >
>> >> >
>> >> > Thanks,
>> >> > Qian Zhang
>> >
>> >
>
>

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Qian Zhang <zh...@gmail.com>.

>
> You need the 2nd command line (i.e. you have to specify all the zk
> nodes on each master, it's
> not like e.g. Cassandra where you can discover other nodes from the
> first one you talk to).


I have an Open DC/OS environment which is enabled master HA (there are 3
master nodes) and works very well, and I see each Mesos master is started
to only connect with local zk:
$ cat /opt/mesosphere/etc/mesos-master | grep ZK
MESOS_ZK=zk://127.0.0.1:2181/mesos

So I think I do not have to specify all the zk on each master.







Thanks,
Qian Zhang

On Sun, Jun 5, 2016 at 4:25 PM, Dick Davies <di...@hellooperator.net> wrote:

> OK, good - that part looks as expected, you've had a successful
> election for a leader
> (and yes that sounds like your zookeeper layer is ok).
>
> You need the 2nd command line (i.e. you have to specify all the zk
> nodes on each master, it's
> not like e.g. Cassandra where you can discover other nodes from the
> first one you talk to).
>
> The error you were getting was about the internal registry /
> replicated log, which is a mesos master level thing.
> You could try when Sivaram suggested - stopping the mesos master
> processes, wiping their
> work_dirs and starting them back up.
> Perhaps some wonky state got in there while you were trying various
> options?
>
>
> On 5 June 2016 at 00:34, Qian Zhang <zh...@gmail.com> wrote:
> > Thanks Vinod and Dick.
> >
> > I think my 3 ZK servers have formed a quorum, each of them has the
> following
> > config:
> >     $ cat conf/zoo.cfg
> >     server.1=192.168.122.132:2888:3888
> >     server.2=192.168.122.225:2888:3888
> >     server.3=192.168.122.171:2888:3888
> >     autopurge.purgeInterval=6
> >     autopurge.snapRetainCount=5
> >     initLimit=10
> >     syncLimit=5
> >     maxClientCnxns=0
> >     clientPort=2181
> >     tickTime=2000
> >     quorumListenOnAllIPs=true
> >     dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
> >     dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
> >
> > And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
> > leader" for one, and "Mode: follower" for the other two.
> >
> > I have already tried to manually start 3 masters simultaneously, and
> here is
> > what I see in their log:
> > In 192.168.122.171(this is the first master I started):
> >     I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
> > (id='25')
> >     I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
> > '/mesos/log_replicas/0000000024' in ZooKeeper
> >     I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
> > '/mesos/json.info_0000000025' in ZooKeeper
> >     I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> >     I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
> > log-replica(1)@192.168.122.171:5050 }
> >     I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected
> leader is
> > master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> >     I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
> > master!
> >
> > In 192.168.122.225 (second master I started):
> >     I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
> > (id='25')
> >     I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
> > '/mesos/json.info_0000000025' in ZooKeeper
> >     I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
> > log-replica(1)@192.168.122.171:5050 }
> >     I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
> > received a broadcasted recover request from (6)@192.168.122.225:5050
> >     I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> >     I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected
> leader is
> > master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> >
> > In 192.168.122.132 (last master I started):
> > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
> > (id='25')
> > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
> > '/mesos/json.info_0000000025' in ZooKeeper
> > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> >
> > So right after I started these 3 masters, the first one (192.168.122.171)
> > was successfully elected as leader, but after 60s, 192.168.122.171 failed
> > with the error mentioned in my first mail, and then 192.168.122.225 was
> > elected as leader, but it failed with the same error too after another
> 60s,
> > and the same thing happened to the last one (192.168.122.132). So after
> > about 180s, all my 3 master were down.
> >
> > I tried both:
> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> > --work_dir=/var/lib/mesos/master
> > and
> >     sudo ./bin/mesos-master.sh
> > --zk=zk://192.168.122.132:2181,192.168.122.171:2181,
> 192.168.122.225:2181/mesos
> > --quorum=2 --work_dir=/var/lib/mesos/master
> > And I see the same error for both.
> >
> > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
> > running on a KVM hypervisor host.
> >
> >
> >
> >
> > Thanks,
> > Qian Zhang
> >
> > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net>
> wrote:
> >>
> >> You told the master it needed a quorum of 2 and it's the only one
> >> online, so it's bombing out.
> >> That's the expected behaviour.
> >>
> >> You need to start at least 2 zookeepers before it will be a functional
> >> group, same for the masters.
> >>
> >> You haven't mentioned how you setup your zookeeper cluster, so i'm
> >> assuming that's working
> >> correctly (3 nodes, all aware of the other 2 in their config). If not,
> >> you need to sort that out first.
> >>
> >>
> >> Also I think your zk URL is wrong - you want to list all 3 zookeeper
> >> nodes like this:
> >>
> >> sudo ./bin/mesos-master.sh
> >> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
> >> --work_dir=/var/lib/mesos/master
> >>
> >> when you've run that command on 2 hosts things should start working,
> >> you'll want all 3 up for
> >> redundancy.
> >>
> >> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
> >> > Hi Folks,
> >> >
> >> > I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
> >> > Zookeeper running, so they form a Zookeeper cluster. And then when I
> >> > started
> >> > the first Mesos master in one node with:
> >> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos
> --quorum=2
> >> > --work_dir=/var/lib/mesos/master
> >> >
> >> > I found it will hang here for 60 seconds:
> >> >   I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
> >> > (UPID=master@192.168.122.132:5050) is detected
> >> >   I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected
> leader
> >> > is
> >> > master@192.168.122.132:5050 with id
> 40d387a6-4d61-49d6-af44-51dd41457390
> >> >   I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
> >> > master!
> >> >   I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from
> registrar
> >> >   I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
> >> >   I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
> >> > writer
> >> >
> >> > And after 60s, master will fail:
> >> > F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed
> to
> >> > recover registrar: Failed to perform fetch within 1mins
> >> > *** Check failure stack trace: ***
> >> >     @     0x7f4b81372f4e  google::LogMessage::Fail()
> >> >     @     0x7f4b81372e9a  google::LogMessage::SendToLog()
> >> >     @     0x7f4b8137289c  google::LogMessage::Flush()
> >> >     @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
> >> >     @     0x7f4b8040eea0  mesos::internal::master::fail()
> >> >     @     0x7f4b804dbeb3
> >> >
> >> >
> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
> >> >     @     0x7f4b804ba453
> >> > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
> >> >     @     0x7f4b804898d7
> >> >
> >> >
> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
> >> >     @     0x7f4b804dbf80
> >> >
> >> >
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> >> >     @           0x49d257  std::function<>::operator()()
> >> >     @           0x49837f
> >> >
> >> >
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> >> >     @           0x493024  process::Future<>::fail()
> >> >     @     0x7f4b8015ad20  process::Promise<>::fail()
> >> >     @     0x7f4b804d9295  process::internal::thenf<>()
> >> >     @     0x7f4b8051788f
> >> >
> >> >
> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> >> >     @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
> >> >     @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
> >> >     @     0x7f4b8050fc69  std::function<>::operator()()
> >> >     @     0x7f4b804f9609
> >> >
> >> >
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
> >> >     @     0x7f4b80517936
> >> >
> >> >
> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
> >> >     @     0x7f4b8050fc69  std::function<>::operator()()
> >> >     @     0x7f4b8056b1b4  process::internal::run<>()
> >> >     @     0x7f4b80561672  process::Future<>::fail()
> >> >     @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
> >> >     @     0x7f4b8059757f
> >> >
> >> >
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> >> >     @     0x7f4b8058fad1
> >> >
> >> >
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
> >> >     @     0x7f4b80585a41
> >> >
> >> >
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
> >> >     @     0x7f4b80597605
> >> >
> >> >
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> >> >     @           0x49d257  std::function<>::operator()()
> >> >     @           0x49837f
> >> >
> >> >
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> >> >     @     0x7f4b8056164a  process::Future<>::fail()
> >> >     @     0x7f4b8055a378  process::Promise<>::fail()
> >> >
> >> > I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but
> no
> >> > luck for both. Any ideas about what happened? Thanks.
> >> >
> >> >
> >> >
> >> > Thanks,
> >> > Qian Zhang
> >
> >
>

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Dick Davies <di...@hellooperator.net>.

OK, good - that part looks as expected, you've had a successful
election for a leader
(and yes that sounds like your zookeeper layer is ok).

You need the 2nd command line (i.e. you have to specify all the zk
nodes on each master, it's
not like e.g. Cassandra where you can discover other nodes from the
first one you talk to).

The error you were getting was about the internal registry /
replicated log, which is a mesos master level thing.
You could try when Sivaram suggested - stopping the mesos master
processes, wiping their
work_dirs and starting them back up.
Perhaps some wonky state got in there while you were trying various options?


On 5 June 2016 at 00:34, Qian Zhang <zh...@gmail.com> wrote:
> Thanks Vinod and Dick.
>
> I think my 3 ZK servers have formed a quorum, each of them has the following
> config:
>     $ cat conf/zoo.cfg
>     server.1=192.168.122.132:2888:3888
>     server.2=192.168.122.225:2888:3888
>     server.3=192.168.122.171:2888:3888
>     autopurge.purgeInterval=6
>     autopurge.snapRetainCount=5
>     initLimit=10
>     syncLimit=5
>     maxClientCnxns=0
>     clientPort=2181
>     tickTime=2000
>     quorumListenOnAllIPs=true
>     dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
>     dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
>
> And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
> leader" for one, and "Mode: follower" for the other two.
>
> I have already tried to manually start 3 masters simultaneously, and here is
> what I see in their log:
> In 192.168.122.171(this is the first master I started):
>     I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
> (id='25')
>     I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
> '/mesos/log_replicas/0000000024' in ZooKeeper
>     I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
> '/mesos/json.info_0000000025' in ZooKeeper
>     I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.171:5050) is detected
>     I0605 07:12:49.423841  1186 network.hpp:461] ZooKeeper group PIDs: {
> log-replica(1)@192.168.122.171:5050 }
>     I0605 07:12:49.424281  1187 master.cpp:1951] The newly elected leader is
> master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>     I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
> master!
>
> In 192.168.122.225 (second master I started):
>     I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
> (id='25')
>     I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
> '/mesos/json.info_0000000025' in ZooKeeper
>     I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
> log-replica(1)@192.168.122.171:5050 }
>     I0605 07:12:51.925721  2252 replica.cpp:673] Replica in EMPTY status
> received a broadcasted recover request from (6)@192.168.122.225:5050
>     I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.171:5050) is detected
>     I0605 07:12:51.928444  2246 master.cpp:1951] The newly elected leader is
> master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>
> In 192.168.122.132 (last master I started):
> I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
> (id='25')
> I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
> '/mesos/json.info_0000000025' in ZooKeeper
> I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.171:5050) is detected
>
> So right after I started these 3 masters, the first one (192.168.122.171)
> was successfully elected as leader, but after 60s, 192.168.122.171 failed
> with the error mentioned in my first mail, and then 192.168.122.225 was
> elected as leader, but it failed with the same error too after another 60s,
> and the same thing happened to the last one (192.168.122.132). So after
> about 180s, all my 3 master were down.
>
> I tried both:
>     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> --work_dir=/var/lib/mesos/master
> and
>     sudo ./bin/mesos-master.sh
> --zk=zk://192.168.122.132:2181,192.168.122.171:2181,192.168.122.225:2181/mesos
> --quorum=2 --work_dir=/var/lib/mesos/master
> And I see the same error for both.
>
> 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
> running on a KVM hypervisor host.
>
>
>
>
> Thanks,
> Qian Zhang
>
> On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net> wrote:
>>
>> You told the master it needed a quorum of 2 and it's the only one
>> online, so it's bombing out.
>> That's the expected behaviour.
>>
>> You need to start at least 2 zookeepers before it will be a functional
>> group, same for the masters.
>>
>> You haven't mentioned how you setup your zookeeper cluster, so i'm
>> assuming that's working
>> correctly (3 nodes, all aware of the other 2 in their config). If not,
>> you need to sort that out first.
>>
>>
>> Also I think your zk URL is wrong - you want to list all 3 zookeeper
>> nodes like this:
>>
>> sudo ./bin/mesos-master.sh
>> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
>> --work_dir=/var/lib/mesos/master
>>
>> when you've run that command on 2 hosts things should start working,
>> you'll want all 3 up for
>> redundancy.
>>
>> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
>> > Hi Folks,
>> >
>> > I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
>> > Zookeeper running, so they form a Zookeeper cluster. And then when I
>> > started
>> > the first Mesos master in one node with:
>> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
>> > --work_dir=/var/lib/mesos/master
>> >
>> > I found it will hang here for 60 seconds:
>> >   I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.132:5050) is detected
>> >   I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader
>> > is
>> > master@192.168.122.132:5050 with id 40d387a6-4d61-49d6-af44-51dd41457390
>> >   I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
>> > master!
>> >   I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
>> >   I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>> >   I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
>> > writer
>> >
>> > And after 60s, master will fail:
>> > F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to
>> > recover registrar: Failed to perform fetch within 1mins
>> > *** Check failure stack trace: ***
>> >     @     0x7f4b81372f4e  google::LogMessage::Fail()
>> >     @     0x7f4b81372e9a  google::LogMessage::SendToLog()
>> >     @     0x7f4b8137289c  google::LogMessage::Flush()
>> >     @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
>> >     @     0x7f4b8040eea0  mesos::internal::master::fail()
>> >     @     0x7f4b804dbeb3
>> >
>> > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>> >     @     0x7f4b804ba453
>> > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>> >     @     0x7f4b804898d7
>> >
>> > _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>> >     @     0x7f4b804dbf80
>> >
>> > _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >     @           0x49d257  std::function<>::operator()()
>> >     @           0x49837f
>> >
>> > _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >     @           0x493024  process::Future<>::fail()
>> >     @     0x7f4b8015ad20  process::Promise<>::fail()
>> >     @     0x7f4b804d9295  process::internal::thenf<>()
>> >     @     0x7f4b8051788f
>> >
>> > _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >     @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
>> >     @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
>> >     @     0x7f4b8050fc69  std::function<>::operator()()
>> >     @     0x7f4b804f9609
>> >
>> > _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>> >     @     0x7f4b80517936
>> >
>> > _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
>> >     @     0x7f4b8050fc69  std::function<>::operator()()
>> >     @     0x7f4b8056b1b4  process::internal::run<>()
>> >     @     0x7f4b80561672  process::Future<>::fail()
>> >     @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
>> >     @     0x7f4b8059757f
>> >
>> > _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >     @     0x7f4b8058fad1
>> >
>> > _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>> >     @     0x7f4b80585a41
>> >
>> > _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
>> >     @     0x7f4b80597605
>> >
>> > _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >     @           0x49d257  std::function<>::operator()()
>> >     @           0x49837f
>> >
>> > _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >     @     0x7f4b8056164a  process::Future<>::fail()
>> >     @     0x7f4b8055a378  process::Promise<>::fail()
>> >
>> > I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but no
>> > luck for both. Any ideas about what happened? Thanks.
>> >
>> >
>> >
>> > Thanks,
>> > Qian Zhang
>
>

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Sivaram Kannan <si...@gmail.com>.

My 2cents - Is there a possibility of old data in /var/lib/mesos - can you
try deleting the folder /var/lib/mesos in all the 3 systems and try
bringing it up??

On Sat, Jun 4, 2016 at 9:04 PM, Qian Zhang <zh...@gmail.com> wrote:

> I am using the latest Mesos code in git (master branch). However, I also
> tried the official 0.28.1 release, but no lock too.
>
>
> Thanks,
> Qian Zhang
>
> On Sun, Jun 5, 2016 at 8:04 AM, Jie Yu <yu...@gmail.com> wrote:
>
>> Which version are you using?
>>
>> - Jie
>>
>> On Sat, Jun 4, 2016 at 4:34 PM, Qian Zhang <zh...@gmail.com> wrote:
>>
>> > Thanks Vinod and Dick.
>> >
>> > I think my 3 ZK servers have formed a quorum, each of them has the
>> > following config:
>> >     $ cat conf/zoo.cfg
>> >     server.1=192.168.122.132:2888:3888
>> >     server.2=192.168.122.225:2888:3888
>> >     server.3=192.168.122.171:2888:3888
>> >     autopurge.purgeInterval=6
>> >     autopurge.snapRetainCount=5
>> >     initLimit=10
>> >     syncLimit=5
>> >     maxClientCnxns=0
>> >     clientPort=2181
>> >     tickTime=2000
>> >     quorumListenOnAllIPs=true
>> >     dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
>> >     dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
>> >
>> > And when I run "bin/zkServer.sh status" on each of them, I can see
>> "Mode:
>> > leader" for one, and "Mode: follower" for the other two.
>> >
>> > I have already tried to manually start 3 masters simultaneously, and
>> here
>> > is what I see in their log:
>> > In 192.168.122.171(this is the first master I started):
>> >     I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> >     I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
>> > '/mesos/log_replicas/0000000024' in ZooKeeper
>> >     I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
>> > '/mesos/json.info_0000000025' in ZooKeeper
>> >     I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.171:5050) is detected
>> >     I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
>> > log-replica(1)@192.168.122.171:5050 }
>> >     I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected
>> leader
>> > is master@192.168.122.171:5050 with id
>> > cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>> >     I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
>> > master!
>> >
>> > In 192.168.122.225 (second master I started):
>> >     I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> >     I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
>> > '/mesos/json.info_0000000025' in ZooKeeper
>> >     I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
>> > log-replica(1)@192.168.122.171:5050 }
>> >     I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
>> > received a broadcasted recover request from (6)@192.168.122.225:5050
>> >     I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.171:5050) is detected
>> >     I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected
>> leader
>> > is master@192.168.122.171:5050 with id
>> > cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>> >
>> > In 192.168.122.132 (last master I started):
>> > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
>> > '/mesos/json.info_0000000025' in ZooKeeper
>> > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master
>> (UPID=
>> > master@192.168.122.171:5050) is detected
>> >
>> > So right after I started these 3 masters, the first one
>> (192.168.122.171)
>> > was successfully elected as leader, but after 60s, 192.168.122.171
>> failed
>> > with the error mentioned in my first mail, and then 192.168.122.225 was
>> > elected as leader, but it failed with the same error too after another
>> 60s,
>> > and the same thing happened to the last one (192.168.122.132). So after
>> > about 180s, all my 3 master were down.
>> >
>> > I tried both:
>> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos
>> --quorum=2
>> > --work_dir=/var/lib/mesos/master
>> > and
>> >     sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
>> > 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
>> > --work_dir=/var/lib/mesos/master
>> > And I see the same error for both.
>> >
>> > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
>> > running on a KVM hypervisor host.
>> >
>> >
>> >
>> >
>> > Thanks,
>> > Qian Zhang
>> >
>> > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net>
>> > wrote:
>> >
>> >> You told the master it needed a quorum of 2 and it's the only one
>> >> online, so it's bombing out.
>> >> That's the expected behaviour.
>> >>
>> >> You need to start at least 2 zookeepers before it will be a functional
>> >> group, same for the masters.
>> >>
>> >> You haven't mentioned how you setup your zookeeper cluster, so i'm
>> >> assuming that's working
>> >> correctly (3 nodes, all aware of the other 2 in their config). If not,
>> >> you need to sort that out first.
>> >>
>> >>
>> >> Also I think your zk URL is wrong - you want to list all 3 zookeeper
>> >> nodes like this:
>> >>
>> >> sudo ./bin/mesos-master.sh
>> >> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
>> >> --work_dir=/var/lib/mesos/master
>> >>
>> >> when you've run that command on 2 hosts things should start working,
>> >> you'll want all 3 up for
>> >> redundancy.
>> >>
>> >> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
>> >> > Hi Folks,
>> >> >
>> >> > I am trying to set up a Mesos HA env with 3 nodes, each of nodes has
>> a
>> >> > Zookeeper running, so they form a Zookeeper cluster. And then when I
>> >> started
>> >> > the first Mesos master in one node with:
>> >> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos
>> >> --quorum=2
>> >> > --work_dir=/var/lib/mesos/master
>> >> >
>> >> > I found it will hang here for 60 seconds:
>> >> >   I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
>> >> > (UPID=master@192.168.122.132:5050) is detected
>> >> >   I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected
>> leader
>> >> is
>> >> > master@192.168.122.132:5050 with id
>> >> 40d387a6-4d61-49d6-af44-51dd41457390
>> >> >   I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
>> >> > master!
>> >> >   I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from
>> registrar
>> >> >   I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>> >> >   I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
>> >> writer
>> >> >
>> >> > And after 60s, master will fail:
>> >> > F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed
>> to
>> >> > recover registrar: Failed to perform fetch within 1mins
>> >> > *** Check failure stack trace: ***
>> >> >     @     0x7f4b81372f4e  google::LogMessage::Fail()
>> >> >     @     0x7f4b81372e9a  google::LogMessage::SendToLog()
>> >> >     @     0x7f4b8137289c  google::LogMessage::Flush()
>> >> >     @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
>> >> >     @     0x7f4b8040eea0  mesos::internal::master::fail()
>> >> >     @     0x7f4b804dbeb3
>> >> >
>> >>
>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>> >> >     @     0x7f4b804ba453
>> >> >
>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>> >> >     @     0x7f4b804898d7
>> >> >
>> >>
>> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>> >> >     @     0x7f4b804dbf80
>> >> >
>> >>
>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >> >     @           0x49d257  std::function<>::operator()()
>> >> >     @           0x49837f
>> >> >
>> >>
>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >> >     @           0x493024  process::Future<>::fail()
>> >> >     @     0x7f4b8015ad20  process::Promise<>::fail()
>> >> >     @     0x7f4b804d9295  process::internal::thenf<>()
>> >> >     @     0x7f4b8051788f
>> >> >
>> >>
>> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >> >     @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
>> >> >     @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
>> >> >     @     0x7f4b8050fc69  std::function<>::operator()()
>> >> >     @     0x7f4b804f9609
>> >> >
>> >>
>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>> >> >     @     0x7f4b80517936
>> >> >
>> >>
>> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
>> >> >     @     0x7f4b8050fc69  std::function<>::operator()()
>> >> >     @     0x7f4b8056b1b4  process::internal::run<>()
>> >> >     @     0x7f4b80561672  process::Future<>::fail()
>> >> >     @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
>> >> >     @     0x7f4b8059757f
>> >> >
>> >>
>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >> >     @     0x7f4b8058fad1
>> >> >
>> >>
>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>> >> >     @     0x7f4b80585a41
>> >> >
>> >>
>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
>> >> >     @     0x7f4b80597605
>> >> >
>> >>
>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >> >     @           0x49d257  std::function<>::operator()()
>> >> >     @           0x49837f
>> >> >
>> >>
>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >> >     @     0x7f4b8056164a  process::Future<>::fail()
>> >> >     @     0x7f4b8055a378  process::Promise<>::fail()
>> >> >
>> >> > I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos,
>> but no
>> >> > luck for both. Any ideas about what happened? Thanks.
>> >> >
>> >> >
>> >> >
>> >> > Thanks,
>> >> > Qian Zhang
>> >>
>> >
>> >
>>
>
>


-- 
ever tried. ever failed. no matter.
try again. fail again. fail better.
        -- Samuel Beckett

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Sivaram Kannan <si...@gmail.com>.

My 2cents - Is there a possibility of old data in /var/lib/mesos - can you
try deleting the folder /var/lib/mesos in all the 3 systems and try
bringing it up??

On Sat, Jun 4, 2016 at 9:04 PM, Qian Zhang <zh...@gmail.com> wrote:

> I am using the latest Mesos code in git (master branch). However, I also
> tried the official 0.28.1 release, but no lock too.
>
>
> Thanks,
> Qian Zhang
>
> On Sun, Jun 5, 2016 at 8:04 AM, Jie Yu <yu...@gmail.com> wrote:
>
>> Which version are you using?
>>
>> - Jie
>>
>> On Sat, Jun 4, 2016 at 4:34 PM, Qian Zhang <zh...@gmail.com> wrote:
>>
>> > Thanks Vinod and Dick.
>> >
>> > I think my 3 ZK servers have formed a quorum, each of them has the
>> > following config:
>> >     $ cat conf/zoo.cfg
>> >     server.1=192.168.122.132:2888:3888
>> >     server.2=192.168.122.225:2888:3888
>> >     server.3=192.168.122.171:2888:3888
>> >     autopurge.purgeInterval=6
>> >     autopurge.snapRetainCount=5
>> >     initLimit=10
>> >     syncLimit=5
>> >     maxClientCnxns=0
>> >     clientPort=2181
>> >     tickTime=2000
>> >     quorumListenOnAllIPs=true
>> >     dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
>> >     dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
>> >
>> > And when I run "bin/zkServer.sh status" on each of them, I can see
>> "Mode:
>> > leader" for one, and "Mode: follower" for the other two.
>> >
>> > I have already tried to manually start 3 masters simultaneously, and
>> here
>> > is what I see in their log:
>> > In 192.168.122.171(this is the first master I started):
>> >     I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> >     I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
>> > '/mesos/log_replicas/0000000024' in ZooKeeper
>> >     I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
>> > '/mesos/json.info_0000000025' in ZooKeeper
>> >     I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.171:5050) is detected
>> >     I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
>> > log-replica(1)@192.168.122.171:5050 }
>> >     I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected
>> leader
>> > is master@192.168.122.171:5050 with id
>> > cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>> >     I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
>> > master!
>> >
>> > In 192.168.122.225 (second master I started):
>> >     I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> >     I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
>> > '/mesos/json.info_0000000025' in ZooKeeper
>> >     I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
>> > log-replica(1)@192.168.122.171:5050 }
>> >     I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
>> > received a broadcasted recover request from (6)@192.168.122.225:5050
>> >     I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.171:5050) is detected
>> >     I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected
>> leader
>> > is master@192.168.122.171:5050 with id
>> > cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>> >
>> > In 192.168.122.132 (last master I started):
>> > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
>> > '/mesos/json.info_0000000025' in ZooKeeper
>> > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master
>> (UPID=
>> > master@192.168.122.171:5050) is detected
>> >
>> > So right after I started these 3 masters, the first one
>> (192.168.122.171)
>> > was successfully elected as leader, but after 60s, 192.168.122.171
>> failed
>> > with the error mentioned in my first mail, and then 192.168.122.225 was
>> > elected as leader, but it failed with the same error too after another
>> 60s,
>> > and the same thing happened to the last one (192.168.122.132). So after
>> > about 180s, all my 3 master were down.
>> >
>> > I tried both:
>> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos
>> --quorum=2
>> > --work_dir=/var/lib/mesos/master
>> > and
>> >     sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
>> > 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
>> > --work_dir=/var/lib/mesos/master
>> > And I see the same error for both.
>> >
>> > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
>> > running on a KVM hypervisor host.
>> >
>> >
>> >
>> >
>> > Thanks,
>> > Qian Zhang
>> >
>> > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net>
>> > wrote:
>> >
>> >> You told the master it needed a quorum of 2 and it's the only one
>> >> online, so it's bombing out.
>> >> That's the expected behaviour.
>> >>
>> >> You need to start at least 2 zookeepers before it will be a functional
>> >> group, same for the masters.
>> >>
>> >> You haven't mentioned how you setup your zookeeper cluster, so i'm
>> >> assuming that's working
>> >> correctly (3 nodes, all aware of the other 2 in their config). If not,
>> >> you need to sort that out first.
>> >>
>> >>
>> >> Also I think your zk URL is wrong - you want to list all 3 zookeeper
>> >> nodes like this:
>> >>
>> >> sudo ./bin/mesos-master.sh
>> >> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
>> >> --work_dir=/var/lib/mesos/master
>> >>
>> >> when you've run that command on 2 hosts things should start working,
>> >> you'll want all 3 up for
>> >> redundancy.
>> >>
>> >> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
>> >> > Hi Folks,
>> >> >
>> >> > I am trying to set up a Mesos HA env with 3 nodes, each of nodes has
>> a
>> >> > Zookeeper running, so they form a Zookeeper cluster. And then when I
>> >> started
>> >> > the first Mesos master in one node with:
>> >> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos
>> >> --quorum=2
>> >> > --work_dir=/var/lib/mesos/master
>> >> >
>> >> > I found it will hang here for 60 seconds:
>> >> >   I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
>> >> > (UPID=master@192.168.122.132:5050) is detected
>> >> >   I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected
>> leader
>> >> is
>> >> > master@192.168.122.132:5050 with id
>> >> 40d387a6-4d61-49d6-af44-51dd41457390
>> >> >   I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
>> >> > master!
>> >> >   I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from
>> registrar
>> >> >   I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>> >> >   I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
>> >> writer
>> >> >
>> >> > And after 60s, master will fail:
>> >> > F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed
>> to
>> >> > recover registrar: Failed to perform fetch within 1mins
>> >> > *** Check failure stack trace: ***
>> >> >     @     0x7f4b81372f4e  google::LogMessage::Fail()
>> >> >     @     0x7f4b81372e9a  google::LogMessage::SendToLog()
>> >> >     @     0x7f4b8137289c  google::LogMessage::Flush()
>> >> >     @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
>> >> >     @     0x7f4b8040eea0  mesos::internal::master::fail()
>> >> >     @     0x7f4b804dbeb3
>> >> >
>> >>
>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>> >> >     @     0x7f4b804ba453
>> >> >
>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>> >> >     @     0x7f4b804898d7
>> >> >
>> >>
>> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>> >> >     @     0x7f4b804dbf80
>> >> >
>> >>
>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >> >     @           0x49d257  std::function<>::operator()()
>> >> >     @           0x49837f
>> >> >
>> >>
>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >> >     @           0x493024  process::Future<>::fail()
>> >> >     @     0x7f4b8015ad20  process::Promise<>::fail()
>> >> >     @     0x7f4b804d9295  process::internal::thenf<>()
>> >> >     @     0x7f4b8051788f
>> >> >
>> >>
>> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >> >     @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
>> >> >     @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
>> >> >     @     0x7f4b8050fc69  std::function<>::operator()()
>> >> >     @     0x7f4b804f9609
>> >> >
>> >>
>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>> >> >     @     0x7f4b80517936
>> >> >
>> >>
>> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
>> >> >     @     0x7f4b8050fc69  std::function<>::operator()()
>> >> >     @     0x7f4b8056b1b4  process::internal::run<>()
>> >> >     @     0x7f4b80561672  process::Future<>::fail()
>> >> >     @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
>> >> >     @     0x7f4b8059757f
>> >> >
>> >>
>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >> >     @     0x7f4b8058fad1
>> >> >
>> >>
>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>> >> >     @     0x7f4b80585a41
>> >> >
>> >>
>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
>> >> >     @     0x7f4b80597605
>> >> >
>> >>
>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >> >     @           0x49d257  std::function<>::operator()()
>> >> >     @           0x49837f
>> >> >
>> >>
>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >> >     @     0x7f4b8056164a  process::Future<>::fail()
>> >> >     @     0x7f4b8055a378  process::Promise<>::fail()
>> >> >
>> >> > I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos,
>> but no
>> >> > luck for both. Any ideas about what happened? Thanks.
>> >> >
>> >> >
>> >> >
>> >> > Thanks,
>> >> > Qian Zhang
>> >>
>> >
>> >
>>
>
>


-- 
ever tried. ever failed. no matter.
try again. fail again. fail better.
        -- Samuel Beckett

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Qian Zhang <zh...@gmail.com>.

I am using the latest Mesos code in git (master branch). However, I also
tried the official 0.28.1 release, but no lock too.


Thanks,
Qian Zhang

On Sun, Jun 5, 2016 at 8:04 AM, Jie Yu <yu...@gmail.com> wrote:

> Which version are you using?
>
> - Jie
>
> On Sat, Jun 4, 2016 at 4:34 PM, Qian Zhang <zh...@gmail.com> wrote:
>
> > Thanks Vinod and Dick.
> >
> > I think my 3 ZK servers have formed a quorum, each of them has the
> > following config:
> >     $ cat conf/zoo.cfg
> >     server.1=192.168.122.132:2888:3888
> >     server.2=192.168.122.225:2888:3888
> >     server.3=192.168.122.171:2888:3888
> >     autopurge.purgeInterval=6
> >     autopurge.snapRetainCount=5
> >     initLimit=10
> >     syncLimit=5
> >     maxClientCnxns=0
> >     clientPort=2181
> >     tickTime=2000
> >     quorumListenOnAllIPs=true
> >     dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
> >     dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
> >
> > And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
> > leader" for one, and "Mode: follower" for the other two.
> >
> > I have already tried to manually start 3 masters simultaneously, and here
> > is what I see in their log:
> > In 192.168.122.171(this is the first master I started):
> >     I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
> > (id='25')
> >     I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
> > '/mesos/log_replicas/0000000024' in ZooKeeper
> >     I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
> > '/mesos/json.info_0000000025' in ZooKeeper
> >     I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> >     I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
> > log-replica(1)@192.168.122.171:5050 }
> >     I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader
> > is master@192.168.122.171:5050 with id
> > cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> >     I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
> > master!
> >
> > In 192.168.122.225 (second master I started):
> >     I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
> > (id='25')
> >     I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
> > '/mesos/json.info_0000000025' in ZooKeeper
> >     I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
> > log-replica(1)@192.168.122.171:5050 }
> >     I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
> > received a broadcasted recover request from (6)@192.168.122.225:5050
> >     I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> >     I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader
> > is master@192.168.122.171:5050 with id
> > cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> >
> > In 192.168.122.132 (last master I started):
> > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
> > (id='25')
> > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
> > '/mesos/json.info_0000000025' in ZooKeeper
> > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master
> (UPID=
> > master@192.168.122.171:5050) is detected
> >
> > So right after I started these 3 masters, the first one (192.168.122.171)
> > was successfully elected as leader, but after 60s, 192.168.122.171 failed
> > with the error mentioned in my first mail, and then 192.168.122.225 was
> > elected as leader, but it failed with the same error too after another
> 60s,
> > and the same thing happened to the last one (192.168.122.132). So after
> > about 180s, all my 3 master were down.
> >
> > I tried both:
> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> > --work_dir=/var/lib/mesos/master
> > and
> >     sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
> > 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
> > --work_dir=/var/lib/mesos/master
> > And I see the same error for both.
> >
> > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
> > running on a KVM hypervisor host.
> >
> >
> >
> >
> > Thanks,
> > Qian Zhang
> >
> > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net>
> > wrote:
> >
> >> You told the master it needed a quorum of 2 and it's the only one
> >> online, so it's bombing out.
> >> That's the expected behaviour.
> >>
> >> You need to start at least 2 zookeepers before it will be a functional
> >> group, same for the masters.
> >>
> >> You haven't mentioned how you setup your zookeeper cluster, so i'm
> >> assuming that's working
> >> correctly (3 nodes, all aware of the other 2 in their config). If not,
> >> you need to sort that out first.
> >>
> >>
> >> Also I think your zk URL is wrong - you want to list all 3 zookeeper
> >> nodes like this:
> >>
> >> sudo ./bin/mesos-master.sh
> >> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
> >> --work_dir=/var/lib/mesos/master
> >>
> >> when you've run that command on 2 hosts things should start working,
> >> you'll want all 3 up for
> >> redundancy.
> >>
> >> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
> >> > Hi Folks,
> >> >
> >> > I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
> >> > Zookeeper running, so they form a Zookeeper cluster. And then when I
> >> started
> >> > the first Mesos master in one node with:
> >> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos
> >> --quorum=2
> >> > --work_dir=/var/lib/mesos/master
> >> >
> >> > I found it will hang here for 60 seconds:
> >> >   I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
> >> > (UPID=master@192.168.122.132:5050) is detected
> >> >   I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected
> leader
> >> is
> >> > master@192.168.122.132:5050 with id
> >> 40d387a6-4d61-49d6-af44-51dd41457390
> >> >   I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
> >> > master!
> >> >   I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from
> registrar
> >> >   I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
> >> >   I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
> >> writer
> >> >
> >> > And after 60s, master will fail:
> >> > F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed
> to
> >> > recover registrar: Failed to perform fetch within 1mins
> >> > *** Check failure stack trace: ***
> >> >     @     0x7f4b81372f4e  google::LogMessage::Fail()
> >> >     @     0x7f4b81372e9a  google::LogMessage::SendToLog()
> >> >     @     0x7f4b8137289c  google::LogMessage::Flush()
> >> >     @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
> >> >     @     0x7f4b8040eea0  mesos::internal::master::fail()
> >> >     @     0x7f4b804dbeb3
> >> >
> >>
> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
> >> >     @     0x7f4b804ba453
> >> > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
> >> >     @     0x7f4b804898d7
> >> >
> >>
> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
> >> >     @     0x7f4b804dbf80
> >> >
> >>
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> >> >     @           0x49d257  std::function<>::operator()()
> >> >     @           0x49837f
> >> >
> >>
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> >> >     @           0x493024  process::Future<>::fail()
> >> >     @     0x7f4b8015ad20  process::Promise<>::fail()
> >> >     @     0x7f4b804d9295  process::internal::thenf<>()
> >> >     @     0x7f4b8051788f
> >> >
> >>
> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> >> >     @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
> >> >     @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
> >> >     @     0x7f4b8050fc69  std::function<>::operator()()
> >> >     @     0x7f4b804f9609
> >> >
> >>
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
> >> >     @     0x7f4b80517936
> >> >
> >>
> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
> >> >     @     0x7f4b8050fc69  std::function<>::operator()()
> >> >     @     0x7f4b8056b1b4  process::internal::run<>()
> >> >     @     0x7f4b80561672  process::Future<>::fail()
> >> >     @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
> >> >     @     0x7f4b8059757f
> >> >
> >>
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> >> >     @     0x7f4b8058fad1
> >> >
> >>
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
> >> >     @     0x7f4b80585a41
> >> >
> >>
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
> >> >     @     0x7f4b80597605
> >> >
> >>
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> >> >     @           0x49d257  std::function<>::operator()()
> >> >     @           0x49837f
> >> >
> >>
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> >> >     @     0x7f4b8056164a  process::Future<>::fail()
> >> >     @     0x7f4b8055a378  process::Promise<>::fail()
> >> >
> >> > I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but
> no
> >> > luck for both. Any ideas about what happened? Thanks.
> >> >
> >> >
> >> >
> >> > Thanks,
> >> > Qian Zhang
> >>
> >
> >
>

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Qian Zhang <zh...@gmail.com>.

I am using the latest Mesos code in git (master branch). However, I also
tried the official 0.28.1 release, but no lock too.


Thanks,
Qian Zhang

On Sun, Jun 5, 2016 at 8:04 AM, Jie Yu <yu...@gmail.com> wrote:

> Which version are you using?
>
> - Jie
>
> On Sat, Jun 4, 2016 at 4:34 PM, Qian Zhang <zh...@gmail.com> wrote:
>
> > Thanks Vinod and Dick.
> >
> > I think my 3 ZK servers have formed a quorum, each of them has the
> > following config:
> >     $ cat conf/zoo.cfg
> >     server.1=192.168.122.132:2888:3888
> >     server.2=192.168.122.225:2888:3888
> >     server.3=192.168.122.171:2888:3888
> >     autopurge.purgeInterval=6
> >     autopurge.snapRetainCount=5
> >     initLimit=10
> >     syncLimit=5
> >     maxClientCnxns=0
> >     clientPort=2181
> >     tickTime=2000
> >     quorumListenOnAllIPs=true
> >     dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
> >     dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
> >
> > And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
> > leader" for one, and "Mode: follower" for the other two.
> >
> > I have already tried to manually start 3 masters simultaneously, and here
> > is what I see in their log:
> > In 192.168.122.171(this is the first master I started):
> >     I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
> > (id='25')
> >     I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
> > '/mesos/log_replicas/0000000024' in ZooKeeper
> >     I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
> > '/mesos/json.info_0000000025' in ZooKeeper
> >     I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> >     I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
> > log-replica(1)@192.168.122.171:5050 }
> >     I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader
> > is master@192.168.122.171:5050 with id
> > cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> >     I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
> > master!
> >
> > In 192.168.122.225 (second master I started):
> >     I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
> > (id='25')
> >     I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
> > '/mesos/json.info_0000000025' in ZooKeeper
> >     I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
> > log-replica(1)@192.168.122.171:5050 }
> >     I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
> > received a broadcasted recover request from (6)@192.168.122.225:5050
> >     I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> >     I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader
> > is master@192.168.122.171:5050 with id
> > cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> >
> > In 192.168.122.132 (last master I started):
> > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
> > (id='25')
> > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
> > '/mesos/json.info_0000000025' in ZooKeeper
> > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master
> (UPID=
> > master@192.168.122.171:5050) is detected
> >
> > So right after I started these 3 masters, the first one (192.168.122.171)
> > was successfully elected as leader, but after 60s, 192.168.122.171 failed
> > with the error mentioned in my first mail, and then 192.168.122.225 was
> > elected as leader, but it failed with the same error too after another
> 60s,
> > and the same thing happened to the last one (192.168.122.132). So after
> > about 180s, all my 3 master were down.
> >
> > I tried both:
> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> > --work_dir=/var/lib/mesos/master
> > and
> >     sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
> > 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
> > --work_dir=/var/lib/mesos/master
> > And I see the same error for both.
> >
> > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
> > running on a KVM hypervisor host.
> >
> >
> >
> >
> > Thanks,
> > Qian Zhang
> >
> > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net>
> > wrote:
> >
> >> You told the master it needed a quorum of 2 and it's the only one
> >> online, so it's bombing out.
> >> That's the expected behaviour.
> >>
> >> You need to start at least 2 zookeepers before it will be a functional
> >> group, same for the masters.
> >>
> >> You haven't mentioned how you setup your zookeeper cluster, so i'm
> >> assuming that's working
> >> correctly (3 nodes, all aware of the other 2 in their config). If not,
> >> you need to sort that out first.
> >>
> >>
> >> Also I think your zk URL is wrong - you want to list all 3 zookeeper
> >> nodes like this:
> >>
> >> sudo ./bin/mesos-master.sh
> >> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
> >> --work_dir=/var/lib/mesos/master
> >>
> >> when you've run that command on 2 hosts things should start working,
> >> you'll want all 3 up for
> >> redundancy.
> >>
> >> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
> >> > Hi Folks,
> >> >
> >> > I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
> >> > Zookeeper running, so they form a Zookeeper cluster. And then when I
> >> started
> >> > the first Mesos master in one node with:
> >> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos
> >> --quorum=2
> >> > --work_dir=/var/lib/mesos/master
> >> >
> >> > I found it will hang here for 60 seconds:
> >> >   I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
> >> > (UPID=master@192.168.122.132:5050) is detected
> >> >   I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected
> leader
> >> is
> >> > master@192.168.122.132:5050 with id
> >> 40d387a6-4d61-49d6-af44-51dd41457390
> >> >   I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
> >> > master!
> >> >   I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from
> registrar
> >> >   I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
> >> >   I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
> >> writer
> >> >
> >> > And after 60s, master will fail:
> >> > F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed
> to
> >> > recover registrar: Failed to perform fetch within 1mins
> >> > *** Check failure stack trace: ***
> >> >     @     0x7f4b81372f4e  google::LogMessage::Fail()
> >> >     @     0x7f4b81372e9a  google::LogMessage::SendToLog()
> >> >     @     0x7f4b8137289c  google::LogMessage::Flush()
> >> >     @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
> >> >     @     0x7f4b8040eea0  mesos::internal::master::fail()
> >> >     @     0x7f4b804dbeb3
> >> >
> >>
> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
> >> >     @     0x7f4b804ba453
> >> > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
> >> >     @     0x7f4b804898d7
> >> >
> >>
> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
> >> >     @     0x7f4b804dbf80
> >> >
> >>
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> >> >     @           0x49d257  std::function<>::operator()()
> >> >     @           0x49837f
> >> >
> >>
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> >> >     @           0x493024  process::Future<>::fail()
> >> >     @     0x7f4b8015ad20  process::Promise<>::fail()
> >> >     @     0x7f4b804d9295  process::internal::thenf<>()
> >> >     @     0x7f4b8051788f
> >> >
> >>
> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> >> >     @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
> >> >     @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
> >> >     @     0x7f4b8050fc69  std::function<>::operator()()
> >> >     @     0x7f4b804f9609
> >> >
> >>
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
> >> >     @     0x7f4b80517936
> >> >
> >>
> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
> >> >     @     0x7f4b8050fc69  std::function<>::operator()()
> >> >     @     0x7f4b8056b1b4  process::internal::run<>()
> >> >     @     0x7f4b80561672  process::Future<>::fail()
> >> >     @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
> >> >     @     0x7f4b8059757f
> >> >
> >>
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> >> >     @     0x7f4b8058fad1
> >> >
> >>
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
> >> >     @     0x7f4b80585a41
> >> >
> >>
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
> >> >     @     0x7f4b80597605
> >> >
> >>
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> >> >     @           0x49d257  std::function<>::operator()()
> >> >     @           0x49837f
> >> >
> >>
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> >> >     @     0x7f4b8056164a  process::Future<>::fail()
> >> >     @     0x7f4b8055a378  process::Promise<>::fail()
> >> >
> >> > I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but
> no
> >> > luck for both. Any ideas about what happened? Thanks.
> >> >
> >> >
> >> >
> >> > Thanks,
> >> > Qian Zhang
> >>
> >
> >
>

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Jie Yu <yu...@gmail.com>.

Which version are you using?

- Jie

On Sat, Jun 4, 2016 at 4:34 PM, Qian Zhang <zh...@gmail.com> wrote:

> Thanks Vinod and Dick.
>
> I think my 3 ZK servers have formed a quorum, each of them has the
> following config:
>     $ cat conf/zoo.cfg
>     server.1=192.168.122.132:2888:3888
>     server.2=192.168.122.225:2888:3888
>     server.3=192.168.122.171:2888:3888
>     autopurge.purgeInterval=6
>     autopurge.snapRetainCount=5
>     initLimit=10
>     syncLimit=5
>     maxClientCnxns=0
>     clientPort=2181
>     tickTime=2000
>     quorumListenOnAllIPs=true
>     dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
>     dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
>
> And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
> leader" for one, and "Mode: follower" for the other two.
>
> I have already tried to manually start 3 masters simultaneously, and here
> is what I see in their log:
> In 192.168.122.171(this is the first master I started):
>     I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
> (id='25')
>     I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
> '/mesos/log_replicas/0000000024' in ZooKeeper
>     I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
> '/mesos/json.info_0000000025' in ZooKeeper
>     I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.171:5050) is detected
>     I0605 07:12:49.423841  1186 network.hpp:461] ZooKeeper group PIDs: {
> log-replica(1)@192.168.122.171:5050 }
>     I0605 07:12:49.424281  1187 master.cpp:1951] The newly elected leader
> is master@192.168.122.171:5050 with id
> cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>     I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
> master!
>
> In 192.168.122.225 (second master I started):
>     I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
> (id='25')
>     I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
> '/mesos/json.info_0000000025' in ZooKeeper
>     I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
> log-replica(1)@192.168.122.171:5050 }
>     I0605 07:12:51.925721  2252 replica.cpp:673] Replica in EMPTY status
> received a broadcasted recover request from (6)@192.168.122.225:5050
>     I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.171:5050) is detected
>     I0605 07:12:51.928444  2246 master.cpp:1951] The newly elected leader
> is master@192.168.122.171:5050 with id
> cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>
> In 192.168.122.132 (last master I started):
> I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
> (id='25')
> I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
> '/mesos/json.info_0000000025' in ZooKeeper
> I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master (UPID=
> master@192.168.122.171:5050) is detected
>
> So right after I started these 3 masters, the first one (192.168.122.171)
> was successfully elected as leader, but after 60s, 192.168.122.171 failed
> with the error mentioned in my first mail, and then 192.168.122.225 was
> elected as leader, but it failed with the same error too after another 60s,
> and the same thing happened to the last one (192.168.122.132). So after
> about 180s, all my 3 master were down.
>
> I tried both:
>     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> --work_dir=/var/lib/mesos/master
> and
>     sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
> 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
> --work_dir=/var/lib/mesos/master
> And I see the same error for both.
>
> 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
> running on a KVM hypervisor host.
>
>
>
>
> Thanks,
> Qian Zhang
>
> On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net>
> wrote:
>
>> You told the master it needed a quorum of 2 and it's the only one
>> online, so it's bombing out.
>> That's the expected behaviour.
>>
>> You need to start at least 2 zookeepers before it will be a functional
>> group, same for the masters.
>>
>> You haven't mentioned how you setup your zookeeper cluster, so i'm
>> assuming that's working
>> correctly (3 nodes, all aware of the other 2 in their config). If not,
>> you need to sort that out first.
>>
>>
>> Also I think your zk URL is wrong - you want to list all 3 zookeeper
>> nodes like this:
>>
>> sudo ./bin/mesos-master.sh
>> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
>> --work_dir=/var/lib/mesos/master
>>
>> when you've run that command on 2 hosts things should start working,
>> you'll want all 3 up for
>> redundancy.
>>
>> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
>> > Hi Folks,
>> >
>> > I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
>> > Zookeeper running, so they form a Zookeeper cluster. And then when I
>> started
>> > the first Mesos master in one node with:
>> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos
>> --quorum=2
>> > --work_dir=/var/lib/mesos/master
>> >
>> > I found it will hang here for 60 seconds:
>> >   I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.132:5050) is detected
>> >   I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader
>> is
>> > master@192.168.122.132:5050 with id
>> 40d387a6-4d61-49d6-af44-51dd41457390
>> >   I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
>> > master!
>> >   I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
>> >   I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>> >   I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
>> writer
>> >
>> > And after 60s, master will fail:
>> > F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to
>> > recover registrar: Failed to perform fetch within 1mins
>> > *** Check failure stack trace: ***
>> >     @     0x7f4b81372f4e  google::LogMessage::Fail()
>> >     @     0x7f4b81372e9a  google::LogMessage::SendToLog()
>> >     @     0x7f4b8137289c  google::LogMessage::Flush()
>> >     @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
>> >     @     0x7f4b8040eea0  mesos::internal::master::fail()
>> >     @     0x7f4b804dbeb3
>> >
>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>> >     @     0x7f4b804ba453
>> > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>> >     @     0x7f4b804898d7
>> >
>> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>> >     @     0x7f4b804dbf80
>> >
>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >     @           0x49d257  std::function<>::operator()()
>> >     @           0x49837f
>> >
>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >     @           0x493024  process::Future<>::fail()
>> >     @     0x7f4b8015ad20  process::Promise<>::fail()
>> >     @     0x7f4b804d9295  process::internal::thenf<>()
>> >     @     0x7f4b8051788f
>> >
>> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >     @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
>> >     @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
>> >     @     0x7f4b8050fc69  std::function<>::operator()()
>> >     @     0x7f4b804f9609
>> >
>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>> >     @     0x7f4b80517936
>> >
>> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
>> >     @     0x7f4b8050fc69  std::function<>::operator()()
>> >     @     0x7f4b8056b1b4  process::internal::run<>()
>> >     @     0x7f4b80561672  process::Future<>::fail()
>> >     @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
>> >     @     0x7f4b8059757f
>> >
>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >     @     0x7f4b8058fad1
>> >
>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>> >     @     0x7f4b80585a41
>> >
>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
>> >     @     0x7f4b80597605
>> >
>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >     @           0x49d257  std::function<>::operator()()
>> >     @           0x49837f
>> >
>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >     @     0x7f4b8056164a  process::Future<>::fail()
>> >     @     0x7f4b8055a378  process::Promise<>::fail()
>> >
>> > I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but no
>> > luck for both. Any ideas about what happened? Thanks.
>> >
>> >
>> >
>> > Thanks,
>> > Qian Zhang
>>
>
>

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Jie Yu <yu...@gmail.com>.

Which version are you using?

- Jie

On Sat, Jun 4, 2016 at 4:34 PM, Qian Zhang <zh...@gmail.com> wrote:

> Thanks Vinod and Dick.
>
> I think my 3 ZK servers have formed a quorum, each of them has the
> following config:
>     $ cat conf/zoo.cfg
>     server.1=192.168.122.132:2888:3888
>     server.2=192.168.122.225:2888:3888
>     server.3=192.168.122.171:2888:3888
>     autopurge.purgeInterval=6
>     autopurge.snapRetainCount=5
>     initLimit=10
>     syncLimit=5
>     maxClientCnxns=0
>     clientPort=2181
>     tickTime=2000
>     quorumListenOnAllIPs=true
>     dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
>     dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
>
> And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
> leader" for one, and "Mode: follower" for the other two.
>
> I have already tried to manually start 3 masters simultaneously, and here
> is what I see in their log:
> In 192.168.122.171(this is the first master I started):
>     I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
> (id='25')
>     I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
> '/mesos/log_replicas/0000000024' in ZooKeeper
>     I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
> '/mesos/json.info_0000000025' in ZooKeeper
>     I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.171:5050) is detected
>     I0605 07:12:49.423841  1186 network.hpp:461] ZooKeeper group PIDs: {
> log-replica(1)@192.168.122.171:5050 }
>     I0605 07:12:49.424281  1187 master.cpp:1951] The newly elected leader
> is master@192.168.122.171:5050 with id
> cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>     I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
> master!
>
> In 192.168.122.225 (second master I started):
>     I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
> (id='25')
>     I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
> '/mesos/json.info_0000000025' in ZooKeeper
>     I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
> log-replica(1)@192.168.122.171:5050 }
>     I0605 07:12:51.925721  2252 replica.cpp:673] Replica in EMPTY status
> received a broadcasted recover request from (6)@192.168.122.225:5050
>     I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.171:5050) is detected
>     I0605 07:12:51.928444  2246 master.cpp:1951] The newly elected leader
> is master@192.168.122.171:5050 with id
> cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>
> In 192.168.122.132 (last master I started):
> I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
> (id='25')
> I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
> '/mesos/json.info_0000000025' in ZooKeeper
> I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master (UPID=
> master@192.168.122.171:5050) is detected
>
> So right after I started these 3 masters, the first one (192.168.122.171)
> was successfully elected as leader, but after 60s, 192.168.122.171 failed
> with the error mentioned in my first mail, and then 192.168.122.225 was
> elected as leader, but it failed with the same error too after another 60s,
> and the same thing happened to the last one (192.168.122.132). So after
> about 180s, all my 3 master were down.
>
> I tried both:
>     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> --work_dir=/var/lib/mesos/master
> and
>     sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
> 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
> --work_dir=/var/lib/mesos/master
> And I see the same error for both.
>
> 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
> running on a KVM hypervisor host.
>
>
>
>
> Thanks,
> Qian Zhang
>
> On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net>
> wrote:
>
>> You told the master it needed a quorum of 2 and it's the only one
>> online, so it's bombing out.
>> That's the expected behaviour.
>>
>> You need to start at least 2 zookeepers before it will be a functional
>> group, same for the masters.
>>
>> You haven't mentioned how you setup your zookeeper cluster, so i'm
>> assuming that's working
>> correctly (3 nodes, all aware of the other 2 in their config). If not,
>> you need to sort that out first.
>>
>>
>> Also I think your zk URL is wrong - you want to list all 3 zookeeper
>> nodes like this:
>>
>> sudo ./bin/mesos-master.sh
>> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
>> --work_dir=/var/lib/mesos/master
>>
>> when you've run that command on 2 hosts things should start working,
>> you'll want all 3 up for
>> redundancy.
>>
>> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
>> > Hi Folks,
>> >
>> > I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
>> > Zookeeper running, so they form a Zookeeper cluster. And then when I
>> started
>> > the first Mesos master in one node with:
>> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos
>> --quorum=2
>> > --work_dir=/var/lib/mesos/master
>> >
>> > I found it will hang here for 60 seconds:
>> >   I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.132:5050) is detected
>> >   I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader
>> is
>> > master@192.168.122.132:5050 with id
>> 40d387a6-4d61-49d6-af44-51dd41457390
>> >   I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
>> > master!
>> >   I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
>> >   I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>> >   I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
>> writer
>> >
>> > And after 60s, master will fail:
>> > F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to
>> > recover registrar: Failed to perform fetch within 1mins
>> > *** Check failure stack trace: ***
>> >     @     0x7f4b81372f4e  google::LogMessage::Fail()
>> >     @     0x7f4b81372e9a  google::LogMessage::SendToLog()
>> >     @     0x7f4b8137289c  google::LogMessage::Flush()
>> >     @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
>> >     @     0x7f4b8040eea0  mesos::internal::master::fail()
>> >     @     0x7f4b804dbeb3
>> >
>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>> >     @     0x7f4b804ba453
>> > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>> >     @     0x7f4b804898d7
>> >
>> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>> >     @     0x7f4b804dbf80
>> >
>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >     @           0x49d257  std::function<>::operator()()
>> >     @           0x49837f
>> >
>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >     @           0x493024  process::Future<>::fail()
>> >     @     0x7f4b8015ad20  process::Promise<>::fail()
>> >     @     0x7f4b804d9295  process::internal::thenf<>()
>> >     @     0x7f4b8051788f
>> >
>> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >     @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
>> >     @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
>> >     @     0x7f4b8050fc69  std::function<>::operator()()
>> >     @     0x7f4b804f9609
>> >
>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>> >     @     0x7f4b80517936
>> >
>> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
>> >     @     0x7f4b8050fc69  std::function<>::operator()()
>> >     @     0x7f4b8056b1b4  process::internal::run<>()
>> >     @     0x7f4b80561672  process::Future<>::fail()
>> >     @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
>> >     @     0x7f4b8059757f
>> >
>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >     @     0x7f4b8058fad1
>> >
>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>> >     @     0x7f4b80585a41
>> >
>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
>> >     @     0x7f4b80597605
>> >
>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >     @           0x49d257  std::function<>::operator()()
>> >     @           0x49837f
>> >
>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >     @     0x7f4b8056164a  process::Future<>::fail()
>> >     @     0x7f4b8055a378  process::Promise<>::fail()
>> >
>> > I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but no
>> > luck for both. Any ideas about what happened? Thanks.
>> >
>> >
>> >
>> > Thanks,
>> > Qian Zhang
>>
>
>

Re: Mesos HA does not work (Failed to recover registrar)

Posted by haosdent <ha...@gmail.com>.

Hi, @Qian Zhang Your issue reminds me of this
http://search-hadoop.com/m/0Vlr69BZgz1NlAPP1&subj=Re+Mesos+Masters+Leader+Keeps+Fluctuating
which I could not reproduce in my env. I am not sure whether your case are
same with Stefano or not.

On Mon, Jun 6, 2016 at 9:06 PM, Qian Zhang <zh...@gmail.com> wrote:

> I deleted everything in the work dir (/var/lib/mesos/master), and tried
> again, the same error still happened :-(
>
>
> Thanks,
> Qian Zhang
>
> On Mon, Jun 6, 2016 at 3:03 AM, Jean Christophe “JC” Martin <
> jch.martin@gmail.com> wrote:
>
>> Qian,
>>
>> Zookeeper should be able to reach a quorum with 2, no need to start 3
>> simultaneously, but there is an issue with Zookeeper related to connection
>> timeouts.
>> https://issues.apache.org/jira/browse/ZOOKEEPER-2164
>> In some circumstances, the timeout is higher than the sync timeout, which
>> cause the leader election to fail.
>> Try setting the parameter cnxtimeout in zookeeper (by default it’s
>> 5000ms) to the value 500 (500ms). After doing this, leader election in ZK
>> will be super fast even if a node is disconnected.
>>
>> JC
>>
>> > On Jun 4, 2016, at 4:34 PM, Qian Zhang <zh...@gmail.com> wrote:
>> >
>> > Thanks Vinod and Dick.
>> >
>> > I think my 3 ZK servers have formed a quorum, each of them has the
>> > following config:
>> >    $ cat conf/zoo.cfg
>> >    server.1=192.168.122.132:2888:3888
>> >    server.2=192.168.122.225:2888:3888
>> >    server.3=192.168.122.171:2888:3888
>> >    autopurge.purgeInterval=6
>> >    autopurge.snapRetainCount=5
>> >    initLimit=10
>> >    syncLimit=5
>> >    maxClientCnxns=0
>> >    clientPort=2181
>> >    tickTime=2000
>> >    quorumListenOnAllIPs=true
>> >    dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
>> >    dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
>> >
>> > And when I run "bin/zkServer.sh status" on each of them, I can see
>> "Mode:
>> > leader" for one, and "Mode: follower" for the other two.
>> >
>> > I have already tried to manually start 3 masters simultaneously, and
>> here
>> > is what I see in their log:
>> > In 192.168.122.171(this is the first master I started):
>> >    I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> >    I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
>> > '/mesos/log_replicas/0000000024' in ZooKeeper
>> >    I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
>> > '/mesos/json.info_0000000025' in ZooKeeper
>> >    I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.171:5050) is detected
>> >    I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
>> > log-replica(1)@192.168.122.171:5050 }
>> >    I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader
>> > is master@192.168.122.171:5050 with id
>> cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>> >    I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
>> > master!
>> >
>> > In 192.168.122.225 (second master I started):
>> >    I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> >    I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
>> > '/mesos/json.info_0000000025' in ZooKeeper
>> >    I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
>> > log-replica(1)@192.168.122.171:5050 }
>> >    I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
>> > received a broadcasted recover request from (6)@192.168.122.225:5050
>> >    I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.171:5050) is detected
>> >    I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader
>> > is master@192.168.122.171:5050 with id
>> cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>> >
>> > In 192.168.122.132 (last master I started):
>> > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
>> > '/mesos/json.info_0000000025' in ZooKeeper
>> > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master
>> (UPID=
>> > master@192.168.122.171:5050) is detected
>> >
>> > So right after I started these 3 masters, the first one
>> (192.168.122.171)
>> > was successfully elected as leader, but after 60s, 192.168.122.171
>> failed
>> > with the error mentioned in my first mail, and then 192.168.122.225 was
>> > elected as leader, but it failed with the same error too after another
>> 60s,
>> > and the same thing happened to the last one (192.168.122.132). So after
>> > about 180s, all my 3 master were down.
>> >
>> > I tried both:
>> >    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
>> > --work_dir=/var/lib/mesos/master
>> > and
>> >    sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
>> > 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
>> > --work_dir=/var/lib/mesos/master
>> > And I see the same error for both.
>> >
>> > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
>> > running on a KVM hypervisor host.
>> >
>> >
>> >
>> >
>> > Thanks,
>> > Qian Zhang
>> >
>> > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net>
>> wrote:
>> >
>> >> You told the master it needed a quorum of 2 and it's the only one
>> >> online, so it's bombing out.
>> >> That's the expected behaviour.
>> >>
>> >> You need to start at least 2 zookeepers before it will be a functional
>> >> group, same for the masters.
>> >>
>> >> You haven't mentioned how you setup your zookeeper cluster, so i'm
>> >> assuming that's working
>> >> correctly (3 nodes, all aware of the other 2 in their config). If not,
>> >> you need to sort that out first.
>> >>
>> >>
>> >> Also I think your zk URL is wrong - you want to list all 3 zookeeper
>> >> nodes like this:
>> >>
>> >> sudo ./bin/mesos-master.sh
>> >> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
>> >> --work_dir=/var/lib/mesos/master
>> >>
>> >> when you've run that command on 2 hosts things should start working,
>> >> you'll want all 3 up for
>> >> redundancy.
>> >>
>> >> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
>> >>> Hi Folks,
>> >>>
>> >>> I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
>> >>> Zookeeper running, so they form a Zookeeper cluster. And then when I
>> >> started
>> >>> the first Mesos master in one node with:
>> >>>    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos
>> --quorum=2
>> >>> --work_dir=/var/lib/mesos/master
>> >>>
>> >>> I found it will hang here for 60 seconds:
>> >>>  I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
>> >>> (UPID=master@192.168.122.132:5050) is detected
>> >>>  I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader
>> >> is
>> >>> master@192.168.122.132:5050 with id
>> 40d387a6-4d61-49d6-af44-51dd41457390
>> >>>  I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
>> >>> master!
>> >>>  I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from
>> registrar
>> >>>  I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>> >>>  I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
>> writer
>> >>>
>> >>> And after 60s, master will fail:
>> >>> F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed
>> to
>> >>> recover registrar: Failed to perform fetch within 1mins
>> >>> *** Check failure stack trace: ***
>> >>>    @     0x7f4b81372f4e  google::LogMessage::Fail()
>> >>>    @     0x7f4b81372e9a  google::LogMessage::SendToLog()
>> >>>    @     0x7f4b8137289c  google::LogMessage::Flush()
>> >>>    @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
>> >>>    @     0x7f4b8040eea0  mesos::internal::master::fail()
>> >>>    @     0x7f4b804dbeb3
>> >>>
>> >>
>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>> >>>    @     0x7f4b804ba453
>> >>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>> >>>    @     0x7f4b804898d7
>> >>>
>> >>
>> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>> >>>    @     0x7f4b804dbf80
>> >>>
>> >>
>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >>>    @           0x49d257  std::function<>::operator()()
>> >>>    @           0x49837f
>> >>>
>> >>
>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >>>    @           0x493024  process::Future<>::fail()
>> >>>    @     0x7f4b8015ad20  process::Promise<>::fail()
>> >>>    @     0x7f4b804d9295  process::internal::thenf<>()
>> >>>    @     0x7f4b8051788f
>> >>>
>> >>
>> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >>>    @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
>> >>>    @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
>> >>>    @     0x7f4b8050fc69  std::function<>::operator()()
>> >>>    @     0x7f4b804f9609
>> >>>
>> >>
>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>> >>>    @     0x7f4b80517936
>> >>>
>> >>
>> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
>> >>>    @     0x7f4b8050fc69  std::function<>::operator()()
>> >>>    @     0x7f4b8056b1b4  process::internal::run<>()
>> >>>    @     0x7f4b80561672  process::Future<>::fail()
>> >>>    @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
>> >>>    @     0x7f4b8059757f
>> >>>
>> >>
>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >>>    @     0x7f4b8058fad1
>> >>>
>> >>
>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>> >>>    @     0x7f4b80585a41
>> >>>
>> >>
>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
>> >>>    @     0x7f4b80597605
>> >>>
>> >>
>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >>>    @           0x49d257  std::function<>::operator()()
>> >>>    @           0x49837f
>> >>>
>> >>
>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >>>    @     0x7f4b8056164a  process::Future<>::fail()
>> >>>    @     0x7f4b8055a378  process::Promise<>::fail()
>> >>>
>> >>> I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but
>> no
>> >>> luck for both. Any ideas about what happened? Thanks.
>> >>>
>> >>>
>> >>>
>> >>> Thanks,
>> >>> Qian Zhang
>> >>
>>
>>
>


-- 
Best Regards,
Haosdent Huang

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Chengwei Yang <ch...@gmail.com>.

@Qian,

I think you're running issues with firewall, did you make sure your master can
reach from each other?

FROM master A
$ telnet B 5050

I think it fail to connect.

Please ensure shutdown any firewall.

-- 
Thanks,
Chengwei

On Mon, Jun 06, 2016 at 09:06:43PM +0800, Qian Zhang wrote:
> I deleted everything in the work dir (/var/lib/mesos/master), and tried again,
> the same error still happened :-(
> 
> 
> Thanks,
> Qian Zhang
> 
> On Mon, Jun 6, 2016 at 3:03 AM, Jean Christophe “JC” Martin <
> jch.martin@gmail.com> wrote:
> 
>     Qian,
> 
>     Zookeeper should be able to reach a quorum with 2, no need to start 3
>     simultaneously, but there is an issue with Zookeeper related to connection
>     timeouts.
>     https://issues.apache.org/jira/browse/ZOOKEEPER-2164
>     In some circumstances, the timeout is higher than the sync timeout, which
>     cause the leader election to fail.
>     Try setting the parameter cnxtimeout in zookeeper (by default it’s 5000ms)
>     to the value 500 (500ms). After doing this, leader election in ZK will be
>     super fast even if a node is disconnected.
>    
>     JC
>    
>     > On Jun 4, 2016, at 4:34 PM, Qian Zhang <zh...@gmail.com> wrote:
>     >
>     > Thanks Vinod and Dick.
>     >
>     > I think my 3 ZK servers have formed a quorum, each of them has the
>     > following config:
>     >    $ cat conf/zoo.cfg
>     >    server.1=192.168.122.132:2888:3888
>     >    server.2=192.168.122.225:2888:3888
>     >    server.3=192.168.122.171:2888:3888
>     >    autopurge.purgeInterval=6
>     >    autopurge.snapRetainCount=5
>     >    initLimit=10
>     >    syncLimit=5
>     >    maxClientCnxns=0
>     >    clientPort=2181
>     >    tickTime=2000
>     >    quorumListenOnAllIPs=true
>     >    dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
>     >    dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
>     >
>     > And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
>     > leader" for one, and "Mode: follower" for the other two.
>     >
>     > I have already tried to manually start 3 masters simultaneously, and here
>     > is what I see in their log:
>     > In 192.168.122.171(this is the first master I started):
>     >    I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
>     > (id='25')
>     >    I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
>     > '/mesos/log_replicas/0000000024' in ZooKeeper
>     >    I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
>     > '/mesos/json.info_0000000025' in ZooKeeper
>     >    I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
>     > (UPID=master@192.168.122.171:5050) is detected
>     >    I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
>     > log-replica(1)@192.168.122.171:5050 }
>     >    I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader
>     > is master@192.168.122.171:5050 with id
>     cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>     >    I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
>     > master!
>     >
>     > In 192.168.122.225 (second master I started):
>     >    I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
>     > (id='25')
>     >    I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
>     > '/mesos/json.info_0000000025' in ZooKeeper
>     >    I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
>     > log-replica(1)@192.168.122.171:5050 }
>     >    I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
>     > received a broadcasted recover request from (6)@192.168.122.225:5050
>     >    I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
>     > (UPID=master@192.168.122.171:5050) is detected
>     >    I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader
>     > is master@192.168.122.171:5050 with id
>     cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>     >
>     > In 192.168.122.132 (last master I started):
>     > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
>     > (id='25')
>     > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
>     > '/mesos/json.info_0000000025' in ZooKeeper
>     > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master (UPID
>     =
>     > master@192.168.122.171:5050) is detected
>     >
>     > So right after I started these 3 masters, the first one (192.168.122.171)
>     > was successfully elected as leader, but after 60s, 192.168.122.171 failed
>     > with the error mentioned in my first mail, and then 192.168.122.225 was
>     > elected as leader, but it failed with the same error too after another
>     60s,
>     > and the same thing happened to the last one (192.168.122.132). So after
>     > about 180s, all my 3 master were down.
>     >
>     > I tried both:
>     >    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
>     > --work_dir=/var/lib/mesos/master
>     > and
>     >    sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
>     > 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
>     > --work_dir=/var/lib/mesos/master
>     > And I see the same error for both.
>     >
>     > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
>     > running on a KVM hypervisor host.
>     >
>     >
>     >
>     >
>     > Thanks,
>     > Qian Zhang
>     >
>     > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net>
>     wrote:
>     >
>     >> You told the master it needed a quorum of 2 and it's the only one
>     >> online, so it's bombing out.
>     >> That's the expected behaviour.
>     >>
>     >> You need to start at least 2 zookeepers before it will be a functional
>     >> group, same for the masters.
>     >>
>     >> You haven't mentioned how you setup your zookeeper cluster, so i'm
>     >> assuming that's working
>     >> correctly (3 nodes, all aware of the other 2 in their config). If not,
>     >> you need to sort that out first.
>     >>
>     >>
>     >> Also I think your zk URL is wrong - you want to list all 3 zookeeper
>     >> nodes like this:
>     >>
>     >> sudo ./bin/mesos-master.sh
>     >> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
>     >> --work_dir=/var/lib/mesos/master
>     >>
>     >> when you've run that command on 2 hosts things should start working,
>     >> you'll want all 3 up for
>     >> redundancy.
>     >>
>     >> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
>     >>> Hi Folks,
>     >>>
>     >>> I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
>     >>> Zookeeper running, so they form a Zookeeper cluster. And then when I
>     >> started
>     >>> the first Mesos master in one node with:
>     >>>    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
>     >>> --work_dir=/var/lib/mesos/master
>     >>>
>     >>> I found it will hang here for 60 seconds:
>     >>>  I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
>     >>> (UPID=master@192.168.122.132:5050) is detected
>     >>>  I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader
>     >> is
>     >>> master@192.168.122.132:5050 with id
>     40d387a6-4d61-49d6-af44-51dd41457390
>     >>>  I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
>     >>> master!
>     >>>  I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
>     >>>  I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>     >>>  I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
>     writer
>     >>>
>     >>> And after 60s, master will fail:
>     >>> F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to
>     >>> recover registrar: Failed to perform fetch within 1mins
>     >>> *** Check failure stack trace: ***
>     >>>    @     0x7f4b81372f4e  google::LogMessage::Fail()
>     >>>    @     0x7f4b81372e9a  google::LogMessage::SendToLog()
>     >>>    @     0x7f4b8137289c  google::LogMessage::Flush()
>     >>>    @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
>     >>>    @     0x7f4b8040eea0  mesos::internal::master::fail()
>     >>>    @     0x7f4b804dbeb3
>     >>>
>     >>
>     _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>     >>>    @     0x7f4b804ba453
>     >>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>     >>>    @     0x7f4b804898d7
>     >>>
>     >>
>     _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>     >>>    @     0x7f4b804dbf80
>     >>>
>     >>
>     _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>     >>>    @           0x49d257  std::function<>::operator()()
>     >>>    @           0x49837f
>     >>>
>     >>
>     _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>     >>>    @           0x493024  process::Future<>::fail()
>     >>>    @     0x7f4b8015ad20  process::Promise<>::fail()
>     >>>    @     0x7f4b804d9295  process::internal::thenf<>()
>     >>>    @     0x7f4b8051788f
>     >>>
>     >>
>     _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>     >>>    @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
>     >>>    @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
>     >>>    @     0x7f4b8050fc69  std::function<>::operator()()
>     >>>    @     0x7f4b804f9609
>     >>>
>     >>
>     _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>     >>>    @     0x7f4b80517936
>     >>>
>     >>
>     _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
>     >>>    @     0x7f4b8050fc69  std::function<>::operator()()
>     >>>    @     0x7f4b8056b1b4  process::internal::run<>()
>     >>>    @     0x7f4b80561672  process::Future<>::fail()
>     >>>    @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
>     >>>    @     0x7f4b8059757f
>     >>>
>     >>
>     _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>     >>>    @     0x7f4b8058fad1
>     >>>
>     >>
>     _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>     >>>    @     0x7f4b80585a41
>     >>>
>     >>
>     _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
>     >>>    @     0x7f4b80597605
>     >>>
>     >>
>     _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>     >>>    @           0x49d257  std::function<>::operator()()
>     >>>    @           0x49837f
>     >>>
>     >>
>     _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>     >>>    @     0x7f4b8056164a  process::Future<>::fail()
>     >>>    @     0x7f4b8055a378  process::Promise<>::fail()
>     >>>
>     >>> I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but
>     no
>     >>> luck for both. Any ideas about what happened? Thanks.
>     >>>
>     >>>
>     >>>
>     >>> Thanks,
>     >>> Qian Zhang
>     >>
> 
> 
> 
> SECURITY NOTE: file ~/.netrc must not be accessible by others

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Chengwei Yang <ch...@gmail.com>.

@Qian,

I think you're running issues with firewall, did you make sure your master can
reach from each other?

FROM master A
$ telnet B 5050

I think it fail to connect.

Please ensure shutdown any firewall.

-- 
Thanks,
Chengwei

On Mon, Jun 06, 2016 at 09:06:43PM +0800, Qian Zhang wrote:
> I deleted everything in the work dir (/var/lib/mesos/master), and tried again,
> the same error still happened :-(
> 
> 
> Thanks,
> Qian Zhang
> 
> On Mon, Jun 6, 2016 at 3:03 AM, Jean Christophe “JC” Martin <
> jch.martin@gmail.com> wrote:
> 
>     Qian,
> 
>     Zookeeper should be able to reach a quorum with 2, no need to start 3
>     simultaneously, but there is an issue with Zookeeper related to connection
>     timeouts.
>     https://issues.apache.org/jira/browse/ZOOKEEPER-2164
>     In some circumstances, the timeout is higher than the sync timeout, which
>     cause the leader election to fail.
>     Try setting the parameter cnxtimeout in zookeeper (by default it’s 5000ms)
>     to the value 500 (500ms). After doing this, leader election in ZK will be
>     super fast even if a node is disconnected.
>    
>     JC
>    
>     > On Jun 4, 2016, at 4:34 PM, Qian Zhang <zh...@gmail.com> wrote:
>     >
>     > Thanks Vinod and Dick.
>     >
>     > I think my 3 ZK servers have formed a quorum, each of them has the
>     > following config:
>     >    $ cat conf/zoo.cfg
>     >    server.1=192.168.122.132:2888:3888
>     >    server.2=192.168.122.225:2888:3888
>     >    server.3=192.168.122.171:2888:3888
>     >    autopurge.purgeInterval=6
>     >    autopurge.snapRetainCount=5
>     >    initLimit=10
>     >    syncLimit=5
>     >    maxClientCnxns=0
>     >    clientPort=2181
>     >    tickTime=2000
>     >    quorumListenOnAllIPs=true
>     >    dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
>     >    dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
>     >
>     > And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
>     > leader" for one, and "Mode: follower" for the other two.
>     >
>     > I have already tried to manually start 3 masters simultaneously, and here
>     > is what I see in their log:
>     > In 192.168.122.171(this is the first master I started):
>     >    I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
>     > (id='25')
>     >    I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
>     > '/mesos/log_replicas/0000000024' in ZooKeeper
>     >    I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
>     > '/mesos/json.info_0000000025' in ZooKeeper
>     >    I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
>     > (UPID=master@192.168.122.171:5050) is detected
>     >    I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
>     > log-replica(1)@192.168.122.171:5050 }
>     >    I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader
>     > is master@192.168.122.171:5050 with id
>     cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>     >    I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
>     > master!
>     >
>     > In 192.168.122.225 (second master I started):
>     >    I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
>     > (id='25')
>     >    I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
>     > '/mesos/json.info_0000000025' in ZooKeeper
>     >    I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
>     > log-replica(1)@192.168.122.171:5050 }
>     >    I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
>     > received a broadcasted recover request from (6)@192.168.122.225:5050
>     >    I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
>     > (UPID=master@192.168.122.171:5050) is detected
>     >    I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader
>     > is master@192.168.122.171:5050 with id
>     cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>     >
>     > In 192.168.122.132 (last master I started):
>     > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
>     > (id='25')
>     > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
>     > '/mesos/json.info_0000000025' in ZooKeeper
>     > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master (UPID
>     =
>     > master@192.168.122.171:5050) is detected
>     >
>     > So right after I started these 3 masters, the first one (192.168.122.171)
>     > was successfully elected as leader, but after 60s, 192.168.122.171 failed
>     > with the error mentioned in my first mail, and then 192.168.122.225 was
>     > elected as leader, but it failed with the same error too after another
>     60s,
>     > and the same thing happened to the last one (192.168.122.132). So after
>     > about 180s, all my 3 master were down.
>     >
>     > I tried both:
>     >    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
>     > --work_dir=/var/lib/mesos/master
>     > and
>     >    sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
>     > 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
>     > --work_dir=/var/lib/mesos/master
>     > And I see the same error for both.
>     >
>     > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
>     > running on a KVM hypervisor host.
>     >
>     >
>     >
>     >
>     > Thanks,
>     > Qian Zhang
>     >
>     > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net>
>     wrote:
>     >
>     >> You told the master it needed a quorum of 2 and it's the only one
>     >> online, so it's bombing out.
>     >> That's the expected behaviour.
>     >>
>     >> You need to start at least 2 zookeepers before it will be a functional
>     >> group, same for the masters.
>     >>
>     >> You haven't mentioned how you setup your zookeeper cluster, so i'm
>     >> assuming that's working
>     >> correctly (3 nodes, all aware of the other 2 in their config). If not,
>     >> you need to sort that out first.
>     >>
>     >>
>     >> Also I think your zk URL is wrong - you want to list all 3 zookeeper
>     >> nodes like this:
>     >>
>     >> sudo ./bin/mesos-master.sh
>     >> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
>     >> --work_dir=/var/lib/mesos/master
>     >>
>     >> when you've run that command on 2 hosts things should start working,
>     >> you'll want all 3 up for
>     >> redundancy.
>     >>
>     >> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
>     >>> Hi Folks,
>     >>>
>     >>> I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
>     >>> Zookeeper running, so they form a Zookeeper cluster. And then when I
>     >> started
>     >>> the first Mesos master in one node with:
>     >>>    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
>     >>> --work_dir=/var/lib/mesos/master
>     >>>
>     >>> I found it will hang here for 60 seconds:
>     >>>  I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
>     >>> (UPID=master@192.168.122.132:5050) is detected
>     >>>  I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader
>     >> is
>     >>> master@192.168.122.132:5050 with id
>     40d387a6-4d61-49d6-af44-51dd41457390
>     >>>  I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
>     >>> master!
>     >>>  I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
>     >>>  I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>     >>>  I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
>     writer
>     >>>
>     >>> And after 60s, master will fail:
>     >>> F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to
>     >>> recover registrar: Failed to perform fetch within 1mins
>     >>> *** Check failure stack trace: ***
>     >>>    @     0x7f4b81372f4e  google::LogMessage::Fail()
>     >>>    @     0x7f4b81372e9a  google::LogMessage::SendToLog()
>     >>>    @     0x7f4b8137289c  google::LogMessage::Flush()
>     >>>    @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
>     >>>    @     0x7f4b8040eea0  mesos::internal::master::fail()
>     >>>    @     0x7f4b804dbeb3
>     >>>
>     >>
>     _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>     >>>    @     0x7f4b804ba453
>     >>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>     >>>    @     0x7f4b804898d7
>     >>>
>     >>
>     _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>     >>>    @     0x7f4b804dbf80
>     >>>
>     >>
>     _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>     >>>    @           0x49d257  std::function<>::operator()()
>     >>>    @           0x49837f
>     >>>
>     >>
>     _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>     >>>    @           0x493024  process::Future<>::fail()
>     >>>    @     0x7f4b8015ad20  process::Promise<>::fail()
>     >>>    @     0x7f4b804d9295  process::internal::thenf<>()
>     >>>    @     0x7f4b8051788f
>     >>>
>     >>
>     _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>     >>>    @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
>     >>>    @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
>     >>>    @     0x7f4b8050fc69  std::function<>::operator()()
>     >>>    @     0x7f4b804f9609
>     >>>
>     >>
>     _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>     >>>    @     0x7f4b80517936
>     >>>
>     >>
>     _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
>     >>>    @     0x7f4b8050fc69  std::function<>::operator()()
>     >>>    @     0x7f4b8056b1b4  process::internal::run<>()
>     >>>    @     0x7f4b80561672  process::Future<>::fail()
>     >>>    @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
>     >>>    @     0x7f4b8059757f
>     >>>
>     >>
>     _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>     >>>    @     0x7f4b8058fad1
>     >>>
>     >>
>     _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>     >>>    @     0x7f4b80585a41
>     >>>
>     >>
>     _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
>     >>>    @     0x7f4b80597605
>     >>>
>     >>
>     _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>     >>>    @           0x49d257  std::function<>::operator()()
>     >>>    @           0x49837f
>     >>>
>     >>
>     _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>     >>>    @     0x7f4b8056164a  process::Future<>::fail()
>     >>>    @     0x7f4b8055a378  process::Promise<>::fail()
>     >>>
>     >>> I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but
>     no
>     >>> luck for both. Any ideas about what happened? Thanks.
>     >>>
>     >>>
>     >>>
>     >>> Thanks,
>     >>> Qian Zhang
>     >>
> 
> 
> 
> SECURITY NOTE: file ~/.netrc must not be accessible by others

Re: Mesos HA does not work (Failed to recover registrar)

Posted by haosdent <ha...@gmail.com>.

Hi, @Qian Zhang Your issue reminds me of this
http://search-hadoop.com/m/0Vlr69BZgz1NlAPP1&subj=Re+Mesos+Masters+Leader+Keeps+Fluctuating
which I could not reproduce in my env. I am not sure whether your case are
same with Stefano or not.

On Mon, Jun 6, 2016 at 9:06 PM, Qian Zhang <zh...@gmail.com> wrote:

> I deleted everything in the work dir (/var/lib/mesos/master), and tried
> again, the same error still happened :-(
>
>
> Thanks,
> Qian Zhang
>
> On Mon, Jun 6, 2016 at 3:03 AM, Jean Christophe “JC” Martin <
> jch.martin@gmail.com> wrote:
>
>> Qian,
>>
>> Zookeeper should be able to reach a quorum with 2, no need to start 3
>> simultaneously, but there is an issue with Zookeeper related to connection
>> timeouts.
>> https://issues.apache.org/jira/browse/ZOOKEEPER-2164
>> In some circumstances, the timeout is higher than the sync timeout, which
>> cause the leader election to fail.
>> Try setting the parameter cnxtimeout in zookeeper (by default it’s
>> 5000ms) to the value 500 (500ms). After doing this, leader election in ZK
>> will be super fast even if a node is disconnected.
>>
>> JC
>>
>> > On Jun 4, 2016, at 4:34 PM, Qian Zhang <zh...@gmail.com> wrote:
>> >
>> > Thanks Vinod and Dick.
>> >
>> > I think my 3 ZK servers have formed a quorum, each of them has the
>> > following config:
>> >    $ cat conf/zoo.cfg
>> >    server.1=192.168.122.132:2888:3888
>> >    server.2=192.168.122.225:2888:3888
>> >    server.3=192.168.122.171:2888:3888
>> >    autopurge.purgeInterval=6
>> >    autopurge.snapRetainCount=5
>> >    initLimit=10
>> >    syncLimit=5
>> >    maxClientCnxns=0
>> >    clientPort=2181
>> >    tickTime=2000
>> >    quorumListenOnAllIPs=true
>> >    dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
>> >    dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
>> >
>> > And when I run "bin/zkServer.sh status" on each of them, I can see
>> "Mode:
>> > leader" for one, and "Mode: follower" for the other two.
>> >
>> > I have already tried to manually start 3 masters simultaneously, and
>> here
>> > is what I see in their log:
>> > In 192.168.122.171(this is the first master I started):
>> >    I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> >    I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
>> > '/mesos/log_replicas/0000000024' in ZooKeeper
>> >    I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
>> > '/mesos/json.info_0000000025' in ZooKeeper
>> >    I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.171:5050) is detected
>> >    I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
>> > log-replica(1)@192.168.122.171:5050 }
>> >    I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader
>> > is master@192.168.122.171:5050 with id
>> cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>> >    I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
>> > master!
>> >
>> > In 192.168.122.225 (second master I started):
>> >    I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> >    I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
>> > '/mesos/json.info_0000000025' in ZooKeeper
>> >    I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
>> > log-replica(1)@192.168.122.171:5050 }
>> >    I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
>> > received a broadcasted recover request from (6)@192.168.122.225:5050
>> >    I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.171:5050) is detected
>> >    I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader
>> > is master@192.168.122.171:5050 with id
>> cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>> >
>> > In 192.168.122.132 (last master I started):
>> > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
>> > '/mesos/json.info_0000000025' in ZooKeeper
>> > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master
>> (UPID=
>> > master@192.168.122.171:5050) is detected
>> >
>> > So right after I started these 3 masters, the first one
>> (192.168.122.171)
>> > was successfully elected as leader, but after 60s, 192.168.122.171
>> failed
>> > with the error mentioned in my first mail, and then 192.168.122.225 was
>> > elected as leader, but it failed with the same error too after another
>> 60s,
>> > and the same thing happened to the last one (192.168.122.132). So after
>> > about 180s, all my 3 master were down.
>> >
>> > I tried both:
>> >    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
>> > --work_dir=/var/lib/mesos/master
>> > and
>> >    sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
>> > 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
>> > --work_dir=/var/lib/mesos/master
>> > And I see the same error for both.
>> >
>> > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
>> > running on a KVM hypervisor host.
>> >
>> >
>> >
>> >
>> > Thanks,
>> > Qian Zhang
>> >
>> > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net>
>> wrote:
>> >
>> >> You told the master it needed a quorum of 2 and it's the only one
>> >> online, so it's bombing out.
>> >> That's the expected behaviour.
>> >>
>> >> You need to start at least 2 zookeepers before it will be a functional
>> >> group, same for the masters.
>> >>
>> >> You haven't mentioned how you setup your zookeeper cluster, so i'm
>> >> assuming that's working
>> >> correctly (3 nodes, all aware of the other 2 in their config). If not,
>> >> you need to sort that out first.
>> >>
>> >>
>> >> Also I think your zk URL is wrong - you want to list all 3 zookeeper
>> >> nodes like this:
>> >>
>> >> sudo ./bin/mesos-master.sh
>> >> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
>> >> --work_dir=/var/lib/mesos/master
>> >>
>> >> when you've run that command on 2 hosts things should start working,
>> >> you'll want all 3 up for
>> >> redundancy.
>> >>
>> >> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
>> >>> Hi Folks,
>> >>>
>> >>> I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
>> >>> Zookeeper running, so they form a Zookeeper cluster. And then when I
>> >> started
>> >>> the first Mesos master in one node with:
>> >>>    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos
>> --quorum=2
>> >>> --work_dir=/var/lib/mesos/master
>> >>>
>> >>> I found it will hang here for 60 seconds:
>> >>>  I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
>> >>> (UPID=master@192.168.122.132:5050) is detected
>> >>>  I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader
>> >> is
>> >>> master@192.168.122.132:5050 with id
>> 40d387a6-4d61-49d6-af44-51dd41457390
>> >>>  I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
>> >>> master!
>> >>>  I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from
>> registrar
>> >>>  I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>> >>>  I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
>> writer
>> >>>
>> >>> And after 60s, master will fail:
>> >>> F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed
>> to
>> >>> recover registrar: Failed to perform fetch within 1mins
>> >>> *** Check failure stack trace: ***
>> >>>    @     0x7f4b81372f4e  google::LogMessage::Fail()
>> >>>    @     0x7f4b81372e9a  google::LogMessage::SendToLog()
>> >>>    @     0x7f4b8137289c  google::LogMessage::Flush()
>> >>>    @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
>> >>>    @     0x7f4b8040eea0  mesos::internal::master::fail()
>> >>>    @     0x7f4b804dbeb3
>> >>>
>> >>
>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>> >>>    @     0x7f4b804ba453
>> >>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>> >>>    @     0x7f4b804898d7
>> >>>
>> >>
>> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>> >>>    @     0x7f4b804dbf80
>> >>>
>> >>
>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >>>    @           0x49d257  std::function<>::operator()()
>> >>>    @           0x49837f
>> >>>
>> >>
>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >>>    @           0x493024  process::Future<>::fail()
>> >>>    @     0x7f4b8015ad20  process::Promise<>::fail()
>> >>>    @     0x7f4b804d9295  process::internal::thenf<>()
>> >>>    @     0x7f4b8051788f
>> >>>
>> >>
>> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >>>    @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
>> >>>    @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
>> >>>    @     0x7f4b8050fc69  std::function<>::operator()()
>> >>>    @     0x7f4b804f9609
>> >>>
>> >>
>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>> >>>    @     0x7f4b80517936
>> >>>
>> >>
>> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
>> >>>    @     0x7f4b8050fc69  std::function<>::operator()()
>> >>>    @     0x7f4b8056b1b4  process::internal::run<>()
>> >>>    @     0x7f4b80561672  process::Future<>::fail()
>> >>>    @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
>> >>>    @     0x7f4b8059757f
>> >>>
>> >>
>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>> >>>    @     0x7f4b8058fad1
>> >>>
>> >>
>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>> >>>    @     0x7f4b80585a41
>> >>>
>> >>
>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
>> >>>    @     0x7f4b80597605
>> >>>
>> >>
>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>> >>>    @           0x49d257  std::function<>::operator()()
>> >>>    @           0x49837f
>> >>>
>> >>
>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>> >>>    @     0x7f4b8056164a  process::Future<>::fail()
>> >>>    @     0x7f4b8055a378  process::Promise<>::fail()
>> >>>
>> >>> I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but
>> no
>> >>> luck for both. Any ideas about what happened? Thanks.
>> >>>
>> >>>
>> >>>
>> >>> Thanks,
>> >>> Qian Zhang
>> >>
>>
>>
>


-- 
Best Regards,
Haosdent Huang

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Qian Zhang <zh...@gmail.com>.

I deleted everything in the work dir (/var/lib/mesos/master), and tried
again, the same error still happened :-(


Thanks,
Qian Zhang

On Mon, Jun 6, 2016 at 3:03 AM, Jean Christophe “JC” Martin <
jch.martin@gmail.com> wrote:

> Qian,
>
> Zookeeper should be able to reach a quorum with 2, no need to start 3
> simultaneously, but there is an issue with Zookeeper related to connection
> timeouts.
> https://issues.apache.org/jira/browse/ZOOKEEPER-2164
> In some circumstances, the timeout is higher than the sync timeout, which
> cause the leader election to fail.
> Try setting the parameter cnxtimeout in zookeeper (by default it’s 5000ms)
> to the value 500 (500ms). After doing this, leader election in ZK will be
> super fast even if a node is disconnected.
>
> JC
>
> > On Jun 4, 2016, at 4:34 PM, Qian Zhang <zh...@gmail.com> wrote:
> >
> > Thanks Vinod and Dick.
> >
> > I think my 3 ZK servers have formed a quorum, each of them has the
> > following config:
> >    $ cat conf/zoo.cfg
> >    server.1=192.168.122.132:2888:3888
> >    server.2=192.168.122.225:2888:3888
> >    server.3=192.168.122.171:2888:3888
> >    autopurge.purgeInterval=6
> >    autopurge.snapRetainCount=5
> >    initLimit=10
> >    syncLimit=5
> >    maxClientCnxns=0
> >    clientPort=2181
> >    tickTime=2000
> >    quorumListenOnAllIPs=true
> >    dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
> >    dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
> >
> > And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
> > leader" for one, and "Mode: follower" for the other two.
> >
> > I have already tried to manually start 3 masters simultaneously, and here
> > is what I see in their log:
> > In 192.168.122.171(this is the first master I started):
> >    I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
> > (id='25')
> >    I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
> > '/mesos/log_replicas/0000000024' in ZooKeeper
> >    I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
> > '/mesos/json.info_0000000025' in ZooKeeper
> >    I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> >    I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
> > log-replica(1)@192.168.122.171:5050 }
> >    I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader
> > is master@192.168.122.171:5050 with id
> cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> >    I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
> > master!
> >
> > In 192.168.122.225 (second master I started):
> >    I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
> > (id='25')
> >    I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
> > '/mesos/json.info_0000000025' in ZooKeeper
> >    I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
> > log-replica(1)@192.168.122.171:5050 }
> >    I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
> > received a broadcasted recover request from (6)@192.168.122.225:5050
> >    I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> >    I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader
> > is master@192.168.122.171:5050 with id
> cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> >
> > In 192.168.122.132 (last master I started):
> > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
> > (id='25')
> > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
> > '/mesos/json.info_0000000025' in ZooKeeper
> > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master
> (UPID=
> > master@192.168.122.171:5050) is detected
> >
> > So right after I started these 3 masters, the first one (192.168.122.171)
> > was successfully elected as leader, but after 60s, 192.168.122.171 failed
> > with the error mentioned in my first mail, and then 192.168.122.225 was
> > elected as leader, but it failed with the same error too after another
> 60s,
> > and the same thing happened to the last one (192.168.122.132). So after
> > about 180s, all my 3 master were down.
> >
> > I tried both:
> >    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> > --work_dir=/var/lib/mesos/master
> > and
> >    sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
> > 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
> > --work_dir=/var/lib/mesos/master
> > And I see the same error for both.
> >
> > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
> > running on a KVM hypervisor host.
> >
> >
> >
> >
> > Thanks,
> > Qian Zhang
> >
> > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net>
> wrote:
> >
> >> You told the master it needed a quorum of 2 and it's the only one
> >> online, so it's bombing out.
> >> That's the expected behaviour.
> >>
> >> You need to start at least 2 zookeepers before it will be a functional
> >> group, same for the masters.
> >>
> >> You haven't mentioned how you setup your zookeeper cluster, so i'm
> >> assuming that's working
> >> correctly (3 nodes, all aware of the other 2 in their config). If not,
> >> you need to sort that out first.
> >>
> >>
> >> Also I think your zk URL is wrong - you want to list all 3 zookeeper
> >> nodes like this:
> >>
> >> sudo ./bin/mesos-master.sh
> >> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
> >> --work_dir=/var/lib/mesos/master
> >>
> >> when you've run that command on 2 hosts things should start working,
> >> you'll want all 3 up for
> >> redundancy.
> >>
> >> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
> >>> Hi Folks,
> >>>
> >>> I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
> >>> Zookeeper running, so they form a Zookeeper cluster. And then when I
> >> started
> >>> the first Mesos master in one node with:
> >>>    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos
> --quorum=2
> >>> --work_dir=/var/lib/mesos/master
> >>>
> >>> I found it will hang here for 60 seconds:
> >>>  I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
> >>> (UPID=master@192.168.122.132:5050) is detected
> >>>  I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader
> >> is
> >>> master@192.168.122.132:5050 with id
> 40d387a6-4d61-49d6-af44-51dd41457390
> >>>  I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
> >>> master!
> >>>  I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
> >>>  I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
> >>>  I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
> writer
> >>>
> >>> And after 60s, master will fail:
> >>> F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to
> >>> recover registrar: Failed to perform fetch within 1mins
> >>> *** Check failure stack trace: ***
> >>>    @     0x7f4b81372f4e  google::LogMessage::Fail()
> >>>    @     0x7f4b81372e9a  google::LogMessage::SendToLog()
> >>>    @     0x7f4b8137289c  google::LogMessage::Flush()
> >>>    @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
> >>>    @     0x7f4b8040eea0  mesos::internal::master::fail()
> >>>    @     0x7f4b804dbeb3
> >>>
> >>
> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
> >>>    @     0x7f4b804ba453
> >>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
> >>>    @     0x7f4b804898d7
> >>>
> >>
> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
> >>>    @     0x7f4b804dbf80
> >>>
> >>
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> >>>    @           0x49d257  std::function<>::operator()()
> >>>    @           0x49837f
> >>>
> >>
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> >>>    @           0x493024  process::Future<>::fail()
> >>>    @     0x7f4b8015ad20  process::Promise<>::fail()
> >>>    @     0x7f4b804d9295  process::internal::thenf<>()
> >>>    @     0x7f4b8051788f
> >>>
> >>
> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> >>>    @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
> >>>    @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
> >>>    @     0x7f4b8050fc69  std::function<>::operator()()
> >>>    @     0x7f4b804f9609
> >>>
> >>
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
> >>>    @     0x7f4b80517936
> >>>
> >>
> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
> >>>    @     0x7f4b8050fc69  std::function<>::operator()()
> >>>    @     0x7f4b8056b1b4  process::internal::run<>()
> >>>    @     0x7f4b80561672  process::Future<>::fail()
> >>>    @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
> >>>    @     0x7f4b8059757f
> >>>
> >>
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> >>>    @     0x7f4b8058fad1
> >>>
> >>
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
> >>>    @     0x7f4b80585a41
> >>>
> >>
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
> >>>    @     0x7f4b80597605
> >>>
> >>
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> >>>    @           0x49d257  std::function<>::operator()()
> >>>    @           0x49837f
> >>>
> >>
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> >>>    @     0x7f4b8056164a  process::Future<>::fail()
> >>>    @     0x7f4b8055a378  process::Promise<>::fail()
> >>>
> >>> I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but
> no
> >>> luck for both. Any ideas about what happened? Thanks.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>> Qian Zhang
> >>
>
>

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Qian Zhang <zh...@gmail.com>.

I deleted everything in the work dir (/var/lib/mesos/master), and tried
again, the same error still happened :-(


Thanks,
Qian Zhang

On Mon, Jun 6, 2016 at 3:03 AM, Jean Christophe “JC” Martin <
jch.martin@gmail.com> wrote:

> Qian,
>
> Zookeeper should be able to reach a quorum with 2, no need to start 3
> simultaneously, but there is an issue with Zookeeper related to connection
> timeouts.
> https://issues.apache.org/jira/browse/ZOOKEEPER-2164
> In some circumstances, the timeout is higher than the sync timeout, which
> cause the leader election to fail.
> Try setting the parameter cnxtimeout in zookeeper (by default it’s 5000ms)
> to the value 500 (500ms). After doing this, leader election in ZK will be
> super fast even if a node is disconnected.
>
> JC
>
> > On Jun 4, 2016, at 4:34 PM, Qian Zhang <zh...@gmail.com> wrote:
> >
> > Thanks Vinod and Dick.
> >
> > I think my 3 ZK servers have formed a quorum, each of them has the
> > following config:
> >    $ cat conf/zoo.cfg
> >    server.1=192.168.122.132:2888:3888
> >    server.2=192.168.122.225:2888:3888
> >    server.3=192.168.122.171:2888:3888
> >    autopurge.purgeInterval=6
> >    autopurge.snapRetainCount=5
> >    initLimit=10
> >    syncLimit=5
> >    maxClientCnxns=0
> >    clientPort=2181
> >    tickTime=2000
> >    quorumListenOnAllIPs=true
> >    dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
> >    dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
> >
> > And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
> > leader" for one, and "Mode: follower" for the other two.
> >
> > I have already tried to manually start 3 masters simultaneously, and here
> > is what I see in their log:
> > In 192.168.122.171(this is the first master I started):
> >    I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
> > (id='25')
> >    I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
> > '/mesos/log_replicas/0000000024' in ZooKeeper
> >    I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
> > '/mesos/json.info_0000000025' in ZooKeeper
> >    I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> >    I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
> > log-replica(1)@192.168.122.171:5050 }
> >    I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader
> > is master@192.168.122.171:5050 with id
> cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> >    I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
> > master!
> >
> > In 192.168.122.225 (second master I started):
> >    I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
> > (id='25')
> >    I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
> > '/mesos/json.info_0000000025' in ZooKeeper
> >    I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
> > log-replica(1)@192.168.122.171:5050 }
> >    I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
> > received a broadcasted recover request from (6)@192.168.122.225:5050
> >    I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> >    I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader
> > is master@192.168.122.171:5050 with id
> cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> >
> > In 192.168.122.132 (last master I started):
> > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
> > (id='25')
> > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
> > '/mesos/json.info_0000000025' in ZooKeeper
> > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master
> (UPID=
> > master@192.168.122.171:5050) is detected
> >
> > So right after I started these 3 masters, the first one (192.168.122.171)
> > was successfully elected as leader, but after 60s, 192.168.122.171 failed
> > with the error mentioned in my first mail, and then 192.168.122.225 was
> > elected as leader, but it failed with the same error too after another
> 60s,
> > and the same thing happened to the last one (192.168.122.132). So after
> > about 180s, all my 3 master were down.
> >
> > I tried both:
> >    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> > --work_dir=/var/lib/mesos/master
> > and
> >    sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
> > 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
> > --work_dir=/var/lib/mesos/master
> > And I see the same error for both.
> >
> > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
> > running on a KVM hypervisor host.
> >
> >
> >
> >
> > Thanks,
> > Qian Zhang
> >
> > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net>
> wrote:
> >
> >> You told the master it needed a quorum of 2 and it's the only one
> >> online, so it's bombing out.
> >> That's the expected behaviour.
> >>
> >> You need to start at least 2 zookeepers before it will be a functional
> >> group, same for the masters.
> >>
> >> You haven't mentioned how you setup your zookeeper cluster, so i'm
> >> assuming that's working
> >> correctly (3 nodes, all aware of the other 2 in their config). If not,
> >> you need to sort that out first.
> >>
> >>
> >> Also I think your zk URL is wrong - you want to list all 3 zookeeper
> >> nodes like this:
> >>
> >> sudo ./bin/mesos-master.sh
> >> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
> >> --work_dir=/var/lib/mesos/master
> >>
> >> when you've run that command on 2 hosts things should start working,
> >> you'll want all 3 up for
> >> redundancy.
> >>
> >> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
> >>> Hi Folks,
> >>>
> >>> I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
> >>> Zookeeper running, so they form a Zookeeper cluster. And then when I
> >> started
> >>> the first Mesos master in one node with:
> >>>    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos
> --quorum=2
> >>> --work_dir=/var/lib/mesos/master
> >>>
> >>> I found it will hang here for 60 seconds:
> >>>  I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
> >>> (UPID=master@192.168.122.132:5050) is detected
> >>>  I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader
> >> is
> >>> master@192.168.122.132:5050 with id
> 40d387a6-4d61-49d6-af44-51dd41457390
> >>>  I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
> >>> master!
> >>>  I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
> >>>  I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
> >>>  I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
> writer
> >>>
> >>> And after 60s, master will fail:
> >>> F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to
> >>> recover registrar: Failed to perform fetch within 1mins
> >>> *** Check failure stack trace: ***
> >>>    @     0x7f4b81372f4e  google::LogMessage::Fail()
> >>>    @     0x7f4b81372e9a  google::LogMessage::SendToLog()
> >>>    @     0x7f4b8137289c  google::LogMessage::Flush()
> >>>    @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
> >>>    @     0x7f4b8040eea0  mesos::internal::master::fail()
> >>>    @     0x7f4b804dbeb3
> >>>
> >>
> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
> >>>    @     0x7f4b804ba453
> >>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
> >>>    @     0x7f4b804898d7
> >>>
> >>
> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
> >>>    @     0x7f4b804dbf80
> >>>
> >>
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> >>>    @           0x49d257  std::function<>::operator()()
> >>>    @           0x49837f
> >>>
> >>
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> >>>    @           0x493024  process::Future<>::fail()
> >>>    @     0x7f4b8015ad20  process::Promise<>::fail()
> >>>    @     0x7f4b804d9295  process::internal::thenf<>()
> >>>    @     0x7f4b8051788f
> >>>
> >>
> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> >>>    @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
> >>>    @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
> >>>    @     0x7f4b8050fc69  std::function<>::operator()()
> >>>    @     0x7f4b804f9609
> >>>
> >>
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
> >>>    @     0x7f4b80517936
> >>>
> >>
> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
> >>>    @     0x7f4b8050fc69  std::function<>::operator()()
> >>>    @     0x7f4b8056b1b4  process::internal::run<>()
> >>>    @     0x7f4b80561672  process::Future<>::fail()
> >>>    @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
> >>>    @     0x7f4b8059757f
> >>>
> >>
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> >>>    @     0x7f4b8058fad1
> >>>
> >>
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
> >>>    @     0x7f4b80585a41
> >>>
> >>
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
> >>>    @     0x7f4b80597605
> >>>
> >>
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> >>>    @           0x49d257  std::function<>::operator()()
> >>>    @           0x49837f
> >>>
> >>
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> >>>    @     0x7f4b8056164a  process::Future<>::fail()
> >>>    @     0x7f4b8055a378  process::Promise<>::fail()
> >>>
> >>> I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but
> no
> >>> luck for both. Any ideas about what happened? Thanks.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>> Qian Zhang
> >>
>
>

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Jean Christophe “JC” Martin <jc...@gmail.com>.

Qian,

Zookeeper should be able to reach a quorum with 2, no need to start 3 simultaneously, but there is an issue with Zookeeper related to connection timeouts. 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164
In some circumstances, the timeout is higher than the sync timeout, which cause the leader election to fail.
Try setting the parameter cnxtimeout in zookeeper (by default it’s 5000ms) to the value 500 (500ms). After doing this, leader election in ZK will be super fast even if a node is disconnected.

JC

> On Jun 4, 2016, at 4:34 PM, Qian Zhang <zh...@gmail.com> wrote:
> 
> Thanks Vinod and Dick.
> 
> I think my 3 ZK servers have formed a quorum, each of them has the
> following config:
>    $ cat conf/zoo.cfg
>    server.1=192.168.122.132:2888:3888
>    server.2=192.168.122.225:2888:3888
>    server.3=192.168.122.171:2888:3888
>    autopurge.purgeInterval=6
>    autopurge.snapRetainCount=5
>    initLimit=10
>    syncLimit=5
>    maxClientCnxns=0
>    clientPort=2181
>    tickTime=2000
>    quorumListenOnAllIPs=true
>    dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
>    dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
> 
> And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
> leader" for one, and "Mode: follower" for the other two.
> 
> I have already tried to manually start 3 masters simultaneously, and here
> is what I see in their log:
> In 192.168.122.171(this is the first master I started):
>    I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
> (id='25')
>    I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
> '/mesos/log_replicas/0000000024' in ZooKeeper
>    I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
> '/mesos/json.info_0000000025' in ZooKeeper
>    I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.171:5050) is detected
>    I0605 07:12:49.423841  1186 network.hpp:461] ZooKeeper group PIDs: {
> log-replica(1)@192.168.122.171:5050 }
>    I0605 07:12:49.424281  1187 master.cpp:1951] The newly elected leader
> is master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>    I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
> master!
> 
> In 192.168.122.225 (second master I started):
>    I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
> (id='25')
>    I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
> '/mesos/json.info_0000000025' in ZooKeeper
>    I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
> log-replica(1)@192.168.122.171:5050 }
>    I0605 07:12:51.925721  2252 replica.cpp:673] Replica in EMPTY status
> received a broadcasted recover request from (6)@192.168.122.225:5050
>    I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.171:5050) is detected
>    I0605 07:12:51.928444  2246 master.cpp:1951] The newly elected leader
> is master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> 
> In 192.168.122.132 (last master I started):
> I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
> (id='25')
> I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
> '/mesos/json.info_0000000025' in ZooKeeper
> I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master (UPID=
> master@192.168.122.171:5050) is detected
> 
> So right after I started these 3 masters, the first one (192.168.122.171)
> was successfully elected as leader, but after 60s, 192.168.122.171 failed
> with the error mentioned in my first mail, and then 192.168.122.225 was
> elected as leader, but it failed with the same error too after another 60s,
> and the same thing happened to the last one (192.168.122.132). So after
> about 180s, all my 3 master were down.
> 
> I tried both:
>    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> --work_dir=/var/lib/mesos/master
> and
>    sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
> 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
> --work_dir=/var/lib/mesos/master
> And I see the same error for both.
> 
> 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
> running on a KVM hypervisor host.
> 
> 
> 
> 
> Thanks,
> Qian Zhang
> 
> On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net> wrote:
> 
>> You told the master it needed a quorum of 2 and it's the only one
>> online, so it's bombing out.
>> That's the expected behaviour.
>> 
>> You need to start at least 2 zookeepers before it will be a functional
>> group, same for the masters.
>> 
>> You haven't mentioned how you setup your zookeeper cluster, so i'm
>> assuming that's working
>> correctly (3 nodes, all aware of the other 2 in their config). If not,
>> you need to sort that out first.
>> 
>> 
>> Also I think your zk URL is wrong - you want to list all 3 zookeeper
>> nodes like this:
>> 
>> sudo ./bin/mesos-master.sh
>> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
>> --work_dir=/var/lib/mesos/master
>> 
>> when you've run that command on 2 hosts things should start working,
>> you'll want all 3 up for
>> redundancy.
>> 
>> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
>>> Hi Folks,
>>> 
>>> I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
>>> Zookeeper running, so they form a Zookeeper cluster. And then when I
>> started
>>> the first Mesos master in one node with:
>>>    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
>>> --work_dir=/var/lib/mesos/master
>>> 
>>> I found it will hang here for 60 seconds:
>>>  I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
>>> (UPID=master@192.168.122.132:5050) is detected
>>>  I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader
>> is
>>> master@192.168.122.132:5050 with id 40d387a6-4d61-49d6-af44-51dd41457390
>>>  I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
>>> master!
>>>  I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
>>>  I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>>>  I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the writer
>>> 
>>> And after 60s, master will fail:
>>> F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to
>>> recover registrar: Failed to perform fetch within 1mins
>>> *** Check failure stack trace: ***
>>>    @     0x7f4b81372f4e  google::LogMessage::Fail()
>>>    @     0x7f4b81372e9a  google::LogMessage::SendToLog()
>>>    @     0x7f4b8137289c  google::LogMessage::Flush()
>>>    @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
>>>    @     0x7f4b8040eea0  mesos::internal::master::fail()
>>>    @     0x7f4b804dbeb3
>>> 
>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>>>    @     0x7f4b804ba453
>>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>>>    @     0x7f4b804898d7
>>> 
>> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>>>    @     0x7f4b804dbf80
>>> 
>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>>>    @           0x49d257  std::function<>::operator()()
>>>    @           0x49837f
>>> 
>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>>>    @           0x493024  process::Future<>::fail()
>>>    @     0x7f4b8015ad20  process::Promise<>::fail()
>>>    @     0x7f4b804d9295  process::internal::thenf<>()
>>>    @     0x7f4b8051788f
>>> 
>> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>>>    @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
>>>    @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
>>>    @     0x7f4b8050fc69  std::function<>::operator()()
>>>    @     0x7f4b804f9609
>>> 
>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>>>    @     0x7f4b80517936
>>> 
>> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
>>>    @     0x7f4b8050fc69  std::function<>::operator()()
>>>    @     0x7f4b8056b1b4  process::internal::run<>()
>>>    @     0x7f4b80561672  process::Future<>::fail()
>>>    @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
>>>    @     0x7f4b8059757f
>>> 
>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>>>    @     0x7f4b8058fad1
>>> 
>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>>>    @     0x7f4b80585a41
>>> 
>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
>>>    @     0x7f4b80597605
>>> 
>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>>>    @           0x49d257  std::function<>::operator()()
>>>    @           0x49837f
>>> 
>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>>>    @     0x7f4b8056164a  process::Future<>::fail()
>>>    @     0x7f4b8055a378  process::Promise<>::fail()
>>> 
>>> I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but no
>>> luck for both. Any ideas about what happened? Thanks.
>>> 
>>> 
>>> 
>>> Thanks,
>>> Qian Zhang
>>

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Qian Zhang <zh...@gmail.com>.

Thanks Vinod and Dick.

I think my 3 ZK servers have formed a quorum, each of them has the
following config:
    $ cat conf/zoo.cfg
    server.1=192.168.122.132:2888:3888
    server.2=192.168.122.225:2888:3888
    server.3=192.168.122.171:2888:3888
    autopurge.purgeInterval=6
    autopurge.snapRetainCount=5
    initLimit=10
    syncLimit=5
    maxClientCnxns=0
    clientPort=2181
    tickTime=2000
    quorumListenOnAllIPs=true
    dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
    dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions

And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
leader" for one, and "Mode: follower" for the other two.

I have already tried to manually start 3 masters simultaneously, and here
is what I see in their log:
In 192.168.122.171(this is the first master I started):
    I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
(id='25')
    I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
'/mesos/log_replicas/0000000024' in ZooKeeper
    I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
'/mesos/json.info_0000000025' in ZooKeeper
    I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
(UPID=master@192.168.122.171:5050) is detected
    I0605 07:12:49.423841  1186 network.hpp:461] ZooKeeper group PIDs: {
log-replica(1)@192.168.122.171:5050 }
    I0605 07:12:49.424281  1187 master.cpp:1951] The newly elected leader
is master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
    I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
master!

In 192.168.122.225 (second master I started):
    I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
(id='25')
    I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
'/mesos/json.info_0000000025' in ZooKeeper
    I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
log-replica(1)@192.168.122.171:5050 }
    I0605 07:12:51.925721  2252 replica.cpp:673] Replica in EMPTY status
received a broadcasted recover request from (6)@192.168.122.225:5050
    I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
(UPID=master@192.168.122.171:5050) is detected
    I0605 07:12:51.928444  2246 master.cpp:1951] The newly elected leader
is master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b

In 192.168.122.132 (last master I started):
I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
(id='25')
I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
'/mesos/json.info_0000000025' in ZooKeeper
I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master (UPID=
master@192.168.122.171:5050) is detected

So right after I started these 3 masters, the first one (192.168.122.171)
was successfully elected as leader, but after 60s, 192.168.122.171 failed
with the error mentioned in my first mail, and then 192.168.122.225 was
elected as leader, but it failed with the same error too after another 60s,
and the same thing happened to the last one (192.168.122.132). So after
about 180s, all my 3 master were down.

I tried both:
    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
--work_dir=/var/lib/mesos/master
and
    sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
--work_dir=/var/lib/mesos/master
And I see the same error for both.

192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
running on a KVM hypervisor host.




Thanks,
Qian Zhang

On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net> wrote:

> You told the master it needed a quorum of 2 and it's the only one
> online, so it's bombing out.
> That's the expected behaviour.
>
> You need to start at least 2 zookeepers before it will be a functional
> group, same for the masters.
>
> You haven't mentioned how you setup your zookeeper cluster, so i'm
> assuming that's working
> correctly (3 nodes, all aware of the other 2 in their config). If not,
> you need to sort that out first.
>
>
> Also I think your zk URL is wrong - you want to list all 3 zookeeper
> nodes like this:
>
> sudo ./bin/mesos-master.sh
> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
> --work_dir=/var/lib/mesos/master
>
> when you've run that command on 2 hosts things should start working,
> you'll want all 3 up for
> redundancy.
>
> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
> > Hi Folks,
> >
> > I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
> > Zookeeper running, so they form a Zookeeper cluster. And then when I
> started
> > the first Mesos master in one node with:
> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> > --work_dir=/var/lib/mesos/master
> >
> > I found it will hang here for 60 seconds:
> >   I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.132:5050) is detected
> >   I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader
> is
> > master@192.168.122.132:5050 with id 40d387a6-4d61-49d6-af44-51dd41457390
> >   I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
> > master!
> >   I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
> >   I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
> >   I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the writer
> >
> > And after 60s, master will fail:
> > F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to
> > recover registrar: Failed to perform fetch within 1mins
> > *** Check failure stack trace: ***
> >     @     0x7f4b81372f4e  google::LogMessage::Fail()
> >     @     0x7f4b81372e9a  google::LogMessage::SendToLog()
> >     @     0x7f4b8137289c  google::LogMessage::Flush()
> >     @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
> >     @     0x7f4b8040eea0  mesos::internal::master::fail()
> >     @     0x7f4b804dbeb3
> >
> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
> >     @     0x7f4b804ba453
> > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
> >     @     0x7f4b804898d7
> >
> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
> >     @     0x7f4b804dbf80
> >
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> >     @           0x49d257  std::function<>::operator()()
> >     @           0x49837f
> >
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> >     @           0x493024  process::Future<>::fail()
> >     @     0x7f4b8015ad20  process::Promise<>::fail()
> >     @     0x7f4b804d9295  process::internal::thenf<>()
> >     @     0x7f4b8051788f
> >
> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> >     @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
> >     @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
> >     @     0x7f4b8050fc69  std::function<>::operator()()
> >     @     0x7f4b804f9609
> >
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
> >     @     0x7f4b80517936
> >
> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
> >     @     0x7f4b8050fc69  std::function<>::operator()()
> >     @     0x7f4b8056b1b4  process::internal::run<>()
> >     @     0x7f4b80561672  process::Future<>::fail()
> >     @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
> >     @     0x7f4b8059757f
> >
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> >     @     0x7f4b8058fad1
> >
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
> >     @     0x7f4b80585a41
> >
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
> >     @     0x7f4b80597605
> >
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> >     @           0x49d257  std::function<>::operator()()
> >     @           0x49837f
> >
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> >     @     0x7f4b8056164a  process::Future<>::fail()
> >     @     0x7f4b8055a378  process::Promise<>::fail()
> >
> > I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but no
> > luck for both. Any ideas about what happened? Thanks.
> >
> >
> >
> > Thanks,
> > Qian Zhang
>

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Qian Zhang <zh...@gmail.com>.

Thanks Vinod and Dick.

I think my 3 ZK servers have formed a quorum, each of them has the
following config:
    $ cat conf/zoo.cfg
    server.1=192.168.122.132:2888:3888
    server.2=192.168.122.225:2888:3888
    server.3=192.168.122.171:2888:3888
    autopurge.purgeInterval=6
    autopurge.snapRetainCount=5
    initLimit=10
    syncLimit=5
    maxClientCnxns=0
    clientPort=2181
    tickTime=2000
    quorumListenOnAllIPs=true
    dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
    dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions

And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
leader" for one, and "Mode: follower" for the other two.

I have already tried to manually start 3 masters simultaneously, and here
is what I see in their log:
In 192.168.122.171(this is the first master I started):
    I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
(id='25')
    I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
'/mesos/log_replicas/0000000024' in ZooKeeper
    I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
'/mesos/json.info_0000000025' in ZooKeeper
    I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
(UPID=master@192.168.122.171:5050) is detected
    I0605 07:12:49.423841  1186 network.hpp:461] ZooKeeper group PIDs: {
log-replica(1)@192.168.122.171:5050 }
    I0605 07:12:49.424281  1187 master.cpp:1951] The newly elected leader
is master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
    I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
master!

In 192.168.122.225 (second master I started):
    I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
(id='25')
    I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
'/mesos/json.info_0000000025' in ZooKeeper
    I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
log-replica(1)@192.168.122.171:5050 }
    I0605 07:12:51.925721  2252 replica.cpp:673] Replica in EMPTY status
received a broadcasted recover request from (6)@192.168.122.225:5050
    I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
(UPID=master@192.168.122.171:5050) is detected
    I0605 07:12:51.928444  2246 master.cpp:1951] The newly elected leader
is master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b

In 192.168.122.132 (last master I started):
I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
(id='25')
I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
'/mesos/json.info_0000000025' in ZooKeeper
I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master (UPID=
master@192.168.122.171:5050) is detected

So right after I started these 3 masters, the first one (192.168.122.171)
was successfully elected as leader, but after 60s, 192.168.122.171 failed
with the error mentioned in my first mail, and then 192.168.122.225 was
elected as leader, but it failed with the same error too after another 60s,
and the same thing happened to the last one (192.168.122.132). So after
about 180s, all my 3 master were down.

I tried both:
    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
--work_dir=/var/lib/mesos/master
and
    sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
--work_dir=/var/lib/mesos/master
And I see the same error for both.

192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
running on a KVM hypervisor host.




Thanks,
Qian Zhang

On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <di...@hellooperator.net> wrote:

> You told the master it needed a quorum of 2 and it's the only one
> online, so it's bombing out.
> That's the expected behaviour.
>
> You need to start at least 2 zookeepers before it will be a functional
> group, same for the masters.
>
> You haven't mentioned how you setup your zookeeper cluster, so i'm
> assuming that's working
> correctly (3 nodes, all aware of the other 2 in their config). If not,
> you need to sort that out first.
>
>
> Also I think your zk URL is wrong - you want to list all 3 zookeeper
> nodes like this:
>
> sudo ./bin/mesos-master.sh
> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
> --work_dir=/var/lib/mesos/master
>
> when you've run that command on 2 hosts things should start working,
> you'll want all 3 up for
> redundancy.
>
> On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
> > Hi Folks,
> >
> > I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
> > Zookeeper running, so they form a Zookeeper cluster. And then when I
> started
> > the first Mesos master in one node with:
> >     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> > --work_dir=/var/lib/mesos/master
> >
> > I found it will hang here for 60 seconds:
> >   I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.132:5050) is detected
> >   I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader
> is
> > master@192.168.122.132:5050 with id 40d387a6-4d61-49d6-af44-51dd41457390
> >   I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
> > master!
> >   I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
> >   I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
> >   I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the writer
> >
> > And after 60s, master will fail:
> > F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to
> > recover registrar: Failed to perform fetch within 1mins
> > *** Check failure stack trace: ***
> >     @     0x7f4b81372f4e  google::LogMessage::Fail()
> >     @     0x7f4b81372e9a  google::LogMessage::SendToLog()
> >     @     0x7f4b8137289c  google::LogMessage::Flush()
> >     @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
> >     @     0x7f4b8040eea0  mesos::internal::master::fail()
> >     @     0x7f4b804dbeb3
> >
> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
> >     @     0x7f4b804ba453
> > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
> >     @     0x7f4b804898d7
> >
> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
> >     @     0x7f4b804dbf80
> >
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> >     @           0x49d257  std::function<>::operator()()
> >     @           0x49837f
> >
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> >     @           0x493024  process::Future<>::fail()
> >     @     0x7f4b8015ad20  process::Promise<>::fail()
> >     @     0x7f4b804d9295  process::internal::thenf<>()
> >     @     0x7f4b8051788f
> >
> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> >     @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
> >     @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
> >     @     0x7f4b8050fc69  std::function<>::operator()()
> >     @     0x7f4b804f9609
> >
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
> >     @     0x7f4b80517936
> >
> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
> >     @     0x7f4b8050fc69  std::function<>::operator()()
> >     @     0x7f4b8056b1b4  process::internal::run<>()
> >     @     0x7f4b80561672  process::Future<>::fail()
> >     @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
> >     @     0x7f4b8059757f
> >
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> >     @     0x7f4b8058fad1
> >
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
> >     @     0x7f4b80585a41
> >
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
> >     @     0x7f4b80597605
> >
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> >     @           0x49d257  std::function<>::operator()()
> >     @           0x49837f
> >
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> >     @     0x7f4b8056164a  process::Future<>::fail()
> >     @     0x7f4b8055a378  process::Promise<>::fail()
> >
> > I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but no
> > luck for both. Any ideas about what happened? Thanks.
> >
> >
> >
> > Thanks,
> > Qian Zhang
>

Re: Mesos HA does not work (Failed to recover registrar)

Posted by Dick Davies <di...@hellooperator.net>.

You told the master it needed a quorum of 2 and it's the only one
online, so it's bombing out.
That's the expected behaviour.

You need to start at least 2 zookeepers before it will be a functional
group, same for the masters.

You haven't mentioned how you setup your zookeeper cluster, so i'm
assuming that's working
correctly (3 nodes, all aware of the other 2 in their config). If not,
you need to sort that out first.


Also I think your zk URL is wrong - you want to list all 3 zookeeper
nodes like this:

sudo ./bin/mesos-master.sh
--zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
--work_dir=/var/lib/mesos/master

when you've run that command on 2 hosts things should start working,
you'll want all 3 up for
redundancy.

On 4 June 2016 at 16:42, Qian Zhang <zh...@gmail.com> wrote:
> Hi Folks,
>
> I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
> Zookeeper running, so they form a Zookeeper cluster. And then when I started
> the first Mesos master in one node with:
>     sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> --work_dir=/var/lib/mesos/master
>
> I found it will hang here for 60 seconds:
>   I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.132:5050) is detected
>   I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader is
> master@192.168.122.132:5050 with id 40d387a6-4d61-49d6-af44-51dd41457390
>   I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
> master!
>   I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
>   I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>   I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the writer
>
> And after 60s, master will fail:
> F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to
> recover registrar: Failed to perform fetch within 1mins
> *** Check failure stack trace: ***
>     @     0x7f4b81372f4e  google::LogMessage::Fail()
>     @     0x7f4b81372e9a  google::LogMessage::SendToLog()
>     @     0x7f4b8137289c  google::LogMessage::Flush()
>     @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f4b8040eea0  mesos::internal::master::fail()
>     @     0x7f4b804dbeb3
> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>     @     0x7f4b804ba453
> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>     @     0x7f4b804898d7
> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>     @     0x7f4b804dbf80
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>     @           0x49d257  std::function<>::operator()()
>     @           0x49837f
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>     @           0x493024  process::Future<>::fail()
>     @     0x7f4b8015ad20  process::Promise<>::fail()
>     @     0x7f4b804d9295  process::internal::thenf<>()
>     @     0x7f4b8051788f
> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>     @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
>     @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
>     @     0x7f4b8050fc69  std::function<>::operator()()
>     @     0x7f4b804f9609
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>     @     0x7f4b80517936
> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
>     @     0x7f4b8050fc69  std::function<>::operator()()
>     @     0x7f4b8056b1b4  process::internal::run<>()
>     @     0x7f4b80561672  process::Future<>::fail()
>     @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
>     @     0x7f4b8059757f
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>     @     0x7f4b8058fad1
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>     @     0x7f4b80585a41
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
>     @     0x7f4b80597605
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>     @           0x49d257  std::function<>::operator()()
>     @           0x49837f
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>     @     0x7f4b8056164a  process::Future<>::fail()
>     @     0x7f4b8055a378  process::Promise<>::fail()
>
> I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but no
> luck for both. Any ideas about what happened? Thanks.
>
>
>
> Thanks,
> Qian Zhang