You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@aurora.apache.org by Dobromir Montauk <do...@tellapart.com> on 2014/10/15 23:22:08 UTC

Aurora stuck in LEADER_AWAITING_REGISTRATION

Hi,

I've brought up Aurora on my Mesos master node with the following command:

ubuntu@ec2-54-82-17-37:~/$

GLOG_v=2
LIBPROCESS_PORT=5050
LIBPROCESS_IP=127.0.0.1
AURORA_HOME=/usr/local/aurora-scheduler
DIST_DIR=/home/ubuntu/aurora-scheduler/dist
AURORA_HOME=/usr/local/aurora-scheduler

sudo /usr/local/aurora-scheduler/bin/aurora-scheduler \
  -cluster_name=tellapart \
  -http_port=8081 \
  -native_log_quorum_size=1 \
  -zk_endpoints=localhost:2181 \
  -mesos_master_address=54.166.50.69:5050,54.160.61.169:5050,localhost:5050
\
  -serverset_path=/aurora/scheduler \
  -native_log_zk_group_path=/aurora/replicated-log \
  -native_log_file_path=$AURORA_HOME/scheduler/db \
  -backup_dir=$AURORA_HOME/scheduler/backups \
  -thermos_executor_path=/dev/null \
  -gc_executor_path=$DIST_DIR/gc_executor.pex \
  -enable_beta_updater=true \
  -vlog=INFO \
  -logtostderr

Attached is the entire log, but basically I'm seeing this:

I1015 21:18:05.315263 27634 group.cpp:313] Group process (group(1)@
10.88.26.227:40393) connected to ZooKeeper
I1015 21:18:05.315322 27634 group.cpp:787] Syncing group operations: queue
size (joins, cancels, datas) = (0, 0, 0)
I1015 21:18:05.315348 27634 group.cpp:385] Trying to create path
'/aurora/replicated-log' in ZooKeeper
I1015 21:18:05.316 THREAD1
com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate
/aurora/scheduler/singleton_candidate_0000000008 is now leader of group:
[singleton_candidate_0000000008]
I1015 21:18:05.317 THREAD1
com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
state machine transition STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
I1015 21:18:05.317 THREAD1
org.apache.aurora.scheduler.SchedulerLifecycle$6.execute: Elected as
leading scheduler!
I1015 21:18:05.330394 27634 network.hpp:423] ZooKeeper group memberships
changed
I1015 21:18:05.330660 27639 group.cpp:658] Trying to get
'/aurora/replicated-log/0000000008' in ZooKeeper
I1015 21:18:05.331550 27635 network.hpp:461] ZooKeeper group PIDs: {
log-replica(1)@10.88.26.227:40393 }
I1015 21:18:06.027016 27634 replica.cpp:638] Replica in EMPTY status
received a broadcasted recover request
I1015 21:18:06.027216 27634 recover.cpp:188] Received a recover response
from a replica in EMPTY status
<repeat last 2 message ad nauseum>

How can I debug what's going on?

Thanks,
Dobromir

Re: Aurora stuck in LEADER_AWAITING_REGISTRATION

Posted by Bill Farner <wf...@apache.org>.

FYI - i created AURORA-840 [1] to document common situations like these,
hopefully to be completed in our documentation sprint tomorrow.  Dobromir -
since setting up a new cluster is fresh for you, you might have some
insight we can add to this.

[1] https://issues.apache.org/jira/browse/AURORA-840

-=Bill

On Wed, Oct 15, 2014 at 5:02 PM, Kevin Sweeney <ke...@apache.org> wrote:

> The scheduler uses that zookeeper path (or whatever url you pass to
> -mesos_msater_address) to find the mesos master is should register with.
>
> Is there a mesos master running with a command-line that looks something
> like
>
> exec /usr/local/sbin/mesos-master \
>     --zk=zk://$ZK_HOST:2181/mesos/master \
> ...?
>
> On Wed, Oct 15, 2014 at 4:53 PM, Dobromir Montauk <do...@tellapart.com>
> wrote:
>
> > Next problem: framework is failing to register.
> >
> > 2014-10-15 22:15:48,024:28619(0x7f852d259700):ZOO_INFO@check_events
> @1750:
> > session establishment complete on server [127.0.0.1:2181],
> > sessionId=0x349110433970e3b, negotiated timeout=10000
> > I1015 22:15:48.024787 28763 group.cpp:313] Group process (group(3)@
> > 10.88.26.227:34348) connected to ZooKeeper
> > I1015 22:15:48.024837 28763 group.cpp:787] Syncing group operations:
> queue
> > size (joins, cancels, datas) = (0, 0, 0)
> > I1015 22:15:48.024868 28763 group.cpp:385] Trying to create path
> > '/mesos/master' in ZooKeeper
> > E1015 22:16:48.013 THREAD132
> > org.apache.aurora.scheduler.SchedulerLifecycle$6$3.run: Framework has not
> > been registered within the tolerated delay.
> > I1015 22:16:48.014 THREAD132
> > com.twitter.common.util.StateMachine$Builder$1.execute:
> SchedulerLifecycle
> > state machine transition LEADER_AWAITING_REGISTRATION -> DEAD
> >
> > My Zookeeper logs have this:
> >
> > 2014-10-15 22:15:47,831 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client attempting to
> establish
> > new session at /127.0.0.1:41116
> > 2014-10-15 22:15:47,836 - INFO  [CommitProcessor:3:ZooKeeperServer@595]
> -
> > Established session 0x349110433970e3a with negotiated timeout 4000 for
> > client /127.0.0.1:41116
> > 2014-10-15 22:15:48,017 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
> > connection
> > from /127.0.0.1:41117
> > 2014-10-15 22:15:48,017 - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@793] - Connection request from old
> > client /127.0.0.1:41117; will be dropped if server is in r-o mode
> > 2014-10-15 22:15:48,018 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client attempting to
> establish
> > new session at /127.0.0.1:41117
> > 2014-10-15 22:15:48,024 - INFO  [CommitProcessor:3:ZooKeeperServer@595]
> -
> > Established session 0x349110433970e3b with negotiated timeout 10000 for
> > client /127.0.0.1:41117
> > 2014-10-15 22:16:48,435 - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
> > EndOfStreamException: Unable to read additional data from client
> sessionid
> > 0x349110433970e39, likely client has closed socket
> > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
> > at
> >
> >
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
> > at java.lang.Thread.run(Thread.java:745)
> > 2014-10-15 22:16:48,436 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> > client /127.0.0.1:41114 which had sessionid 0x349110433970e39
> >
> > I've removed /mesos/master and verified that it gets recreated:
> >
> > [zk: localhost:2181(CONNECTED) 4] rmr /mesos/master
> > [zk: localhost:2181(CONNECTED) 5] ls /mesos/master
> > Node does not exist: /mesos/master
> > [zk: localhost:2181(CONNECTED) 6] get /mesos/master
> > cZxid = 0x10000b717
> > ctime = Wed Oct 15 23:34:17 UTC 2014
> > mZxid = 0x10000b717
> > mtime = Wed Oct 15 23:34:17 UTC 2014
> > pZxid = 0x10000b717
> > cversion = 0
> > dataVersion = 0
> > aclVersion = 0
> > ephemeralOwner = 0x0
> > dataLength = 0
> > numChildren = 0
> >
> > Anyone know what's going wrong?
> >
> >
> > On Wed, Oct 15, 2014 at 2:47 PM, Dobromir Montauk <
> dobromir@tellapart.com>
> > wrote:
> >
> > > I found instructions that fixed the current issue here:
> > >
> > > http://wilderness.apache.org/channels/?f=aurora/2014-05-27
> > >
> > > Would be nice to have a more friendly error message :)
> > >
> > > On Wed, Oct 15, 2014 at 2:22 PM, Dobromir Montauk <
> > dobromir@tellapart.com>
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> I've brought up Aurora on my Mesos master node with the following
> > command:
> > >>
> > >> ubuntu@ec2-54-82-17-37:~/$
> > >>
> > >> GLOG_v=2
> > >> LIBPROCESS_PORT=5050
> > >> LIBPROCESS_IP=127.0.0.1
> > >> AURORA_HOME=/usr/local/aurora-scheduler
> > >> DIST_DIR=/home/ubuntu/aurora-scheduler/dist
> > >> AURORA_HOME=/usr/local/aurora-scheduler
> > >>
> > >> sudo /usr/local/aurora-scheduler/bin/aurora-scheduler \
> > >>   -cluster_name=tellapart \
> > >>   -http_port=8081 \
> > >>   -native_log_quorum_size=1 \
> > >>   -zk_endpoints=localhost:2181 \
> > >>   -mesos_master_address=54.166.50.69:5050,54.160.61.169:5050
> > ,localhost:5050
> > >> \
> > >>   -serverset_path=/aurora/scheduler \
> > >>   -native_log_zk_group_path=/aurora/replicated-log \
> > >>   -native_log_file_path=$AURORA_HOME/scheduler/db \
> > >>   -backup_dir=$AURORA_HOME/scheduler/backups \
> > >>   -thermos_executor_path=/dev/null \
> > >>   -gc_executor_path=$DIST_DIR/gc_executor.pex \
> > >>   -enable_beta_updater=true \
> > >>   -vlog=INFO \
> > >>   -logtostderr
> > >>
> > >> Attached is the entire log, but basically I'm seeing this:
> > >>
> > >> I1015 21:18:05.315263 27634 group.cpp:313] Group process (group(1)@
> > >> 10.88.26.227:40393) connected to ZooKeeper
> > >> I1015 21:18:05.315322 27634 group.cpp:787] Syncing group operations:
> > >> queue size (joins, cancels, datas) = (0, 0, 0)
> > >> I1015 21:18:05.315348 27634 group.cpp:385] Trying to create path
> > >> '/aurora/replicated-log' in ZooKeeper
> > >> I1015 21:18:05.316 THREAD1
> > >> com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate
> > >> /aurora/scheduler/singleton_candidate_0000000008 is now leader of
> group:
> > >> [singleton_candidate_0000000008]
> > >> I1015 21:18:05.317 THREAD1
> > >> com.twitter.common.util.StateMachine$Builder$1.execute:
> > SchedulerLifecycle
> > >> state machine transition STORAGE_PREPARED ->
> > LEADER_AWAITING_REGISTRATION
> > >> I1015 21:18:05.317 THREAD1
> > >> org.apache.aurora.scheduler.SchedulerLifecycle$6.execute: Elected as
> > >> leading scheduler!
> > >> I1015 21:18:05.330394 27634 network.hpp:423] ZooKeeper group
> memberships
> > >> changed
> > >> I1015 21:18:05.330660 27639 group.cpp:658] Trying to get
> > >> '/aurora/replicated-log/0000000008' in ZooKeeper
> > >> I1015 21:18:05.331550 27635 network.hpp:461] ZooKeeper group PIDs: {
> > >> log-replica(1)@10.88.26.227:40393 }
> > >> I1015 21:18:06.027016 27634 replica.cpp:638] Replica in EMPTY status
> > >> received a broadcasted recover request
> > >> I1015 21:18:06.027216 27634 recover.cpp:188] Received a recover
> response
> > >> from a replica in EMPTY status
> > >> <repeat last 2 message ad nauseum>
> > >>
> > >> How can I debug what's going on?
> > >>
> > >> Thanks,
> > >> Dobromir
> > >>
> > >
> > >
> >
>

Re: Aurora stuck in LEADER_AWAITING_REGISTRATION

Posted by Kevin Sweeney <ke...@apache.org>.

The scheduler uses that zookeeper path (or whatever url you pass to
-mesos_msater_address) to find the mesos master is should register with.

Is there a mesos master running with a command-line that looks something
like

exec /usr/local/sbin/mesos-master \
    --zk=zk://$ZK_HOST:2181/mesos/master \
...?

On Wed, Oct 15, 2014 at 4:53 PM, Dobromir Montauk <do...@tellapart.com>
wrote:

> Next problem: framework is failing to register.
>
> 2014-10-15 22:15:48,024:28619(0x7f852d259700):ZOO_INFO@check_events@1750:
> session establishment complete on server [127.0.0.1:2181],
> sessionId=0x349110433970e3b, negotiated timeout=10000
> I1015 22:15:48.024787 28763 group.cpp:313] Group process (group(3)@
> 10.88.26.227:34348) connected to ZooKeeper
> I1015 22:15:48.024837 28763 group.cpp:787] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I1015 22:15:48.024868 28763 group.cpp:385] Trying to create path
> '/mesos/master' in ZooKeeper
> E1015 22:16:48.013 THREAD132
> org.apache.aurora.scheduler.SchedulerLifecycle$6$3.run: Framework has not
> been registered within the tolerated delay.
> I1015 22:16:48.014 THREAD132
> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
> state machine transition LEADER_AWAITING_REGISTRATION -> DEAD
>
> My Zookeeper logs have this:
>
> 2014-10-15 22:15:47,831 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client attempting to establish
> new session at /127.0.0.1:41116
> 2014-10-15 22:15:47,836 - INFO  [CommitProcessor:3:ZooKeeperServer@595] -
> Established session 0x349110433970e3a with negotiated timeout 4000 for
> client /127.0.0.1:41116
> 2014-10-15 22:15:48,017 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
> connection
> from /127.0.0.1:41117
> 2014-10-15 22:15:48,017 - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@793] - Connection request from old
> client /127.0.0.1:41117; will be dropped if server is in r-o mode
> 2014-10-15 22:15:48,018 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client attempting to establish
> new session at /127.0.0.1:41117
> 2014-10-15 22:15:48,024 - INFO  [CommitProcessor:3:ZooKeeperServer@595] -
> Established session 0x349110433970e3b with negotiated timeout 10000 for
> client /127.0.0.1:41117
> 2014-10-15 22:16:48,435 - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x349110433970e39, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
> at
>
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
> at java.lang.Thread.run(Thread.java:745)
> 2014-10-15 22:16:48,436 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> client /127.0.0.1:41114 which had sessionid 0x349110433970e39
>
> I've removed /mesos/master and verified that it gets recreated:
>
> [zk: localhost:2181(CONNECTED) 4] rmr /mesos/master
> [zk: localhost:2181(CONNECTED) 5] ls /mesos/master
> Node does not exist: /mesos/master
> [zk: localhost:2181(CONNECTED) 6] get /mesos/master
> cZxid = 0x10000b717
> ctime = Wed Oct 15 23:34:17 UTC 2014
> mZxid = 0x10000b717
> mtime = Wed Oct 15 23:34:17 UTC 2014
> pZxid = 0x10000b717
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x0
> dataLength = 0
> numChildren = 0
>
> Anyone know what's going wrong?
>
>
> On Wed, Oct 15, 2014 at 2:47 PM, Dobromir Montauk <do...@tellapart.com>
> wrote:
>
> > I found instructions that fixed the current issue here:
> >
> > http://wilderness.apache.org/channels/?f=aurora/2014-05-27
> >
> > Would be nice to have a more friendly error message :)
> >
> > On Wed, Oct 15, 2014 at 2:22 PM, Dobromir Montauk <
> dobromir@tellapart.com>
> > wrote:
> >
> >> Hi,
> >>
> >> I've brought up Aurora on my Mesos master node with the following
> command:
> >>
> >> ubuntu@ec2-54-82-17-37:~/$
> >>
> >> GLOG_v=2
> >> LIBPROCESS_PORT=5050
> >> LIBPROCESS_IP=127.0.0.1
> >> AURORA_HOME=/usr/local/aurora-scheduler
> >> DIST_DIR=/home/ubuntu/aurora-scheduler/dist
> >> AURORA_HOME=/usr/local/aurora-scheduler
> >>
> >> sudo /usr/local/aurora-scheduler/bin/aurora-scheduler \
> >>   -cluster_name=tellapart \
> >>   -http_port=8081 \
> >>   -native_log_quorum_size=1 \
> >>   -zk_endpoints=localhost:2181 \
> >>   -mesos_master_address=54.166.50.69:5050,54.160.61.169:5050
> ,localhost:5050
> >> \
> >>   -serverset_path=/aurora/scheduler \
> >>   -native_log_zk_group_path=/aurora/replicated-log \
> >>   -native_log_file_path=$AURORA_HOME/scheduler/db \
> >>   -backup_dir=$AURORA_HOME/scheduler/backups \
> >>   -thermos_executor_path=/dev/null \
> >>   -gc_executor_path=$DIST_DIR/gc_executor.pex \
> >>   -enable_beta_updater=true \
> >>   -vlog=INFO \
> >>   -logtostderr
> >>
> >> Attached is the entire log, but basically I'm seeing this:
> >>
> >> I1015 21:18:05.315263 27634 group.cpp:313] Group process (group(1)@
> >> 10.88.26.227:40393) connected to ZooKeeper
> >> I1015 21:18:05.315322 27634 group.cpp:787] Syncing group operations:
> >> queue size (joins, cancels, datas) = (0, 0, 0)
> >> I1015 21:18:05.315348 27634 group.cpp:385] Trying to create path
> >> '/aurora/replicated-log' in ZooKeeper
> >> I1015 21:18:05.316 THREAD1
> >> com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate
> >> /aurora/scheduler/singleton_candidate_0000000008 is now leader of group:
> >> [singleton_candidate_0000000008]
> >> I1015 21:18:05.317 THREAD1
> >> com.twitter.common.util.StateMachine$Builder$1.execute:
> SchedulerLifecycle
> >> state machine transition STORAGE_PREPARED ->
> LEADER_AWAITING_REGISTRATION
> >> I1015 21:18:05.317 THREAD1
> >> org.apache.aurora.scheduler.SchedulerLifecycle$6.execute: Elected as
> >> leading scheduler!
> >> I1015 21:18:05.330394 27634 network.hpp:423] ZooKeeper group memberships
> >> changed
> >> I1015 21:18:05.330660 27639 group.cpp:658] Trying to get
> >> '/aurora/replicated-log/0000000008' in ZooKeeper
> >> I1015 21:18:05.331550 27635 network.hpp:461] ZooKeeper group PIDs: {
> >> log-replica(1)@10.88.26.227:40393 }
> >> I1015 21:18:06.027016 27634 replica.cpp:638] Replica in EMPTY status
> >> received a broadcasted recover request
> >> I1015 21:18:06.027216 27634 recover.cpp:188] Received a recover response
> >> from a replica in EMPTY status
> >> <repeat last 2 message ad nauseum>
> >>
> >> How can I debug what's going on?
> >>
> >> Thanks,
> >> Dobromir
> >>
> >
> >
>

Re: Aurora stuck in LEADER_AWAITING_REGISTRATION

Posted by Dobromir Montauk <do...@tellapart.com>.

Next problem: framework is failing to register.

2014-10-15 22:15:48,024:28619(0x7f852d259700):ZOO_INFO@check_events@1750:
session establishment complete on server [127.0.0.1:2181],
sessionId=0x349110433970e3b, negotiated timeout=10000
I1015 22:15:48.024787 28763 group.cpp:313] Group process (group(3)@
10.88.26.227:34348) connected to ZooKeeper
I1015 22:15:48.024837 28763 group.cpp:787] Syncing group operations: queue
size (joins, cancels, datas) = (0, 0, 0)
I1015 22:15:48.024868 28763 group.cpp:385] Trying to create path
'/mesos/master' in ZooKeeper
E1015 22:16:48.013 THREAD132
org.apache.aurora.scheduler.SchedulerLifecycle$6$3.run: Framework has not
been registered within the tolerated delay.
I1015 22:16:48.014 THREAD132
com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
state machine transition LEADER_AWAITING_REGISTRATION -> DEAD

My Zookeeper logs have this:

2014-10-15 22:15:47,831 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client attempting to establish
new session at /127.0.0.1:41116
2014-10-15 22:15:47,836 - INFO  [CommitProcessor:3:ZooKeeperServer@595] -
Established session 0x349110433970e3a with negotiated timeout 4000 for
client /127.0.0.1:41116
2014-10-15 22:15:48,017 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection
from /127.0.0.1:41117
2014-10-15 22:15:48,017 - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@793] - Connection request from old
client /127.0.0.1:41117; will be dropped if server is in r-o mode
2014-10-15 22:15:48,018 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client attempting to establish
new session at /127.0.0.1:41117
2014-10-15 22:15:48,024 - INFO  [CommitProcessor:3:ZooKeeperServer@595] -
Established session 0x349110433970e3b with negotiated timeout 10000 for
client /127.0.0.1:41117
2014-10-15 22:16:48,435 - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x349110433970e39, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
2014-10-15 22:16:48,436 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
client /127.0.0.1:41114 which had sessionid 0x349110433970e39

I've removed /mesos/master and verified that it gets recreated:

[zk: localhost:2181(CONNECTED) 4] rmr /mesos/master
[zk: localhost:2181(CONNECTED) 5] ls /mesos/master
Node does not exist: /mesos/master
[zk: localhost:2181(CONNECTED) 6] get /mesos/master
cZxid = 0x10000b717
ctime = Wed Oct 15 23:34:17 UTC 2014
mZxid = 0x10000b717
mtime = Wed Oct 15 23:34:17 UTC 2014
pZxid = 0x10000b717
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 0

Anyone know what's going wrong?


On Wed, Oct 15, 2014 at 2:47 PM, Dobromir Montauk <do...@tellapart.com>
wrote:

> I found instructions that fixed the current issue here:
>
> http://wilderness.apache.org/channels/?f=aurora/2014-05-27
>
> Would be nice to have a more friendly error message :)
>
> On Wed, Oct 15, 2014 at 2:22 PM, Dobromir Montauk <do...@tellapart.com>
> wrote:
>
>> Hi,
>>
>> I've brought up Aurora on my Mesos master node with the following command:
>>
>> ubuntu@ec2-54-82-17-37:~/$
>>
>> GLOG_v=2
>> LIBPROCESS_PORT=5050
>> LIBPROCESS_IP=127.0.0.1
>> AURORA_HOME=/usr/local/aurora-scheduler
>> DIST_DIR=/home/ubuntu/aurora-scheduler/dist
>> AURORA_HOME=/usr/local/aurora-scheduler
>>
>> sudo /usr/local/aurora-scheduler/bin/aurora-scheduler \
>>   -cluster_name=tellapart \
>>   -http_port=8081 \
>>   -native_log_quorum_size=1 \
>>   -zk_endpoints=localhost:2181 \
>>   -mesos_master_address=54.166.50.69:5050,54.160.61.169:5050,localhost:5050
>> \
>>   -serverset_path=/aurora/scheduler \
>>   -native_log_zk_group_path=/aurora/replicated-log \
>>   -native_log_file_path=$AURORA_HOME/scheduler/db \
>>   -backup_dir=$AURORA_HOME/scheduler/backups \
>>   -thermos_executor_path=/dev/null \
>>   -gc_executor_path=$DIST_DIR/gc_executor.pex \
>>   -enable_beta_updater=true \
>>   -vlog=INFO \
>>   -logtostderr
>>
>> Attached is the entire log, but basically I'm seeing this:
>>
>> I1015 21:18:05.315263 27634 group.cpp:313] Group process (group(1)@
>> 10.88.26.227:40393) connected to ZooKeeper
>> I1015 21:18:05.315322 27634 group.cpp:787] Syncing group operations:
>> queue size (joins, cancels, datas) = (0, 0, 0)
>> I1015 21:18:05.315348 27634 group.cpp:385] Trying to create path
>> '/aurora/replicated-log' in ZooKeeper
>> I1015 21:18:05.316 THREAD1
>> com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate
>> /aurora/scheduler/singleton_candidate_0000000008 is now leader of group:
>> [singleton_candidate_0000000008]
>> I1015 21:18:05.317 THREAD1
>> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
>> state machine transition STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
>> I1015 21:18:05.317 THREAD1
>> org.apache.aurora.scheduler.SchedulerLifecycle$6.execute: Elected as
>> leading scheduler!
>> I1015 21:18:05.330394 27634 network.hpp:423] ZooKeeper group memberships
>> changed
>> I1015 21:18:05.330660 27639 group.cpp:658] Trying to get
>> '/aurora/replicated-log/0000000008' in ZooKeeper
>> I1015 21:18:05.331550 27635 network.hpp:461] ZooKeeper group PIDs: {
>> log-replica(1)@10.88.26.227:40393 }
>> I1015 21:18:06.027016 27634 replica.cpp:638] Replica in EMPTY status
>> received a broadcasted recover request
>> I1015 21:18:06.027216 27634 recover.cpp:188] Received a recover response
>> from a replica in EMPTY status
>> <repeat last 2 message ad nauseum>
>>
>> How can I debug what's going on?
>>
>> Thanks,
>> Dobromir
>>
>
>

Re: Aurora stuck in LEADER_AWAITING_REGISTRATION

Posted by Bill Farner <wf...@apache.org>.

Glad you got that sorted out!  This is where the abstraction between the
mesos library and aurora begin to show.  Unfortunately the scheduler has no
way to know about this due to how the replicated log (provided by mesos)
works.  I've filed MESOS-1933 [1] to improve this message on the mesos side.

[1] https://issues.apache.org/jira/browse/MESOS-1933

-=Bill

On Wed, Oct 15, 2014 at 2:47 PM, Dobromir Montauk <do...@tellapart.com>
wrote:

> I found instructions that fixed the current issue here:
>
> http://wilderness.apache.org/channels/?f=aurora/2014-05-27
>
> Would be nice to have a more friendly error message :)
>
> On Wed, Oct 15, 2014 at 2:22 PM, Dobromir Montauk <do...@tellapart.com>
> wrote:
>
> > Hi,
> >
> > I've brought up Aurora on my Mesos master node with the following
> command:
> >
> > ubuntu@ec2-54-82-17-37:~/$
> >
> > GLOG_v=2
> > LIBPROCESS_PORT=5050
> > LIBPROCESS_IP=127.0.0.1
> > AURORA_HOME=/usr/local/aurora-scheduler
> > DIST_DIR=/home/ubuntu/aurora-scheduler/dist
> > AURORA_HOME=/usr/local/aurora-scheduler
> >
> > sudo /usr/local/aurora-scheduler/bin/aurora-scheduler \
> >   -cluster_name=tellapart \
> >   -http_port=8081 \
> >   -native_log_quorum_size=1 \
> >   -zk_endpoints=localhost:2181 \
> >   -mesos_master_address=54.166.50.69:5050,54.160.61.169:5050
> ,localhost:5050
> > \
> >   -serverset_path=/aurora/scheduler \
> >   -native_log_zk_group_path=/aurora/replicated-log \
> >   -native_log_file_path=$AURORA_HOME/scheduler/db \
> >   -backup_dir=$AURORA_HOME/scheduler/backups \
> >   -thermos_executor_path=/dev/null \
> >   -gc_executor_path=$DIST_DIR/gc_executor.pex \
> >   -enable_beta_updater=true \
> >   -vlog=INFO \
> >   -logtostderr
> >
> > Attached is the entire log, but basically I'm seeing this:
> >
> > I1015 21:18:05.315263 27634 group.cpp:313] Group process (group(1)@
> > 10.88.26.227:40393) connected to ZooKeeper
> > I1015 21:18:05.315322 27634 group.cpp:787] Syncing group operations:
> queue
> > size (joins, cancels, datas) = (0, 0, 0)
> > I1015 21:18:05.315348 27634 group.cpp:385] Trying to create path
> > '/aurora/replicated-log' in ZooKeeper
> > I1015 21:18:05.316 THREAD1
> > com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate
> > /aurora/scheduler/singleton_candidate_0000000008 is now leader of group:
> > [singleton_candidate_0000000008]
> > I1015 21:18:05.317 THREAD1
> > com.twitter.common.util.StateMachine$Builder$1.execute:
> SchedulerLifecycle
> > state machine transition STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
> > I1015 21:18:05.317 THREAD1
> > org.apache.aurora.scheduler.SchedulerLifecycle$6.execute: Elected as
> > leading scheduler!
> > I1015 21:18:05.330394 27634 network.hpp:423] ZooKeeper group memberships
> > changed
> > I1015 21:18:05.330660 27639 group.cpp:658] Trying to get
> > '/aurora/replicated-log/0000000008' in ZooKeeper
> > I1015 21:18:05.331550 27635 network.hpp:461] ZooKeeper group PIDs: {
> > log-replica(1)@10.88.26.227:40393 }
> > I1015 21:18:06.027016 27634 replica.cpp:638] Replica in EMPTY status
> > received a broadcasted recover request
> > I1015 21:18:06.027216 27634 recover.cpp:188] Received a recover response
> > from a replica in EMPTY status
> > <repeat last 2 message ad nauseum>
> >
> > How can I debug what's going on?
> >
> > Thanks,
> > Dobromir
> >
>

Re: Aurora stuck in LEADER_AWAITING_REGISTRATION

Posted by Dobromir Montauk <do...@tellapart.com>.

I found instructions that fixed the current issue here:

http://wilderness.apache.org/channels/?f=aurora/2014-05-27

Would be nice to have a more friendly error message :)

On Wed, Oct 15, 2014 at 2:22 PM, Dobromir Montauk <do...@tellapart.com>
wrote:

> Hi,
>
> I've brought up Aurora on my Mesos master node with the following command:
>
> ubuntu@ec2-54-82-17-37:~/$
>
> GLOG_v=2
> LIBPROCESS_PORT=5050
> LIBPROCESS_IP=127.0.0.1
> AURORA_HOME=/usr/local/aurora-scheduler
> DIST_DIR=/home/ubuntu/aurora-scheduler/dist
> AURORA_HOME=/usr/local/aurora-scheduler
>
> sudo /usr/local/aurora-scheduler/bin/aurora-scheduler \
>   -cluster_name=tellapart \
>   -http_port=8081 \
>   -native_log_quorum_size=1 \
>   -zk_endpoints=localhost:2181 \
>   -mesos_master_address=54.166.50.69:5050,54.160.61.169:5050,localhost:5050
> \
>   -serverset_path=/aurora/scheduler \
>   -native_log_zk_group_path=/aurora/replicated-log \
>   -native_log_file_path=$AURORA_HOME/scheduler/db \
>   -backup_dir=$AURORA_HOME/scheduler/backups \
>   -thermos_executor_path=/dev/null \
>   -gc_executor_path=$DIST_DIR/gc_executor.pex \
>   -enable_beta_updater=true \
>   -vlog=INFO \
>   -logtostderr
>
> Attached is the entire log, but basically I'm seeing this:
>
> I1015 21:18:05.315263 27634 group.cpp:313] Group process (group(1)@
> 10.88.26.227:40393) connected to ZooKeeper
> I1015 21:18:05.315322 27634 group.cpp:787] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I1015 21:18:05.315348 27634 group.cpp:385] Trying to create path
> '/aurora/replicated-log' in ZooKeeper
> I1015 21:18:05.316 THREAD1
> com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate
> /aurora/scheduler/singleton_candidate_0000000008 is now leader of group:
> [singleton_candidate_0000000008]
> I1015 21:18:05.317 THREAD1
> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
> state machine transition STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
> I1015 21:18:05.317 THREAD1
> org.apache.aurora.scheduler.SchedulerLifecycle$6.execute: Elected as
> leading scheduler!
> I1015 21:18:05.330394 27634 network.hpp:423] ZooKeeper group memberships
> changed
> I1015 21:18:05.330660 27639 group.cpp:658] Trying to get
> '/aurora/replicated-log/0000000008' in ZooKeeper
> I1015 21:18:05.331550 27635 network.hpp:461] ZooKeeper group PIDs: {
> log-replica(1)@10.88.26.227:40393 }
> I1015 21:18:06.027016 27634 replica.cpp:638] Replica in EMPTY status
> received a broadcasted recover request
> I1015 21:18:06.027216 27634 recover.cpp:188] Received a recover response
> from a replica in EMPTY status
> <repeat last 2 message ad nauseum>
>
> How can I debug what's going on?
>
> Thanks,
> Dobromir
>

Re: Aurora stuck in LEADER_AWAITING_REGISTRATION

Posted by Bill Farner <wf...@apache.org>.

Have you initialized the replicated log on this host?  This can be a
surprise, but it's a gate for a good reason.  Instructions here:

https://github.com/apache/incubator-aurora/blob/master/docs/deploying-aurora-scheduler.md#initializing-the-replicated-log

-=Bill

On Wed, Oct 15, 2014 at 2:22 PM, Dobromir Montauk <do...@tellapart.com>
wrote:

> Hi,
>
> I've brought up Aurora on my Mesos master node with the following command:
>
> ubuntu@ec2-54-82-17-37:~/$
>
> GLOG_v=2
> LIBPROCESS_PORT=5050
> LIBPROCESS_IP=127.0.0.1
> AURORA_HOME=/usr/local/aurora-scheduler
> DIST_DIR=/home/ubuntu/aurora-scheduler/dist
> AURORA_HOME=/usr/local/aurora-scheduler
>
> sudo /usr/local/aurora-scheduler/bin/aurora-scheduler \
>   -cluster_name=tellapart \
>   -http_port=8081 \
>   -native_log_quorum_size=1 \
>   -zk_endpoints=localhost:2181 \
>   -mesos_master_address=54.166.50.69:5050,54.160.61.169:5050,localhost:5050
> \
>   -serverset_path=/aurora/scheduler \
>   -native_log_zk_group_path=/aurora/replicated-log \
>   -native_log_file_path=$AURORA_HOME/scheduler/db \
>   -backup_dir=$AURORA_HOME/scheduler/backups \
>   -thermos_executor_path=/dev/null \
>   -gc_executor_path=$DIST_DIR/gc_executor.pex \
>   -enable_beta_updater=true \
>   -vlog=INFO \
>   -logtostderr
>
> Attached is the entire log, but basically I'm seeing this:
>
> I1015 21:18:05.315263 27634 group.cpp:313] Group process (group(1)@
> 10.88.26.227:40393) connected to ZooKeeper
> I1015 21:18:05.315322 27634 group.cpp:787] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I1015 21:18:05.315348 27634 group.cpp:385] Trying to create path
> '/aurora/replicated-log' in ZooKeeper
> I1015 21:18:05.316 THREAD1
> com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate
> /aurora/scheduler/singleton_candidate_0000000008 is now leader of group:
> [singleton_candidate_0000000008]
> I1015 21:18:05.317 THREAD1
> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
> state machine transition STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
> I1015 21:18:05.317 THREAD1
> org.apache.aurora.scheduler.SchedulerLifecycle$6.execute: Elected as
> leading scheduler!
> I1015 21:18:05.330394 27634 network.hpp:423] ZooKeeper group memberships
> changed
> I1015 21:18:05.330660 27639 group.cpp:658] Trying to get
> '/aurora/replicated-log/0000000008' in ZooKeeper
> I1015 21:18:05.331550 27635 network.hpp:461] ZooKeeper group PIDs: {
> log-replica(1)@10.88.26.227:40393 }
> I1015 21:18:06.027016 27634 replica.cpp:638] Replica in EMPTY status
> received a broadcasted recover request
> I1015 21:18:06.027216 27634 recover.cpp:188] Received a recover response
> from a replica in EMPTY status
> <repeat last 2 message ad nauseum>
>
> How can I debug what's going on?
>
> Thanks,
> Dobromir
>