You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@aurora.apache.org by Mohit Jaggi <mo...@uber.com> on 2017/09/27 06:51:05 UTC

Lost framework registered event [Was Re: leader election issues]

Updating subject...as it looks like leader election was fine but
registration ack did not make it to the SchedulerLifecycle code. Weird that
an event will get lost like that.

On Tue, Sep 26, 2017 at 4:21 PM, John Sirois <jo...@gmail.com> wrote:

>
>
> On Tue, Sep 26, 2017 at 4:33 PM, Mohit Jaggi <mo...@uber.com> wrote:
>
>> John,
>> I was referring to the following log message...isn't that the right one?
>>
>
> Aha - it is, apologies.
>
>
>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>  [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>
>> On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <jo...@gmail.com>
>> wrote:
>>
>>> > ... but it succeeded(see logs at the end) ...
>>>
>>> The underlying c++ code in the scheduler driver successfully connected
>>> to the leading master socket:
>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926 18:11:37.231549 24868
>>> scheduler.cpp:361] Connected with the master at
>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>
>>> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the same
>>> as a framework registration call being successfully executed against the
>>> newly connected master.
>>> You need to be careful about what you derive from the logs just based on
>>> a reading of the words. Generally you'll need to look carefully / grep
>>> sourcecode to be sure you are mentally modelling the code flows correctly.
>>> It certainly gets tricky.
>>>
>>> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> Hmm....so it is indeed the mesos registration, but it succeeded(see
>>>> logs at the end). Looking at the code, I see that
>>>> com.google.common.eventbus  is used as pub-sub mechanism to link the
>>>> registration (first log message)  to the  registrationAcked flag...and
>>>> this flag is not being set for 10mins (?) otherwise the registration
>>>> timeout handler will not print the second log message.
>>>>
>>>> delayedActions.onRegistrationTimeout(
>>>>     () -> {
>>>>       if (!registrationAcked.get()) {
>>>>         LOG.error(
>>>>             "Framework has not been registered within the tolerated delay.");
>>>>         stateMachine.transition(State.DEAD);
>>>>       }
>>>>     });
>>>>
>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>>> [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>>>
>>>> ...
>>>>
>>>> ...
>>>>
>>>> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926 18:21:37.205
>>>> [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not been registered
>>>> within the tolerated delay.
>>>>
>>>> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <jo...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <mo...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> Actually I wonder if the "register" here is for Aurora to register
>>>>>> with mesos as a framework...I was assuming it refered to registering with
>>>>>> ZK as a leader.
>>>>>>
>>>>>
>>>>> Bingo.
>>>>>
>>>>>
>>>>>>
>>>>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
>>>>>> renanidelvalle@gmail.com> wrote:
>>>>>>
>>>>>>> I've had this, or something very similar, happen before. It's an
>>>>>>> issue with Aurora and ZK. Election is based upon ZK, so if writing down who
>>>>>>> the leader is to the ZK server path fails, or if ZK is unable to reach
>>>>>>> quorum on the write, the election will fail. Sometimes this might manifest
>>>>>>> itself in weird ways, such as two aurora schedulers believing they are
>>>>>>> leaders. If you could tell us a little bit about your ZK set up we might be
>>>>>>> able to narrow down the issue. Also, Aurora version and whether you are
>>>>>>> using Curator or the commons library will help as well.
>>>>>>>
>>>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hmm..it seems machine62 became a leader but could not "register" as
>>>>>>>> leader. Not sure what that means. My naive assumption is that "becoming
>>>>>>>> leader" and "registering as leader" are "atomic".
>>>>>>>>
>>>>>>>> ------- grep on SchedulerLifecycle -----
>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0,
>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0,
>>>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler!
>>>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62
>>>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0,
>>>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on
>>>>>>>> registration in (10, mins)
>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0,
>>>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within the
>>>>>>>> tolerated delay.
>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0,
>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD
>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition DEAD
>>>>>>>> -> DEAD
>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra call.
>>>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62
>>>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main,
>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition IDLE
>>>>>>>> -> PREPARING_STORAGE
>>>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62
>>>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main,
>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED
>>>>>>>>
>>>>>>>>
>>>>>>>> ------ connecting to mesos -----
>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node 'log_replicas'
>>>>>>>> at '/mesos' in ZooKeeper
>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: (id='1506')
>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get
>>>>>>>> '/mesos/json.info_0000001506' in ZooKeeper
>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID=
>>>>>>>> master@10.163.25.45:5050) is detected
>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at
>>>>>>>> master@10.163.25.45:5050
>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms before
>>>>>>>> initiating a re-(connection) attempt with the master
>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at
>>>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <wf...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> No, there is not a legitimate reason.  Did this occur for an
>>>>>>>>> extended period of time?  Do you have logs from the scheduler indicating
>>>>>>>>> that it lost ZK leadership and subsequently interacted with mesos?
>>>>>>>>>
>>>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <mohit.jaggi@uber.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Fellows,
>>>>>>>>>> While examining Aurora log files, I noticed a condition where a
>>>>>>>>>> scheduler was talking to Mesos but it was not showing up as a leader in
>>>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler became the
>>>>>>>>>> leader.
>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>>>>>>>
>>>>>>>>>> Mohit.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Lost framework registered event [Was Re: leader election issues]

Posted by Mohit Jaggi <mo...@uber.com>.

thanks!

On Thu, Oct 26, 2017 at 9:51 PM, Bill Farner <wf...@apache.org> wrote:

> FYI - i believe we stumbled on this issue, which is greatly exacerbated by
> f2755e1
> <https://github.com/apache/aurora/commit/f2755e1cdd67f3c1516726c21d6e8f13059a5a01>.
> The good news is that we now have a good handle on the culprit!  More
> details at https://issues.apache.org/jira/browse/AURORA-1953
>
> On Thu, Sep 28, 2017 at 2:14 PM, Mohit Jaggi <mo...@uber.com> wrote:
>
>> Hmm...it is a very busy cluster and 10 mins of logs will be voluminous.
>> They contain some internal details which I cannot share publicly. If you
>> suspect specific areas, I can try to get those logs and remove internal
>> info.
>>
>> Re: code, we have a fork which is very close to master.
>>
>> On Wed, Sep 27, 2017 at 10:03 PM, Bill Farner <wf...@apache.org> wrote:
>>
>>> What commit/release was this with?  From the looks of the log contents,
>>> it's not master.  I'd like to make sure i'm looking at the correct code.
>>>
>>> Are there more logs not being included?  If so, can you share more
>>> complete logs?  In particular, logs during the 10 minute delay would be
>>> particularly helpful.
>>>
>>> On Tue, Sep 26, 2017 at 11:51 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> Updating subject...as it looks like leader election was fine but
>>>> registration ack did not make it to the SchedulerLifecycle code. Weird that
>>>> an event will get lost like that.
>>>>
>>>> On Tue, Sep 26, 2017 at 4:21 PM, John Sirois <jo...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 26, 2017 at 4:33 PM, Mohit Jaggi <mo...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> John,
>>>>>> I was referring to the following log message...isn't that the right
>>>>>> one?
>>>>>>
>>>>>
>>>>> Aha - it is, apologies.
>>>>>
>>>>>
>>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>>>>>  [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>>>>>
>>>>>> On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <jo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> > ... but it succeeded(see logs at the end) ...
>>>>>>>
>>>>>>> The underlying c++ code in the scheduler driver successfully
>>>>>>> connected to the leading master socket:
>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master
>>>>>>> at http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>>
>>>>>>> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the
>>>>>>> same as a framework registration call being successfully executed against
>>>>>>> the newly connected master.
>>>>>>> You need to be careful about what you derive from the logs just
>>>>>>> based on a reading of the words. Generally you'll need to look carefully /
>>>>>>> grep sourcecode to be sure you are mentally modelling the code flows
>>>>>>> correctly. It certainly gets tricky.
>>>>>>>
>>>>>>> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hmm....so it is indeed the mesos registration, but it succeeded(see
>>>>>>>> logs at the end). Looking at the code, I see that
>>>>>>>> com.google.common.eventbus  is used as pub-sub mechanism to link
>>>>>>>> the registration (first log message)  to the  registrationAcked flag...and
>>>>>>>> this flag is not being set for 10mins (?) otherwise the registration
>>>>>>>> timeout handler will not print the second log message.
>>>>>>>>
>>>>>>>> delayedActions.onRegistrationTimeout(
>>>>>>>>     () -> {
>>>>>>>>       if (!registrationAcked.get()) {
>>>>>>>>         LOG.error(
>>>>>>>>             "Framework has not been registered within the tolerated delay.");
>>>>>>>>         stateMachine.transition(State.DEAD);
>>>>>>>>       }
>>>>>>>>     });
>>>>>>>>
>>>>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926
>>>>>>>> 18:11:58.795 [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e
>>>>>>>> 2f84d17-0000"
>>>>>>>>
>>>>>>>> ...
>>>>>>>>
>>>>>>>> ...
>>>>>>>>
>>>>>>>> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926
>>>>>>>> 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not been
>>>>>>>> registered within the tolerated delay.
>>>>>>>>
>>>>>>>> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <john.sirois@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <mohit.jaggi@uber.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Actually I wonder if the "register" here is for Aurora to
>>>>>>>>>> register with mesos as a framework...I was assuming it refered to
>>>>>>>>>> registering with ZK as a leader.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Bingo.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
>>>>>>>>>> renanidelvalle@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I've had this, or something very similar, happen before. It's an
>>>>>>>>>>> issue with Aurora and ZK. Election is based upon ZK, so if writing down who
>>>>>>>>>>> the leader is to the ZK server path fails, or if ZK is unable to reach
>>>>>>>>>>> quorum on the write, the election will fail. Sometimes this might manifest
>>>>>>>>>>> itself in weird ways, such as two aurora schedulers believing they are
>>>>>>>>>>> leaders. If you could tell us a little bit about your ZK set up we might be
>>>>>>>>>>> able to narrow down the issue. Also, Aurora version and whether you are
>>>>>>>>>>> using Curator or the commons library will help as well.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <
>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hmm..it seems machine62 became a leader but could not
>>>>>>>>>>>> "register" as leader. Not sure what that means. My naive assumption is that
>>>>>>>>>>>> "becoming leader" and "registering as leader" are "atomic".
>>>>>>>>>>>>
>>>>>>>>>>>> ------- grep on SchedulerLifecycle -----
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0,
>>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0,
>>>>>>>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler!
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62
>>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0,
>>>>>>>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on
>>>>>>>>>>>> registration in (10, mins)
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within the
>>>>>>>>>>>> tolerated delay.
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition DEAD
>>>>>>>>>>>> -> DEAD
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra call.
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62
>>>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main,
>>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition IDLE
>>>>>>>>>>>> -> PREPARING_STORAGE
>>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62
>>>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main,
>>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ------ connecting to mesos -----
>>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node 'log_replicas'
>>>>>>>>>>>> at '/mesos' in ZooKeeper
>>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: (id='1506')
>>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get
>>>>>>>>>>>> '/mesos/json.info_0000001506' in ZooKeeper
>>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID=
>>>>>>>>>>>> master@10.163.25.45:5050) is detected
>>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at
>>>>>>>>>>>> master@10.163.25.45:5050
>>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms before
>>>>>>>>>>>> initiating a re-(connection) attempt with the master
>>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at
>>>>>>>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <
>>>>>>>>>>>> wfarner@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> No, there is not a legitimate reason.  Did this occur for an
>>>>>>>>>>>>> extended period of time?  Do you have logs from the scheduler indicating
>>>>>>>>>>>>> that it lost ZK leadership and subsequently interacted with mesos?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <
>>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Fellows,
>>>>>>>>>>>>>> While examining Aurora log files, I noticed a condition where
>>>>>>>>>>>>>> a scheduler was talking to Mesos but it was not showing up as a leader in
>>>>>>>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler became the
>>>>>>>>>>>>>> leader.
>>>>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Lost framework registered event [Was Re: leader election issues]

Posted by Bill Farner <wf...@apache.org>.

FYI - i believe we stumbled on this issue, which is greatly exacerbated by
f2755e1
<https://github.com/apache/aurora/commit/f2755e1cdd67f3c1516726c21d6e8f13059a5a01>.
The good news is that we now have a good handle on the culprit!  More
details at https://issues.apache.org/jira/browse/AURORA-1953

On Thu, Sep 28, 2017 at 2:14 PM, Mohit Jaggi <mo...@uber.com> wrote:

> Hmm...it is a very busy cluster and 10 mins of logs will be voluminous.
> They contain some internal details which I cannot share publicly. If you
> suspect specific areas, I can try to get those logs and remove internal
> info.
>
> Re: code, we have a fork which is very close to master.
>
> On Wed, Sep 27, 2017 at 10:03 PM, Bill Farner <wf...@apache.org> wrote:
>
>> What commit/release was this with?  From the looks of the log contents,
>> it's not master.  I'd like to make sure i'm looking at the correct code.
>>
>> Are there more logs not being included?  If so, can you share more
>> complete logs?  In particular, logs during the 10 minute delay would be
>> particularly helpful.
>>
>> On Tue, Sep 26, 2017 at 11:51 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> Updating subject...as it looks like leader election was fine but
>>> registration ack did not make it to the SchedulerLifecycle code. Weird that
>>> an event will get lost like that.
>>>
>>> On Tue, Sep 26, 2017 at 4:21 PM, John Sirois <jo...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Sep 26, 2017 at 4:33 PM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> John,
>>>>> I was referring to the following log message...isn't that the right
>>>>> one?
>>>>>
>>>>
>>>> Aha - it is, apologies.
>>>>
>>>>
>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>>>>  [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>>>>
>>>>> On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <jo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> > ... but it succeeded(see logs at the end) ...
>>>>>>
>>>>>> The underlying c++ code in the scheduler driver successfully
>>>>>> connected to the leading master socket:
>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master
>>>>>> at http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>
>>>>>> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the
>>>>>> same as a framework registration call being successfully executed against
>>>>>> the newly connected master.
>>>>>> You need to be careful about what you derive from the logs just based
>>>>>> on a reading of the words. Generally you'll need to look carefully / grep
>>>>>> sourcecode to be sure you are mentally modelling the code flows correctly.
>>>>>> It certainly gets tricky.
>>>>>>
>>>>>> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <mo...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hmm....so it is indeed the mesos registration, but it succeeded(see
>>>>>>> logs at the end). Looking at the code, I see that
>>>>>>> com.google.common.eventbus  is used as pub-sub mechanism to link
>>>>>>> the registration (first log message)  to the  registrationAcked flag...and
>>>>>>> this flag is not being set for 10mins (?) otherwise the registration
>>>>>>> timeout handler will not print the second log message.
>>>>>>>
>>>>>>> delayedActions.onRegistrationTimeout(
>>>>>>>     () -> {
>>>>>>>       if (!registrationAcked.get()) {
>>>>>>>         LOG.error(
>>>>>>>             "Framework has not been registered within the tolerated delay.");
>>>>>>>         stateMachine.transition(State.DEAD);
>>>>>>>       }
>>>>>>>     });
>>>>>>>
>>>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926
>>>>>>> 18:11:58.795 [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e
>>>>>>> 2f84d17-0000"
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926
>>>>>>> 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not been
>>>>>>> registered within the tolerated delay.
>>>>>>>
>>>>>>> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <jo...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Actually I wonder if the "register" here is for Aurora to register
>>>>>>>>> with mesos as a framework...I was assuming it refered to registering with
>>>>>>>>> ZK as a leader.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Bingo.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
>>>>>>>>> renanidelvalle@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I've had this, or something very similar, happen before. It's an
>>>>>>>>>> issue with Aurora and ZK. Election is based upon ZK, so if writing down who
>>>>>>>>>> the leader is to the ZK server path fails, or if ZK is unable to reach
>>>>>>>>>> quorum on the write, the election will fail. Sometimes this might manifest
>>>>>>>>>> itself in weird ways, such as two aurora schedulers believing they are
>>>>>>>>>> leaders. If you could tell us a little bit about your ZK set up we might be
>>>>>>>>>> able to narrow down the issue. Also, Aurora version and whether you are
>>>>>>>>>> using Curator or the commons library will help as well.
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <
>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hmm..it seems machine62 became a leader but could not "register"
>>>>>>>>>>> as leader. Not sure what that means. My naive assumption is that "becoming
>>>>>>>>>>> leader" and "registering as leader" are "atomic".
>>>>>>>>>>>
>>>>>>>>>>> ------- grep on SchedulerLifecycle -----
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0,
>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0,
>>>>>>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler!
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62
>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0,
>>>>>>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on
>>>>>>>>>>> registration in (10, mins)
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within the
>>>>>>>>>>> tolerated delay.
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition DEAD
>>>>>>>>>>> -> DEAD
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra call.
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62
>>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main,
>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition IDLE
>>>>>>>>>>> -> PREPARING_STORAGE
>>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62
>>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main,
>>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------ connecting to mesos -----
>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node 'log_replicas'
>>>>>>>>>>> at '/mesos' in ZooKeeper
>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: (id='1506')
>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get
>>>>>>>>>>> '/mesos/json.info_0000001506' in ZooKeeper
>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID=
>>>>>>>>>>> master@10.163.25.45:5050) is detected
>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at
>>>>>>>>>>> master@10.163.25.45:5050
>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms before
>>>>>>>>>>> initiating a re-(connection) attempt with the master
>>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at
>>>>>>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <wfarner@apache.org
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> No, there is not a legitimate reason.  Did this occur for an
>>>>>>>>>>>> extended period of time?  Do you have logs from the scheduler indicating
>>>>>>>>>>>> that it lost ZK leadership and subsequently interacted with mesos?
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <
>>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Fellows,
>>>>>>>>>>>>> While examining Aurora log files, I noticed a condition where
>>>>>>>>>>>>> a scheduler was talking to Mesos but it was not showing up as a leader in
>>>>>>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler became the
>>>>>>>>>>>>> leader.
>>>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Lost framework registered event [Was Re: leader election issues]

Posted by Mohit Jaggi <mo...@uber.com>.

Hmm...it is a very busy cluster and 10 mins of logs will be voluminous.
They contain some internal details which I cannot share publicly. If you
suspect specific areas, I can try to get those logs and remove internal
info.

Re: code, we have a fork which is very close to master.

On Wed, Sep 27, 2017 at 10:03 PM, Bill Farner <wf...@apache.org> wrote:

> What commit/release was this with?  From the looks of the log contents,
> it's not master.  I'd like to make sure i'm looking at the correct code.
>
> Are there more logs not being included?  If so, can you share more
> complete logs?  In particular, logs during the 10 minute delay would be
> particularly helpful.
>
> On Tue, Sep 26, 2017 at 11:51 PM, Mohit Jaggi <mo...@uber.com>
> wrote:
>
>> Updating subject...as it looks like leader election was fine but
>> registration ack did not make it to the SchedulerLifecycle code. Weird that
>> an event will get lost like that.
>>
>> On Tue, Sep 26, 2017 at 4:21 PM, John Sirois <jo...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Tue, Sep 26, 2017 at 4:33 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> John,
>>>> I was referring to the following log message...isn't that the right one?
>>>>
>>>
>>> Aha - it is, apologies.
>>>
>>>
>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>>>  [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>>>
>>>> On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <jo...@gmail.com>
>>>> wrote:
>>>>
>>>>> > ... but it succeeded(see logs at the end) ...
>>>>>
>>>>> The underlying c++ code in the scheduler driver successfully connected
>>>>> to the leading master socket:
>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at
>>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>
>>>>> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the
>>>>> same as a framework registration call being successfully executed against
>>>>> the newly connected master.
>>>>> You need to be careful about what you derive from the logs just based
>>>>> on a reading of the words. Generally you'll need to look carefully / grep
>>>>> sourcecode to be sure you are mentally modelling the code flows correctly.
>>>>> It certainly gets tricky.
>>>>>
>>>>> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <mo...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> Hmm....so it is indeed the mesos registration, but it succeeded(see
>>>>>> logs at the end). Looking at the code, I see that
>>>>>> com.google.common.eventbus  is used as pub-sub mechanism to link the
>>>>>> registration (first log message)  to the  registrationAcked flag...and
>>>>>> this flag is not being set for 10mins (?) otherwise the registration
>>>>>> timeout handler will not print the second log message.
>>>>>>
>>>>>> delayedActions.onRegistrationTimeout(
>>>>>>     () -> {
>>>>>>       if (!registrationAcked.get()) {
>>>>>>         LOG.error(
>>>>>>             "Framework has not been registered within the tolerated delay.");
>>>>>>         stateMachine.transition(State.DEAD);
>>>>>>       }
>>>>>>     });
>>>>>>
>>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>>>>> [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926
>>>>>> 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not been
>>>>>> registered within the tolerated delay.
>>>>>>
>>>>>> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <jo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Actually I wonder if the "register" here is for Aurora to register
>>>>>>>> with mesos as a framework...I was assuming it refered to registering with
>>>>>>>> ZK as a leader.
>>>>>>>>
>>>>>>>
>>>>>>> Bingo.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
>>>>>>>> renanidelvalle@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I've had this, or something very similar, happen before. It's an
>>>>>>>>> issue with Aurora and ZK. Election is based upon ZK, so if writing down who
>>>>>>>>> the leader is to the ZK server path fails, or if ZK is unable to reach
>>>>>>>>> quorum on the write, the election will fail. Sometimes this might manifest
>>>>>>>>> itself in weird ways, such as two aurora schedulers believing they are
>>>>>>>>> leaders. If you could tell us a little bit about your ZK set up we might be
>>>>>>>>> able to narrow down the issue. Also, Aurora version and whether you are
>>>>>>>>> using Curator or the commons library will help as well.
>>>>>>>>>
>>>>>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <mohit.jaggi@uber.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hmm..it seems machine62 became a leader but could not "register"
>>>>>>>>>> as leader. Not sure what that means. My naive assumption is that "becoming
>>>>>>>>>> leader" and "registering as leader" are "atomic".
>>>>>>>>>>
>>>>>>>>>> ------- grep on SchedulerLifecycle -----
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0,
>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0,
>>>>>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler!
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62
>>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0,
>>>>>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on
>>>>>>>>>> registration in (10, mins)
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within the
>>>>>>>>>> tolerated delay.
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition DEAD
>>>>>>>>>> -> DEAD
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra call.
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62
>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main,
>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition IDLE
>>>>>>>>>> -> PREPARING_STORAGE
>>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62
>>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main,
>>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------ connecting to mesos -----
>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node 'log_replicas'
>>>>>>>>>> at '/mesos' in ZooKeeper
>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: (id='1506')
>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get
>>>>>>>>>> '/mesos/json.info_0000001506' in ZooKeeper
>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID=
>>>>>>>>>> master@10.163.25.45:5050) is detected
>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at
>>>>>>>>>> master@10.163.25.45:5050
>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms before
>>>>>>>>>> initiating a re-(connection) attempt with the master
>>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at
>>>>>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <wf...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> No, there is not a legitimate reason.  Did this occur for an
>>>>>>>>>>> extended period of time?  Do you have logs from the scheduler indicating
>>>>>>>>>>> that it lost ZK leadership and subsequently interacted with mesos?
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <
>>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Fellows,
>>>>>>>>>>>> While examining Aurora log files, I noticed a condition where a
>>>>>>>>>>>> scheduler was talking to Mesos but it was not showing up as a leader in
>>>>>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler became the
>>>>>>>>>>>> leader.
>>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>>>>>>>>>
>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Lost framework registered event [Was Re: leader election issues]

Posted by Bill Farner <wf...@apache.org>.

What commit/release was this with?  From the looks of the log contents,
it's not master.  I'd like to make sure i'm looking at the correct code.

Are there more logs not being included?  If so, can you share more complete
logs?  In particular, logs during the 10 minute delay would be particularly
helpful.

On Tue, Sep 26, 2017 at 11:51 PM, Mohit Jaggi <mo...@uber.com> wrote:

> Updating subject...as it looks like leader election was fine but
> registration ack did not make it to the SchedulerLifecycle code. Weird that
> an event will get lost like that.
>
> On Tue, Sep 26, 2017 at 4:21 PM, John Sirois <jo...@gmail.com>
> wrote:
>
>>
>>
>> On Tue, Sep 26, 2017 at 4:33 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> John,
>>> I was referring to the following log message...isn't that the right one?
>>>
>>
>> Aha - it is, apologies.
>>
>>
>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>>  [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>>
>>> On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <jo...@gmail.com>
>>> wrote:
>>>
>>>> > ... but it succeeded(see logs at the end) ...
>>>>
>>>> The underlying c++ code in the scheduler driver successfully connected
>>>> to the leading master socket:
>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at
>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>
>>>> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the same
>>>> as a framework registration call being successfully executed against the
>>>> newly connected master.
>>>> You need to be careful about what you derive from the logs just based
>>>> on a reading of the words. Generally you'll need to look carefully / grep
>>>> sourcecode to be sure you are mentally modelling the code flows correctly.
>>>> It certainly gets tricky.
>>>>
>>>> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> Hmm....so it is indeed the mesos registration, but it succeeded(see
>>>>> logs at the end). Looking at the code, I see that
>>>>> com.google.common.eventbus  is used as pub-sub mechanism to link the
>>>>> registration (first log message)  to the  registrationAcked flag...and
>>>>> this flag is not being set for 10mins (?) otherwise the registration
>>>>> timeout handler will not print the second log message.
>>>>>
>>>>> delayedActions.onRegistrationTimeout(
>>>>>     () -> {
>>>>>       if (!registrationAcked.get()) {
>>>>>         LOG.error(
>>>>>             "Framework has not been registered within the tolerated delay.");
>>>>>         stateMachine.transition(State.DEAD);
>>>>>       }
>>>>>     });
>>>>>
>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>>>> [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-0000"
>>>>>
>>>>> ...
>>>>>
>>>>> ...
>>>>>
>>>>> Sep 26 18:21:37 machine62 aurora-scheduler[24743]: E0926 18:21:37.205
>>>>> [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not been registered
>>>>> within the tolerated delay.
>>>>>
>>>>> On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <jo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <mo...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Actually I wonder if the "register" here is for Aurora to register
>>>>>>> with mesos as a framework...I was assuming it refered to registering with
>>>>>>> ZK as a leader.
>>>>>>>
>>>>>>
>>>>>> Bingo.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
>>>>>>> renanidelvalle@gmail.com> wrote:
>>>>>>>
>>>>>>>> I've had this, or something very similar, happen before. It's an
>>>>>>>> issue with Aurora and ZK. Election is based upon ZK, so if writing down who
>>>>>>>> the leader is to the ZK server path fails, or if ZK is unable to reach
>>>>>>>> quorum on the write, the election will fail. Sometimes this might manifest
>>>>>>>> itself in weird ways, such as two aurora schedulers believing they are
>>>>>>>> leaders. If you could tell us a little bit about your ZK set up we might be
>>>>>>>> able to narrow down the issue. Also, Aurora version and whether you are
>>>>>>>> using Curator or the commons library will help as well.
>>>>>>>>
>>>>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hmm..it seems machine62 became a leader but could not "register"
>>>>>>>>> as leader. Not sure what that means. My naive assumption is that "becoming
>>>>>>>>> leader" and "registering as leader" are "atomic".
>>>>>>>>>
>>>>>>>>> ------- grep on SchedulerLifecycle -----
>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0,
>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0,
>>>>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler!
>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62
>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0,
>>>>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on
>>>>>>>>> registration in (10, mins)
>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within the
>>>>>>>>> tolerated delay.
>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD
>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition DEAD
>>>>>>>>> -> DEAD
>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra call.
>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62
>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main,
>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition IDLE
>>>>>>>>> -> PREPARING_STORAGE
>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62
>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main,
>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------ connecting to mesos -----
>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node 'log_replicas'
>>>>>>>>> at '/mesos' in ZooKeeper
>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: (id='1506')
>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get
>>>>>>>>> '/mesos/json.info_0000001506' in ZooKeeper
>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID=
>>>>>>>>> master@10.163.25.45:5050) is detected
>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at
>>>>>>>>> master@10.163.25.45:5050
>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms before
>>>>>>>>> initiating a re-(connection) attempt with the master
>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at
>>>>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <wf...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> No, there is not a legitimate reason.  Did this occur for an
>>>>>>>>>> extended period of time?  Do you have logs from the scheduler indicating
>>>>>>>>>> that it lost ZK leadership and subsequently interacted with mesos?
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <
>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Fellows,
>>>>>>>>>>> While examining Aurora log files, I noticed a condition where a
>>>>>>>>>>> scheduler was talking to Mesos but it was not showing up as a leader in
>>>>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler became the
>>>>>>>>>>> leader.
>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>>>>>>>>
>>>>>>>>>>> Mohit.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>