You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by James Bucher <jb...@expedia.com> on 2017/09/06 17:43:37 UTC

Flink 1.2.1 JobManager Election Deadlock

Hey all,

Just wanted to report this for posterity in case someone else sees something similar. We were running Flink 1.2.1 in Kubernetes. We use an HA setup with an external Zookeeper and S3 for checkpointing. We recently noticed a job that appears to have deadlocked on JobManager Leader election. We think the issue happened something like the following:

  1.  Job was up and running normally
  2.  Some cluster event caused the JobManager Pod (process) to get restarted.
  3.  JobManager came up again but got stuck on LeaderElection. At this time the JobManager UI sent back a response with "Service temporarily unavailable due to an ongoing leader election. Please refresh."
  4.  JobManager never exited this state.
  5.  This state persisted across Pod restarts/deletes.

In order to try to pin down this problem we brought up the Job in another Flink Cluster and started debugging this issue. As a first step I upped the logging level on the JobManager and applied the change. This resulted in the following log (Full logs attached to this email):


2017-09-05 18:50:55,072 TRACE org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver  - Trace: GetDataBuilderImpl-Background - 4 ms

2017-09-05 18:50:55,075 DEBUG org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Received CHILD_ADDED event (path: /4e4cdc8fb8c1437c620cd4063bd265e1)

2017-09-05 18:50:55,088 DEBUG org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Received CHILD_ADDED event notification for job 4e4cdc8fb8c1437c620cd4063bd265e1

2017-09-05 18:50:56,072 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 1,3  replyHeader:: 1,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com,F  response:: s{4305881524,4305881524,1497456679255,1497456679255,0,3,0,0,0,3,21475572278}

2017-09-05 18:50:56,078 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 2,3  replyHeader:: 2,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue,F  response:: s{4305881525,4305881525,1497456679260,1497456679260,0,1,0,0,0,1,4305881526}

2017-09-05 18:50:56,079 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 3,3  replyHeader:: 3,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip,F  response:: s{4305881526,4305881526,1497456679263,1497456679263,0,1,0,0,0,1,4305881527}

2017-09-05 18:50:56,080 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 4,3  replyHeader:: 4,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink,F  response:: s{4305881527,4305881527,1497456679267,1497456679267,0,1,0,0,0,1,4305881528}

2017-09-05 18:50:56,081 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 5,3  replyHeader:: 5,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default,F  response:: s{4305881528,4305881528,1497456679270,1497456679270,0,27,0,0,0,5,21475449005}

2017-09-05 18:50:56,085 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.

2017-09-05 18:50:56,087 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 6,3  replyHeader:: 6,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com,F  response:: s{4305881524,4305881524,1497456679255,1497456679255,0,3,0,0,0,3,21475572278}

2017-09-05 18:50:56,087 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 7,3  replyHeader:: 7,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue,F  response:: s{4305881525,4305881525,1497456679260,1497456679260,0,1,0,0,0,1,4305881526}

2017-09-05 18:50:56,088 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 8,3  replyHeader:: 8,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip,F  response:: s{4305881526,4305881526,1497456679263,1497456679263,0,1,0,0,0,1,4305881527}

2017-09-05 18:50:56,090 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 9,3  replyHeader:: 9,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink,F  response:: s{4305881527,4305881527,1497456679267,1497456679267,0,1,0,0,0,1,4305881528}

2017-09-05 18:50:56,091 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 10,3  replyHeader:: 10,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default,F  response:: s{4305881528,4305881528,1497456679270,1497456679270,0,27,0,0,0,5,21475449005}

2017-09-05 18:50:56,092 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader serverPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader finished:false header:: 11,3  replyHeader:: 11,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader,T  response:: s{21475449005,21475449005,1503519415753,1503519415753,0,1,0,0,0,1,21475449006}

2017-09-05 18:50:56,092 TRACE org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver  - Trace: ExistsBuilderImpl-Background - 1 ms

2017-09-05 18:50:56,093 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader serverPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader finished:false header:: 12,4  replyHeader:: 12,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader,T  response:: ,s{21475449005,21475449005,1503519415753,1503519415753,0,1,0,0,0,1,21475449006}

2017-09-05 18:50:56,093 TRACE org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver  - Trace: GetDataBuilderImpl-Background - 1 ms

2017-09-05 18:50:56,093 DEBUG org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Leader node has changed.

At this point I dug into the logs to try and see what was going on. It quickly became apparent that this probably had something to do with Apache Curator which Flink relies upon to do the leader election. At this point I grabbed a dump of zookeeper using: https://github.com/ctapmex/zkTreeUtil:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root path="/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink">
  <zknode name="default">
    <zknode name="leaderlatch">
      <zknode name="00000000000000000000000000000000">
        <zknode name="job_manager_lock"/>
      </zknode>
      <zknode ephemeral="true" name="_c_9cd3f122-65de-40e7-8e2b-aa34e6a85b83-latch-0000000020"/>
    </zknode>
    <zknode name="checkpoint-counter">
      <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1" value="&#0;&#0;&#0;&#1;"/>
    </zknode>
    <zknode name="checkpoints">
      <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1"/>
    </zknode>
    <zknode name="jobgraphs">
      <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1" value="��&#0;&#5;sr&#0;;org.apache.flink.runtime.state.RetrievableStreamStateHandle&#0;&#1;&#30;&#24;�U�+&#2;&#0;&#1;L&#0;&#24;wrappedStreamStateHandlet&#0;2Lorg/apache/flink/runtime/state/StreamStateHandle;xpsr&#0;9org.apache.flink.runtime.state.filesystem.FileStateHandle&#4;�u�b�&#27;�&#2;&#0;&#2;J&#0;&#9;stateSizeL&#0;&#8;filePatht&#0;&#31;Lorg/apache/flink/core/fs/Path;xp&#0;&#0;&#0;&#0;&#0;&#15;h�sr&#0;&#29;org.apache.flink.core.fs.Path&#0;&#0;&#0;&#0;&#0;&#0;&#0;&#1;&#2;&#0;&#1;L&#0;&#3;urit&#0;&#14;Ljava/net/URI;xpsr&#0;&#12;java.net.URI�&#1;x.C�I�&#3;&#0;&#1;L&#0;&#6;stringt&#0;&#18;Ljava/lang/String;xpt&#0;�s3a://ewetest-traveler-profile/client-state/krazyglue/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/zookeeperCheckpoints/submittedJobGraph6507f982fe76x"/>
    </zknode>
    <zknode name="leader">
      <zknode name="00000000000000000000000000000000"/>
    </zknode>
  </zknode>
</root>

In a healthy cluster there is generally a node under “job_manager_lock” which represents the elected leader. This pointed to a problem with the Apache Curator framework. I opened up both Flink 1.2. And 1.3.2 to see what version of Curator Flink 1.2.1 and 1.3.2 used. I noticed that Flink 1.3.2 uses Curator 2.12.0 whereas Flink 1.2.1 uses 2.8.0. Looking at the release notes for curator there were quite a few bugs might be the cause of this (https://issues.apache.org/jira/browse/CURATOR-264 for example). I was curious to see if leader election would become unstuck if the cluster was upgraded to Flink 1.3.2. I upgraded the JobManager to 1.3.2 and sure enough leader election proceeded correctly. Upon Upgrading the TaskManagers the job restored from its last checkpoint correctly and continued.

Other interesting things of note:

  1.  Restarting/Killing the JobManager didn’t resolve the issue. I suspect this is due to bad state in zookeeper
  2.  Zookeeper was running throughout this time. We didn’t see issues in our kafka cluster and Zookeeper has a leader elected correctly and the Zookeeper command line utility worked without issue.
  3.  Restarts of the JobManager appeared to correctly create ephemeral nodes in zookeeper to represent possible leaders (which also appear to get cleaned up correctly).

As far as I can tell this issue is resolved with the newer Apache Curator version in 1.3.2. It may be useful to keep an eye out for this issue however.

Thanks,
James Bucher

Re: Flink 1.2.1 JobManager Election Deadlock

Posted by Mar_zieh <m....@gmail.com>.

Dear guys

Hello

I have cluster of three physical nodes with docker installed on each of
them. I configured Mesos,Hadoop,Marathon and Zookeeper;they were run without
any problems. Then, I ran a Flink application on Marathon. After that, I
could open Flink UI on firefox. I tested a program on Flink cluster and it
was run without error. 
The problem is when I close docker and run it again, I can run Hadoop and
Mesos UI;but Flink UI is down with this error:

Service temporarily unavailable due to an ongoing leader election. Please
refresh

I check Zookeeper is run fine. Would you please tell me how to solve the
issue? 

Any help would be really appreciated.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Flink 1.2.1 JobManager Election Deadlock

Posted by Ufuk Celebi <uc...@apache.org>.

Thanks for looking into this and finding out that it is (probably)
related to Curator. Very valuable information!

On Thu, Sep 7, 2017 at 3:09 PM, Timo Walther <tw...@apache.org> wrote:
> Thanks for informing us. As far as I know, we were not aware of any deadlock
> in the JobManager election. Let's hope that the updated Curator version
> fixed the problem. We will defenitely keep an eye on this. Feel free to
> contact the dev@ mailing list, if the problem still exists in 1.3.2.
>
> Regards,
> Timo
>
> Am 06.09.17 um 22:58 schrieb James Bucher:
>
> It seems I forgot to attach the full logs. They should be attached now.
>
> From: James Bucher <jb...@expedia.com>
> Date: Wednesday, September 6, 2017 at 10:43 AM
> To: "user@flink.apache.org" <us...@flink.apache.org>
> Subject: Flink 1.2.1 JobManager Election Deadlock
>
> Hey all,
>
> Just wanted to report this for posterity in case someone else sees something
> similar. We were running Flink 1.2.1 in Kubernetes. We use an HA setup with
> an external Zookeeper and S3 for checkpointing. We recently noticed a job
> that appears to have deadlocked on JobManager Leader election. We think the
> issue happened something like the following:
>
> Job was up and running normally
> Some cluster event caused the JobManager Pod (process) to get restarted.
> JobManager came up again but got stuck on LeaderElection. At this time the
> JobManager UI sent back a response with "Service temporarily unavailable due
> to an ongoing leader election. Please refresh."
> JobManager never exited this state.
> This state persisted across Pod restarts/deletes.
>
> In order to try to pin down this problem we brought up the Job in another
> Flink Cluster and started debugging this issue. As a first step I upped the
> logging level on the JobManager and applied the change. This resulted in the
> following log (Full logs attached to this email):
>
> 2017-09-05 18:50:55,072 TRACE
> org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver  -
> Trace: GetDataBuilderImpl-Background - 4 ms
>
> 2017-09-05 18:50:55,075 DEBUG
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
> Received CHILD_ADDED event (path: /4e4cdc8fb8c1437c620cd4063bd265e1)
>
> 2017-09-05 18:50:55,088 DEBUG
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
> Received CHILD_ADDED event notification for job
> 4e4cdc8fb8c1437c620cd4063bd265e1
>
> 2017-09-05 18:50:56,072 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 1,3  replyHeader:: 1,21476396435,0
> request:: '/traveler-profile.us-west-2c.test.expedia.com,F  response::
> s{4305881524,4305881524,1497456679255,1497456679255,0,3,0,0,0,3,21475572278}
>
> 2017-09-05 18:50:56,078 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 2,3  replyHeader:: 2,21476396435,0
> request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue,F
> response::
> s{4305881525,4305881525,1497456679260,1497456679260,0,1,0,0,0,1,4305881526}
>
> 2017-09-05 18:50:56,079 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 3,3  replyHeader:: 3,21476396435,0
> request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip,F
> response::
> s{4305881526,4305881526,1497456679263,1497456679263,0,1,0,0,0,1,4305881527}
>
> 2017-09-05 18:50:56,080 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 4,3  replyHeader:: 4,21476396435,0
> request::
> '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink,F
> response::
> s{4305881527,4305881527,1497456679267,1497456679267,0,1,0,0,0,1,4305881528}
>
> 2017-09-05 18:50:56,081 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 5,3  replyHeader:: 5,21476396435,0
> request::
> '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default,F
> response::
> s{4305881528,4305881528,1497456679270,1497456679270,0,27,0,0,0,5,21475449005}
>
> 2017-09-05 18:50:56,085 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService.
>
> 2017-09-05 18:50:56,087 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 6,3  replyHeader:: 6,21476396435,0
> request:: '/traveler-profile.us-west-2c.test.expedia.com,F  response::
> s{4305881524,4305881524,1497456679255,1497456679255,0,3,0,0,0,3,21475572278}
>
> 2017-09-05 18:50:56,087 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 7,3  replyHeader:: 7,21476396435,0
> request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue,F
> response::
> s{4305881525,4305881525,1497456679260,1497456679260,0,1,0,0,0,1,4305881526}
>
> 2017-09-05 18:50:56,088 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 8,3  replyHeader:: 8,21476396435,0
> request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip,F
> response::
> s{4305881526,4305881526,1497456679263,1497456679263,0,1,0,0,0,1,4305881527}
>
> 2017-09-05 18:50:56,090 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 9,3  replyHeader:: 9,21476396435,0
> request::
> '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink,F
> response::
> s{4305881527,4305881527,1497456679267,1497456679267,0,1,0,0,0,1,4305881528}
>
> 2017-09-05 18:50:56,091 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null
> serverPath:null finished:false header:: 10,3  replyHeader:: 10,21476396435,0
> request::
> '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default,F
> response::
> s{4305881528,4305881528,1497456679270,1497456679270,0,27,0,0,0,5,21475449005}
>
> 2017-09-05 18:50:56,092 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
> clientPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
> serverPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
> finished:false header:: 11,3  replyHeader:: 11,21476396435,0  request::
> '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader,T
> response::
> s{21475449005,21475449005,1503519415753,1503519415753,0,1,0,0,0,1,21475449006}
>
> 2017-09-05 18:50:56,092 TRACE
> org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver  -
> Trace: ExistsBuilderImpl-Background - 1 ms
>
> 2017-09-05 18:50:56,093 DEBUG org.apache.zookeeper.ClientCnxn
> - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
> clientPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
> serverPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
> finished:false header:: 12,4  replyHeader:: 12,21476396435,0  request::
> '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader,T
> response::
> ,s{21475449005,21475449005,1503519415753,1503519415753,0,1,0,0,0,1,21475449006}
>
> 2017-09-05 18:50:56,093 TRACE
> org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver  -
> Trace: GetDataBuilderImpl-Background - 1 ms
>
> 2017-09-05 18:50:56,093 DEBUG
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Leader node has changed.
>
>
> At this point I dug into the logs to try and see what was going on. It
> quickly became apparent that this probably had something to do with Apache
> Curator which Flink relies upon to do the leader election. At this point I
> grabbed a dump of zookeeper using: https://github.com/ctapmex/zkTreeUtil:
>
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <root
> path="/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink">
>   <zknode name="default">
>     <zknode name="leaderlatch">
>       <zknode name="00000000000000000000000000000000">
>         <zknode name="job_manager_lock"/>
>       </zknode>
>       <zknode ephemeral="true"
> name="_c_9cd3f122-65de-40e7-8e2b-aa34e6a85b83-latch-0000000020"/>
>     </zknode>
>     <zknode name="checkpoint-counter">
>       <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1"
> value="&#0;&#0;&#0;&#1;"/>
>     </zknode>
>     <zknode name="checkpoints">
>       <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1"/>
>     </zknode>
>     <zknode name="jobgraphs">
>       <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1"
> value="��&#0;&#5;sr&#0;;org.apache.flink.runtime.state.RetrievableStreamStateHandle&#0;&#1;&#30;&#24;�U�+&#2;&#0;&#1;L&#0;&#24;wrappedStreamStateHandlet&#0;2Lorg/apache/flink/runtime/state/StreamStateHandle;xpsr&#0;9org.apache.flink.runtime.state.filesystem.FileStateHandle&#4;�u�b�&#27;�&#2;&#0;&#2;J&#0;&#9;stateSizeL&#0;&#8;filePatht&#0;&#31;Lorg/apache/flink/core/fs/Path;xp&#0;&#0;&#0;&#0;&#0;&#15;h�sr&#0;&#29;org.apache.flink.core.fs.Path&#0;&#0;&#0;&#0;&#0;&#0;&#0;&#1;&#2;&#0;&#1;L&#0;&#3;urit&#0;&#14;Ljava/net/URI;xpsr&#0;&#12;java.net.URI�&#1;x.C�I�&#3;&#0;&#1;L&#0;&#6;stringt&#0;&#18;Ljava/lang/String;xpt&#0;�s3a://ewetest-traveler-profile/client-state/krazyglue/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/zookeeperCheckpoints/submittedJobGraph6507f982fe76x"/>
>     </zknode>
>     <zknode name="leader">
>       <zknode name="00000000000000000000000000000000"/>
>     </zknode>
>   </zknode>
> </root>
>
> In a healthy cluster there is generally a node under “job_manager_lock”
> which represents the elected leader. This pointed to a problem with the
> Apache Curator framework. I opened up both Flink 1.2. And 1.3.2 to see what
> version of Curator Flink 1.2.1 and 1.3.2 used. I noticed that Flink 1.3.2
> uses Curator 2.12.0 whereas Flink 1.2.1 uses 2.8.0. Looking at the release
> notes for curator there were quite a few bugs might be the cause of this
> (https://issues.apache.org/jira/browse/CURATOR-264 for example). I was
> curious to see if leader election would become unstuck if the cluster was
> upgraded to Flink 1.3.2. I upgraded the JobManager to 1.3.2 and sure enough
> leader election proceeded correctly. Upon Upgrading the TaskManagers the job
> restored from its last checkpoint correctly and continued.
>
> Other interesting things of note:
>
> Restarting/Killing the JobManager didn’t resolve the issue. I suspect this
> is due to bad state in zookeeper
> Zookeeper was running throughout this time. We didn’t see issues in our
> kafka cluster and Zookeeper has a leader elected correctly and the Zookeeper
> command line utility worked without issue.
> Restarts of the JobManager appeared to correctly create ephemeral nodes in
> zookeeper to represent possible leaders (which also appear to get cleaned up
> correctly).
>
> As far as I can tell this issue is resolved with the newer Apache Curator
> version in 1.3.2. It may be useful to keep an eye out for this issue
> however.
>
> Thanks,
> James Bucher
>
>
>

Re: Flink 1.2.1 JobManager Election Deadlock

Posted by Timo Walther <tw...@apache.org>.

Thanks for informing us. As far as I know, we were not aware of any 
deadlock in the JobManager election. Let's hope that the updated Curator 
version fixed the problem. We will defenitely keep an eye on this. Feel 
free to contact the dev@ mailing list, if the problem still exists in 1.3.2.

Regards,
Timo

Am 06.09.17 um 22:58 schrieb James Bucher:
> It seems I forgot to attach the full logs. They should be attached now.
>
> From: James Bucher <jbucher@expedia.com <ma...@expedia.com>>
> Date: Wednesday, September 6, 2017 at 10:43 AM
> To: "user@flink.apache.org <ma...@flink.apache.org>" 
> <user@flink.apache.org <ma...@flink.apache.org>>
> Subject: Flink 1.2.1 JobManager Election Deadlock
>
>     Hey all,
>
>     Just wanted to report this for posterity in case someone else sees
>     something similar. We were running Flink 1.2.1 in Kubernetes. We
>     use an HA setup with an external Zookeeper and S3 for
>     checkpointing. We recently noticed a job that appears to have
>     deadlocked on JobManager Leader election. We think the issue
>     happened something like the following:
>
>      1. Job was up and running normally
>      2. Some cluster event caused the JobManager Pod (process) to get
>         restarted.
>      3. JobManager came up again but got stuck on LeaderElection. At
>         this time the JobManager UI sent back a response with "Service
>         temporarily unavailable due to an ongoing leader election.
>         Please refresh."
>      4. JobManager never exited this state.
>      5. This state persisted across Pod restarts/deletes.
>
>     In order to try to pin down this problem we brought up the Job in
>     another Flink Cluster and started debugging this issue. As a first
>     step I upped the logging level on the JobManager and applied the
>     change. This resulted in the following log (Full logs attached to
>     this email):
>
>     2017-09-05 18:50:55,072 TRACE
>     org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver
>     - Trace: GetDataBuilderImpl-Background - 4 ms
>
>     2017-09-05 18:50:55,075 DEBUG
>     org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore
>     - Received CHILD_ADDED event (path: /4e4cdc8fb8c1437c620cd4063bd265e1)
>
>     2017-09-05 18:50:55,088 DEBUG
>     org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore
>     - Received CHILD_ADDED event notification for job
>     4e4cdc8fb8c1437c620cd4063bd265e1
>
>     2017-09-05 18:50:56,072 DEBUG org.apache.zookeeper.ClientCnxn    
>           - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
>     clientPath:null serverPath:null finished:false header:: 1,3 
>     replyHeader:: 1,21476396435,0 request::
>     '/traveler-profile.us-west-2c.test.expedia.com,F response::
>     s{4305881524,4305881524,1497456679255,1497456679255,0,3,0,0,0,3,21475572278}
>
>     2017-09-05 18:50:56,078 DEBUG org.apache.zookeeper.ClientCnxn    
>           - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
>     clientPath:null serverPath:null finished:false header:: 2,3 
>     replyHeader:: 2,21476396435,0 request::
>     '/traveler-profile.us-west-2c.test.expedia.com/krazyglue,F
>     response::
>     s{4305881525,4305881525,1497456679260,1497456679260,0,1,0,0,0,1,4305881526}
>
>     2017-09-05 18:50:56,079 DEBUG org.apache.zookeeper.ClientCnxn    
>           - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
>     clientPath:null serverPath:null finished:false header:: 3,3 
>     replyHeader:: 3,21476396435,0 request::
>     '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip,F
>     response::
>     s{4305881526,4305881526,1497456679263,1497456679263,0,1,0,0,0,1,4305881527}
>
>     2017-09-05 18:50:56,080 DEBUG org.apache.zookeeper.ClientCnxn    
>           - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
>     clientPath:null serverPath:null finished:false header:: 4,3 
>     replyHeader:: 4,21476396435,0 request::
>     '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink,F
>     response::
>     s{4305881527,4305881527,1497456679267,1497456679267,0,1,0,0,0,1,4305881528}
>
>     2017-09-05 18:50:56,081 DEBUG org.apache.zookeeper.ClientCnxn    
>           - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
>     clientPath:null serverPath:null finished:false header:: 5,3 
>     replyHeader:: 5,21476396435,0 request::
>     '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default,F
>     response::
>     s{4305881528,4305881528,1497456679270,1497456679270,0,27,0,0,0,5,21475449005}
>
>     2017-09-05 18:50:56,085 INFO
>     org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>     - Starting ZooKeeperLeaderRetrievalService.
>
>     2017-09-05 18:50:56,087 DEBUG org.apache.zookeeper.ClientCnxn    
>           - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
>     clientPath:null serverPath:null finished:false header:: 6,3 
>     replyHeader:: 6,21476396435,0 request::
>     '/traveler-profile.us-west-2c.test.expedia.com,F response::
>     s{4305881524,4305881524,1497456679255,1497456679255,0,3,0,0,0,3,21475572278}
>
>     2017-09-05 18:50:56,087 DEBUG org.apache.zookeeper.ClientCnxn    
>           - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
>     clientPath:null serverPath:null finished:false header:: 7,3 
>     replyHeader:: 7,21476396435,0 request::
>     '/traveler-profile.us-west-2c.test.expedia.com/krazyglue,F
>     response::
>     s{4305881525,4305881525,1497456679260,1497456679260,0,1,0,0,0,1,4305881526}
>
>     2017-09-05 18:50:56,088 DEBUG org.apache.zookeeper.ClientCnxn    
>           - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
>     clientPath:null serverPath:null finished:false header:: 8,3 
>     replyHeader:: 8,21476396435,0 request::
>     '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip,F
>     response::
>     s{4305881526,4305881526,1497456679263,1497456679263,0,1,0,0,0,1,4305881527}
>
>     2017-09-05 18:50:56,090 DEBUG org.apache.zookeeper.ClientCnxn    
>           - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
>     clientPath:null serverPath:null finished:false header:: 9,3 
>     replyHeader:: 9,21476396435,0 request::
>     '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink,F
>     response::
>     s{4305881527,4305881527,1497456679267,1497456679267,0,1,0,0,0,1,4305881528}
>
>     2017-09-05 18:50:56,091 DEBUG org.apache.zookeeper.ClientCnxn    
>           - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
>     clientPath:null serverPath:null finished:false header:: 10,3 
>     replyHeader:: 10,21476396435,0 request::
>     '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default,F
>     response::
>     s{4305881528,4305881528,1497456679270,1497456679270,0,27,0,0,0,5,21475449005}
>
>     2017-09-05 18:50:56,092 DEBUG org.apache.zookeeper.ClientCnxn    
>           - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
>     clientPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
>     serverPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
>     finished:false header:: 11,3  replyHeader:: 11,21476396435,0 
>     request::
>     '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader,T
>     response::
>     s{21475449005,21475449005,1503519415753,1503519415753,0,1,0,0,0,1,21475449006}
>
>     2017-09-05 18:50:56,092 TRACE
>     org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver
>     - Trace: ExistsBuilderImpl-Background - 1 ms
>
>     2017-09-05 18:50:56,093 DEBUG org.apache.zookeeper.ClientCnxn    
>           - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
>     clientPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
>     serverPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
>     finished:false header:: 12,4  replyHeader:: 12,21476396435,0 
>     request::
>     '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader,T
>     response::
>     ,s{21475449005,21475449005,1503519415753,1503519415753,0,1,0,0,0,1,21475449006}
>
>     2017-09-05 18:50:56,093 TRACE
>     org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver
>     - Trace: GetDataBuilderImpl-Background - 1 ms
>
>     2017-09-05 18:50:56,093 DEBUG
>     org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>     - Leader node has changed.
>
>
>     At this point I dug into the logs to try and see what was going
>     on. It quickly became apparent that this probably had something to
>     do with Apache Curator which Flink relies upon to do the leader
>     election. At this point I grabbed a dump of zookeeper using:
>     https://github.com/ctapmex/zkTreeUtil:
>
>     <?xml version="1.0" encoding="UTF-8" standalone="no"?>
>     <root
>     path="/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink">
>       <zknode name="default">
>         <zknode name="leaderlatch">
>           <zknode name="00000000000000000000000000000000">
>             <zknode name="job_manager_lock"/>
>           </zknode>
>           <zknode ephemeral="true"
>     name="_c_9cd3f122-65de-40e7-8e2b-aa34e6a85b83-latch-0000000020"/>
>         </zknode>
>         <zknode name="checkpoint-counter">
>           <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1"
>     value="&#0;&#0;&#0;&#1;"/>
>         </zknode>
>         <zknode name="checkpoints">
>           <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1"/>
>         </zknode>
>         <zknode name="jobgraphs">
>           <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1"
>     value="��&#0;&#5;sr&#0;;org.apache.flink.runtime.state.RetrievableStreamStateHandle&#0;&#1;&#30;&#24;�U�+&#2;&#0;&#1;L&#0;&#24;wrappedStreamStateHandlet&#0;2Lorg/apache/flink/runtime/state/StreamStateHandle;xpsr&#0;9org.apache.flink.runtime.state.filesystem.FileStateHandle&#4;�u�b�&#27;�&#2;&#0;&#2;J&#0;&#9;stateSizeL&#0;&#8;filePatht&#0;&#31;Lorg/apache/flink/core/fs/Path;xp&#0;&#0;&#0;&#0;&#0;&#15;h�sr&#0;&#29;org.apache.flink.core.fs.Path&#0;&#0;&#0;&#0;&#0;&#0;&#0;&#1;&#2;&#0;&#1;L&#0;&#3;urit&#0;&#14;Ljava/net/URI;xpsr&#0;&#12;java.net.URI�&#1;x.C�I�&#3;&#0;&#1;L&#0;&#6;stringt&#0;&#18;Ljava/lang/String;xpt&#0;�s3a://ewetest-traveler-profile/client-state/krazyglue/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/zookeeperCheckpoints/submittedJobGraph6507f982fe76x"/>
>         </zknode>
>         <zknode name="leader">
>           <zknode name="00000000000000000000000000000000"/>
>         </zknode>
>       </zknode>
>     </root>
>
>     In a healthy cluster there is generally a node under
>     “job_manager_lock” which represents the elected leader. This
>     pointed to a problem with the Apache Curator framework. I opened
>     up both Flink 1.2. And 1.3.2 to see what version of Curator Flink
>     1.2.1 and 1.3.2 used. I noticed that Flink 1.3.2 uses Curator
>     2.12.0 whereas Flink 1.2.1 uses 2.8.0. Looking at the release
>     notes for curator there were quite a few bugs might be the cause
>     of this (https://issues.apache.org/jira/browse/CURATOR-264 for
>     example). I was curious to see if leader election would become
>     unstuck if the cluster was upgraded to Flink 1.3.2. I upgraded the
>     JobManager to 1.3.2 and sure enough leader election proceeded
>     correctly. Upon Upgrading the TaskManagers the job restored from
>     its last checkpoint correctly and continued.
>
>     Other interesting things of note:
>
>      1. Restarting/Killing the JobManager didn’t resolve the issue. I
>         suspect this is due to bad state in zookeeper
>      2. Zookeeper was running throughout this time. We didn’t see
>         issues in our kafka cluster and Zookeeper has a leader elected
>         correctly and the Zookeeper command line utility worked
>         without issue.
>      3. Restarts of the JobManager appeared to correctly create
>         ephemeral nodes in zookeeper to represent possible leaders
>         (which also appear to get cleaned up correctly).
>
>     As far as I can tell this issue is resolved with the newer Apache
>     Curator version in 1.3.2. It may be useful to keep an eye out for
>     this issue however.
>
>     Thanks,
>     James Bucher
>

Re: Flink 1.2.1 JobManager Election Deadlock

Posted by James Bucher <jb...@expedia.com>.

It seems I forgot to attach the full logs. They should be attached now.

From: James Bucher <jb...@expedia.com>>
Date: Wednesday, September 6, 2017 at 10:43 AM
To: "user@flink.apache.org<ma...@flink.apache.org>" <us...@flink.apache.org>>
Subject: Flink 1.2.1 JobManager Election Deadlock

Hey all,

Just wanted to report this for posterity in case someone else sees something similar. We were running Flink 1.2.1 in Kubernetes. We use an HA setup with an external Zookeeper and S3 for checkpointing. We recently noticed a job that appears to have deadlocked on JobManager Leader election. We think the issue happened something like the following:

  1.  Job was up and running normally
  2.  Some cluster event caused the JobManager Pod (process) to get restarted.
  3.  JobManager came up again but got stuck on LeaderElection. At this time the JobManager UI sent back a response with "Service temporarily unavailable due to an ongoing leader election. Please refresh."
  4.  JobManager never exited this state.
  5.  This state persisted across Pod restarts/deletes.

In order to try to pin down this problem we brought up the Job in another Flink Cluster and started debugging this issue. As a first step I upped the logging level on the JobManager and applied the change. This resulted in the following log (Full logs attached to this email):


2017-09-05 18:50:55,072 TRACE org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver  - Trace: GetDataBuilderImpl-Background - 4 ms

2017-09-05 18:50:55,075 DEBUG org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Received CHILD_ADDED event (path: /4e4cdc8fb8c1437c620cd4063bd265e1)

2017-09-05 18:50:55,088 DEBUG org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Received CHILD_ADDED event notification for job 4e4cdc8fb8c1437c620cd4063bd265e1

2017-09-05 18:50:56,072 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 1,3  replyHeader:: 1,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com,F  response:: s{4305881524,4305881524,1497456679255,1497456679255,0,3,0,0,0,3,21475572278}

2017-09-05 18:50:56,078 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 2,3  replyHeader:: 2,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue,F  response:: s{4305881525,4305881525,1497456679260,1497456679260,0,1,0,0,0,1,4305881526}

2017-09-05 18:50:56,079 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 3,3  replyHeader:: 3,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip,F  response:: s{4305881526,4305881526,1497456679263,1497456679263,0,1,0,0,0,1,4305881527}

2017-09-05 18:50:56,080 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 4,3  replyHeader:: 4,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink,F  response:: s{4305881527,4305881527,1497456679267,1497456679267,0,1,0,0,0,1,4305881528}

2017-09-05 18:50:56,081 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 5,3  replyHeader:: 5,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default,F  response:: s{4305881528,4305881528,1497456679270,1497456679270,0,27,0,0,0,5,21475449005}

2017-09-05 18:50:56,085 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.

2017-09-05 18:50:56,087 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 6,3  replyHeader:: 6,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com,F  response:: s{4305881524,4305881524,1497456679255,1497456679255,0,3,0,0,0,3,21475572278}

2017-09-05 18:50:56,087 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 7,3  replyHeader:: 7,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue,F  response:: s{4305881525,4305881525,1497456679260,1497456679260,0,1,0,0,0,1,4305881526}

2017-09-05 18:50:56,088 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 8,3  replyHeader:: 8,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip,F  response:: s{4305881526,4305881526,1497456679263,1497456679263,0,1,0,0,0,1,4305881527}

2017-09-05 18:50:56,090 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 9,3  replyHeader:: 9,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink,F  response:: s{4305881527,4305881527,1497456679267,1497456679267,0,1,0,0,0,1,4305881528}

2017-09-05 18:50:56,091 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:null serverPath:null finished:false header:: 10,3  replyHeader:: 10,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default,F  response:: s{4305881528,4305881528,1497456679270,1497456679270,0,27,0,0,0,5,21475449005}

2017-09-05 18:50:56,092 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader serverPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader finished:false header:: 11,3  replyHeader:: 11,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader,T  response:: s{21475449005,21475449005,1503519415753,1503519415753,0,1,0,0,0,1,21475449006}

2017-09-05 18:50:56,092 TRACE org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver  - Trace: ExistsBuilderImpl-Background - 1 ms

2017-09-05 18:50:56,093 DEBUG org.apache.zookeeper.ClientCnxn                               - Reading reply sessionid:0x15e3df8b5a57b4b, packet:: clientPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader serverPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader finished:false header:: 12,4  replyHeader:: 12,21476396435,0  request:: '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader,T  response:: ,s{21475449005,21475449005,1503519415753,1503519415753,0,1,0,0,0,1,21475449006}

2017-09-05 18:50:56,093 TRACE org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver  - Trace: GetDataBuilderImpl-Background - 1 ms

2017-09-05 18:50:56,093 DEBUG org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Leader node has changed.

At this point I dug into the logs to try and see what was going on. It quickly became apparent that this probably had something to do with Apache Curator which Flink relies upon to do the leader election. At this point I grabbed a dump of zookeeper using: https://github.com/ctapmex/zkTreeUtil:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root path="/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink">
  <zknode name="default">
    <zknode name="leaderlatch">
      <zknode name="00000000000000000000000000000000">
        <zknode name="job_manager_lock"/>
      </zknode>
      <zknode ephemeral="true" name="_c_9cd3f122-65de-40e7-8e2b-aa34e6a85b83-latch-0000000020"/>
    </zknode>
    <zknode name="checkpoint-counter">
      <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1" value="&#0;&#0;&#0;&#1;"/>
    </zknode>
    <zknode name="checkpoints">
      <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1"/>
    </zknode>
    <zknode name="jobgraphs">
      <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1" value="��&#0;&#5;sr&#0;;org.apache.flink.runtime.state.RetrievableStreamStateHandle&#0;&#1;&#30;&#24;�U�+&#2;&#0;&#1;L&#0;&#24;wrappedStreamStateHandlet&#0;2Lorg/apache/flink/runtime/state/StreamStateHandle;xpsr&#0;9org.apache.flink.runtime.state.filesystem.FileStateHandle&#4;�u�b�&#27;�&#2;&#0;&#2;J&#0;&#9;stateSizeL&#0;&#8;filePatht&#0;&#31;Lorg/apache/flink/core/fs/Path;xp&#0;&#0;&#0;&#0;&#0;&#15;h�sr&#0;&#29;org.apache.flink.core.fs.Path&#0;&#0;&#0;&#0;&#0;&#0;&#0;&#1;&#2;&#0;&#1;L&#0;&#3;urit&#0;&#14;Ljava/net/URI;xpsr&#0;&#12;java.net.URI�&#1;x.C�I�&#3;&#0;&#1;L&#0;&#6;stringt&#0;&#18;Ljava/lang/String;xpt&#0;�s3a://ewetest-traveler-profile/client-state/krazyglue/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/zookeeperCheckpoints/submittedJobGraph6507f982fe76x"/>
    </zknode>
    <zknode name="leader">
      <zknode name="00000000000000000000000000000000"/>
    </zknode>
  </zknode>
</root>

In a healthy cluster there is generally a node under “job_manager_lock” which represents the elected leader. This pointed to a problem with the Apache Curator framework. I opened up both Flink 1.2. And 1.3.2 to see what version of Curator Flink 1.2.1 and 1.3.2 used. I noticed that Flink 1.3.2 uses Curator 2.12.0 whereas Flink 1.2.1 uses 2.8.0. Looking at the release notes for curator there were quite a few bugs might be the cause of this (https://issues.apache.org/jira/browse/CURATOR-264 for example). I was curious to see if leader election would become unstuck if the cluster was upgraded to Flink 1.3.2. I upgraded the JobManager to 1.3.2 and sure enough leader election proceeded correctly. Upon Upgrading the TaskManagers the job restored from its last checkpoint correctly and continued.

Other interesting things of note:

  1.  Restarting/Killing the JobManager didn’t resolve the issue. I suspect this is due to bad state in zookeeper
  2.  Zookeeper was running throughout this time. We didn’t see issues in our kafka cluster and Zookeeper has a leader elected correctly and the Zookeeper command line utility worked without issue.
  3.  Restarts of the JobManager appeared to correctly create ephemeral nodes in zookeeper to represent possible leaders (which also appear to get cleaned up correctly).

As far as I can tell this issue is resolved with the newer Apache Curator version in 1.3.2. It may be useful to keep an eye out for this issue however.

Thanks,
James Bucher