You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Chirag Dewan <ch...@yahoo.in> on 2018/02/14 05:12:42 UTC

Deploying Flink with JobManager HA on Docker Swarm/Kubernetes

Hi,
I am trying to deploy a Flink cluster (1 JM, 2TM) on a Docker Swarm. For JobManager HA, I have started a 3 node zookeeper service on the same swarm network and configured Flink's zookeeper quorum with zookeeper service instances. 
JobManager gets started with the LeaderElectionService and gets assigned a LeaderSessionID too, which I can see from the following log statements(attaching only related logs) :
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService   org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager was granted leadership with leader session ID Some(1f3b2ec6-77b6-4532-928f-ad8befd5202f).
 Trying to associate with JobManager leader akka.tcp://flink@jobmanager:6123/user/jobmanager Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#590681231] - leader session 1f3b2ec6-77b6-4532-928f-ad8befd5202f

But TaskManagers are not able to register with the JobManager and gives the following error:
Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,RegisterTaskManager(4fc8aceeae1e27e42b9f16df6c0cf5e3,4fc8aceeae1e27e42b9f16df6c0cf5e3 @ a118cdf39114 (dataPort=43017),cores=1, physMem=1044111360, heap=536870912, managed=324208384,1)) because the expected leader session ID 1f3b2ec6-77b6-4532-928f-ad8befd5202f did not equal the received leader session ID 00000000-0000-0000-0000-000000000000.

Seems like the ResourceManager was not able to retrieve the LeaderSessionID and passed 00 ID. 
One interesting thing I observed was a ZK version log:
The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.

Is this a ZK version problem? Should I be using ZK 3.4.6?
My configuration:
Flink Version : 1.4.0ZK version : 3.4.11 (I just pulled the latest image)
Thanks in advance. 
Chirag

Re: Deploying Flink with JobManager HA on Docker Swarm/Kubernetes

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,

AFAIK, the JobGraph itself is not stored in ZK but in HDFS. ZK only stores a handle to the serialised JobGraph.

Best,
Aljoscha

> On 15. Feb 2018, at 04:59, Chirag Dewan <ch...@yahoo.in> wrote:
> 
> Thanks a lot Aljoscha.
> 
> I was doing a silly mistake. TaskManagers can now register with JobManager.
> 
> One more thing, does Flink now store Job Graphs on ZK too?
> 
> Regards,
> 
> Chirag
> 
> On Wednesday, 14 February, 2018, 8:06:14 PM IST, Aljoscha Krettek <al...@apache.org> wrote:
> 
> 
> It should be roughly the same settings that you use in your JobManager. They are described here: https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#zookeeper-based-ha-mode <https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#zookeeper-based-ha-mode>
> 
>> On 14. Feb 2018, at 15:32, Chirag Dewan <chirag.dewan22@yahoo.in <ma...@yahoo.in>> wrote:
>> 
>> Thanks Aljoscha.
>> 
>> I haven't checked that bit. Is there any configuration for TaskManagers to find ZK?
>> 
>> Regards,
>> 
>> Chirag
>> 
>> Sent from Yahoo Mail on Android <https://overview.mail.yahoo.com/mobile/?.src=Android>
>> On Wed, 14 Feb 2018 at 7:43 PM, Aljoscha Krettek
>> <aljoscha@apache.org <ma...@apache.org>> wrote:
>> Do you see in the logs whether the TaskManager correctly connect to ZooKeeper as well? They need this in order to find the JobManager leader.
>> 
>> Best,
>> Aljoscha
>> 
>>> On 14. Feb 2018, at 06:12, Chirag Dewan <chirag.dewan22@yahoo.in <ma...@yahoo.in>> wrote:
>>> 
>>> Hi,
>>> 
>>> I am trying to deploy a Flink cluster (1 JM, 2TM) on a Docker Swarm. For JobManager HA, I have started a 3 node zookeeper service on the same swarm network and configured Flink's zookeeper quorum with zookeeper service instances. 
>>> 
>>> JobManager gets started with the LeaderElectionService and gets assigned a LeaderSessionID too, which I can see from the following log statements(attaching only related logs) :
>>> 
>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService   org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
>>> JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager <> was granted leadership with leader session ID Some(1f3b2ec6-77b6-4532-928f-ad8befd5202f).
>>>  Trying to associate with JobManager leader akka.tcp://flink@jobmanager:6123/user/jobmanager <>
>>>  Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#590681231 <>] - leader session 1f3b2ec6-77b6-4532-928f-ad8befd5202f
>>> 
>>> 
>>> But TaskManagers are not able to register with the JobManager and gives the following error:
>>> 
>>> Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,RegisterTaskManager(4fc8aceeae1e27e42b9f16df6c0cf5e3,4fc8aceeae1e27e42b9f16df6c0cf5e3 @ a118cdf39114 (dataPort=43017),cores=1, physMem=1044111360, heap=536870912, managed=324208384,1)) because the expected leader session ID 1f3b2ec6-77b6-4532-928f-ad8befd5202f did not equal the received leader session ID 00000000-0000-0000-0000-000000000000.
>>> 
>>> Seems like the ResourceManager was not able to retrieve the LeaderSessionID and passed 00 ID. 
>>> 
>>> One interesting thing I observed was a ZK version log:
>>> 
>>> The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
>>> 
>>> Is this a ZK version problem? Should I be using ZK 3.4.6?
>>> 
>>> My configuration:
>>> 
>>> Flink Version : 1.4.0
>>> ZK version : 3.4.11 (I just pulled the latest image)
>>> 
>>> Thanks in advance. 
>>> 
>>> Chirag
>>> 
>> 
>

Re: Deploying Flink with JobManager HA on Docker Swarm/Kubernetes

Posted by Chirag Dewan <ch...@yahoo.in>.

Thanks a lot Aljoscha.
I was doing a silly mistake. TaskManagers can now register with JobManager.
One more thing, does Flink now store Job Graphs on ZK too?
Regards,
Chirag
On Wednesday, 14 February, 2018, 8:06:14 PM IST, Aljoscha Krettek <al...@apache.org> wrote:

It should be roughly the same settings that you use in your JobManager. They are described here: https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#zookeeper-based-ha-mode

On 14. Feb 2018, at 15:32, Chirag Dewan <ch...@yahoo.in> wrote:
Thanks Aljoscha.
I haven't checked that bit. Is there any configuration for TaskManagers to find ZK?
Regards,
Chirag

Sent from Yahoo Mail on Android

On Wed, 14 Feb 2018 at 7:43 PM, Aljoscha Krettek<al...@apache.org> wrote: Do you see in the logs whether the TaskManager correctly connect to ZooKeeper as well? They need this in order to find the JobManager leader.
Best,Aljoscha

On 14. Feb 2018, at 06:12, Chirag Dewan <ch...@yahoo.in> wrote:
Hi,
I am trying to deploy a Flink cluster (1 JM, 2TM) on a Docker Swarm. For JobManager HA, I have started a 3 node zookeeper service on the same swarm network and configured Flink's zookeeper quorum with zookeeper service instances.
JobManager gets started with the LeaderElectionService and gets assigned a LeaderSessionID too, which I can see from the following log statements(attaching only related logs) :
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager was granted leadership with leader session ID Some(1f3b2ec6-77b6-4532-928f-ad8befd5202f).
Trying to associate with JobManager leader akka.tcp://flink@jobmanager:6123/user/jobmanager Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#590681231] - leader session 1f3b2ec6-77b6-4532-928f-ad8befd5202f

But TaskManagers are not able to register with the JobManager and gives the following error:
Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,RegisterTaskManager(4fc8aceeae1e27e42b9f16df6c0cf5e3,4fc8aceeae1e27e42b9f16df6c0cf5e3 @ a118cdf39114 (dataPort=43017),cores=1, physMem=1044111360, heap=536870912, managed=324208384,1)) because the expected leader session ID 1f3b2ec6-77b6-4532-928f-ad8befd5202f did not equal the received leader session ID 00000000-0000-0000-0000-000000000000.

Seems like the ResourceManager was not able to retrieve the LeaderSessionID and passed 00 ID.
One interesting thing I observed was a ZK version log:
The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.

Is this a ZK version problem? Should I be using ZK 3.4.6?
My configuration:
Flink Version : 1.4.0ZK version : 3.4.11 (I just pulled the latest image)
Thanks in advance.
Chirag

Re: Deploying Flink with JobManager HA on Docker Swarm/Kubernetes

Posted by Aljoscha Krettek <al...@apache.org>.

It should be roughly the same settings that you use in your JobManager. They are described here: https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#zookeeper-based-ha-mode

> On 14. Feb 2018, at 15:32, Chirag Dewan <ch...@yahoo.in> wrote:
> 
> Thanks Aljoscha.
> 
> I haven't checked that bit. Is there any configuration for TaskManagers to find ZK?
> 
> Regards,
> 
> Chirag
> 
> Sent from Yahoo Mail on Android <https://overview.mail.yahoo.com/mobile/?.src=Android>
> On Wed, 14 Feb 2018 at 7:43 PM, Aljoscha Krettek
> <al...@apache.org> wrote:
> Do you see in the logs whether the TaskManager correctly connect to ZooKeeper as well? They need this in order to find the JobManager leader.
> 
> Best,
> Aljoscha
> 
>> On 14. Feb 2018, at 06:12, Chirag Dewan <chirag.dewan22@yahoo.in <ma...@yahoo.in>> wrote:
>> 
>> Hi,
>> 
>> I am trying to deploy a Flink cluster (1 JM, 2TM) on a Docker Swarm. For JobManager HA, I have started a 3 node zookeeper service on the same swarm network and configured Flink's zookeeper quorum with zookeeper service instances. 
>> 
>> JobManager gets started with the LeaderElectionService and gets assigned a LeaderSessionID too, which I can see from the following log statements(attaching only related logs) :
>> 
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService   org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
>> JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager <> was granted leadership with leader session ID Some(1f3b2ec6-77b6-4532-928f-ad8befd5202f).
>>  Trying to associate with JobManager leader akka.tcp://flink@jobmanager:6123/user/jobmanager <>
>>  Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#590681231 <>] - leader session 1f3b2ec6-77b6-4532-928f-ad8befd5202f
>> 
>> 
>> But TaskManagers are not able to register with the JobManager and gives the following error:
>> 
>> Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,RegisterTaskManager(4fc8aceeae1e27e42b9f16df6c0cf5e3,4fc8aceeae1e27e42b9f16df6c0cf5e3 @ a118cdf39114 (dataPort=43017),cores=1, physMem=1044111360, heap=536870912, managed=324208384,1)) because the expected leader session ID 1f3b2ec6-77b6-4532-928f-ad8befd5202f did not equal the received leader session ID 00000000-0000-0000-0000-000000000000.
>> 
>> Seems like the ResourceManager was not able to retrieve the LeaderSessionID and passed 00 ID. 
>> 
>> One interesting thing I observed was a ZK version log:
>> 
>> The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
>> 
>> Is this a ZK version problem? Should I be using ZK 3.4.6?
>> 
>> My configuration:
>> 
>> Flink Version : 1.4.0
>> ZK version : 3.4.11 (I just pulled the latest image)
>> 
>> Thanks in advance. 
>> 
>> Chirag
>> 
>

Re: Deploying Flink with JobManager HA on Docker Swarm/Kubernetes

Posted by Chirag Dewan <ch...@yahoo.in>.

Thanks Aljoscha.
I haven't checked that bit. Is there any configuration for TaskManagers to find ZK?
Regards,
Chirag

Sent from Yahoo Mail on Android 
 
  On Wed, 14 Feb 2018 at 7:43 PM, Aljoscha Krettek<al...@apache.org> wrote:   Do you see in the logs whether the TaskManager correctly connect to ZooKeeper as well? They need this in order to find the JobManager leader.
Best,Aljoscha


On 14. Feb 2018, at 06:12, Chirag Dewan <ch...@yahoo.in> wrote:
Hi,
I am trying to deploy a Flink cluster (1 JM, 2TM) on a Docker Swarm. For JobManager HA, I have started a 3 node zookeeper service on the same swarm network and configured Flink's zookeeper quorum with zookeeper service instances. 
JobManager gets started with the LeaderElectionService and gets assigned a LeaderSessionID too, which I can see from the following log statements(attaching only related logs) :
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService   org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager was granted leadership with leader session ID Some(1f3b2ec6-77b6-4532-928f-ad8befd5202f).
 Trying to associate with JobManager leader akka.tcp://flink@jobmanager:6123/user/jobmanager Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#590681231] - leader session 1f3b2ec6-77b6-4532-928f-ad8befd5202f

But TaskManagers are not able to register with the JobManager and gives the following error:
Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,RegisterTaskManager(4fc8aceeae1e27e42b9f16df6c0cf5e3,4fc8aceeae1e27e42b9f16df6c0cf5e3 @ a118cdf39114 (dataPort=43017),cores=1, physMem=1044111360, heap=536870912, managed=324208384,1)) because the expected leader session ID 1f3b2ec6-77b6-4532-928f-ad8befd5202f did not equal the received leader session ID 00000000-0000-0000-0000-000000000000.

Seems like the ResourceManager was not able to retrieve the LeaderSessionID and passed 00 ID. 
One interesting thing I observed was a ZK version log:
The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.

Is this a ZK version problem? Should I be using ZK 3.4.6?
My configuration:
Flink Version : 1.4.0ZK version : 3.4.11 (I just pulled the latest image)
Thanks in advance. 
Chirag

Re: Deploying Flink with JobManager HA on Docker Swarm/Kubernetes

Posted by Aljoscha Krettek <al...@apache.org>.

Do you see in the logs whether the TaskManager correctly connect to ZooKeeper as well? They need this in order to find the JobManager leader.

Best,
Aljoscha

> On 14. Feb 2018, at 06:12, Chirag Dewan <ch...@yahoo.in> wrote:
> 
> Hi,
> 
> I am trying to deploy a Flink cluster (1 JM, 2TM) on a Docker Swarm. For JobManager HA, I have started a 3 node zookeeper service on the same swarm network and configured Flink's zookeeper quorum with zookeeper service instances. 
> 
> JobManager gets started with the LeaderElectionService and gets assigned a LeaderSessionID too, which I can see from the following log statements(attaching only related logs) :
> 
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService   org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
> JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager was granted leadership with leader session ID Some(1f3b2ec6-77b6-4532-928f-ad8befd5202f).
>  Trying to associate with JobManager leader akka.tcp://flink@jobmanager:6123/user/jobmanager
>  Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#590681231] - leader session 1f3b2ec6-77b6-4532-928f-ad8befd5202f
> 
> 
> But TaskManagers are not able to register with the JobManager and gives the following error:
> 
> Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,RegisterTaskManager(4fc8aceeae1e27e42b9f16df6c0cf5e3,4fc8aceeae1e27e42b9f16df6c0cf5e3 @ a118cdf39114 (dataPort=43017),cores=1, physMem=1044111360, heap=536870912, managed=324208384,1)) because the expected leader session ID 1f3b2ec6-77b6-4532-928f-ad8befd5202f did not equal the received leader session ID 00000000-0000-0000-0000-000000000000.
> 
> Seems like the ResourceManager was not able to retrieve the LeaderSessionID and passed 00 ID. 
> 
> One interesting thing I observed was a ZK version log:
> 
> The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
> 
> Is this a ZK version problem? Should I be using ZK 3.4.6?
> 
> My configuration:
> 
> Flink Version : 1.4.0
> ZK version : 3.4.11 (I just pulled the latest image)
> 
> Thanks in advance. 
> 
> Chirag
>