You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Helmut Zechmann <he...@adeven.com> on 2018/08/17 15:08:35 UTC

Flink Jobmanager Failover in HA mode

Hi all,

we have a problem with flink 1.5.2 high availability in standalone mode.

We have two jobmanagers running. When I shut down the main job manager, the failover job manager encounters an error during failover.

Logs:


2018-08-17 14:38:16,478 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@seg-1.adjust.com:29095] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2018-08-17 14:38:31,449 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [null] failed with java.net.ConnectException: Connection refused: seg-1.adjust.com/178.162.219.66:29095
2018-08-17 14:38:31,451 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@seg-1.adjust.com:29095] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@seg-1.adjust.com:29095]] Caused by: [Connection refused: seg-1.adjust.com/178.162.219.66:29095]
2018-08-17 14:38:41,379 ERROR org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler  - Could not retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	[... shortened ...]
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
	... 9 more
2018-08-17 14:38:48,005 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://seg-2.adjust.com:8083 was granted leadership with leaderSessionID=708d1a64-c353-448b-9101-7eb3f910970e
2018-08-17 14:38:48,005 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://flink@seg-2.adjust.com:30169/user/resourcemanager was granted leadership with fencing token 8de829de14876a367a80d37194b944ee
2018-08-17 14:38:48,006 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
2018-08-17 14:38:48,007 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@seg-2.adjust.com:30169/user/dispatcher was granted leadership with fencing token 684f50f8-327c-47e1-a53c-931c4f4ea3e5
2018-08-17 14:38:48,007 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
2018-08-17 14:38:48,021 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(b951bbf518bcf6cc031be6d2ccc441bb, null).
2018-08-17 14:38:48,028 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(06ed64f48fa0a7cffde53b99cbaa073f, null).
2018-08-17 14:38:48,035 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint.
java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
	at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
	[... shortened ...]
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
	at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
	at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936)
	at org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291)
	at org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281)
	at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38)
	... 21 more
Caused by: java.lang.Exception: Cannot set up the user code libraries: /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb (No such file or directory)
	at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:134)
	... 25 more
Caused by: java.io.FileNotFoundException: /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	[... shortened ...]
	... 25 more
2018-08-17 14:38:48,036 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Shutting down BLOB cache
2018-08-17 14:38:48,038 INFO  org.apache.flink.runtime.blob.BlobServer                      - Stopped BLOB server at 0.0.0.0:27073


Our HA config is:


high-availability: zookeeper
high-availability.cluster-id: 1.5-batch
high-availability.storageDir: file:///var/lib/flink/ceph//prod/1.5-batch/ha_state
high-availability.zookeeper.path.root: /1.5-batch
high-availability.zookeeper.quorum: kafka-4:2181,kafka-5:2181,kafka-6:2181


Any ideas what might be the probleme here?


Best,

Helmut

Re: Flink Jobmanager Failover in HA mode

Posted by Helmut Zechmann <he...@adeven.com>.

Hi Dominik,

all jobs on the cluster (batch only jobs without state) where in status
FINISHED.

Best,

Helmut

On Fri, Aug 17, 2018 at 8:04 PM Dominik Wosiński <wo...@gmail.com> wrote:

> I have faced this issue, but in 1.4.0 IIRC. This seems to be related to
> https://issues.apache.org/jira/browse/FLINK-10011. What was the status of
> the jobs when the main Job Manager has been stopped ?
>
> 2018-08-17 17:08 GMT+02:00 Helmut Zechmann <he...@adeven.com>:
>
>> Hi all,
>>
>> we have a problem with flink 1.5.2 high availability in standalone mode.
>>
>> We have two jobmanagers running. When I shut down the main job manager,
>> the failover job manager encounters an error during failover.
>>
>> Logs:
>>
>>
>> 2018-08-17 14:38:16,478 WARN  akka.remote.ReliableDeliverySupervisor
>>                   - Association with remote system [akka.tcp://
>> flink@seg-1.adjust.com:29095] has failed, address is now gated for [50]
>> ms. Reason: [Disassociated]
>> 2018-08-17 14:38:31,449 WARN  akka.remote.transport.netty.NettyTransport
>>                   - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused:
>> seg-1.adjust.com/178.162.219.66:29095
>> 2018-08-17 <http://seg-1.adjust.com/178.162.219.66:29095%0D2018-08-17>
>> 14:38:31,451 WARN  akka.remote.ReliableDeliverySupervisor
>>       - Association with remote system [akka.tcp://
>> flink@seg-1.adjust.com:29095] has failed, address is now gated for [50]
>> ms. Reason: [Association failed with [akka.tcp://
>> flink@seg-1.adjust.com:29095]] Caused by: [Connection refused:
>> seg-1.adjust.com/178.162.219.66:29095]
>> 2018-08-17 14:38:41,379 ERROR
>> org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler
>> - Could not retrieve the redirect address.
>> java.util.concurrent.CompletionException:
>> akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://
>> flink@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] after [10000
>> ms]. Sender[null] sent message of type
>> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
>>         at
>> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>>         [... shortened ...]
>> Caused by: akka.pattern.AskTimeoutException: Ask timed out on
>> [Actor[akka.tcp://
>> flink@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] after [10000
>> ms]. Sender[null] sent message of type
>> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
>>         at
>> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
>>         ... 9 more
>> 2018-08-17 14:38:48,005 INFO
>> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    -
>> http://seg-2.adjust.com:8083 was granted leadership with
>> leaderSessionID=708d1a64-c353-448b-9101-7eb3f910970e
>> 2018-08-17 14:38:48,005 INFO
>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>> ResourceManager akka.tcp://
>> flink@seg-2.adjust.com:30169/user/resourcemanager was granted leadership
>> with fencing token 8de829de14876a367a80d37194b944ee
>> 2018-08-17 14:38:48,006 INFO
>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
>> Starting the SlotManager.
>> 2018-08-17 14:38:48,007 INFO
>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher
>> akka.tcp://flink@seg-2.adjust.com:30169/user/dispatcher was granted
>> leadership with fencing token 684f50f8-327c-47e1-a53c-931c4f4ea3e5
>> 2018-08-17 14:38:48,007 INFO
>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering
>> all persisted jobs.
>> 2018-08-17 14:38:48,021 INFO
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> Recovered SubmittedJobGraph(b951bbf518bcf6cc031be6d2ccc441bb, null).
>> 2018-08-17 14:38:48,028 INFO
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> Recovered SubmittedJobGraph(06ed64f48fa0a7cffde53b99cbaa073f, null).
>> 2018-08-17 14:38:48,035 ERROR
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error
>> occurred in the cluster entrypoint.
>> java.lang.RuntimeException:
>> org.apache.flink.runtime.client.JobExecutionException: Could not set up
>> JobManager
>>         at
>> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>>         [... shortened ...]
>> Caused by: org.apache.flink.runtime.client.JobExecutionException: Could
>> not set up JobManager
>>         at
>> org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936)
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291)
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281)
>>         at
>> org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38)
>>         ... 21 more
>> Caused by: java.lang.Exception: Cannot set up the user code libraries:
>> /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb
>> (No such file or directory)
>>         at
>> org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:134)
>>         ... 25 more
>> Caused by: java.io.FileNotFoundException:
>> /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb
>> (No such file or directory)
>>         at java.io.FileInputStream.open0(Native Method)
>>         [... shortened ...]
>>         ... 25 more
>> 2018-08-17 14:38:48,036 INFO
>> org.apache.flink.runtime.blob.TransientBlobCache              - Shutting
>> down BLOB cache
>> 2018-08-17 14:38:48,038 INFO  org.apache.flink.runtime.blob.BlobServer
>>                   - Stopped BLOB server at 0.0.0.0:27073
>>
>>
>> Our HA config is:
>>
>>
>> high-availability: zookeeper
>> high-availability.cluster-id: 1.5-batch
>> high-availability.storageDir:
>> file:///var/lib/flink/ceph//prod/1.5-batch/ha_state
>> high-availability.zookeeper.path.root: /1.5-batch
>> high-availability.zookeeper.quorum: kafka-4:2181,kafka-5:2181,kafka-6:2181
>>
>>
>> Any ideas what might be the probleme here?
>>
>>
>> Best,
>>
>> Helmut
>
>
>

Re: Flink Jobmanager Failover in HA mode

Posted by Dominik Wosiński <wo...@gmail.com>.

I have faced this issue, but in 1.4.0 IIRC. This seems to be related to
https://issues.apache.org/jira/browse/FLINK-10011. What was the status of
the jobs when the main Job Manager has been stopped ?

2018-08-17 17:08 GMT+02:00 Helmut Zechmann <he...@adeven.com>:

> Hi all,
>
> we have a problem with flink 1.5.2 high availability in standalone mode.
>
> We have two jobmanagers running. When I shut down the main job manager,
> the failover job manager encounters an error during failover.
>
> Logs:
>
>
> 2018-08-17 14:38:16,478 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system [akka.tcp://
> flink@seg-1.adjust.com:29095] has failed, address is now gated for [50]
> ms. Reason: [Disassociated]
> 2018-08-17 14:38:31,449 WARN  akka.remote.transport.netty.NettyTransport
>                   - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused:
> seg-1.adjust.com/178.162.219.66:29095
> 2018-08-17 14:38:31,451 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system [akka.tcp://
> flink@seg-1.adjust.com:29095] has failed, address is now gated for [50]
> ms. Reason: [Association failed with [akka.tcp://flink@seg-1.
> adjust.com:29095]] Caused by: [Connection refused:
> seg-1.adjust.com/178.162.219.66:29095]
> 2018-08-17 14:38:41,379 ERROR org.apache.flink.runtime.rest.
> handler.legacy.files.StaticFileServerHandler  - Could not retrieve the
> redirect address.
> java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException:
> Ask timed out on [Actor[akka.tcp://flink@seg-1.adjust.com:29095/user/
> dispatcher#-1599908403]] after [10000 ms]. Sender[null] sent message of
> type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
>         at java.util.concurrent.CompletableFuture.encodeThrowable(
> CompletableFuture.java:292)
>         [... shortened ...]
> Caused by: akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka.tcp://flink@seg-1.adjust.com:29095/user/dispatcher#-1599908403]]
> after [10000 ms]. Sender[null] sent message of type
> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
>         at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(
> AskSupport.scala:604)
>         ... 9 more
> 2018-08-17 14:38:48,005 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint
>   - http://seg-2.adjust.com:8083 was granted leadership with
> leaderSessionID=708d1a64-c353-448b-9101-7eb3f910970e
> 2018-08-17 14:38:48,005 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> - ResourceManager akka.tcp://flink@seg-2.adjust.
> com:30169/user/resourcemanager was granted leadership with fencing token
> 8de829de14876a367a80d37194b944ee
> 2018-08-17 14:38:48,006 INFO  org.apache.flink.runtime.
> resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
> 2018-08-17 14:38:48,007 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher
>     - Dispatcher akka.tcp://flink@seg-2.adjust.com:30169/user/dispatcher
> was granted leadership with fencing token 684f50f8-327c-47e1-a53c-
> 931c4f4ea3e5
> 2018-08-17 14:38:48,007 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher
>     - Recovering all persisted jobs.
> 2018-08-17 14:38:48,021 INFO  org.apache.flink.runtime.jobmanager.
> ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(
> b951bbf518bcf6cc031be6d2ccc441bb, null).
> 2018-08-17 14:38:48,028 INFO  org.apache.flink.runtime.jobmanager.
> ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(
> 06ed64f48fa0a7cffde53b99cbaa073f, null).
> 2018-08-17 14:38:48,035 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>        - Fatal error occurred in the cluster entrypoint.
> java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException:
> Could not set up JobManager
>         at org.apache.flink.util.ExceptionUtils.rethrow(
> ExceptionUtils.java:199)
>         [... shortened ...]
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Could
> not set up JobManager
>         at org.apache.flink.runtime.jobmaster.JobManagerRunner.<
> init>(JobManagerRunner.java:176)
>         at org.apache.flink.runtime.dispatcher.Dispatcher$
> DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936)
>         at org.apache.flink.runtime.dispatcher.Dispatcher.
> createJobManagerRunner(Dispatcher.java:291)
>         at org.apache.flink.runtime.dispatcher.Dispatcher.runJob(
> Dispatcher.java:281)
>         at org.apache.flink.util.function.ConsumerWithException.accept(
> ConsumerWithException.java:38)
>         ... 21 more
> Caused by: java.lang.Exception: Cannot set up the user code libraries:
> /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_
> b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f2
> 02b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb (No such file or directory)
>         at org.apache.flink.runtime.jobmaster.JobManagerRunner.<
> init>(JobManagerRunner.java:134)
>         ... 25 more
> Caused by: java.io.FileNotFoundException: /var/lib/flink/ceph/prod/1.5-
> batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-
> a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb
> (No such file or directory)
>         at java.io.FileInputStream.open0(Native Method)
>         [... shortened ...]
>         ... 25 more
> 2018-08-17 14:38:48,036 INFO  org.apache.flink.runtime.blob.TransientBlobCache
>             - Shutting down BLOB cache
> 2018-08-17 14:38:48,038 INFO  org.apache.flink.runtime.blob.BlobServer
>                   - Stopped BLOB server at 0.0.0.0:27073
>
>
> Our HA config is:
>
>
> high-availability: zookeeper
> high-availability.cluster-id: 1.5-batch
> high-availability.storageDir: file:///var/lib/flink/ceph//
> prod/1.5-batch/ha_state
> high-availability.zookeeper.path.root: /1.5-batch
> high-availability.zookeeper.quorum: kafka-4:2181,kafka-5:2181,kafka-6:2181
>
>
> Any ideas what might be the probleme here?
>
>
> Best,
>
> Helmut