You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Micah Wylde <mw...@lyft.com> on 2018/12/11 04:19:57 UTC

After job cancel, leftover ZK state prevents job manager startup

Hello,

We've been seeing an issue with several Flink 1.5.4 clusters that looks
like this:

1. Job is cancelled with a savepoint
2. The jar is deleted from our HA blobstore (S3)
3. The jobgraph in ZK is *not* deleted
4. We restart the cluster
5. Startup fails in recovery because the jar is not available, with the
stacktrace:

00:13:58.486 ERROR o.a.f.r.e.ClusterEntrypoint - Fatal error occurred in
the cluster entrypoint.
{{ java.lang.RuntimeException:
org.apache.flink.runtime.client.JobExecutionException: Could not set up
JobManager}}
{{ at
org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)}}
{{ at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)}}
{{ at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)}}
{{ at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)}}
{{ at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)}}
{{ at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)}}
{{ at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}}
{{ at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)}}Caused
by: java.lang.Exception: Cannot set up the user code libraries: No such
file or directory:
s3://streamingplatform-production/{JOB_NAME}/flink/highavailability/{JOB_NAME}/blob/job_5a3fe2c00c05efd3a552a1c6707d2c10/blob_p-6d585831f5c947335ac505b400cf8f3630cc706a-42355c2885b668b0bc5e15b856141b0

This superficially seems similar to several issues that have apparently
been fixed in 1.5.4, like FLINK-10255 and FLINK-10184.

Has anybody else seen this issue on 1.5.4 (or later) clusters? Or any
advice for debugging?

Thanks,
Micah

Re: After job cancel, leftover ZK state prevents job manager startup

Posted by Micah Wylde <mw...@lyft.com>.

We have a single Jobmanager in an HA setup. From looking at logs and
metrics, it appears that before the issue occurred there was a long (15s)
GC pause on the jobmanager, which then caused a leadership election.
Because there is only one jobmanager, the same one became leader again
after it recovered.

In the logs, we see:

2018-11-15 19:28:10.039 INFO Client session timed out, have not heard from
server in 27384ms for sessionid 0x3655a5782843c03, closing socket
connection and attempting reconnect
2018-11-15 19:28:10.138 INFO State change: SUSPENDED
2018-11-15 19:28:10.141 WARN Connection to ZooKeeper suspended. Can no
longer retrieve the leader from ZooKeeper.
2018-11-15 19:28:10.144 INFO JobManager for job {JOB}. Interval: 300000 ms.
Lookback: 3600000 ms, KafkaOffsetMins: 0 min
(853c2f252ac2e44de63025ec36efefdc) was revoked leadership at
akka.tcp://flink@{job}-production-internal.lyft.net:6123/user/jobmanager_0.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The
contender akka.tcp://flink@{job}-
production-internal.lyft.net:6123/user/dispatcher no longer participates in
the leader election.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. Can no
longer retrieve the leader from ZooKeeper.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The
contender akka.tcp://flink@{job}-
production-internal.lyft.net:6123/user/jobmanager_0 no longer participates
in the leader election.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The
contender http://localhost:18080 no longer participates in the leader
election.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. The
contender akka.tcp://flink@{job}-
production-internal.lyft.net:6123/user/resourcemanager no longer
participates in the leader election.
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. Can no
longer retrieve the leader from ZooKeeper.
2018-11-15 19:28:10.148 WARN ZooKeeper connection SUSPENDING. Changes to
the submitted job graphs are not monitored (temporarily).
2018-11-15 19:28:10.148 INFO http://localhost:18080 lost leadership
2018-11-15 19:28:10.148 WARN Connection to ZooKeeper suspended. Can no
longer retrieve the leader from ZooKeeper.
2018-11-15 19:28:10.440 WARN SASL configuration failed:
javax.security.auth.login.LoginException: No JAAS configuration section
named 'Client' was found in specified JAAS configuration file:
'/tmp/jaas-265845598028730621.conf'. Will continue connection to Zookeeper
server without SASL authentication, if Zookeeper server allows it.
2018-11-15 19:28:10.440 INFO Opening socket connection to server
10.0.39.156/10.0.39.156:2181
2018-11-15 19:28:10.440 ERROR Authentication failed
2018-11-15 19:28:10.441 INFO Socket connection established to
10.0.39.156/10.0.39.156:2181, initiating session
2018-11-15 19:28:10.459 INFO Session establishment complete on server
10.0.39.156/10.0.39.156:2181, sessionid = 0x3655a5782843c03, negotiated
timeout = 40000
2018-11-15 19:28:10.459 INFO State change: RECONNECTED
2018-11-15 19:28:10.460 INFO Connection to ZooKeeper was reconnected.
Leader retrieval can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected.
Leader election can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected.
Leader election can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected.
Leader election can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected.
Leader election can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected.
Leader retrieval can be restarted.
2018-11-15 19:28:10.461 INFO ZooKeeper connection RECONNECTED. Changes to
the submitted job graphs are monitored again.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected.
Leader retrieval can be restarted.
2018-11-15 19:28:10.461 INFO Connection to ZooKeeper was reconnected.
Leader retrieval can be restarted.
2018-11-15 19:28:10.470 INFO JobManager runner for {job}. Interval: 300000
ms. Lookback: 3600000 ms, KafkaOffsetMins: 0 min
(853c2f252ac2e44de63025ec36efefdc) was granted leadership with session id
3409e3b7-8824-4352-a44d-9dac4b101221 at akka.tcp://flink@{job}-
production-internal.lyft.net:6123/user/jobmanager_0.
2018-11-15 19:28:10.480 INFO http://localhost:18080 was granted leadership
with leaderSessionID=764d2013-5812-4721-be24-1c47a58c0314

On Tue, Dec 11, 2018 at 1:35 AM Till Rohrmann <tr...@apache.org> wrote:

> Hi Micah,
>
> the problem looks indeed similar to FLINK-10255. Could you tell me your
> exact cluster setup (HA with stand by JobManagers?). Moreover, the logs of
> all JobManagers on DEBUG level would be helpful for further debugging.
>
> Cheers,
> Till
>
> On Tue, Dec 11, 2018 at 10:09 AM Stefan Richter <
> s.richter@data-artisans.com> wrote:
>
>> Hi,
>>
>> Thanks for reporting the problem, I think the exception trace looks
>> indeed very similar to traces in the discussion for FLINK-10184. I will
>> pull in Till who worked on the fix to hear his opinion. Maybe the current
>> fix only made the problem less likely to appear but is not complete, yet?
>>
>> Best,
>> Stefan
>>
>> > On 11. Dec 2018, at 05:19, Micah Wylde <mw...@lyft.com> wrote:
>> >
>> > Hello,
>> >
>> > We've been seeing an issue with several Flink 1.5.4 clusters that looks
>> like this:
>> >
>> > 1. Job is cancelled with a savepoint
>> > 2. The jar is deleted from our HA blobstore (S3)
>> > 3. The jobgraph in ZK is *not* deleted
>> > 4. We restart the cluster
>> > 5. Startup fails in recovery because the jar is not available, with the
>> stacktrace:
>> >
>> > 00:13:58.486 ERROR o.a.f.r.e.ClusterEntrypoint - Fatal error occurred
>> in the cluster entrypoint.
>> > {{ java.lang.RuntimeException:
>> org.apache.flink.runtime.client.JobExecutionException: Could not set up
>> JobManager}}
>> > {{ at
>> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)}}
>> > {{ at
>> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)}}
>> > {{ at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)}}
>> > {{ at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)}}
>> > {{ at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)}}
>> > {{ at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)}}
>> > {{ at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}}
>> > {{ at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)}}Caused
>> by: java.lang.Exception: Cannot set up the user code libraries: No such
>> file or directory:
>> s3://streamingplatform-production/{JOB_NAME}/flink/highavailability/{JOB_NAME}/blob/job_5a3fe2c00c05efd3a552a1c6707d2c10/blob_p-6d585831f5c947335ac505b400cf8f3630cc706a-42355c2885b668b0bc5e15b856141b0
>> >
>> > This superficially seems similar to several issues that have apparently
>> been fixed in 1.5.4, like FLINK-10255 and FLINK-10184.
>> >
>> > Has anybody else seen this issue on 1.5.4 (or later) clusters? Or any
>> advice for debugging?
>> >
>> > Thanks,
>> > Micah
>>
>>

Re: After job cancel, leftover ZK state prevents job manager startup

Posted by Till Rohrmann <tr...@apache.org>.

Hi Micah,

the problem looks indeed similar to FLINK-10255. Could you tell me your
exact cluster setup (HA with stand by JobManagers?). Moreover, the logs of
all JobManagers on DEBUG level would be helpful for further debugging.

Cheers,
Till

On Tue, Dec 11, 2018 at 10:09 AM Stefan Richter <s....@data-artisans.com>
wrote:

> Hi,
>
> Thanks for reporting the problem, I think the exception trace looks indeed
> very similar to traces in the discussion for FLINK-10184. I will pull in
> Till who worked on the fix to hear his opinion. Maybe the current fix only
> made the problem less likely to appear but is not complete, yet?
>
> Best,
> Stefan
>
> > On 11. Dec 2018, at 05:19, Micah Wylde <mw...@lyft.com> wrote:
> >
> > Hello,
> >
> > We've been seeing an issue with several Flink 1.5.4 clusters that looks
> like this:
> >
> > 1. Job is cancelled with a savepoint
> > 2. The jar is deleted from our HA blobstore (S3)
> > 3. The jobgraph in ZK is *not* deleted
> > 4. We restart the cluster
> > 5. Startup fails in recovery because the jar is not available, with the
> stacktrace:
> >
> > 00:13:58.486 ERROR o.a.f.r.e.ClusterEntrypoint - Fatal error occurred in
> the cluster entrypoint.
> > {{ java.lang.RuntimeException:
> org.apache.flink.runtime.client.JobExecutionException: Could not set up
> JobManager}}
> > {{ at
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)}}
> > {{ at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)}}
> > {{ at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)}}
> > {{ at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)}}
> > {{ at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)}}
> > {{ at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)}}
> > {{ at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}}
> > {{ at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)}}Caused
> by: java.lang.Exception: Cannot set up the user code libraries: No such
> file or directory:
> s3://streamingplatform-production/{JOB_NAME}/flink/highavailability/{JOB_NAME}/blob/job_5a3fe2c00c05efd3a552a1c6707d2c10/blob_p-6d585831f5c947335ac505b400cf8f3630cc706a-42355c2885b668b0bc5e15b856141b0
> >
> > This superficially seems similar to several issues that have apparently
> been fixed in 1.5.4, like FLINK-10255 and FLINK-10184.
> >
> > Has anybody else seen this issue on 1.5.4 (or later) clusters? Or any
> advice for debugging?
> >
> > Thanks,
> > Micah
>
>

Re: After job cancel, leftover ZK state prevents job manager startup

Posted by Stefan Richter <s....@data-artisans.com>.

Hi,

Thanks for reporting the problem, I think the exception trace looks indeed very similar to traces in the discussion for FLINK-10184. I will pull in Till who worked on the fix to hear his opinion. Maybe the current fix only made the problem less likely to appear but is not complete, yet?

Best,
Stefan

> On 11. Dec 2018, at 05:19, Micah Wylde <mw...@lyft.com> wrote:
> 
> Hello,
> 
> We've been seeing an issue with several Flink 1.5.4 clusters that looks like this:
> 
> 1. Job is cancelled with a savepoint
> 2. The jar is deleted from our HA blobstore (S3)
> 3. The jobgraph in ZK is *not* deleted
> 4. We restart the cluster
> 5. Startup fails in recovery because the jar is not available, with the stacktrace:
> 
> 00:13:58.486 ERROR o.a.f.r.e.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
> {{ java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager}}
> {{ at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)}}
> {{ at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)}}
> {{ at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)}}
> {{ at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)}}
> {{ at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)}}
> {{ at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)}}
> {{ at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}}
> {{ at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)}}Caused by: java.lang.Exception: Cannot set up the user code libraries: No such file or directory: s3://streamingplatform-production/{JOB_NAME}/flink/highavailability/{JOB_NAME}/blob/job_5a3fe2c00c05efd3a552a1c6707d2c10/blob_p-6d585831f5c947335ac505b400cf8f3630cc706a-42355c2885b668b0bc5e15b856141b0
> 
> This superficially seems similar to several issues that have apparently been fixed in 1.5.4, like FLINK-10255 and FLINK-10184.
> 
> Has anybody else seen this issue on 1.5.4 (or later) clusters? Or any advice for debugging?
> 
> Thanks,
> Micah