You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Javier Vegas <jv...@strava.com> on 2022/10/26 01:27:26 UTC
"Exec Failure java.io.EOFException null" message before taskmanagers crash

I have a Flink 1.15 app running in Kubernetes (v1.22) deployed via operator
1.2, using S3-based HA with 2 jobmanagers and 2 taskmanagers.

The app consumes a high-traffic Kafka topic and writes to a Cassandra
database. It had been running fine for 4 days, but at some point
the taskmanagers crashed. Looking through my logs, the oldest messages I
see are "Exec Failure java.io.EOFException null" from both taskmanagers at
exactly the same time, but there is no associated stack trace. After that,
the taskmanagers try to restart, but I see another "Exec Failure
java.io.EOFException null" message from one jobmanager and shortly after
the task manager sets the newly started taskmanagers to a failed state with
the message below. This repeats a couple more times until finally no more
taskmanagers try to come up, and the jobmanager sit there throwing
RecipientUnreachableExceptions because there are no more
taskmanagers around.

Any idea what that "Exec Failure java.io.EOFException null" message is
about, or what can I do to debug it if it happens again?

Thanks,

Javier Vegas

message from jobmanager
Source: event-activity (2/4)#0 (090ef433d97011f8f595885a9bb39a28) switched
from RUNNING to FAILED with failure cause:
org.apache.flink.util.FlinkException: Disconnect from JobManager
responsible for b7c72bb5b7570a7c981f761afa1b7ea6. at
org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1679)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$closeJob$18(TaskExecutor.java:1660)
at java.base/java.util.Optional.ifPresent(Unknown Source) at
org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJob(TaskExecutor.java:1658)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:462)
at
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.lambda$terminate$0(AkkaRpcActor.java:568)
at
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:567)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:191)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) at
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) at
scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at
scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at
akka.actor.Actor.aroundReceive(Actor.scala:537) at
akka.actor.Actor.aroundReceive$(Actor.scala:535) at
akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) at
akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) at
akka.actor.ActorCell.invoke(ActorCell.scala:548) at
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) at
akka.dispatch.Mailbox.run(Mailbox.scala:231) at
akka.dispatch.Mailbox.exec(Mailbox.scala:243) at
java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) at
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown
Source) at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) at
java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: org.apache.flink.util.FlinkException: The TaskExecutor is
shutting down. at
org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:456)
... 25 more