You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Olga Luganska <tr...@hotmail.com> on 2018/11/15 22:42:31 UTC

Standalone HA cluster: Fatal error occurred in the cluster entrypoint.

Hello,

I am running flink 1.6.1 standalone HA cluster. Today I am unable to start cluster because of "Fatal error in cluster entrypoint"
(I used to see this error when running flink 1.5 version, after upgrade to 1.6.1 (which had a fix for this bug) everything worked well for a while)

Question: what exactly needs to be done to clean "state handle store"?


2018-11-15 15:09:53,181 DEBUG org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - Fencing token not set: Ignoring message LocalFencedMessage(null, org.apache.flink.runtime.rpc.messages.RunAsync@21fd224c) because the fencing token is null.

2018-11-15 15:09:53,182 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint.

java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.

        at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)

        at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:61)

        at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)

        at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

        at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)

        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.

        at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)

        at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:692)

        at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:677)

        at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:658)

        at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:817)

        at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:59)

        ... 9 more

Caused by: java.io.FileNotFoundException: /checkpoint_repo/ha/submittedJobGraphdd865937d674 (No such file or directory)

        at java.io.FileInputStream.open0(Native Method)

        at java.io.FileInputStream.open(FileInputStream.java:195)

        at java.io.FileInputStream.<init>(FileInputStream.java:138)

        at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)

        at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142)

        at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)

        at org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:64)

        at org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:57)

        at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:202)

        ... 14 more

2018-11-15 15:09:53,185 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Shutting down BLOB cache


thank you,

Olga

Re: Standalone HA cluster: Fatal error occurred in the cluster entrypoint.

Posted by Alexander Smirnov <al...@gmail.com>.

Hi all,

I am getting similar exception while upgrading from Flink 1.4 to 1.6:

```
06 Feb 2019 14:37:34,080 ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error
occurred in the cluster entrypoint.
java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not
retrieve submitted JobGraph from state handle under
/689f43070c701826e19ac24841050ea1. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
    at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
    at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74)
    at
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
    at
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
    at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
    at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/689f43070c701826e19ac24841050ea1. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
    at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
    at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
    at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
    at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
    at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
    at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
    ... 9 more
Caused by: java.io.InvalidClassException:
org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration;
local class incompatible: stream classdesc serialVersionUID =
-647384516034982626, local class serialVersionUID = 2
```

Is it safe to clean Zookeeper state as it is suggested in logs? What kind
of information I am losing?

Thank you, Alexander

On Fri, Nov 16, 2018 at 7:46 PM Olga Luganska <tr...@hotmail.com> wrote:

> Hi, Miki
>
> Thank you for reply!
>
> I have deleted zookeeper data and was able to restart cluster.
>
> Olga
>
> Sent from my iPhone
>
> On Nov 16, 2018, at 4:38 AM, miki haiat <mi...@gmail.com> wrote:
>
> I "solved" this issue by cleaning the zookeeper information and start the
> cluster again all the the checkpoint and job graph data will be erased and
> basacly you will start a new cluster...
>
> It's happened to me allot on a 1.5.x
> On a 1.6 things are running perfect .
> I'm not sure way this error is back again on 1.6.1 ?
>
>
> On Fri, 16 Nov 2018, 0:42 Olga Luganska <treble77@hotmail.com wrote:
>
>> Hello,
>>
>> I am running flink 1.6.1 standalone HA cluster. Today I am unable to
>> start cluster because of "Fatal error in cluster entrypoint"
>> (I used to see this error when running flink 1.5 version, after upgrade
>> to 1.6.1 (which had a fix for this bug) everything worked well for a while)
>>
>> Question: what exactly needs to be done to clean "state handle store"?
>>
>> 2018-11-15 15:09:53,181 DEBUG
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - Fencing
>> token not set: Ignoring message LocalFencedMessage(null,
>> org.apache.flink.runtime.rpc.messages.RunAsync@21fd224c) because the
>> fencing token is null.
>>
>> 2018-11-15 15:09:53,182 ERROR
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error
>> occurred in the cluster entrypoint.
>>
>> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could
>> not retrieve submitted JobGraph from state handle under
>> /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>>
>>         at
>> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>>
>>         at
>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:61)
>>
>>         at
>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>>
>>         at
>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>>
>>         at
>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>>
>>         at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>>
>>         at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>>
>>         at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>
>>         at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>
>>         at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>
>>         at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>> submitted JobGraph from state handle under
>> /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>>
>>         at
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>>
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:692)
>>
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:677)
>>
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:658)
>>
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:817)
>>
>>         at
>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:59)
>>
>>         ... 9 more
>>
>> Caused by: java.io.FileNotFoundException:
>> /checkpoint_repo/ha/submittedJobGraphdd865937d674 (No such file or
>> directory)
>>
>>         at java.io.FileInputStream.open0(Native Method)
>>
>>         at java.io.FileInputStream.open(FileInputStream.java:195)
>>
>>         at java.io.FileInputStream.<init>(FileInputStream.java:138)
>>
>>         at
>> org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
>>
>>         at
>> org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142)
>>
>>         at
>> org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
>>
>>         at
>> org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:64)
>>
>>         at
>> org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:57)
>>
>>         at
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:202)
>>
>>         ... 14 more
>>
>> 2018-11-15 15:09:53,185 INFO
>> org.apache.flink.runtime.blob.TransientBlobCache              - Shutting
>> down BLOB cache
>>
>>
>> thank you,
>>
>> Olga
>>
>>

Re: Standalone HA cluster: Fatal error occurred in the cluster entrypoint.

Posted by Olga Luganska <tr...@hotmail.com>.

Hi, Miki

Thank you for reply!

I have deleted zookeeper data and was able to restart cluster.

Olga

Sent from my iPhone

On Nov 16, 2018, at 4:38 AM, miki haiat <mi...@gmail.com>> wrote:

I "solved" this issue by cleaning the zookeeper information and start the cluster again all the the checkpoint and job graph data will be erased and basacly you will start a new cluster...

It's happened to me allot on a 1.5.x
On a 1.6 things are running perfect .
I'm not sure way this error is back again on 1.6.1 ?


On Fri, 16 Nov 2018, 0:42 Olga Luganska <tr...@hotmail.com> wrote:
Hello,

I am running flink 1.6.1 standalone HA cluster. Today I am unable to start cluster because of "Fatal error in cluster entrypoint"
(I used to see this error when running flink 1.5 version, after upgrade to 1.6.1 (which had a fix for this bug) everything worked well for a while)

Question: what exactly needs to be done to clean "state handle store"?


2018-11-15 15:09:53,181 DEBUG org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - Fencing token not set: Ignoring message LocalFencedMessage(null, org.apache.flink.runtime.rpc.messages.RunAsync@21fd224c) because the fencing token is null.

2018-11-15 15:09:53,182 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint.

java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.

        at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)

        at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:61)

        at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)

        at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

        at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)

        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.

        at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)

        at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:692)

        at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:677)

        at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:658)

        at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:817)

        at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:59)

        ... 9 more

Caused by: java.io.FileNotFoundException: /checkpoint_repo/ha/submittedJobGraphdd865937d674 (No such file or directory)

        at java.io.FileInputStream.open0(Native Method)

        at java.io.FileInputStream.open(FileInputStream.java:195)

        at java.io.FileInputStream.<init>(FileInputStream.java:138)

        at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)

        at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142)

        at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)

        at org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:64)

        at org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:57)

        at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:202)

        ... 14 more

2018-11-15 15:09:53,185 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Shutting down BLOB cache


thank you,

Olga

Re: Standalone HA cluster: Fatal error occurred in the cluster entrypoint.

Posted by miki haiat <mi...@gmail.com>.

I "solved" this issue by cleaning the zookeeper information and start the
cluster again all the the checkpoint and job graph data will be erased and
basacly you will start a new cluster...

It's happened to me allot on a 1.5.x
On a 1.6 things are running perfect .
I'm not sure way this error is back again on 1.6.1 ?


On Fri, 16 Nov 2018, 0:42 Olga Luganska <treble77@hotmail.com wrote:

> Hello,
>
> I am running flink 1.6.1 standalone HA cluster. Today I am unable to start
> cluster because of "Fatal error in cluster entrypoint"
> (I used to see this error when running flink 1.5 version, after upgrade to
> 1.6.1 (which had a fix for this bug) everything worked well for a while)
>
> Question: what exactly needs to be done to clean "state handle store"?
>
> 2018-11-15 15:09:53,181 DEBUG
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - Fencing
> token not set: Ignoring message LocalFencedMessage(null,
> org.apache.flink.runtime.rpc.messages.RunAsync@21fd224c) because the
> fencing token is null.
>
> 2018-11-15 15:09:53,182 ERROR
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error
> occurred in the cluster entrypoint.
>
> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could
> not retrieve submitted JobGraph from state handle under
> /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state
> handle is broken. Try cleaning the state handle store.
>
>         at
> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>
>         at
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:61)
>
>         at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>
>         at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>
>         at
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>
>         at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>
>         at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>
>         at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>
>         at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
> submitted JobGraph from state handle under
> /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state
> handle is broken. Try cleaning the state handle store.
>
>         at
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>
>         at
> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:692)
>
>         at
> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:677)
>
>         at
> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:658)
>
>         at
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:817)
>
>         at
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:59)
>
>         ... 9 more
>
> Caused by: java.io.FileNotFoundException:
> /checkpoint_repo/ha/submittedJobGraphdd865937d674 (No such file or
> directory)
>
>         at java.io.FileInputStream.open0(Native Method)
>
>         at java.io.FileInputStream.open(FileInputStream.java:195)
>
>         at java.io.FileInputStream.<init>(FileInputStream.java:138)
>
>         at
> org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
>
>         at
> org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142)
>
>         at
> org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
>
>         at
> org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:64)
>
>         at
> org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:57)
>
>         at
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:202)
>
>         ... 14 more
>
> 2018-11-15 15:09:53,185 INFO
> org.apache.flink.runtime.blob.TransientBlobCache              - Shutting
> down BLOB cache
>
>
> thank you,
>
> Olga
>
>