You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by seuzxc <xc...@qq.com> on 2019/11/28 05:17:33 UTC

Re: JobGraphs not cleaned up in HA mode

hi ，I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager ,  the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file):  


Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
	at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
	at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
	at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
	at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
	at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
	at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
	... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
	at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

回复： JobGraphs not cleaned up in HA mode

Posted by 曾祥才 <xc...@qq.com>.

the chk-* directory is not found , I think the misssing because of jobmanager removes it automaticly , but why it still in zookeeper?&nbsp;&nbsp;


&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 晚上8:31
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"vino yang"<yanghua1127@gmail.com&gt;;"User-Flink"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



One more thing:You configured:
high-availability.cluster-id: /cluster-test

it should be:
high-availability.cluster-id: cluster-test


I don't think this is major issue, in case it helps, you can check.

Can you check one more thing:
Is check pointing happening or not?&nbsp;
Were you able to see the chk-* folder under checkpoint directory?


Regards
Bhaskar


On Thu, Nov 28, 2019 at 5:00 PM 曾祥才 <xcz200706@qq.com&gt; wrote:

hi，
Is there any deference （for me using nas is more convenient to test currently）?&nbsp; &nbsp;
from the docs seems hdfs ,s3, nfs etc all will be fine.






------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"vino yang"<yanghua1127@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 晚上7:17
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;"User-Flink"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Hi,

Why do you not use HDFS directly?


Best,
Vino


曾祥才 <xcz200706@qq.com&gt; 于2019年11月28日周四 下午6:48写道：



anyone have the same problem？ pls help, thks






------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午2:46
收件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
抄送:&nbsp;"User-Flink"<user@flink.apache.org&gt;;
主题:&nbsp;回复： JobGraphs not cleaned up in HA mode



the config&nbsp; (/flink is the NASdirectory ):&nbsp; 


jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
high-availability: zookeeper
high-availability.cluster-id: /cluster-test
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: ****:2181
high-availability.jobmanager.port: 6123
high-availability.zookeeper.path.root: /flink/risk-insight
high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
state.backend: filesystem
state.checkpoints.dir: file:///flink/checkpoints
state.savepoints.dir: file:///flink/savepoints
state.checkpoints.num-retained: 2
jobmanager.execution.failover-strategy: region
jobmanager.archive.fs.dir: file:///flink/archive/history

&nbsp;




------------------ 原始邮件 ------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午3:12
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"User-Flink"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Can you share the flink configuration once?

Regards
Bhaskar


On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <xcz200706@qq.com&gt; wrote:

if i clean the zookeeper data , it runs fine .&nbsp; but next time when the jobmanager failed and redeploy the error occurs again








------------------ 原始邮件 ------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午3:05
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;

主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "&nbsp;Check why its unable to find. 
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart


Regards
Bhaskar


On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <xcz200706@qq.com&gt; wrote:

i've made a misstake( the log before is another cluster) . the full exception log is :


INFO&nbsp; org.apache.flink.runtime.dispatcher.StandaloneDispatcher&nbsp; &nbsp; &nbsp; - Recovering all persisted jobs. 
2019-11-28 02:33:12,726 INFO&nbsp; org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl&nbsp; - Starting the SlotManager. 
2019-11-28 02:33:12,743 INFO&nbsp; org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore&nbsp; - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper. 
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;- Fatal error occurred in the cluster entrypoint. 
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb. 
	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915) 
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) 
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) 
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
	at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) 
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594) 
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) 
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) 
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) 
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. 
	at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) 
	at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75) 
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) 
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) 
	... 7 more 
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. 
	at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190) 
	at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755) 
	at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740) 
	at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721) 
	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888) 
	at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73) 
	... 9 more 
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory) 
	at java.io.FileInputStream.open0(Native Method) 
	at java.io.FileInputStream.open(FileInputStream.java:195) 



&nbsp;




------------------ 原始邮件 ------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午2:46
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"User-Flink"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "It seems you configured hadoop state store and giving NAS mount. 


Regards
Bhaskar


&nbsp;


On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <xcz200706@qq.com&gt; wrote:

/flink/checkpoints&nbsp; is a external persistent store (a nas directory mounts to the job manager)








------------------ 原始邮件 ------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午2:29
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"user"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.


Zookeeper is referring&nbsp; path: /flink/checkpoints/submittedJobGraph480ddf9572ed&nbsp; to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints&nbsp; is not the external persistent store




Regards
Bhaskar


On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xcz200706@qq.com&gt; wrote:

hi ，I've the same problem with flink 1.9.1 , any solution to fix it
 when the k8s redoploy jobmanager ,&nbsp; the error looks like (seems zk not
 remove submitted job info, but jobmanager remove the file):&nbsp; 
 
 
 Caused by: org.apache.flink.util.FlinkException: Could not retrieve
 submitted JobGraph from state handle under
 /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
 handle is broken. Try cleaning the state handle store.
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
 &nbsp; &nbsp; &nbsp; &nbsp; ... 9 more
 Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
 block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
 file=/flink/checkpoints/submittedJobGraph480ddf9572ed
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)
 
 
 
 --
 Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: JobGraphs not cleaned up in HA mode

Posted by Vijay Bhaskar <bh...@gmail.com>.

One more thing:
You configured:
high-availability.cluster-id: /cluster-test
it should be:
high-availability.cluster-id: cluster-test
I don't think this is major issue, in case it helps, you can check.
Can you check one more thing:
Is check pointing happening or not?
Were you able to see the chk-* folder under checkpoint directory?

Regards
Bhaskar

On Thu, Nov 28, 2019 at 5:00 PM 曾祥才 <xc...@qq.com> wrote:

> hi，
> Is there any deference （for me using nas is more convenient to test
> currently）?
> from the docs seems hdfs ,s3, nfs etc all will be fine.
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "vino yang"<ya...@gmail.com>;
> *发送时间:* 2019年11月28日(星期四) 晚上7:17
> *收件人:* "曾祥才"<xc...@qq.com>;
> *抄送:* "Vijay Bhaskar"<bh...@gmail.com>;"User-Flink"<
> user@flink.apache.org>;
> *主题:* Re: JobGraphs not cleaned up in HA mode
>
> Hi,
>
> Why do you not use HDFS directly?
>
> Best,
> Vino
>
> 曾祥才 <xc...@qq.com> 于2019年11月28日周四 下午6:48写道：
>
>>
>> anyone have the same problem？ pls help, thks
>>
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "曾祥才"<xc...@qq.com>;
>> *发送时间:* 2019年11月28日(星期四) 下午2:46
>> *收件人:* "Vijay Bhaskar"<bh...@gmail.com>;
>> *抄送:* "User-Flink"<us...@flink.apache.org>;
>> *主题:* 回复： JobGraphs not cleaned up in HA mode
>>
>> the config  (/flink is the NASdirectory ):
>>
>> jobmanager.rpc.address: flink-jobmanager
>> taskmanager.numberOfTaskSlots: 16
>> web.upload.dir: /flink/webUpload
>> blob.server.port: 6124
>> jobmanager.rpc.port: 6123
>> taskmanager.rpc.port: 6122
>> jobmanager.heap.size: 1024m
>> taskmanager.heap.size: 1024m
>> high-availability: zookeeper
>> high-availability.cluster-id: /cluster-test
>> high-availability.storageDir: /flink/ha
>> high-availability.zookeeper.quorum: ****:2181
>> high-availability.jobmanager.port: 6123
>> high-availability.zookeeper.path.root: /flink/risk-insight
>> high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
>> state.backend: filesystem
>> state.checkpoints.dir: file:///flink/checkpoints
>> state.savepoints.dir: file:///flink/savepoints
>> state.checkpoints.num-retained: 2
>> jobmanager.execution.failover-strategy: region
>> jobmanager.archive.fs.dir: file:///flink/archive/history
>>
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "Vijay Bhaskar"<bh...@gmail.com>;
>> *发送时间:* 2019年11月28日(星期四) 下午3:12
>> *收件人:* "曾祥才"<xc...@qq.com>;
>> *抄送:* "User-Flink"<us...@flink.apache.org>;
>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>
>> Can you share the flink configuration once?
>>
>> Regards
>> Bhaskar
>>
>> On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <xc...@qq.com> wrote:
>>
>>> if i clean the zookeeper data , it runs fine .  but next time when the
>>> jobmanager failed and redeploy the error occurs again
>>>
>>>
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "Vijay Bhaskar"<bh...@gmail.com>;
>>> *发送时间:* 2019年11月28日(星期四) 下午3:05
>>> *收件人:* "曾祥才"<xc...@qq.com>;
>>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>>
>>> Again it could not find the state store file: "Caused by:
>>> java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
>>>  Check why its unable to find.
>>> Better thing is: Clean up zookeeper state and check your configurations,
>>> correct them and restart cluster.
>>> Otherwise it always picks up corrupted state from zookeeper and it will
>>> never restart
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <xc...@qq.com> wrote:
>>>
>>>> i've made a misstake( the log before is another cluster) . the full
>>>> exception log is :
>>>>
>>>>
>>>> INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      -
>>>> Recovering all persisted jobs.
>>>> 2019-11-28 02:33:12,726 INFO
>>>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  -
>>>> Starting the SlotManager.
>>>> 2019-11-28 02:33:12,743 INFO
>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>>>> Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from
>>>> ZooKeeper.
>>>> 2019-11-28 02:33:12,744 ERROR
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error
>>>> occurred in the cluster entrypoint.
>>>> org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take
>>>> leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
>>>> at
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
>>>>
>>>> at
>>>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>>>>
>>>> at
>>>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>>>>
>>>> at
>>>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>>>>
>>>> at
>>>> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
>>>> at
>>>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
>>>>
>>>> at
>>>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>>>>
>>>> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
>>>> at
>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
>>>>
>>>> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>> at
>>>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>
>>>> at
>>>> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>> at
>>>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>
>>>> Caused by: java.lang.RuntimeException:
>>>> org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph
>>>> from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates
>>>> that the retrieved state handle is broken. Try cleaning the state handle
>>>> store.
>>>> at
>>>> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>>>> at
>>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
>>>>
>>>> at
>>>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
>>>> at
>>>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>>>>
>>>> ... 7 more
>>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>>>> submitted JobGraph from state handle under
>>>> /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state
>>>> handle is broken. Try cleaning the state handle store.
>>>> at
>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
>>>>
>>>> at
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
>>>>
>>>> at
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
>>>>
>>>> at
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
>>>>
>>>> at
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
>>>>
>>>> at
>>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
>>>>
>>>> ... 9 more
>>>> Caused by: java.io.FileNotFoundException:
>>>> /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
>>>> at java.io.FileInputStream.open0(Native Method)
>>>> at java.io.FileInputStream.open(FileInputStream.java:195)
>>>>
>>>>
>>>>
>>>>
>>>> ------------------ 原始邮件 ------------------
>>>> *发件人:* "Vijay Bhaskar"<bh...@gmail.com>;
>>>> *发送时间:* 2019年11月28日(星期四) 下午2:46
>>>> *收件人:* "曾祥才"<xc...@qq.com>;
>>>> *抄送:* "User-Flink"<us...@flink.apache.org>;
>>>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>>>
>>>> Is it filesystem or hadoop? If its NAS then why the exception "Caused
>>>> by: org.apache.hadoop.hdfs.BlockMissingException: "
>>>> It seems you configured hadoop state store and giving NAS mount.
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>>
>>>>
>>>> On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <xc...@qq.com> wrote:
>>>>
>>>>> /flink/checkpoints  is a external persistent store (a nas directory
>>>>> mounts to the job manager)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------ 原始邮件 ------------------
>>>>> *发件人:* "Vijay Bhaskar"<bh...@gmail.com>;
>>>>> *发送时间:* 2019年11月28日(星期四) 下午2:29
>>>>> *收件人:* "曾祥才"<xc...@qq.com>;
>>>>> *抄送:* "user"<us...@flink.apache.org>;
>>>>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>>>>
>>>>> Following are the mandatory condition to run in HA:
>>>>>
>>>>> a) You should have persistent common external store for jobmanager and
>>>>> task managers to while writing the state
>>>>> b) You should have persistent external store for zookeeper to store
>>>>> the Jobgraph.
>>>>>
>>>>> Zookeeper is referring  path:
>>>>> /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but
>>>>> jobmanager unable to find it.
>>>>> It seems /flink/checkpoints  is not the external persistent store
>>>>>
>>>>>
>>>>> Regards
>>>>> Bhaskar
>>>>>
>>>>> On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xc...@qq.com> wrote:
>>>>>
>>>>>> hi ，I've the same problem with flink 1.9.1 , any solution to fix it
>>>>>> when the k8s redoploy jobmanager ,  the error looks like (seems zk not
>>>>>> remove submitted job info, but jobmanager remove the file):
>>>>>>
>>>>>>
>>>>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>>>>>> submitted JobGraph from state handle under
>>>>>> /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved
>>>>>> state
>>>>>> handle is broken. Try cleaning the state handle store.
>>>>>>         at
>>>>>>
>>>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>>>>>>         at
>>>>>>
>>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
>>>>>>         at
>>>>>>
>>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
>>>>>>         at
>>>>>>
>>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
>>>>>>         at
>>>>>>
>>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
>>>>>>         at
>>>>>>
>>>>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
>>>>>>         ... 9 more
>>>>>> Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not
>>>>>> obtain
>>>>>> block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
>>>>>> file=/flink/checkpoints/submittedJobGraph480ddf9572ed
>>>>>>         at
>>>>>>
>>>>>> org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sent from:
>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>>>>>
>>>>>

回复： JobGraphs not cleaned up in HA mode

Posted by 曾祥才 <xc...@qq.com>.

hi，
Is there any deference （for me using nas is more convenient to test currently）?&nbsp; &nbsp;
from the docs seems hdfs ,s3, nfs etc all will be fine.






------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"vino yang"<yanghua1127@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 晚上7:17
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;"User-Flink"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Hi,

Why do you not use HDFS directly?


Best,
Vino


曾祥才 <xcz200706@qq.com&gt; 于2019年11月28日周四 下午6:48写道：



anyone have the same problem？ pls help, thks






------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午2:46
收件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
抄送:&nbsp;"User-Flink"<user@flink.apache.org&gt;;
主题:&nbsp;回复： JobGraphs not cleaned up in HA mode



the config&nbsp; (/flink is the NASdirectory ):&nbsp; 


jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
high-availability: zookeeper
high-availability.cluster-id: /cluster-test
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: ****:2181
high-availability.jobmanager.port: 6123
high-availability.zookeeper.path.root: /flink/risk-insight
high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
state.backend: filesystem
state.checkpoints.dir: file:///flink/checkpoints
state.savepoints.dir: file:///flink/savepoints
state.checkpoints.num-retained: 2
jobmanager.execution.failover-strategy: region
jobmanager.archive.fs.dir: file:///flink/archive/history

&nbsp;




------------------ 原始邮件 ------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午3:12
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"User-Flink"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Can you share the flink configuration once?

Regards
Bhaskar


On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <xcz200706@qq.com&gt; wrote:

if i clean the zookeeper data , it runs fine .&nbsp; but next time when the jobmanager failed and redeploy the error occurs again








------------------ 原始邮件 ------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午3:05
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;

主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "&nbsp;Check why its unable to find. 
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart


Regards
Bhaskar


On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <xcz200706@qq.com&gt; wrote:

i've made a misstake( the log before is another cluster) . the full exception log is :


INFO&nbsp; org.apache.flink.runtime.dispatcher.StandaloneDispatcher&nbsp; &nbsp; &nbsp; - Recovering all persisted jobs. 
2019-11-28 02:33:12,726 INFO&nbsp; org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl&nbsp; - Starting the SlotManager. 
2019-11-28 02:33:12,743 INFO&nbsp; org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore&nbsp; - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper. 
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;- Fatal error occurred in the cluster entrypoint. 
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb. 
	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915) 
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) 
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) 
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
	at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) 
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594) 
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) 
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) 
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) 
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. 
	at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) 
	at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75) 
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) 
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) 
	... 7 more 
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. 
	at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190) 
	at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755) 
	at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740) 
	at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721) 
	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888) 
	at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73) 
	... 9 more 
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory) 
	at java.io.FileInputStream.open0(Native Method) 
	at java.io.FileInputStream.open(FileInputStream.java:195) 



&nbsp;




------------------ 原始邮件 ------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午2:46
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"User-Flink"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "It seems you configured hadoop state store and giving NAS mount. 


Regards
Bhaskar


&nbsp;


On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <xcz200706@qq.com&gt; wrote:

/flink/checkpoints&nbsp; is a external persistent store (a nas directory mounts to the job manager)








------------------ 原始邮件 ------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午2:29
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"user"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.


Zookeeper is referring&nbsp; path: /flink/checkpoints/submittedJobGraph480ddf9572ed&nbsp; to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints&nbsp; is not the external persistent store




Regards
Bhaskar


On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xcz200706@qq.com&gt; wrote:

hi ，I've the same problem with flink 1.9.1 , any solution to fix it
 when the k8s redoploy jobmanager ,&nbsp; the error looks like (seems zk not
 remove submitted job info, but jobmanager remove the file):&nbsp; 
 
 
 Caused by: org.apache.flink.util.FlinkException: Could not retrieve
 submitted JobGraph from state handle under
 /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
 handle is broken. Try cleaning the state handle store.
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
 &nbsp; &nbsp; &nbsp; &nbsp; ... 9 more
 Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
 block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
 file=/flink/checkpoints/submittedJobGraph480ddf9572ed
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)
 
 
 
 --
 Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: JobGraphs not cleaned up in HA mode

Posted by vino yang <ya...@gmail.com>.

Hi,

Why do you not use HDFS directly?

Best,
Vino

曾祥才 <xc...@qq.com> 于2019年11月28日周四 下午6:48写道：

>
> anyone have the same problem？ pls help, thks
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "曾祥才"<xc...@qq.com>;
> *发送时间:* 2019年11月28日(星期四) 下午2:46
> *收件人:* "Vijay Bhaskar"<bh...@gmail.com>;
> *抄送:* "User-Flink"<us...@flink.apache.org>;
> *主题:* 回复： JobGraphs not cleaned up in HA mode
>
> the config  (/flink is the NASdirectory ):
>
> jobmanager.rpc.address: flink-jobmanager
> taskmanager.numberOfTaskSlots: 16
> web.upload.dir: /flink/webUpload
> blob.server.port: 6124
> jobmanager.rpc.port: 6123
> taskmanager.rpc.port: 6122
> jobmanager.heap.size: 1024m
> taskmanager.heap.size: 1024m
> high-availability: zookeeper
> high-availability.cluster-id: /cluster-test
> high-availability.storageDir: /flink/ha
> high-availability.zookeeper.quorum: ****:2181
> high-availability.jobmanager.port: 6123
> high-availability.zookeeper.path.root: /flink/risk-insight
> high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
> state.backend: filesystem
> state.checkpoints.dir: file:///flink/checkpoints
> state.savepoints.dir: file:///flink/savepoints
> state.checkpoints.num-retained: 2
> jobmanager.execution.failover-strategy: region
> jobmanager.archive.fs.dir: file:///flink/archive/history
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Vijay Bhaskar"<bh...@gmail.com>;
> *发送时间:* 2019年11月28日(星期四) 下午3:12
> *收件人:* "曾祥才"<xc...@qq.com>;
> *抄送:* "User-Flink"<us...@flink.apache.org>;
> *主题:* Re: JobGraphs not cleaned up in HA mode
>
> Can you share the flink configuration once?
>
> Regards
> Bhaskar
>
> On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <xc...@qq.com> wrote:
>
>> if i clean the zookeeper data , it runs fine .  but next time when the
>> jobmanager failed and redeploy the error occurs again
>>
>>
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "Vijay Bhaskar"<bh...@gmail.com>;
>> *发送时间:* 2019年11月28日(星期四) 下午3:05
>> *收件人:* "曾祥才"<xc...@qq.com>;
>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>
>> Again it could not find the state store file: "Caused by:
>> java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
>>  Check why its unable to find.
>> Better thing is: Clean up zookeeper state and check your configurations,
>> correct them and restart cluster.
>> Otherwise it always picks up corrupted state from zookeeper and it will
>> never restart
>>
>> Regards
>> Bhaskar
>>
>> On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <xc...@qq.com> wrote:
>>
>>> i've made a misstake( the log before is another cluster) . the full
>>> exception log is :
>>>
>>>
>>> INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      -
>>> Recovering all persisted jobs.
>>> 2019-11-28 02:33:12,726 INFO
>>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  -
>>> Starting the SlotManager.
>>> 2019-11-28 02:33:12,743 INFO
>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>>> Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from
>>> ZooKeeper.
>>> 2019-11-28 02:33:12,744 ERROR
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error
>>> occurred in the cluster entrypoint.
>>> org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take
>>> leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
>>> at
>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
>>> at
>>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>>>
>>> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
>>> at
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
>>>
>>> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> at
>>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>
>>> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> at
>>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>
>>> Caused by: java.lang.RuntimeException:
>>> org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph
>>> from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates
>>> that the retrieved state handle is broken. Try cleaning the state handle
>>> store.
>>> at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>>> at
>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
>>> at
>>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>>>
>>> ... 7 more
>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>>> submitted JobGraph from state handle under
>>> /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state
>>> handle is broken. Try cleaning the state handle store.
>>> at
>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
>>>
>>> at
>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
>>>
>>> at
>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
>>>
>>> at
>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
>>>
>>> at
>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
>>>
>>> at
>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
>>>
>>> ... 9 more
>>> Caused by: java.io.FileNotFoundException:
>>> /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
>>> at java.io.FileInputStream.open0(Native Method)
>>> at java.io.FileInputStream.open(FileInputStream.java:195)
>>>
>>>
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "Vijay Bhaskar"<bh...@gmail.com>;
>>> *发送时间:* 2019年11月28日(星期四) 下午2:46
>>> *收件人:* "曾祥才"<xc...@qq.com>;
>>> *抄送:* "User-Flink"<us...@flink.apache.org>;
>>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>>
>>> Is it filesystem or hadoop? If its NAS then why the exception "Caused
>>> by: org.apache.hadoop.hdfs.BlockMissingException: "
>>> It seems you configured hadoop state store and giving NAS mount.
>>>
>>> Regards
>>> Bhaskar
>>>
>>>
>>>
>>> On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <xc...@qq.com> wrote:
>>>
>>>> /flink/checkpoints  is a external persistent store (a nas directory
>>>> mounts to the job manager)
>>>>
>>>>
>>>>
>>>>
>>>> ------------------ 原始邮件 ------------------
>>>> *发件人:* "Vijay Bhaskar"<bh...@gmail.com>;
>>>> *发送时间:* 2019年11月28日(星期四) 下午2:29
>>>> *收件人:* "曾祥才"<xc...@qq.com>;
>>>> *抄送:* "user"<us...@flink.apache.org>;
>>>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>>>
>>>> Following are the mandatory condition to run in HA:
>>>>
>>>> a) You should have persistent common external store for jobmanager and
>>>> task managers to while writing the state
>>>> b) You should have persistent external store for zookeeper to store the
>>>> Jobgraph.
>>>>
>>>> Zookeeper is referring  path:
>>>> /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but
>>>> jobmanager unable to find it.
>>>> It seems /flink/checkpoints  is not the external persistent store
>>>>
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>> On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xc...@qq.com> wrote:
>>>>
>>>>> hi ，I've the same problem with flink 1.9.1 , any solution to fix it
>>>>> when the k8s redoploy jobmanager ,  the error looks like (seems zk not
>>>>> remove submitted job info, but jobmanager remove the file):
>>>>>
>>>>>
>>>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>>>>> submitted JobGraph from state handle under
>>>>> /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved
>>>>> state
>>>>> handle is broken. Try cleaning the state handle store.
>>>>>         at
>>>>>
>>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>>>>>         at
>>>>>
>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
>>>>>         at
>>>>>
>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
>>>>>         at
>>>>>
>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
>>>>>         at
>>>>>
>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
>>>>>         at
>>>>>
>>>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
>>>>>         ... 9 more
>>>>> Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not
>>>>> obtain
>>>>> block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
>>>>> file=/flink/checkpoints/submittedJobGraph480ddf9572ed
>>>>>         at
>>>>>
>>>>> org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from:
>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>>>>
>>>>

回复： JobGraphs not cleaned up in HA mode

Posted by 曾祥才 <xc...@qq.com>.

anyone have the same problem？ pls help, thks






------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午2:46
收件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
抄送:&nbsp;"User-Flink"<user@flink.apache.org&gt;;
主题:&nbsp;回复： JobGraphs not cleaned up in HA mode



the config&nbsp; (/flink is the NASdirectory ):&nbsp; 


jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
high-availability: zookeeper
high-availability.cluster-id: /cluster-test
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: ****:2181
high-availability.jobmanager.port: 6123
high-availability.zookeeper.path.root: /flink/risk-insight
high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
state.backend: filesystem
state.checkpoints.dir: file:///flink/checkpoints
state.savepoints.dir: file:///flink/savepoints
state.checkpoints.num-retained: 2
jobmanager.execution.failover-strategy: region
jobmanager.archive.fs.dir: file:///flink/archive/history

&nbsp;




------------------ 原始邮件 ------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午3:12
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"User-Flink"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Can you share the flink configuration once?

Regards
Bhaskar


On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <xcz200706@qq.com&gt; wrote:

if i clean the zookeeper data , it runs fine .&nbsp; but next time when the jobmanager failed and redeploy the error occurs again








------------------ 原始邮件 ------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午3:05
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;

主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "&nbsp;Check why its unable to find. 
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart


Regards
Bhaskar


On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <xcz200706@qq.com&gt; wrote:

i've made a misstake( the log before is another cluster) . the full exception log is :


INFO&nbsp; org.apache.flink.runtime.dispatcher.StandaloneDispatcher&nbsp; &nbsp; &nbsp; - Recovering all persisted jobs. 
2019-11-28 02:33:12,726 INFO&nbsp; org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl&nbsp; - Starting the SlotManager. 
2019-11-28 02:33:12,743 INFO&nbsp; org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore&nbsp; - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper. 
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;- Fatal error occurred in the cluster entrypoint. 
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb. 
	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915) 
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) 
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) 
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
	at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) 
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594) 
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) 
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) 
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) 
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. 
	at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) 
	at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75) 
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) 
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) 
	... 7 more 
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. 
	at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190) 
	at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755) 
	at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740) 
	at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721) 
	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888) 
	at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73) 
	... 9 more 
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory) 
	at java.io.FileInputStream.open0(Native Method) 
	at java.io.FileInputStream.open(FileInputStream.java:195) 



&nbsp;




------------------ 原始邮件 ------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午2:46
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"User-Flink"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "It seems you configured hadoop state store and giving NAS mount. 


Regards
Bhaskar


&nbsp;


On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <xcz200706@qq.com&gt; wrote:

/flink/checkpoints&nbsp; is a external persistent store (a nas directory mounts to the job manager)








------------------ 原始邮件 ------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午2:29
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"user"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.


Zookeeper is referring&nbsp; path: /flink/checkpoints/submittedJobGraph480ddf9572ed&nbsp; to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints&nbsp; is not the external persistent store




Regards
Bhaskar


On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xcz200706@qq.com&gt; wrote:

hi ，I've the same problem with flink 1.9.1 , any solution to fix it
 when the k8s redoploy jobmanager ,&nbsp; the error looks like (seems zk not
 remove submitted job info, but jobmanager remove the file):&nbsp; 
 
 
 Caused by: org.apache.flink.util.FlinkException: Could not retrieve
 submitted JobGraph from state handle under
 /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
 handle is broken. Try cleaning the state handle store.
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
 &nbsp; &nbsp; &nbsp; &nbsp; ... 9 more
 Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
 block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
 file=/flink/checkpoints/submittedJobGraph480ddf9572ed
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)
 
 
 
 --
 Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

回复： JobGraphs not cleaned up in HA mode

Posted by 曾祥才 <xc...@qq.com>.

the config&nbsp; (/flink is the NASdirectory ):&nbsp;&nbsp;


jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
high-availability: zookeeper
high-availability.cluster-id: /cluster-test
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: ****:2181
high-availability.jobmanager.port: 6123
high-availability.zookeeper.path.root: /flink/risk-insight
high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
state.backend: filesystem
state.checkpoints.dir: file:///flink/checkpoints
state.savepoints.dir: file:///flink/savepoints
state.checkpoints.num-retained: 2
jobmanager.execution.failover-strategy: region
jobmanager.archive.fs.dir: file:///flink/archive/history

&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午3:12
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"User-Flink"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Can you share the flink configuration&nbsp;once?

Regards
Bhaskar


On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <xcz200706@qq.com&gt; wrote:

if i clean the zookeeper data , it runs fine .&nbsp; but next time when the jobmanager failed and redeploy the error occurs again








------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午3:05
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;

主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199&nbsp;"&nbsp;Check why its unable to find.&nbsp;
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always&nbsp;picks up corrupted state from zookeeper and it will never restart


Regards
Bhaskar


On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <xcz200706@qq.com&gt; wrote:

i've made a misstake( the log before is another cluster) . the full exception log is :


INFO&nbsp; org.apache.flink.runtime.dispatcher.StandaloneDispatcher&nbsp; &nbsp; &nbsp; - Recovering all persisted jobs.&nbsp;
2019-11-28 02:33:12,726 INFO&nbsp; org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl&nbsp; - Starting the SlotManager.&nbsp;
2019-11-28 02:33:12,743 INFO&nbsp; org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore&nbsp; - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper.&nbsp;
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;- Fatal error occurred in the cluster entrypoint.&nbsp;
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.&nbsp;
	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)&nbsp;
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)&nbsp;
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)&nbsp;
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)&nbsp;
	at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)&nbsp;
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)&nbsp;
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)&nbsp;
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)&nbsp;
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)&nbsp;
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)&nbsp;
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)&nbsp;
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)&nbsp;
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)&nbsp;
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.&nbsp;
	at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)&nbsp;
	at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)&nbsp;
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)&nbsp;
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)&nbsp;
	... 7 more&nbsp;
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.&nbsp;
	at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)&nbsp;
	at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)&nbsp;
	at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)&nbsp;
	at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)&nbsp;
	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)&nbsp;
	at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)&nbsp;
	... 9 more&nbsp;
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)&nbsp;
	at java.io.FileInputStream.open0(Native Method)&nbsp;
	at java.io.FileInputStream.open(FileInputStream.java:195)&nbsp;



&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午2:46
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"User-Flink"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException:&nbsp;"It seems you configured hadoop state store and giving NAS mount.&nbsp;


Regards
Bhaskar


&nbsp;


On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <xcz200706@qq.com&gt; wrote:

/flink/checkpoints&nbsp; is a external persistent store&nbsp;(a nas directory mounts to the job manager)








------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午2:29
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"user"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.


Zookeeper is referring&nbsp; path: /flink/checkpoints/submittedJobGraph480ddf9572ed&nbsp; to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints&nbsp; is not the external persistent store




Regards
Bhaskar


On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xcz200706@qq.com&gt; wrote:

hi ，I've the same problem with flink 1.9.1 , any solution to fix it
 when the k8s redoploy jobmanager ,&nbsp; the error looks like (seems zk not
 remove submitted job info, but jobmanager remove the file):&nbsp; 
 
 
 Caused by: org.apache.flink.util.FlinkException: Could not retrieve
 submitted JobGraph from state handle under
 /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
 handle is broken. Try cleaning the state handle store.
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
 &nbsp; &nbsp; &nbsp; &nbsp; ... 9 more
 Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
 block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
 file=/flink/checkpoints/submittedJobGraph480ddf9572ed
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)
 
 
 
 --
 Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: JobGraphs not cleaned up in HA mode

Posted by Vijay Bhaskar <bh...@gmail.com>.

Can you share the flink configuration once?

Regards
Bhaskar

On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <xc...@qq.com> wrote:

> if i clean the zookeeper data , it runs fine .  but next time when the
> jobmanager failed and redeploy the error occurs again
>
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Vijay Bhaskar"<bh...@gmail.com>;
> *发送时间:* 2019年11月28日(星期四) 下午3:05
> *收件人:* "曾祥才"<xc...@qq.com>;
> *主题:* Re: JobGraphs not cleaned up in HA mode
>
> Again it could not find the state store file: "Caused by:
> java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
>  Check why its unable to find.
> Better thing is: Clean up zookeeper state and check your configurations,
> correct them and restart cluster.
> Otherwise it always picks up corrupted state from zookeeper and it will
> never restart
>
> Regards
> Bhaskar
>
> On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <xc...@qq.com> wrote:
>
>> i've made a misstake( the log before is another cluster) . the full
>> exception log is :
>>
>>
>> INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      -
>> Recovering all persisted jobs.
>> 2019-11-28 02:33:12,726 INFO
>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  -
>> Starting the SlotManager.
>> 2019-11-28 02:33:12,743 INFO
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from
>> ZooKeeper.
>> 2019-11-28 02:33:12,744 ERROR
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error
>> occurred in the cluster entrypoint.
>> org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take
>> leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
>> at
>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
>> at
>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>> at
>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>> at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>> at
>> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
>> at
>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
>> at
>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
>> at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
>> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> at
>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> Caused by: java.lang.RuntimeException:
>> org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph
>> from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates
>> that the retrieved state handle is broken. Try cleaning the state handle
>> store.
>> at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>> at
>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
>> at
>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
>> at
>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>> ... 7 more
>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>> submitted JobGraph from state handle under
>> /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>> at
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
>> at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
>> at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
>> at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
>> at
>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
>> at
>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
>> ... 9 more
>> Caused by: java.io.FileNotFoundException:
>> /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
>> at java.io.FileInputStream.open0(Native Method)
>> at java.io.FileInputStream.open(FileInputStream.java:195)
>>
>>
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "Vijay Bhaskar"<bh...@gmail.com>;
>> *发送时间:* 2019年11月28日(星期四) 下午2:46
>> *收件人:* "曾祥才"<xc...@qq.com>;
>> *抄送:* "User-Flink"<us...@flink.apache.org>;
>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>
>> Is it filesystem or hadoop? If its NAS then why the exception "Caused by:
>> org.apache.hadoop.hdfs.BlockMissingException: "
>> It seems you configured hadoop state store and giving NAS mount.
>>
>> Regards
>> Bhaskar
>>
>>
>>
>> On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <xc...@qq.com> wrote:
>>
>>> /flink/checkpoints  is a external persistent store (a nas directory
>>> mounts to the job manager)
>>>
>>>
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "Vijay Bhaskar"<bh...@gmail.com>;
>>> *发送时间:* 2019年11月28日(星期四) 下午2:29
>>> *收件人:* "曾祥才"<xc...@qq.com>;
>>> *抄送:* "user"<us...@flink.apache.org>;
>>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>>
>>> Following are the mandatory condition to run in HA:
>>>
>>> a) You should have persistent common external store for jobmanager and
>>> task managers to while writing the state
>>> b) You should have persistent external store for zookeeper to store the
>>> Jobgraph.
>>>
>>> Zookeeper is referring  path:
>>> /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but
>>> jobmanager unable to find it.
>>> It seems /flink/checkpoints  is not the external persistent store
>>>
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xc...@qq.com> wrote:
>>>
>>>> hi ，I've the same problem with flink 1.9.1 , any solution to fix it
>>>> when the k8s redoploy jobmanager ,  the error looks like (seems zk not
>>>> remove submitted job info, but jobmanager remove the file):
>>>>
>>>>
>>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>>>> submitted JobGraph from state handle under
>>>> /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved
>>>> state
>>>> handle is broken. Try cleaning the state handle store.
>>>>         at
>>>>
>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>>>>         at
>>>>
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
>>>>         at
>>>>
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
>>>>         at
>>>>
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
>>>>         at
>>>>
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
>>>>         at
>>>>
>>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
>>>>         ... 9 more
>>>> Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not
>>>> obtain
>>>> block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
>>>> file=/flink/checkpoints/submittedJobGraph480ddf9572ed
>>>>         at
>>>>
>>>> org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from:
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>>>
>>>

回复： JobGraphs not cleaned up in HA mode

Posted by 曾祥才 <xc...@qq.com>.

if i clean the zookeeper data , it runs fine .&nbsp; but next time when the jobmanager failed and redeploy the error occurs again








------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午3:05
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;

主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199&nbsp;"&nbsp;Check why its unable to find.&nbsp;
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always&nbsp;picks up corrupted state from zookeeper and it will never restart


Regards
Bhaskar


On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <xcz200706@qq.com&gt; wrote:

i've made a misstake( the log before is another cluster) . the full exception log is :


INFO&nbsp; org.apache.flink.runtime.dispatcher.StandaloneDispatcher&nbsp; &nbsp; &nbsp; - Recovering all persisted jobs.&nbsp;
2019-11-28 02:33:12,726 INFO&nbsp; org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl&nbsp; - Starting the SlotManager.&nbsp;
2019-11-28 02:33:12,743 INFO&nbsp; org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore&nbsp; - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper.&nbsp;
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;- Fatal error occurred in the cluster entrypoint.&nbsp;
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.&nbsp;
	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)&nbsp;
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)&nbsp;
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)&nbsp;
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)&nbsp;
	at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)&nbsp;
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)&nbsp;
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)&nbsp;
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)&nbsp;
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)&nbsp;
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)&nbsp;
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)&nbsp;
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)&nbsp;
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)&nbsp;
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.&nbsp;
	at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)&nbsp;
	at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)&nbsp;
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)&nbsp;
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)&nbsp;
	... 7 more&nbsp;
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.&nbsp;
	at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)&nbsp;
	at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)&nbsp;
	at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)&nbsp;
	at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)&nbsp;
	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)&nbsp;
	at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)&nbsp;
	... 9 more&nbsp;
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)&nbsp;
	at java.io.FileInputStream.open0(Native Method)&nbsp;
	at java.io.FileInputStream.open(FileInputStream.java:195)&nbsp;



&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午2:46
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"User-Flink"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException:&nbsp;"It seems you configured hadoop state store and giving NAS mount.&nbsp;


Regards
Bhaskar


&nbsp;


On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <xcz200706@qq.com&gt; wrote:

/flink/checkpoints&nbsp; is a external persistent store&nbsp;(a nas directory mounts to the job manager)








------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午2:29
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"user"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.


Zookeeper is referring&nbsp; path: /flink/checkpoints/submittedJobGraph480ddf9572ed&nbsp; to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints&nbsp; is not the external persistent store




Regards
Bhaskar


On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xcz200706@qq.com&gt; wrote:

hi ，I've the same problem with flink 1.9.1 , any solution to fix it
 when the k8s redoploy jobmanager ,&nbsp; the error looks like (seems zk not
 remove submitted job info, but jobmanager remove the file):&nbsp; 
 
 
 Caused by: org.apache.flink.util.FlinkException: Could not retrieve
 submitted JobGraph from state handle under
 /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
 handle is broken. Try cleaning the state handle store.
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
 &nbsp; &nbsp; &nbsp; &nbsp; ... 9 more
 Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
 block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
 file=/flink/checkpoints/submittedJobGraph480ddf9572ed
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)
 
 
 
 --
 Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: JobGraphs not cleaned up in HA mode

Posted by Vijay Bhaskar <bh...@gmail.com>.

Is it filesystem or hadoop? If its NAS then why the exception "Caused by:
org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount.

Regards
Bhaskar



On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <xc...@qq.com> wrote:

> /flink/checkpoints  is a external persistent store (a nas directory mounts
> to the job manager)
>
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Vijay Bhaskar"<bh...@gmail.com>;
> *发送时间:* 2019年11月28日(星期四) 下午2:29
> *收件人:* "曾祥才"<xc...@qq.com>;
> *抄送:* "user"<us...@flink.apache.org>;
> *主题:* Re: JobGraphs not cleaned up in HA mode
>
> Following are the mandatory condition to run in HA:
>
> a) You should have persistent common external store for jobmanager and
> task managers to while writing the state
> b) You should have persistent external store for zookeeper to store the
> Jobgraph.
>
> Zookeeper is referring  path:
> /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but
> jobmanager unable to find it.
> It seems /flink/checkpoints  is not the external persistent store
>
>
> Regards
> Bhaskar
>
> On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xc...@qq.com> wrote:
>
>> hi ，I've the same problem with flink 1.9.1 , any solution to fix it
>> when the k8s redoploy jobmanager ,  the error looks like (seems zk not
>> remove submitted job info, but jobmanager remove the file):
>>
>>
>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>> submitted JobGraph from state handle under
>> /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>>         at
>>
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>>         at
>>
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
>>         at
>>
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
>>         at
>>
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
>>         at
>>
>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
>>         at
>>
>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
>>         ... 9 more
>> Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
>> block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
>> file=/flink/checkpoints/submittedJobGraph480ddf9572ed
>>         at
>>
>> org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)
>>
>>
>>
>> --
>> Sent from:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>
>

回复： JobGraphs not cleaned up in HA mode

Posted by 曾祥才 <xc...@qq.com>.

/flink/checkpoints&nbsp; is a external persistent store&nbsp;(a nas directory mounts to the job manager)








------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"Vijay Bhaskar"<bhaskar.ebay77@gmail.com&gt;;
发送时间:&nbsp;2019年11月28日(星期四) 下午2:29
收件人:&nbsp;"曾祥才"<xcz200706@qq.com&gt;;
抄送:&nbsp;"user"<user@flink.apache.org&gt;;
主题:&nbsp;Re: JobGraphs not cleaned up in HA mode



Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.


Zookeeper is referring&nbsp; path: /flink/checkpoints/submittedJobGraph480ddf9572ed&nbsp; to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints&nbsp; is not the external persistent store




Regards
Bhaskar


On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xcz200706@qq.com&gt; wrote:

hi ，I've the same problem with flink 1.9.1 , any solution to fix it
 when the k8s redoploy jobmanager ,&nbsp; the error looks like (seems zk not
 remove submitted job info, but jobmanager remove the file):&nbsp; 
 
 
 Caused by: org.apache.flink.util.FlinkException: Could not retrieve
 submitted JobGraph from state handle under
 /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
 handle is broken. Try cleaning the state handle store.
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
 &nbsp; &nbsp; &nbsp; &nbsp; ... 9 more
 Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
 block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
 file=/flink/checkpoints/submittedJobGraph480ddf9572ed
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)
 
 
 
 --
 Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: JobGraphs not cleaned up in HA mode

Posted by Vijay Bhaskar <bh...@gmail.com>.

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task
managers to while writing the state
b) You should have persistent external store for zookeeper to store the
Jobgraph.

Zookeeper is referring  path:
/flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but
jobmanager unable to find it.
It seems /flink/checkpoints  is not the external persistent store


Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xc...@qq.com> wrote:

> hi ，I've the same problem with flink 1.9.1 , any solution to fix it
> when the k8s redoploy jobmanager ,  the error looks like (seems zk not
> remove submitted job info, but jobmanager remove the file):
>
>
> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
> submitted JobGraph from state handle under
> /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
> handle is broken. Try cleaning the state handle store.
>         at
>
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>         at
>
> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
>         at
>
> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
>         at
>
> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
>         at
>
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
>         at
>
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
>         ... 9 more
> Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
> block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
> file=/flink/checkpoints/submittedJobGraph480ddf9572ed
>         at
>
> org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>