You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user-zh@flink.apache.org by Han Xiao <xi...@chinaunicom.cn> on 2019/03/25 11:25:59 UTC

flink ha模式进程hang!!!

        各位朋友大家好,我是flink初学者,部署flink ha的过程中出现一些问题,麻烦大家帮忙看下;
启动flink ha后,jobmanager进程直接hang,使用的flink 1.7.2版本,下面log中有一处出现此错误  File does not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91  ,让我不解的是我的checkpoint目录以及ha目录并不是这个,为什么会到这个目录去找,我所配置的目录下没有生成JobGraph ,他会一直去检索 /a5ffe00b0bc5688d9a7de5c62b8150e6 这个作业图而且找不到,我删除了所有相关的配置路径之后重新搭建,启动时还是会去检索,我该怎样避免flink去检索这个JobGraph ,让我的ha群集健康的运行起来。


报错日志:
2019-03-25 18:55:00,742 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint.
java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
        at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
        at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74)
        at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
.......
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
        at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
        at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
        at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
........
Caused by: java.io.FileNotFoundException: File does not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
.......
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
.......

谢谢!

Re: Re: flink ha模式进程hang!!!

Posted by Han Xiao <xi...@chinaunicom.cn>.
这个问题早上的时候已经解决,就是因为zk中有残余的失败jobGraph,删除即可恢复群集。
真的非常谢谢您,以后还要多和您请教学习。


Thank you for your reply!

发件人: Zili Chen
发送时间: 2019-03-26 09:46
收件人: user-zh@flink.apache.org
主题: Re: Re: flink ha模式进程hang!!!
如果没有清理此前的 zk 数据的话,有可能是此前你把 high-availability.storageDir 配置成
/flink/ha/zookeeper,随后清理了 hdfs 但是 zk 上还有过期的 handler 的信息
 
Best,
tison.
 
 
Han Xiao <xi...@chinaunicom.cn> 于2019年3月26日周二 上午9:33写道:
 
> Hi,早上好,谢谢您的回复,以下是我的配置项及参数:
>
> flink-conf.yaml
> common:
> jobmanager.rpc.address: test10
> jobmanager.rpc.port: 6123
> jobmanager.heap.size: 1024m
> taskmanager.heap.size: 1024m
> taskmanager.numberOfTaskSlots: 2
> parallelism.default: 2
> taskmanager.tmp.dirs: /app/tools/flink-1.7.2/tmp
>
> High Availability:
> high-availability: zookeeper
> high-availability.storageDir: hdfs://test10:8020/flink/ha/
>  ##此文件目录可以正常生成,但无jobGraph相关目录;
> high-availability.zookeeper.quorum:
> ip1:2181,ip2:2181,ip3:2181,ip4:2181,ip5:2181
> high-availability.zookeeper.client.acl: open
>
> Fault tolerance and checkpointing:
> state.backend: filesystem
> state.checkpoints.dir: hdfs://test10:8020/flink-checkpoints  ##此目录没有生成;
>
>  Web Frontend:
> rest.port: 8081
>
> masters:                                     slaves:
> test10:8081                                   test12
> test11 : 8082                                    test13
>                                                          test14
>
> 以上为全部配置项,结合下面报的错误信息检索路径,我的配置中并没有。。。很让我不解。
>
> Thank you for your reply!
> 发件人: Zili Chen
> 发送时间: 2019-03-25 19:57
> 收件人: user-zh@flink.apache.org
> 主题: Re: flink ha模式进程hang!!!
> 看起来是 HDFS 去 /flink/ha/zookeeper/submittedJobGraphb05001535f91 这个路径下找
> submittedJobGraph,这个看起来就不太对。
>
> Flink 的 ha 需要配置 zk 的路径和把 state 存到 file system 的路径,你可以试试把
> high-availability.storageDir
> 配成一个有效的 HDFS 路径
>
> Best,
> tison.
>
>
> Zili Chen <wa...@gmail.com> 于2019年3月25日周一 下午7:53写道:
>
> > 能提供你的 ha 配置吗?特别是 high-availability.storageDir,我怀疑是不是没有配置这个啊
> > Best,
> > tison.
> >
> >
> > Han Xiao <xi...@chinaunicom.cn> 于2019年3月25日周一 下午7:26写道:
> >
> >>         各位朋友大家好,我是flink初学者,部署flink ha的过程中出现一些问题,麻烦大家帮忙看下;
> >> 启动flink ha后,jobmanager进程直接hang,使用的flink 1.7.2版本,下面log中有一处出现此错误  File
> does
> >> not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
> >> ,让我不解的是我的checkpoint目录以及ha目录并不是这个,为什么会到这个目录去找,我所配置的目录下没有生成JobGraph
> ,他会一直去检索
> >> /a5ffe00b0bc5688d9a7de5c62b8150e6
> >> 这个作业图而且找不到,我删除了所有相关的配置路径之后重新搭建,启动时还是会去检索,我该怎样避免flink去检索这个JobGraph
> >> ,让我的ha群集健康的运行起来。
> >>
> >>
> >> 报错日志:
> >> 2019-03-25 18:55:00,742 ERROR
> >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal
> error
> >> occurred in the cluster entrypoint.
> >> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could
> >> not retrieve submitted JobGraph from state handle under
> >> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved
> state
> >> handle is broken. Try cleaning the state handle store.
> >>         at
> >> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
> >>         at
> >>
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74)
> >>         at
> >>
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
> >> .......
> >> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
> >> submitted JobGraph from state handle under
> >> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved
> state
> >> handle is broken. Try cleaning the state handle store.
> >>         at
> >>
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
> >>         at
> >>
> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
> >>         at
> >>
> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
> >> ........
> >> Caused by: java.io.FileNotFoundException: File does not exist:
> >> /flink/ha/zookeeper/submittedJobGraphb05001535f91
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
> >> .......
> >> Caused by: org.apache.hadoop.ipc.RemoteException(java.io
> .FileNotFoundException):
> >> File does not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
> >> .......
> >>
> >> 谢谢!
> >>
> >
>

Re: Re: flink ha模式进程hang!!!

Posted by Zili Chen <wa...@gmail.com>.
如果没有清理此前的 zk 数据的话,有可能是此前你把 high-availability.storageDir 配置成
/flink/ha/zookeeper,随后清理了 hdfs 但是 zk 上还有过期的 handler 的信息

Best,
tison.


Han Xiao <xi...@chinaunicom.cn> 于2019年3月26日周二 上午9:33写道:

> Hi,早上好,谢谢您的回复,以下是我的配置项及参数:
>
> flink-conf.yaml
> common:
> jobmanager.rpc.address: test10
> jobmanager.rpc.port: 6123
> jobmanager.heap.size: 1024m
> taskmanager.heap.size: 1024m
> taskmanager.numberOfTaskSlots: 2
> parallelism.default: 2
> taskmanager.tmp.dirs: /app/tools/flink-1.7.2/tmp
>
> High Availability:
> high-availability: zookeeper
> high-availability.storageDir: hdfs://test10:8020/flink/ha/
>  ##此文件目录可以正常生成,但无jobGraph相关目录;
> high-availability.zookeeper.quorum:
> ip1:2181,ip2:2181,ip3:2181,ip4:2181,ip5:2181
> high-availability.zookeeper.client.acl: open
>
> Fault tolerance and checkpointing:
> state.backend: filesystem
> state.checkpoints.dir: hdfs://test10:8020/flink-checkpoints  ##此目录没有生成;
>
>  Web Frontend:
> rest.port: 8081
>
> masters:                                     slaves:
> test10:8081                                   test12
> test11 : 8082                                    test13
>                                                          test14
>
> 以上为全部配置项,结合下面报的错误信息检索路径,我的配置中并没有。。。很让我不解。
>
> Thank you for your reply!
> 发件人: Zili Chen
> 发送时间: 2019-03-25 19:57
> 收件人: user-zh@flink.apache.org
> 主题: Re: flink ha模式进程hang!!!
> 看起来是 HDFS 去 /flink/ha/zookeeper/submittedJobGraphb05001535f91 这个路径下找
> submittedJobGraph,这个看起来就不太对。
>
> Flink 的 ha 需要配置 zk 的路径和把 state 存到 file system 的路径,你可以试试把
> high-availability.storageDir
> 配成一个有效的 HDFS 路径
>
> Best,
> tison.
>
>
> Zili Chen <wa...@gmail.com> 于2019年3月25日周一 下午7:53写道:
>
> > 能提供你的 ha 配置吗?特别是 high-availability.storageDir,我怀疑是不是没有配置这个啊
> > Best,
> > tison.
> >
> >
> > Han Xiao <xi...@chinaunicom.cn> 于2019年3月25日周一 下午7:26写道:
> >
> >>         各位朋友大家好,我是flink初学者,部署flink ha的过程中出现一些问题,麻烦大家帮忙看下;
> >> 启动flink ha后,jobmanager进程直接hang,使用的flink 1.7.2版本,下面log中有一处出现此错误  File
> does
> >> not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
> >> ,让我不解的是我的checkpoint目录以及ha目录并不是这个,为什么会到这个目录去找,我所配置的目录下没有生成JobGraph
> ,他会一直去检索
> >> /a5ffe00b0bc5688d9a7de5c62b8150e6
> >> 这个作业图而且找不到,我删除了所有相关的配置路径之后重新搭建,启动时还是会去检索,我该怎样避免flink去检索这个JobGraph
> >> ,让我的ha群集健康的运行起来。
> >>
> >>
> >> 报错日志:
> >> 2019-03-25 18:55:00,742 ERROR
> >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal
> error
> >> occurred in the cluster entrypoint.
> >> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could
> >> not retrieve submitted JobGraph from state handle under
> >> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved
> state
> >> handle is broken. Try cleaning the state handle store.
> >>         at
> >> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
> >>         at
> >>
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74)
> >>         at
> >>
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
> >> .......
> >> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
> >> submitted JobGraph from state handle under
> >> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved
> state
> >> handle is broken. Try cleaning the state handle store.
> >>         at
> >>
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
> >>         at
> >>
> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
> >>         at
> >>
> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
> >> ........
> >> Caused by: java.io.FileNotFoundException: File does not exist:
> >> /flink/ha/zookeeper/submittedJobGraphb05001535f91
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
> >> .......
> >> Caused by: org.apache.hadoop.ipc.RemoteException(java.io
> .FileNotFoundException):
> >> File does not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
> >> .......
> >>
> >> 谢谢!
> >>
> >
>

Re: Re: flink ha模式进程hang!!!

Posted by Han Xiao <xi...@chinaunicom.cn>.
非常谢谢您的解答,这个问题是zk中有失败任务的jobGraph,导致每次启动群集就会去检索,删除zk中残余后重启即可解决。

 
Thank you for your reply!
发件人: baiyg25281@hundsun.com
发送时间: 2019-03-26 09:40
收件人: user-zh
主题: Re: Re: flink ha模式进程hang!!!
是不是跟这个访问控制有关?
high-availability.zookeeper.client.acl: open
 
 
 
baiyg25281@hundsun.com
发件人: Han Xiao
发送时间: 2019-03-26 09:33
收件人: user-zh@flink.apache.org
主题: Re: Re: flink ha模式进程hang!!!
Hi,早上好,谢谢您的回复,以下是我的配置项及参数:
flink-conf.yaml
common:
jobmanager.rpc.address: test10
jobmanager.rpc.port: 6123
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
taskmanager.numberOfTaskSlots: 2
parallelism.default: 2
taskmanager.tmp.dirs: /app/tools/flink-1.7.2/tmp
High Availability:
high-availability: zookeeper
high-availability.storageDir: hdfs://test10:8020/flink/ha/   ##此文件目录可以正常生成,但无jobGraph相关目录;
high-availability.zookeeper.quorum: ip1:2181,ip2:2181,ip3:2181,ip4:2181,ip5:2181
high-availability.zookeeper.client.acl: open
Fault tolerance and checkpointing:
state.backend: filesystem
state.checkpoints.dir: hdfs://test10:8020/flink-checkpoints  ##此目录没有生成;
Web Frontend:
rest.port: 8081
masters:                                     slaves:
test10:8081                                   test12
test11 : 8082                                    test13
                                                         test14
以上为全部配置项,结合下面报的错误信息检索路径,我的配置中并没有。。。很让我不解。
Thank you for your reply!
发件人: Zili Chen
发送时间: 2019-03-25 19:57
收件人: user-zh@flink.apache.org
主题: Re: flink ha模式进程hang!!!
看起来是 HDFS 去 /flink/ha/zookeeper/submittedJobGraphb05001535f91 这个路径下找
submittedJobGraph,这个看起来就不太对。
Flink 的 ha 需要配置 zk 的路径和把 state 存到 file system 的路径,你可以试试把
high-availability.storageDir
配成一个有效的 HDFS 路径
Best,
tison.
Zili Chen <wa...@gmail.com> 于2019年3月25日周一 下午7:53写道:
> 能提供你的 ha 配置吗?特别是 high-availability.storageDir,我怀疑是不是没有配置这个啊
> Best,
> tison.
>
>
> Han Xiao <xi...@chinaunicom.cn> 于2019年3月25日周一 下午7:26写道:
>
>>         各位朋友大家好,我是flink初学者,部署flink ha的过程中出现一些问题,麻烦大家帮忙看下;
>> 启动flink ha后,jobmanager进程直接hang,使用的flink 1.7.2版本,下面log中有一处出现此错误  File does
>> not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
>> ,让我不解的是我的checkpoint目录以及ha目录并不是这个,为什么会到这个目录去找,我所配置的目录下没有生成JobGraph ,他会一直去检索
>> /a5ffe00b0bc5688d9a7de5c62b8150e6
>> 这个作业图而且找不到,我删除了所有相关的配置路径之后重新搭建,启动时还是会去检索,我该怎样避免flink去检索这个JobGraph
>> ,让我的ha群集健康的运行起来。
>>
>>
>> 报错日志:
>> 2019-03-25 18:55:00,742 ERROR
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error
>> occurred in the cluster entrypoint.
>> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could
>> not retrieve submitted JobGraph from state handle under
>> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>>         at
>> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>>         at
>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74)
>>         at
>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>> .......
>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>> submitted JobGraph from state handle under
>> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>>         at
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
>> ........
>> Caused by: java.io.FileNotFoundException: File does not exist:
>> /flink/ha/zookeeper/submittedJobGraphb05001535f91
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
>> .......
>> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException):
>> File does not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
>> .......
>>
>> 谢谢!
>>
>

Re: Re: flink ha模式进程hang!!!

Posted by "baiyg25281@hundsun.com" <ba...@hundsun.com>.
是不是跟这个访问控制有关?
high-availability.zookeeper.client.acl: open



baiyg25281@hundsun.com
 
发件人: Han Xiao
发送时间: 2019-03-26 09:33
收件人: user-zh@flink.apache.org
主题: Re: Re: flink ha模式进程hang!!!
Hi,早上好,谢谢您的回复,以下是我的配置项及参数:
 
flink-conf.yaml
common:
jobmanager.rpc.address: test10
jobmanager.rpc.port: 6123
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
taskmanager.numberOfTaskSlots: 2
parallelism.default: 2
taskmanager.tmp.dirs: /app/tools/flink-1.7.2/tmp
 
High Availability:
high-availability: zookeeper
high-availability.storageDir: hdfs://test10:8020/flink/ha/   ##此文件目录可以正常生成,但无jobGraph相关目录;
high-availability.zookeeper.quorum: ip1:2181,ip2:2181,ip3:2181,ip4:2181,ip5:2181
high-availability.zookeeper.client.acl: open
 
Fault tolerance and checkpointing:
state.backend: filesystem
state.checkpoints.dir: hdfs://test10:8020/flink-checkpoints  ##此目录没有生成;
 
Web Frontend:
rest.port: 8081
 
masters:                                     slaves:
test10:8081                                   test12
test11 : 8082                                    test13
                                                         test14
 
以上为全部配置项,结合下面报的错误信息检索路径,我的配置中并没有。。。很让我不解。
 
Thank you for your reply!
发件人: Zili Chen
发送时间: 2019-03-25 19:57
收件人: user-zh@flink.apache.org
主题: Re: flink ha模式进程hang!!!
看起来是 HDFS 去 /flink/ha/zookeeper/submittedJobGraphb05001535f91 这个路径下找
submittedJobGraph,这个看起来就不太对。
Flink 的 ha 需要配置 zk 的路径和把 state 存到 file system 的路径,你可以试试把
high-availability.storageDir
配成一个有效的 HDFS 路径
Best,
tison.
Zili Chen <wa...@gmail.com> 于2019年3月25日周一 下午7:53写道:
> 能提供你的 ha 配置吗?特别是 high-availability.storageDir,我怀疑是不是没有配置这个啊
> Best,
> tison.
>
>
> Han Xiao <xi...@chinaunicom.cn> 于2019年3月25日周一 下午7:26写道:
>
>>         各位朋友大家好,我是flink初学者,部署flink ha的过程中出现一些问题,麻烦大家帮忙看下;
>> 启动flink ha后,jobmanager进程直接hang,使用的flink 1.7.2版本,下面log中有一处出现此错误  File does
>> not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
>> ,让我不解的是我的checkpoint目录以及ha目录并不是这个,为什么会到这个目录去找,我所配置的目录下没有生成JobGraph ,他会一直去检索
>> /a5ffe00b0bc5688d9a7de5c62b8150e6
>> 这个作业图而且找不到,我删除了所有相关的配置路径之后重新搭建,启动时还是会去检索,我该怎样避免flink去检索这个JobGraph
>> ,让我的ha群集健康的运行起来。
>>
>>
>> 报错日志:
>> 2019-03-25 18:55:00,742 ERROR
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error
>> occurred in the cluster entrypoint.
>> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could
>> not retrieve submitted JobGraph from state handle under
>> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>>         at
>> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>>         at
>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74)
>>         at
>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>> .......
>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>> submitted JobGraph from state handle under
>> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>>         at
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
>> ........
>> Caused by: java.io.FileNotFoundException: File does not exist:
>> /flink/ha/zookeeper/submittedJobGraphb05001535f91
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
>> .......
>> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException):
>> File does not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
>> .......
>>
>> 谢谢!
>>
>

Re: Re: flink ha模式进程hang!!!

Posted by Han Xiao <xi...@chinaunicom.cn>.
Hi,早上好,谢谢您的回复,以下是我的配置项及参数:

flink-conf.yaml
common:
jobmanager.rpc.address: test10
jobmanager.rpc.port: 6123
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
taskmanager.numberOfTaskSlots: 2
parallelism.default: 2
taskmanager.tmp.dirs: /app/tools/flink-1.7.2/tmp

High Availability:
high-availability: zookeeper
high-availability.storageDir: hdfs://test10:8020/flink/ha/   ##此文件目录可以正常生成,但无jobGraph相关目录;
high-availability.zookeeper.quorum: ip1:2181,ip2:2181,ip3:2181,ip4:2181,ip5:2181
high-availability.zookeeper.client.acl: open

Fault tolerance and checkpointing:
state.backend: filesystem
state.checkpoints.dir: hdfs://test10:8020/flink-checkpoints  ##此目录没有生成;

 Web Frontend:
rest.port: 8081

masters:                                     slaves:
test10:8081                                   test12
test11 : 8082                                    test13
                                                         test14

以上为全部配置项,结合下面报的错误信息检索路径,我的配置中并没有。。。很让我不解。

Thank you for your reply!
发件人: Zili Chen
发送时间: 2019-03-25 19:57
收件人: user-zh@flink.apache.org
主题: Re: flink ha模式进程hang!!!
看起来是 HDFS 去 /flink/ha/zookeeper/submittedJobGraphb05001535f91 这个路径下找
submittedJobGraph,这个看起来就不太对。
 
Flink 的 ha 需要配置 zk 的路径和把 state 存到 file system 的路径,你可以试试把
high-availability.storageDir
配成一个有效的 HDFS 路径
 
Best,
tison.
 
 
Zili Chen <wa...@gmail.com> 于2019年3月25日周一 下午7:53写道:
 
> 能提供你的 ha 配置吗?特别是 high-availability.storageDir,我怀疑是不是没有配置这个啊
> Best,
> tison.
>
>
> Han Xiao <xi...@chinaunicom.cn> 于2019年3月25日周一 下午7:26写道:
>
>>         各位朋友大家好,我是flink初学者,部署flink ha的过程中出现一些问题,麻烦大家帮忙看下;
>> 启动flink ha后,jobmanager进程直接hang,使用的flink 1.7.2版本,下面log中有一处出现此错误  File does
>> not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
>> ,让我不解的是我的checkpoint目录以及ha目录并不是这个,为什么会到这个目录去找,我所配置的目录下没有生成JobGraph ,他会一直去检索
>> /a5ffe00b0bc5688d9a7de5c62b8150e6
>> 这个作业图而且找不到,我删除了所有相关的配置路径之后重新搭建,启动时还是会去检索,我该怎样避免flink去检索这个JobGraph
>> ,让我的ha群集健康的运行起来。
>>
>>
>> 报错日志:
>> 2019-03-25 18:55:00,742 ERROR
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error
>> occurred in the cluster entrypoint.
>> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could
>> not retrieve submitted JobGraph from state handle under
>> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>>         at
>> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>>         at
>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74)
>>         at
>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>> .......
>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>> submitted JobGraph from state handle under
>> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>>         at
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
>> ........
>> Caused by: java.io.FileNotFoundException: File does not exist:
>> /flink/ha/zookeeper/submittedJobGraphb05001535f91
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
>> .......
>> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException):
>> File does not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
>> .......
>>
>> 谢谢!
>>
>

Re: flink ha模式进程hang!!!

Posted by Zili Chen <wa...@gmail.com>.
看起来是 HDFS 去 /flink/ha/zookeeper/submittedJobGraphb05001535f91 这个路径下找
submittedJobGraph,这个看起来就不太对。

Flink 的 ha 需要配置 zk 的路径和把 state 存到 file system 的路径,你可以试试把
high-availability.storageDir
配成一个有效的 HDFS 路径

Best,
tison.


Zili Chen <wa...@gmail.com> 于2019年3月25日周一 下午7:53写道:

> 能提供你的 ha 配置吗?特别是 high-availability.storageDir,我怀疑是不是没有配置这个啊
> Best,
> tison.
>
>
> Han Xiao <xi...@chinaunicom.cn> 于2019年3月25日周一 下午7:26写道:
>
>>         各位朋友大家好,我是flink初学者,部署flink ha的过程中出现一些问题,麻烦大家帮忙看下;
>> 启动flink ha后,jobmanager进程直接hang,使用的flink 1.7.2版本,下面log中有一处出现此错误  File does
>> not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
>> ,让我不解的是我的checkpoint目录以及ha目录并不是这个,为什么会到这个目录去找,我所配置的目录下没有生成JobGraph ,他会一直去检索
>> /a5ffe00b0bc5688d9a7de5c62b8150e6
>> 这个作业图而且找不到,我删除了所有相关的配置路径之后重新搭建,启动时还是会去检索,我该怎样避免flink去检索这个JobGraph
>> ,让我的ha群集健康的运行起来。
>>
>>
>> 报错日志:
>> 2019-03-25 18:55:00,742 ERROR
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error
>> occurred in the cluster entrypoint.
>> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could
>> not retrieve submitted JobGraph from state handle under
>> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>>         at
>> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>>         at
>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74)
>>         at
>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>> .......
>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>> submitted JobGraph from state handle under
>> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>>         at
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
>> ........
>> Caused by: java.io.FileNotFoundException: File does not exist:
>> /flink/ha/zookeeper/submittedJobGraphb05001535f91
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
>> .......
>> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException):
>> File does not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
>> .......
>>
>> 谢谢!
>>
>

Re: flink ha模式进程hang!!!

Posted by Zili Chen <wa...@gmail.com>.
能提供你的 ha 配置吗?特别是 high-availability.storageDir,我怀疑是不是没有配置这个啊
Best,
tison.


Han Xiao <xi...@chinaunicom.cn> 于2019年3月25日周一 下午7:26写道:

>         各位朋友大家好,我是flink初学者,部署flink ha的过程中出现一些问题,麻烦大家帮忙看下;
> 启动flink ha后,jobmanager进程直接hang,使用的flink 1.7.2版本,下面log中有一处出现此错误  File does
> not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
> ,让我不解的是我的checkpoint目录以及ha目录并不是这个,为什么会到这个目录去找,我所配置的目录下没有生成JobGraph ,他会一直去检索
> /a5ffe00b0bc5688d9a7de5c62b8150e6
> 这个作业图而且找不到,我删除了所有相关的配置路径之后重新搭建,启动时还是会去检索,我该怎样避免flink去检索这个JobGraph
> ,让我的ha群集健康的运行起来。
>
>
> 报错日志:
> 2019-03-25 18:55:00,742 ERROR
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error
> occurred in the cluster entrypoint.
> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could
> not retrieve submitted JobGraph from state handle under
> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state
> handle is broken. Try cleaning the state handle store.
>         at
> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>         at
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74)
>         at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
> .......
> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
> submitted JobGraph from state handle under
> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state
> handle is broken. Try cleaning the state handle store.
>         at
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>         at
> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
>         at
> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
> ........
> Caused by: java.io.FileNotFoundException: File does not exist:
> /flink/ha/zookeeper/submittedJobGraphb05001535f91
>         at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
>         at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
> .......
> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException):
> File does not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
>         at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
>         at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
> .......
>
> 谢谢!
>