You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user-zh@flink.apache.org by yanggang_it_job <ya...@163.com> on 2020/06/01 07:30:58 UTC

checkpoint失败讨论

最近多个以rocksdb作为状态后端,hdfs作为远程文件系统的任务,频繁报错,这个报错有以下特征
1、报错之前这些任务都平稳运行,突然在某一天报错
2、当发现此类错误的时候,多个任务也会因相同的报错而导致checkpoint失败


报错信息如下
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/stream/flink-checkpoints/19523bf083346eb80b409167e9b91b53/chk-43396/cef72b90-8492-4b09-8d1b-384b0ebe5768 could only be replicated to 0 nodes instead of minReplication (=1). There are 8 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1723)


辛苦大家看看
谢谢

Re: checkpoint失败讨论

Posted by Yun Tang <my...@live.com>.
Hi

这个错误“could only be replicated to 0 nodes instead of minReplication (=1)”是HDFS不稳定导致的,无法将数据进行duplicate与Flink本身并无关系。

祝好
唐云

________________________________
From: yanggang_it_job <ya...@163.com>
Sent: Monday, June 1, 2020 15:30
To: user-zh@flink.apache.org <us...@flink.apache.org>
Subject: checkpoint失败讨论

最近多个以rocksdb作为状态后端,hdfs作为远程文件系统的任务,频繁报错,这个报错有以下特征
1、报错之前这些任务都平稳运行,突然在某一天报错
2、当发现此类错误的时候,多个任务也会因相同的报错而导致checkpoint失败


报错信息如下
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/stream/flink-checkpoints/19523bf083346eb80b409167e9b91b53/chk-43396/cef72b90-8492-4b09-8d1b-384b0ebe5768 could only be replicated to 0 nodes instead of minReplication (=1). There are 8 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1723)


辛苦大家看看
谢谢