You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Adrian Vasiliu <va...@fr.ibm.com> on 2019/10/14 13:10:11 UTC

FLINK-13497 / "Could not create file for checking if truncate works" / HDFS

Hello,



We recently upgraded our product from Flink 1.7.2 to Flink 1.9, and we
experience repeated failing jobs with  


java.lang.RuntimeException: Could not create file for checking if truncate
works. You can disable support for truncate() completely via
BucketingSink.setUseTruncate(false).

    at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.reflectTruncate(BucketingSink.java:645)

    at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:388)

    at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)

    at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)

    at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)

    at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:281)

    at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:878)

    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:392)

    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)

    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)

    at java.lang.Thread.run(Thread.java:748)

Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be replicated to 0
nodes instead of minReplication (=1). There are 2 datanode(s) running and no
node(s) are excluded in this operation.

    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1719)

    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3368)

    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3292)

    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:850)

    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:504)

    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)

    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)

    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)

    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:422)

    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)

    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)



    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)

    at org.apache.hadoop.ipc.Client.call(Client.java:1435)

    at org.apache.hadoop.ipc.Client.call(Client.java:1345)

    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)

    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)

    at com.sun.proxy.$Proxy49.addBlock(Unknown Source)

    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:444)

    at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:498)

    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)

    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)

    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)

    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)

    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)

    at com.sun.proxy.$Proxy50.addBlock(Unknown Source)

    at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1838)

    at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1638)

    at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)



Reading through <https://issues.apache.org/jira/browse/FLINK-13593> , it looks
related but this is marked as fixed in 1.9.



Then, the discussion there points to
<https://issues.apache.org/jira/browse/FLINK-13497> which is marked as
unresolved / fixed in 1.10.  
  
Any lights about:  
1/ Would you confirm that our stack trace is related with
<https://issues.apache.org/jira/browse/FLINK-13497>  ?  
2/ Any ETA for a 1.9.x fixing it?  


Thanks

Adrian Vasiliu

  


Re: FLINK-13497 / "Could not create file for checking if truncate works" / HDFS

Posted by Congxian Qiu <qc...@gmail.com>.
Glad to hear it!

Best,
Congxian


Adrian Vasiliu <va...@fr.ibm.com> 于2019年10月15日周二 下午9:10写道:

> Hi,
> FYI we've switched to a different Hadoop server, and the issue vanished...
> It does look as the cause was on hadoop side.
> Thanks again Congxian.
> Adrian
>
>
> ----- Original message -----
> From: "Adrian Vasiliu" <va...@fr.ibm.com>
> To: qcx978132955@gmail.com
> Cc: user@flink.apache.org
> Subject: [EXTERNAL] RE: FLINK-13497 / "Could not create file for checking
> if truncate works" / HDFS
> Date: Tue, Oct 15, 2019 8:37 AM
>
> Thanks Congxian. The possible causes listed in the mostly voted answer of
> https://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-instead-of-minreplication-1-there-are-1/36310025 do
> not seem to hold for us, because we have other pretty much similar flink
> jobs using the same Hadoop server and root directory (under different hdfs
> paths), and they do work. Thus in principle the config on the Hadoop
> server-side wouldn't be the cause. Also, according to the Ambari monitoring
> tools, the Hadoop server is healthy, and we did restart it. However, we'll
> check all points mentioned in various answers, in particular the one about
> temp files.
> Thanks
> Adrian
>
>
> ----- Original message -----
> From: Congxian Qiu <qc...@gmail.com>
> To: Adrian Vasiliu <va...@fr.ibm.com>
> Cc: user <us...@flink.apache.org>
> Subject: [EXTERNAL] Re: FLINK-13497 / "Could not create file for checking
> if truncate works" / HDFS
> Date: Tue, Oct 15, 2019 4:02 AM
>
> Hi
>
> From the given stack trace, maybe you could solve the "replication
> problem" first,   File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e
> could only be replicated to 0 nodes instead of minReplication (=1). There
> are 2 datanode(s) running and no node(s) are excluded in this operation,
> and maybe the answer from SO[1] can help.
>
> [1]
> https://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-instead-of-minreplication-1-there-are-1/36310025
> Best,
> Congxian
>
> Adrian Vasiliu <va...@fr.ibm.com> 于2019年10月14日周一 下午9:10写道:
>
> Hello,
>
> We recently upgraded our product from Flink 1.7.2 to Flink 1.9, and we
> experience repeated failing jobs with
>
> java.lang.RuntimeException: Could not create file for checking if
> truncate works. You can disable support for truncate() completely via
> BucketingSink.setUseTruncate(false).
>     at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink
> .reflectTruncate(BucketingSink.java:645)
>     at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink
> .initializeState(BucketingSink.java:388)
>     at org.apache.flink.streaming.util.functions.StreamingFunctionUtils
> .tryRestoreFunction(StreamingFunctionUtils.java:178)
>     at org.apache.flink.streaming.util.functions.StreamingFunctionUtils
> .restoreFunctionState(StreamingFunctionUtils.java:160)
>     at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator
> .initializeState(AbstractUdfStreamOperator.java:96)
>     at org.apache.flink.streaming.api.operators.AbstractStreamOperator
> .initializeState(AbstractStreamOperator.java:281)
>     at org.apache.flink.streaming.runtime.tasks.StreamTask
> .initializeState(StreamTask.java:878)
>     at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(
> StreamTask.java:392)
>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException):
> File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be
> replicated to 0 nodes instead of minReplication (=1). There are 2
> datanode(s) running and no node(s) are excluded in this operation.
>     at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager
> .chooseTarget4NewBlock(BlockManager.java:1719)
>     at org.apache.hadoop.hdfs.server.namenode.FSNamesystem
> .getNewBlockTargets(FSNamesystem.java:3368)
>     at org.apache.hadoop.hdfs.server.namenode.FSNamesystem
> .getAdditionalBlock(FSNamesystem.java:3292)
>     at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(
> NameNodeRpcServer.java:850)
>     at org.apache.hadoop.hdfs.protocolPB.
> ClientNamenodeProtocolServerSideTranslatorPB.addBlock(
> ClientNamenodeProtocolServerSideTranslatorPB.java:504)
>     at org.apache.hadoop.hdfs.protocol.proto.
> ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(
> ClientNamenodeProtocolProtos.java)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker
> .call(ProtobufRpcEngine.java:640)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1866)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
>
>     at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1435)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1345)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(
> ProtobufRpcEngine.java:227)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(
> ProtobufRpcEngine.java:116)
>     at com.sun.proxy.$Proxy49.addBlock(Unknown Source)
>     at org.apache.hadoop.hdfs.protocolPB.
> ClientNamenodeProtocolTranslatorPB.addBlock(
> ClientNamenodeProtocolTranslatorPB.java:444)
>     at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(
> RetryInvocationHandler.java:409)
>     at org.apache.hadoop.io.retry.RetryInvocationHandler$Call
> .invokeMethod(RetryInvocationHandler.java:163)
>     at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(
> RetryInvocationHandler.java:155)
>     at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(
> RetryInvocationHandler.java:95)
>     at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(
> RetryInvocationHandler.java:346)
>     at com.sun.proxy.$Proxy50.addBlock(Unknown Source)
>     at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(
> DataStreamer.java:1838)
>     at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(
> DataStreamer.java:1638)
>     at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)
>
> Reading through https://issues.apache.org/jira/browse/FLINK-13593 , it
> looks related but this is marked as fixed in 1.9.
>
> Then, the discussion there points to
> https://issues.apache.org/jira/browse/FLINK-13497 which is marked as
> unresolved / fixed in 1.10.
>
> Any lights about:
> 1/ Would you confirm that our stack trace is related with
> https://issues.apache.org/jira/browse/FLINK-13497  ?
> 2/ Any ETA for a 1.9.x fixing it?
>
> Thanks
> Adrian Vasiliu
>
>
>
>
>
>

RE: FLINK-13497 / "Could not create file for checking if truncate works" / HDFS

Posted by Adrian Vasiliu <va...@fr.ibm.com>.
Hi,

FYI we've switched to a different Hadoop server, and the issue vanished... It
does look as the cause was on hadoop side.  
Thanks again Congxian.  
Adrian



> \----- Original message -----  
> From: "Adrian Vasiliu" <va...@fr.ibm.com>  
> To: qcx978132955@gmail.com  
> Cc: user@flink.apache.org  
> Subject: [EXTERNAL] RE: FLINK-13497 / "Could not create file for checking if
truncate works" / HDFS  
> Date: Tue, Oct 15, 2019 8:37 AM  
>  
>

> Thanks Congxian. The possible causes listed in the mostly voted answer of
<https://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-
instead-of-minreplication-1-there-are-1/36310025> do not seem to hold for us,
because we have other pretty much similar flink jobs using the same Hadoop
server and root directory (under different hdfs paths), and they do work. Thus
in principle the config on the Hadoop server-side wouldn't be the cause. Also,
according to the Ambari monitoring tools, the Hadoop server is healthy, and we
did restart it. However, we'll check all points mentioned in various answers,
in particular the one about temp files.

>

> Thanks

>

> Adrian

>

>  
>

>> \----- Original message -----  
> From: Congxian Qiu <qc...@gmail.com>  
> To: Adrian Vasiliu <va...@fr.ibm.com>  
> Cc: user <us...@flink.apache.org>  
> Subject: [EXTERNAL] Re: FLINK-13497 / "Could not create file for checking if
truncate works" / HDFS  
> Date: Tue, Oct 15, 2019 4:02 AM  
>  
>>

>> Hi

>>

>>  
>>

>> From the given stack trace, maybe you could solve the "replication problem"
first,   File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be
replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s)
running and no node(s) are excluded in this operation, and maybe the answer
from SO[1] can help.

>>

>>  
>>

>> [1] <https://stackoverflow.com/questions/36015864/hadoop-be-replicated-
to-0-nodes-instead-of-minreplication-1-there-are-1/36310025>

>>

>> Best,

>>

>> Congxian

>>

>>  
>>

>> Adrian Vasiliu <[vasiliu@fr.ibm.com](mailto:vasiliu@fr.ibm.com)>
于2019年10月14日周一 下午9:10写道:

>>

>>> Hello,

>>>

>>>  
>>>

>>> We recently upgraded our product from Flink 1.7.2 to Flink 1.9, and we
experience repeated failing jobs with  
>  
>>>

>>> java.lang.RuntimeException: Could not create file for checking if truncate
works. You can disable support for truncate() completely via
BucketingSink.setUseTruncate(false).

>>>

>>>     at
org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.reflectTruncate(BucketingSink.java:645)

>>>

>>>     at
org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:388)

>>>

>>>     at
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)

>>>

>>>     at
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)

>>>

>>>     at
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)

>>>

>>>     at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:281)

>>>

>>>     at
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:878)

>>>

>>>     at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:392)

>>>

>>>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)

>>>

>>>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)

>>>

>>>     at java.lang.Thread.run(Thread.java:748)

>>>

>>> Caused by:
org.apache.hadoop.ipc.RemoteException([java.io](http://java.io).IOException):
File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be replicated to
0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no
node(s) are excluded in this operation.

>>>

>>>     at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1719)

>>>

>>>     at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3368)

>>>

>>>     at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3292)

>>>

>>>     at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:850)

>>>

>>>     at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:504)

>>>

>>>     at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

>>>

>>>     at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)

>>>

>>>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)

>>>

>>>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)

>>>

>>>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)

>>>

>>>     at java.security.AccessController.doPrivileged(Native Method)

>>>

>>>     at javax.security.auth.Subject.doAs(Subject.java:422)

>>>

>>>     at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)

>>>

>>>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

>>>

>>>  
>>>

>>>     at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)

>>>

>>>     at org.apache.hadoop.ipc.Client.call(Client.java:1435)

>>>

>>>     at org.apache.hadoop.ipc.Client.call(Client.java:1345)

>>>

>>>     at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)

>>>

>>>     at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)

>>>

>>>     at com.sun.proxy.$Proxy49.addBlock(Unknown Source)

>>>

>>>     at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:444)

>>>

>>>     at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source)

>>>

>>>     at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

>>>

>>>     at java.lang.reflect.Method.invoke(Method.java:498)

>>>

>>>     at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)

>>>

>>>     at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)

>>>

>>>     at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)

>>>

>>>     at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)

>>>

>>>     at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)

>>>

>>>     at com.sun.proxy.$Proxy50.addBlock(Unknown Source)

>>>

>>>     at
org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1838)

>>>

>>>     at
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1638)

>>>

>>>     at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)

>>>

>>>  
>>>

>>> Reading through <https://issues.apache.org/jira/browse/FLINK-13593> , it
looks related but this is marked as fixed in 1.9.

>>>

>>>  
>>>

>>> Then, the discussion there points to
<https://issues.apache.org/jira/browse/FLINK-13497> which is marked as
unresolved / fixed in 1.10.  
>  
> Any lights about:  
> 1/ Would you confirm that our stack trace is related with
<https://issues.apache.org/jira/browse/FLINK-13497>  ?  
> 2/ Any ETA for a 1.9.x fixing it?  
>  
>>>

>>> Thanks

>>>

>>> Adrian Vasiliu

>

>  



  


RE: FLINK-13497 / "Could not create file for checking if truncate works" / HDFS

Posted by Adrian Vasiliu <va...@fr.ibm.com>.
Thanks Congxian. The possible causes listed in the mostly voted answer of
<https://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-
instead-of-minreplication-1-there-are-1/36310025> do not seem to hold for us,
because we have other pretty much similar flink jobs using the same Hadoop
server and root directory (under different hdfs paths), and they do work. Thus
in principle the config on the Hadoop server-side wouldn't be the cause. Also,
according to the Ambari monitoring tools, the Hadoop server is healthy, and we
did restart it. However, we'll check all points mentioned in various answers,
in particular the one about temp files.

Thanks

Adrian



> \----- Original message -----  
> From: Congxian Qiu <qc...@gmail.com>  
> To: Adrian Vasiliu <va...@fr.ibm.com>  
> Cc: user <us...@flink.apache.org>  
> Subject: [EXTERNAL] Re: FLINK-13497 / "Could not create file for checking if
truncate works" / HDFS  
> Date: Tue, Oct 15, 2019 4:02 AM  
>  
>

> Hi

>

>  
>

> From the given stack trace, maybe you could solve the "replication problem"
first,   File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be
replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s)
running and no node(s) are excluded in this operation, and maybe the answer
from SO[1] can help.

>

>  
>

> [1] <https://stackoverflow.com/questions/36015864/hadoop-be-replicated-
to-0-nodes-instead-of-minreplication-1-there-are-1/36310025>

>

> Best,

>

> Congxian

>

>  
>

> Adrian Vasiliu <[vasiliu@fr.ibm.com](mailto:vasiliu@fr.ibm.com)>
于2019年10月14日周一 下午9:10写道:

>

>> Hello,

>>

>>  
>>

>> We recently upgraded our product from Flink 1.7.2 to Flink 1.9, and we
experience repeated failing jobs with  
>  
>>

>> java.lang.RuntimeException: Could not create file for checking if truncate
works. You can disable support for truncate() completely via
BucketingSink.setUseTruncate(false).

>>

>>     at
org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.reflectTruncate(BucketingSink.java:645)

>>

>>     at
org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:388)

>>

>>     at
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)

>>

>>     at
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)

>>

>>     at
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)

>>

>>     at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:281)

>>

>>     at
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:878)

>>

>>     at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:392)

>>

>>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)

>>

>>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)

>>

>>     at java.lang.Thread.run(Thread.java:748)

>>

>> Caused by:
org.apache.hadoop.ipc.RemoteException([java.io](http://java.io).IOException):
File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be replicated to
0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no
node(s) are excluded in this operation.

>>

>>     at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1719)

>>

>>     at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3368)

>>

>>     at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3292)

>>

>>     at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:850)

>>

>>     at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:504)

>>

>>     at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

>>

>>     at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)

>>

>>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)

>>

>>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)

>>

>>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)

>>

>>     at java.security.AccessController.doPrivileged(Native Method)

>>

>>     at javax.security.auth.Subject.doAs(Subject.java:422)

>>

>>     at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)

>>

>>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

>>

>>  
>>

>>     at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)

>>

>>     at org.apache.hadoop.ipc.Client.call(Client.java:1435)

>>

>>     at org.apache.hadoop.ipc.Client.call(Client.java:1345)

>>

>>     at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)

>>

>>     at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)

>>

>>     at com.sun.proxy.$Proxy49.addBlock(Unknown Source)

>>

>>     at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:444)

>>

>>     at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source)

>>

>>     at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

>>

>>     at java.lang.reflect.Method.invoke(Method.java:498)

>>

>>     at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)

>>

>>     at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)

>>

>>     at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)

>>

>>     at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)

>>

>>     at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)

>>

>>     at com.sun.proxy.$Proxy50.addBlock(Unknown Source)

>>

>>     at
org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1838)

>>

>>     at
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1638)

>>

>>     at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)

>>

>>  
>>

>> Reading through <https://issues.apache.org/jira/browse/FLINK-13593> , it
looks related but this is marked as fixed in 1.9.

>>

>>  
>>

>> Then, the discussion there points to
<https://issues.apache.org/jira/browse/FLINK-13497> which is marked as
unresolved / fixed in 1.10.  
>  
> Any lights about:  
> 1/ Would you confirm that our stack trace is related with
<https://issues.apache.org/jira/browse/FLINK-13497>  ?  
> 2/ Any ETA for a 1.9.x fixing it?  
>  
>>

>> Thanks

>>

>> Adrian Vasiliu



  


Re: FLINK-13497 / "Could not create file for checking if truncate works" / HDFS

Posted by Congxian Qiu <qc...@gmail.com>.
Hi

From the given stack trace, maybe you could solve the "replication problem"
first,   File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be
replicated to 0 nodes instead of minReplication (=1). There are 2
datanode(s) running and no node(s) are excluded in this operation, and
maybe the answer from SO[1] can help.

[1]
https://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-instead-of-minreplication-1-there-are-1/36310025
Best,
Congxian


Adrian Vasiliu <va...@fr.ibm.com> 于2019年10月14日周一 下午9:10写道:

> Hello,
>
> We recently upgraded our product from Flink 1.7.2 to Flink 1.9, and we
> experience repeated failing jobs with
>
> java.lang.RuntimeException: Could not create file for checking if
> truncate works. You can disable support for truncate() completely via
> BucketingSink.setUseTruncate(false).
>     at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink
> .reflectTruncate(BucketingSink.java:645)
>     at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink
> .initializeState(BucketingSink.java:388)
>     at org.apache.flink.streaming.util.functions.StreamingFunctionUtils
> .tryRestoreFunction(StreamingFunctionUtils.java:178)
>     at org.apache.flink.streaming.util.functions.StreamingFunctionUtils
> .restoreFunctionState(StreamingFunctionUtils.java:160)
>     at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator
> .initializeState(AbstractUdfStreamOperator.java:96)
>     at org.apache.flink.streaming.api.operators.AbstractStreamOperator
> .initializeState(AbstractStreamOperator.java:281)
>     at org.apache.flink.streaming.runtime.tasks.StreamTask
> .initializeState(StreamTask.java:878)
>     at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(
> StreamTask.java:392)
>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException):
> File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be
> replicated to 0 nodes instead of minReplication (=1). There are 2
> datanode(s) running and no node(s) are excluded in this operation.
>     at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager
> .chooseTarget4NewBlock(BlockManager.java:1719)
>     at org.apache.hadoop.hdfs.server.namenode.FSNamesystem
> .getNewBlockTargets(FSNamesystem.java:3368)
>     at org.apache.hadoop.hdfs.server.namenode.FSNamesystem
> .getAdditionalBlock(FSNamesystem.java:3292)
>     at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(
> NameNodeRpcServer.java:850)
>     at org.apache.hadoop.hdfs.protocolPB.
> ClientNamenodeProtocolServerSideTranslatorPB.addBlock(
> ClientNamenodeProtocolServerSideTranslatorPB.java:504)
>     at org.apache.hadoop.hdfs.protocol.proto.
> ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(
> ClientNamenodeProtocolProtos.java)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker
> .call(ProtobufRpcEngine.java:640)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1866)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
>
>     at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1435)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1345)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(
> ProtobufRpcEngine.java:227)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(
> ProtobufRpcEngine.java:116)
>     at com.sun.proxy.$Proxy49.addBlock(Unknown Source)
>     at org.apache.hadoop.hdfs.protocolPB.
> ClientNamenodeProtocolTranslatorPB.addBlock(
> ClientNamenodeProtocolTranslatorPB.java:444)
>     at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(
> RetryInvocationHandler.java:409)
>     at org.apache.hadoop.io.retry.RetryInvocationHandler$Call
> .invokeMethod(RetryInvocationHandler.java:163)
>     at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(
> RetryInvocationHandler.java:155)
>     at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(
> RetryInvocationHandler.java:95)
>     at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(
> RetryInvocationHandler.java:346)
>     at com.sun.proxy.$Proxy50.addBlock(Unknown Source)
>     at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(
> DataStreamer.java:1838)
>     at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(
> DataStreamer.java:1638)
>     at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)
>
> Reading through https://issues.apache.org/jira/browse/FLINK-13593 , it
> looks related but this is marked as fixed in 1.9.
>
> Then, the discussion there points to
> https://issues.apache.org/jira/browse/FLINK-13497 which is marked as
> unresolved / fixed in 1.10.
>
> Any lights about:
> 1/ Would you confirm that our stack trace is related with
> https://issues.apache.org/jira/browse/FLINK-13497  ?
> 2/ Any ETA for a 1.9.x fixing it?
>
> Thanks
> Adrian Vasiliu
>
>