You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Yuzhang Han <yu...@gmail.com> on 2013/06/25 00:01:06 UTC

"could only be replicated to 0 nodes instead of minReplication" exception during job execution

Hello,

I am using YARN. I get some exceptions at my namenode and datanode. They
are thrown when my Reduce progress gets 67%. Then, reduce phase is
restarted from 0% several times, but always restarts at this point. Can
someone tell me what I should do? Many thanks!


Namenode log:

2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock:
blockMap updated: 10.224.2.190:50010 is added to
blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
2013-06-24 19:08:50,349 WARN
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy:
Not able to place enough replicas, still in need of 1 to reach 1
For more information, please enable DEBUG log level on
org.apache.commons.logging.impl.Log4JLogger
2013-06-24 19:08:50,350 ERROR
org.apache.hadoop.security.UserGroupInformation:
PriviledgedActionException as:ubuntu (auth:SIMPLE)
cause:java.io.IOException: File
/output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002
could only be replicated to 0 nodes instead of minReplication (=1).
There are 2 datanode(s) running and no node(s) are excluded in this
operation.
2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 9000, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
10.224.2.190:49375: error: java.io.IOException: File
/output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002
could only be replicated to 0 nodes instead of minReplication (=1).
There are 2 datanode(s) running and no node(s) are excluded in this
operation.
java.io.IOException: File
/output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002
could only be replicated to 0 nodes instead of minReplication (=1).
There are 2 datanode(s) running and no node(s) are excluded in this
operation.
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:416)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock:
blockMap updated: 10.224.2.190:50010 is added to
blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
2013-06-24 19:08:50,418 WARN
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy:
Not able to place enough replicas, still in need of 1 to reach 1
For more information, please enable DEBUG log level on
org.apache.commons.logging.impl.Log4JLogger



Datanode log:

2013-06-24 19:25:54,695 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022,
type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2013-06-24 19:25:54,699 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /
10.35.99.108:59710 dest: /10.35.99.108:50010
2013-06-24 19:25:56,473 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
    at sun.nio.ch.IOUtil.read(IOUtil.java:224)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
    at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
    at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
    at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
    at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.FilterInputStream.read(FilterInputStream.java:133)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
    at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
    at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
    at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
    at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
    at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
    at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
    at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
    at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
    at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
    at java.lang.Thread.run(Thread.java:679)
2013-06-24 19:25:56,476 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026,
type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.

Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Yuzhang Han <yu...@gmail.com>.
Omkar, by the way, I see following exception in my nodemanager log. Do you
think it is related?

2013-06-24 16:15:48,039 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
2013-06-24 16:17:35,413 ERROR
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught
exception in status-updater
java.lang.reflect.UndeclaredThrowableException
    at
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135)
    at
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:72)
    at
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:343)
Caused by: com.google.protobuf.ServiceException: java.io.IOException:
Failed on local exception: java.io.IOException: Response is null.; Host
Details : local host is: "ip-10-224-2-190/10.224.2.190"; destination host
is: "ec2-54-224-142-46.compute-1.amazonaws.com":8031;
    at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:212)
    at $Proxy23.nodeHeartbeat(Unknown Source)
    at
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:70)
    ... 1 more
Caused by: java.io.IOException: Failed on local exception:
java.io.IOException: Response is null.; Host Details : local host is:
"ip-10-224-2-190/10.224.2.190"; destination host is: "
ec2-54-224-142-46.compute-1.amazonaws.com":8031;
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761)
    at org.apache.hadoop.ipc.Client.call(Client.java:1239)
    at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
    ... 3 more
Caused by: java.io.IOException: Response is null.
    at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)
2013-06-24 16:17:37,420 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:38,421 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:39,422 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:40,304 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is
stopped.
2013-06-24 16:17:40,305 INFO
org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is
stopped.
2013-06-24 16:17:40,322 INFO org.mortbay.log: Stopped
SelectChannelConnector@0.0.0.0:8042
2013-06-24 16:17:40,422 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer is
stopped.
2013-06-24 16:17:40,422 INFO org.apache.hadoop.ipc.Server: Stopping server
on 51111
2013-06-24 16:17:40,423 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 3 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:40,424 INFO org.apache.hadoop.ipc.Server: Stopping IPC
Server listener on 51111
2013-06-24 16:17:40,425 INFO org.apache.hadoop.ipc.Server: Stopping IPC
Server Responder
2013-06-24 16:17:40,426 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler
is stopped.


On Tue, Jun 25, 2013 at 12:04 AM, Yuzhang Han <yu...@gmail.com>wrote:

> Thank you, Omkar.
>
> I didn't see other errors on datanode and namenode logs. My namenode 50070
> interface shows
>
> Configured Capacity : 393.72 GB DFS Used : 60.86 GB Non DFS Used : 137.51
> GB DFS Remaining : 195.35 GB DFS Used% : 15.46% DFS Remaining% : 49.62% Block
> Pool Used : 60.86 GB Block Pool Used% : 15.46% DataNodes usages : Min %Median %Max %stdev %
>
> 14.55% 16.37% 16.37% 0.91%
>
>
> It doesn't imply insufficient disk space, does it? Can you think of any
> other possible cause of the exceptions?
>
>
> On Mon, Jun 24, 2013 at 6:17 PM, Omkar Joshi <oj...@hortonworks.com>wrote:
>
>> Hi,
>>
>> I see there are 2 datanodes and for some reason namenode is not able to
>> create even single replica for requested blocks. are you sure the system on
>> which these datanodes are running have sufficient disk space? Do you see
>> any other errors in datanode/namenode logs?
>>
>> What must be happening is as file creation in hdfs is failing it is
>> marking that reduce attempt as failure and restarting it. Keep checking
>> namenode state when it reaches 67%.
>>
>> Thanks,
>> Omkar Joshi
>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>
>>
>> On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>wrote:
>>
>>> Hello,
>>>
>>> I am using YARN. I get some exceptions at my namenode and datanode. They
>>> are thrown when my Reduce progress gets 67%. Then, reduce phase is
>>> restarted from 0% several times, but always restarts at this point. Can
>>> someone tell me what I should do? Many thanks!
>>>
>>>
>>> Namenode log:
>>>
>>> 2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>>> 2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>>> 2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>>> 2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>>> java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>>> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)
>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:416)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>>> 2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>>> 2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>>>
>>>
>>>
>>> Datanode log:
>>>
>>> 2013-06-24 19:25:54,695 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>>> BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022,
>>> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
>>> 2013-06-24 19:25:54,699 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
>>> BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /
>>> 10.35.99.108:59710 dest: /10.35.99.108:50010
>>> 2013-06-24 19:25:56,473 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
>>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
>>> java.io.IOException: Connection reset by peer
>>>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>>>     at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>>>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>>>     at
>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>>     at
>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>>     at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
>>>     at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
>>>     at java.io.FilterInputStream.read(FilterInputStream.java:133)
>>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>>     at java.io.DataInputStream.read(DataInputStream.java:149)
>>>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
>>>     at java.lang.Thread.run(Thread.java:679)
>>> 2013-06-24 19:25:56,476 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026,
>>> type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
>>>
>>
>>
>

RE: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Devaraj k <de...@huawei.com>.
Could you check this page for any of these possible cause in your cluster.

http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo

Thanks
Devaraj k

From: Yuzhang Han [mailto:yuzhanghan1982@gmail.com]
Sent: 25 June 2013 09:34
To: user@hadoop.apache.org
Subject: Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Thank you, Omkar.
I didn't see other errors on datanode and namenode logs. My namenode 50070 interface shows
Configured Capacity

:

393.72 GB

DFS Used

:

60.86 GB

Non DFS Used

:

137.51 GB

DFS Remaining

:

195.35 GB

DFS Used%

:

15.46%

DFS Remaining%

:

49.62%

Block Pool Used

:

60.86 GB

Block Pool Used%

:

15.46%

DataNodes usages

:

Min %

Median %

Max %

stdev %


14.55%

16.37%

16.37%

0.91%



It doesn't imply insufficient disk space, does it? Can you think of any other possible cause of the exceptions?

On Mon, Jun 24, 2013 at 6:17 PM, Omkar Joshi <oj...@hortonworks.com>> wrote:
Hi,

I see there are 2 datanodes and for some reason namenode is not able to create even single replica for requested blocks. are you sure the system on which these datanodes are running have sufficient disk space? Do you see any other errors in datanode/namenode logs?

What must be happening is as file creation in hdfs is failing it is marking that reduce attempt as failure and restarting it. Keep checking namenode state when it reaches 67%.

Thanks,
Omkar Joshi
Hortonworks Inc.<http://www.hortonworks.com>

On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>> wrote:
Hello,
I am using YARN. I get some exceptions at my namenode and datanode. They are thrown when my Reduce progress gets 67%. Then, reduce phase is restarted from 0% several times, but always restarts at this point. Can someone tell me what I should do? Many thanks!

Namenode log:

2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010<http://10.224.2.190:50010> is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0

2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1

For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger

2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.

2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375<http://10.224.2.190:49375>: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.

java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.

        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)

        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)

        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)

        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:416)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010<http://10.224.2.190:50010> is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0

2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1

For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger


Datanode log:

2013-06-24 19:25:54,695 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2013-06-24 19:25:54,699 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /10.35.99.108:59710<http://10.35.99.108:59710> dest: /10.35.99.108:50010<http://10.35.99.108:50010>
2013-06-24 19:25:56,473 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
    at sun.nio.ch.IOUtil.read(IOUtil.java:224)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
    at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
    at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.FilterInputStream.read(FilterInputStream.java:133)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
    at java.lang.Thread.run(Thread.java:679)
2013-06-24 19:25:56,476 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026, type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.



Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Yuzhang Han <yu...@gmail.com>.
Omkar, by the way, I see following exception in my nodemanager log. Do you
think it is related?

2013-06-24 16:15:48,039 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
2013-06-24 16:17:35,413 ERROR
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught
exception in status-updater
java.lang.reflect.UndeclaredThrowableException
    at
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135)
    at
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:72)
    at
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:343)
Caused by: com.google.protobuf.ServiceException: java.io.IOException:
Failed on local exception: java.io.IOException: Response is null.; Host
Details : local host is: "ip-10-224-2-190/10.224.2.190"; destination host
is: "ec2-54-224-142-46.compute-1.amazonaws.com":8031;
    at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:212)
    at $Proxy23.nodeHeartbeat(Unknown Source)
    at
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:70)
    ... 1 more
Caused by: java.io.IOException: Failed on local exception:
java.io.IOException: Response is null.; Host Details : local host is:
"ip-10-224-2-190/10.224.2.190"; destination host is: "
ec2-54-224-142-46.compute-1.amazonaws.com":8031;
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761)
    at org.apache.hadoop.ipc.Client.call(Client.java:1239)
    at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
    ... 3 more
Caused by: java.io.IOException: Response is null.
    at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)
2013-06-24 16:17:37,420 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:38,421 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:39,422 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:40,304 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is
stopped.
2013-06-24 16:17:40,305 INFO
org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is
stopped.
2013-06-24 16:17:40,322 INFO org.mortbay.log: Stopped
SelectChannelConnector@0.0.0.0:8042
2013-06-24 16:17:40,422 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer is
stopped.
2013-06-24 16:17:40,422 INFO org.apache.hadoop.ipc.Server: Stopping server
on 51111
2013-06-24 16:17:40,423 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 3 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:40,424 INFO org.apache.hadoop.ipc.Server: Stopping IPC
Server listener on 51111
2013-06-24 16:17:40,425 INFO org.apache.hadoop.ipc.Server: Stopping IPC
Server Responder
2013-06-24 16:17:40,426 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler
is stopped.


On Tue, Jun 25, 2013 at 12:04 AM, Yuzhang Han <yu...@gmail.com>wrote:

> Thank you, Omkar.
>
> I didn't see other errors on datanode and namenode logs. My namenode 50070
> interface shows
>
> Configured Capacity : 393.72 GB DFS Used : 60.86 GB Non DFS Used : 137.51
> GB DFS Remaining : 195.35 GB DFS Used% : 15.46% DFS Remaining% : 49.62% Block
> Pool Used : 60.86 GB Block Pool Used% : 15.46% DataNodes usages : Min %Median %Max %stdev %
>
> 14.55% 16.37% 16.37% 0.91%
>
>
> It doesn't imply insufficient disk space, does it? Can you think of any
> other possible cause of the exceptions?
>
>
> On Mon, Jun 24, 2013 at 6:17 PM, Omkar Joshi <oj...@hortonworks.com>wrote:
>
>> Hi,
>>
>> I see there are 2 datanodes and for some reason namenode is not able to
>> create even single replica for requested blocks. are you sure the system on
>> which these datanodes are running have sufficient disk space? Do you see
>> any other errors in datanode/namenode logs?
>>
>> What must be happening is as file creation in hdfs is failing it is
>> marking that reduce attempt as failure and restarting it. Keep checking
>> namenode state when it reaches 67%.
>>
>> Thanks,
>> Omkar Joshi
>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>
>>
>> On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>wrote:
>>
>>> Hello,
>>>
>>> I am using YARN. I get some exceptions at my namenode and datanode. They
>>> are thrown when my Reduce progress gets 67%. Then, reduce phase is
>>> restarted from 0% several times, but always restarts at this point. Can
>>> someone tell me what I should do? Many thanks!
>>>
>>>
>>> Namenode log:
>>>
>>> 2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>>> 2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>>> 2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>>> 2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>>> java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>>> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)
>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:416)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>>> 2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>>> 2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>>>
>>>
>>>
>>> Datanode log:
>>>
>>> 2013-06-24 19:25:54,695 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>>> BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022,
>>> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
>>> 2013-06-24 19:25:54,699 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
>>> BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /
>>> 10.35.99.108:59710 dest: /10.35.99.108:50010
>>> 2013-06-24 19:25:56,473 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
>>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
>>> java.io.IOException: Connection reset by peer
>>>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>>>     at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>>>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>>>     at
>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>>     at
>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>>     at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
>>>     at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
>>>     at java.io.FilterInputStream.read(FilterInputStream.java:133)
>>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>>     at java.io.DataInputStream.read(DataInputStream.java:149)
>>>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
>>>     at java.lang.Thread.run(Thread.java:679)
>>> 2013-06-24 19:25:56,476 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026,
>>> type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
>>>
>>
>>
>

RE: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Devaraj k <de...@huawei.com>.
Could you check this page for any of these possible cause in your cluster.

http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo

Thanks
Devaraj k

From: Yuzhang Han [mailto:yuzhanghan1982@gmail.com]
Sent: 25 June 2013 09:34
To: user@hadoop.apache.org
Subject: Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Thank you, Omkar.
I didn't see other errors on datanode and namenode logs. My namenode 50070 interface shows
Configured Capacity

:

393.72 GB

DFS Used

:

60.86 GB

Non DFS Used

:

137.51 GB

DFS Remaining

:

195.35 GB

DFS Used%

:

15.46%

DFS Remaining%

:

49.62%

Block Pool Used

:

60.86 GB

Block Pool Used%

:

15.46%

DataNodes usages

:

Min %

Median %

Max %

stdev %


14.55%

16.37%

16.37%

0.91%



It doesn't imply insufficient disk space, does it? Can you think of any other possible cause of the exceptions?

On Mon, Jun 24, 2013 at 6:17 PM, Omkar Joshi <oj...@hortonworks.com>> wrote:
Hi,

I see there are 2 datanodes and for some reason namenode is not able to create even single replica for requested blocks. are you sure the system on which these datanodes are running have sufficient disk space? Do you see any other errors in datanode/namenode logs?

What must be happening is as file creation in hdfs is failing it is marking that reduce attempt as failure and restarting it. Keep checking namenode state when it reaches 67%.

Thanks,
Omkar Joshi
Hortonworks Inc.<http://www.hortonworks.com>

On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>> wrote:
Hello,
I am using YARN. I get some exceptions at my namenode and datanode. They are thrown when my Reduce progress gets 67%. Then, reduce phase is restarted from 0% several times, but always restarts at this point. Can someone tell me what I should do? Many thanks!

Namenode log:

2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010<http://10.224.2.190:50010> is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0

2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1

For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger

2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.

2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375<http://10.224.2.190:49375>: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.

java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.

        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)

        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)

        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)

        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:416)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010<http://10.224.2.190:50010> is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0

2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1

For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger


Datanode log:

2013-06-24 19:25:54,695 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2013-06-24 19:25:54,699 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /10.35.99.108:59710<http://10.35.99.108:59710> dest: /10.35.99.108:50010<http://10.35.99.108:50010>
2013-06-24 19:25:56,473 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
    at sun.nio.ch.IOUtil.read(IOUtil.java:224)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
    at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
    at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.FilterInputStream.read(FilterInputStream.java:133)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
    at java.lang.Thread.run(Thread.java:679)
2013-06-24 19:25:56,476 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026, type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.



Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Yuzhang Han <yu...@gmail.com>.
Omkar, by the way, I see following exception in my nodemanager log. Do you
think it is related?

2013-06-24 16:15:48,039 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
2013-06-24 16:17:35,413 ERROR
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught
exception in status-updater
java.lang.reflect.UndeclaredThrowableException
    at
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135)
    at
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:72)
    at
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:343)
Caused by: com.google.protobuf.ServiceException: java.io.IOException:
Failed on local exception: java.io.IOException: Response is null.; Host
Details : local host is: "ip-10-224-2-190/10.224.2.190"; destination host
is: "ec2-54-224-142-46.compute-1.amazonaws.com":8031;
    at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:212)
    at $Proxy23.nodeHeartbeat(Unknown Source)
    at
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:70)
    ... 1 more
Caused by: java.io.IOException: Failed on local exception:
java.io.IOException: Response is null.; Host Details : local host is:
"ip-10-224-2-190/10.224.2.190"; destination host is: "
ec2-54-224-142-46.compute-1.amazonaws.com":8031;
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761)
    at org.apache.hadoop.ipc.Client.call(Client.java:1239)
    at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
    ... 3 more
Caused by: java.io.IOException: Response is null.
    at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)
2013-06-24 16:17:37,420 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:38,421 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:39,422 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:40,304 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is
stopped.
2013-06-24 16:17:40,305 INFO
org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is
stopped.
2013-06-24 16:17:40,322 INFO org.mortbay.log: Stopped
SelectChannelConnector@0.0.0.0:8042
2013-06-24 16:17:40,422 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer is
stopped.
2013-06-24 16:17:40,422 INFO org.apache.hadoop.ipc.Server: Stopping server
on 51111
2013-06-24 16:17:40,423 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 3 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:40,424 INFO org.apache.hadoop.ipc.Server: Stopping IPC
Server listener on 51111
2013-06-24 16:17:40,425 INFO org.apache.hadoop.ipc.Server: Stopping IPC
Server Responder
2013-06-24 16:17:40,426 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler
is stopped.


On Tue, Jun 25, 2013 at 12:04 AM, Yuzhang Han <yu...@gmail.com>wrote:

> Thank you, Omkar.
>
> I didn't see other errors on datanode and namenode logs. My namenode 50070
> interface shows
>
> Configured Capacity : 393.72 GB DFS Used : 60.86 GB Non DFS Used : 137.51
> GB DFS Remaining : 195.35 GB DFS Used% : 15.46% DFS Remaining% : 49.62% Block
> Pool Used : 60.86 GB Block Pool Used% : 15.46% DataNodes usages : Min %Median %Max %stdev %
>
> 14.55% 16.37% 16.37% 0.91%
>
>
> It doesn't imply insufficient disk space, does it? Can you think of any
> other possible cause of the exceptions?
>
>
> On Mon, Jun 24, 2013 at 6:17 PM, Omkar Joshi <oj...@hortonworks.com>wrote:
>
>> Hi,
>>
>> I see there are 2 datanodes and for some reason namenode is not able to
>> create even single replica for requested blocks. are you sure the system on
>> which these datanodes are running have sufficient disk space? Do you see
>> any other errors in datanode/namenode logs?
>>
>> What must be happening is as file creation in hdfs is failing it is
>> marking that reduce attempt as failure and restarting it. Keep checking
>> namenode state when it reaches 67%.
>>
>> Thanks,
>> Omkar Joshi
>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>
>>
>> On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>wrote:
>>
>>> Hello,
>>>
>>> I am using YARN. I get some exceptions at my namenode and datanode. They
>>> are thrown when my Reduce progress gets 67%. Then, reduce phase is
>>> restarted from 0% several times, but always restarts at this point. Can
>>> someone tell me what I should do? Many thanks!
>>>
>>>
>>> Namenode log:
>>>
>>> 2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>>> 2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>>> 2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>>> 2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>>> java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>>> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)
>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:416)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>>> 2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>>> 2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>>>
>>>
>>>
>>> Datanode log:
>>>
>>> 2013-06-24 19:25:54,695 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>>> BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022,
>>> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
>>> 2013-06-24 19:25:54,699 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
>>> BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /
>>> 10.35.99.108:59710 dest: /10.35.99.108:50010
>>> 2013-06-24 19:25:56,473 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
>>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
>>> java.io.IOException: Connection reset by peer
>>>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>>>     at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>>>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>>>     at
>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>>     at
>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>>     at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
>>>     at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
>>>     at java.io.FilterInputStream.read(FilterInputStream.java:133)
>>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>>     at java.io.DataInputStream.read(DataInputStream.java:149)
>>>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
>>>     at java.lang.Thread.run(Thread.java:679)
>>> 2013-06-24 19:25:56,476 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026,
>>> type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
>>>
>>
>>
>

Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Yuzhang Han <yu...@gmail.com>.
Omkar, by the way, I see following exception in my nodemanager log. Do you
think it is related?

2013-06-24 16:15:48,039 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
2013-06-24 16:17:35,413 ERROR
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught
exception in status-updater
java.lang.reflect.UndeclaredThrowableException
    at
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135)
    at
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:72)
    at
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:343)
Caused by: com.google.protobuf.ServiceException: java.io.IOException:
Failed on local exception: java.io.IOException: Response is null.; Host
Details : local host is: "ip-10-224-2-190/10.224.2.190"; destination host
is: "ec2-54-224-142-46.compute-1.amazonaws.com":8031;
    at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:212)
    at $Proxy23.nodeHeartbeat(Unknown Source)
    at
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:70)
    ... 1 more
Caused by: java.io.IOException: Failed on local exception:
java.io.IOException: Response is null.; Host Details : local host is:
"ip-10-224-2-190/10.224.2.190"; destination host is: "
ec2-54-224-142-46.compute-1.amazonaws.com":8031;
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761)
    at org.apache.hadoop.ipc.Client.call(Client.java:1239)
    at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
    ... 3 more
Caused by: java.io.IOException: Response is null.
    at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)
2013-06-24 16:17:37,420 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:38,421 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:39,422 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:40,304 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is
stopped.
2013-06-24 16:17:40,305 INFO
org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is
stopped.
2013-06-24 16:17:40,322 INFO org.mortbay.log: Stopped
SelectChannelConnector@0.0.0.0:8042
2013-06-24 16:17:40,422 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer is
stopped.
2013-06-24 16:17:40,422 INFO org.apache.hadoop.ipc.Server: Stopping server
on 51111
2013-06-24 16:17:40,423 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: ec2-54-224-142-46.compute-1.amazonaws.com/10.10.79.145:8031.
Already tried 3 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-24 16:17:40,424 INFO org.apache.hadoop.ipc.Server: Stopping IPC
Server listener on 51111
2013-06-24 16:17:40,425 INFO org.apache.hadoop.ipc.Server: Stopping IPC
Server Responder
2013-06-24 16:17:40,426 INFO
org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler
is stopped.


On Tue, Jun 25, 2013 at 12:04 AM, Yuzhang Han <yu...@gmail.com>wrote:

> Thank you, Omkar.
>
> I didn't see other errors on datanode and namenode logs. My namenode 50070
> interface shows
>
> Configured Capacity : 393.72 GB DFS Used : 60.86 GB Non DFS Used : 137.51
> GB DFS Remaining : 195.35 GB DFS Used% : 15.46% DFS Remaining% : 49.62% Block
> Pool Used : 60.86 GB Block Pool Used% : 15.46% DataNodes usages : Min %Median %Max %stdev %
>
> 14.55% 16.37% 16.37% 0.91%
>
>
> It doesn't imply insufficient disk space, does it? Can you think of any
> other possible cause of the exceptions?
>
>
> On Mon, Jun 24, 2013 at 6:17 PM, Omkar Joshi <oj...@hortonworks.com>wrote:
>
>> Hi,
>>
>> I see there are 2 datanodes and for some reason namenode is not able to
>> create even single replica for requested blocks. are you sure the system on
>> which these datanodes are running have sufficient disk space? Do you see
>> any other errors in datanode/namenode logs?
>>
>> What must be happening is as file creation in hdfs is failing it is
>> marking that reduce attempt as failure and restarting it. Keep checking
>> namenode state when it reaches 67%.
>>
>> Thanks,
>> Omkar Joshi
>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>
>>
>> On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>wrote:
>>
>>> Hello,
>>>
>>> I am using YARN. I get some exceptions at my namenode and datanode. They
>>> are thrown when my Reduce progress gets 67%. Then, reduce phase is
>>> restarted from 0% several times, but always restarts at this point. Can
>>> someone tell me what I should do? Many thanks!
>>>
>>>
>>> Namenode log:
>>>
>>> 2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>>> 2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>>> 2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>>> 2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>>> java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>>> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)
>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:416)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>>> 2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>>> 2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>>>
>>>
>>>
>>> Datanode log:
>>>
>>> 2013-06-24 19:25:54,695 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>>> BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022,
>>> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
>>> 2013-06-24 19:25:54,699 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
>>> BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /
>>> 10.35.99.108:59710 dest: /10.35.99.108:50010
>>> 2013-06-24 19:25:56,473 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
>>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
>>> java.io.IOException: Connection reset by peer
>>>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>>>     at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>>>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>>>     at
>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>>     at
>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>>     at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
>>>     at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
>>>     at java.io.FilterInputStream.read(FilterInputStream.java:133)
>>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>>     at java.io.DataInputStream.read(DataInputStream.java:149)
>>>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
>>>     at
>>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
>>>     at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
>>>     at java.lang.Thread.run(Thread.java:679)
>>> 2013-06-24 19:25:56,476 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026,
>>> type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
>>>
>>
>>
>

RE: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Devaraj k <de...@huawei.com>.
Could you check this page for any of these possible cause in your cluster.

http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo

Thanks
Devaraj k

From: Yuzhang Han [mailto:yuzhanghan1982@gmail.com]
Sent: 25 June 2013 09:34
To: user@hadoop.apache.org
Subject: Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Thank you, Omkar.
I didn't see other errors on datanode and namenode logs. My namenode 50070 interface shows
Configured Capacity

:

393.72 GB

DFS Used

:

60.86 GB

Non DFS Used

:

137.51 GB

DFS Remaining

:

195.35 GB

DFS Used%

:

15.46%

DFS Remaining%

:

49.62%

Block Pool Used

:

60.86 GB

Block Pool Used%

:

15.46%

DataNodes usages

:

Min %

Median %

Max %

stdev %


14.55%

16.37%

16.37%

0.91%



It doesn't imply insufficient disk space, does it? Can you think of any other possible cause of the exceptions?

On Mon, Jun 24, 2013 at 6:17 PM, Omkar Joshi <oj...@hortonworks.com>> wrote:
Hi,

I see there are 2 datanodes and for some reason namenode is not able to create even single replica for requested blocks. are you sure the system on which these datanodes are running have sufficient disk space? Do you see any other errors in datanode/namenode logs?

What must be happening is as file creation in hdfs is failing it is marking that reduce attempt as failure and restarting it. Keep checking namenode state when it reaches 67%.

Thanks,
Omkar Joshi
Hortonworks Inc.<http://www.hortonworks.com>

On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>> wrote:
Hello,
I am using YARN. I get some exceptions at my namenode and datanode. They are thrown when my Reduce progress gets 67%. Then, reduce phase is restarted from 0% several times, but always restarts at this point. Can someone tell me what I should do? Many thanks!

Namenode log:

2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010<http://10.224.2.190:50010> is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0

2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1

For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger

2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.

2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375<http://10.224.2.190:49375>: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.

java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.

        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)

        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)

        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)

        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:416)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010<http://10.224.2.190:50010> is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0

2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1

For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger


Datanode log:

2013-06-24 19:25:54,695 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2013-06-24 19:25:54,699 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /10.35.99.108:59710<http://10.35.99.108:59710> dest: /10.35.99.108:50010<http://10.35.99.108:50010>
2013-06-24 19:25:56,473 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
    at sun.nio.ch.IOUtil.read(IOUtil.java:224)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
    at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
    at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.FilterInputStream.read(FilterInputStream.java:133)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
    at java.lang.Thread.run(Thread.java:679)
2013-06-24 19:25:56,476 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026, type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.



RE: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Devaraj k <de...@huawei.com>.
Could you check this page for any of these possible cause in your cluster.

http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo

Thanks
Devaraj k

From: Yuzhang Han [mailto:yuzhanghan1982@gmail.com]
Sent: 25 June 2013 09:34
To: user@hadoop.apache.org
Subject: Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Thank you, Omkar.
I didn't see other errors on datanode and namenode logs. My namenode 50070 interface shows
Configured Capacity

:

393.72 GB

DFS Used

:

60.86 GB

Non DFS Used

:

137.51 GB

DFS Remaining

:

195.35 GB

DFS Used%

:

15.46%

DFS Remaining%

:

49.62%

Block Pool Used

:

60.86 GB

Block Pool Used%

:

15.46%

DataNodes usages

:

Min %

Median %

Max %

stdev %


14.55%

16.37%

16.37%

0.91%



It doesn't imply insufficient disk space, does it? Can you think of any other possible cause of the exceptions?

On Mon, Jun 24, 2013 at 6:17 PM, Omkar Joshi <oj...@hortonworks.com>> wrote:
Hi,

I see there are 2 datanodes and for some reason namenode is not able to create even single replica for requested blocks. are you sure the system on which these datanodes are running have sufficient disk space? Do you see any other errors in datanode/namenode logs?

What must be happening is as file creation in hdfs is failing it is marking that reduce attempt as failure and restarting it. Keep checking namenode state when it reaches 67%.

Thanks,
Omkar Joshi
Hortonworks Inc.<http://www.hortonworks.com>

On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>> wrote:
Hello,
I am using YARN. I get some exceptions at my namenode and datanode. They are thrown when my Reduce progress gets 67%. Then, reduce phase is restarted from 0% several times, but always restarts at this point. Can someone tell me what I should do? Many thanks!

Namenode log:

2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010<http://10.224.2.190:50010> is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0

2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1

For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger

2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.

2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375<http://10.224.2.190:49375>: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.

java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.

        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)

        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)

        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)

        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:416)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010<http://10.224.2.190:50010> is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0

2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1

For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger


Datanode log:

2013-06-24 19:25:54,695 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2013-06-24 19:25:54,699 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /10.35.99.108:59710<http://10.35.99.108:59710> dest: /10.35.99.108:50010<http://10.35.99.108:50010>
2013-06-24 19:25:56,473 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
    at sun.nio.ch.IOUtil.read(IOUtil.java:224)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
    at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
    at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.FilterInputStream.read(FilterInputStream.java:133)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
    at java.lang.Thread.run(Thread.java:679)
2013-06-24 19:25:56,476 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026, type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.



Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Yuzhang Han <yu...@gmail.com>.
Thank you, Omkar.

I didn't see other errors on datanode and namenode logs. My namenode 50070
interface shows

Configured Capacity : 393.72 GB DFS Used : 60.86 GB Non DFS Used :
137.51 GB DFS
Remaining : 195.35 GB DFS Used% : 15.46% DFS Remaining% : 49.62% Block Pool
Used : 60.86 GB Block Pool Used% : 15.46% DataNodes usages : Min %
Median %Max %stdev %

14.55% 16.37% 16.37% 0.91%


It doesn't imply insufficient disk space, does it? Can you think of any
other possible cause of the exceptions?


On Mon, Jun 24, 2013 at 6:17 PM, Omkar Joshi <oj...@hortonworks.com> wrote:

> Hi,
>
> I see there are 2 datanodes and for some reason namenode is not able to
> create even single replica for requested blocks. are you sure the system on
> which these datanodes are running have sufficient disk space? Do you see
> any other errors in datanode/namenode logs?
>
> What must be happening is as file creation in hdfs is failing it is
> marking that reduce attempt as failure and restarting it. Keep checking
> namenode state when it reaches 67%.
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>
> On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>wrote:
>
>> Hello,
>>
>> I am using YARN. I get some exceptions at my namenode and datanode. They
>> are thrown when my Reduce progress gets 67%. Then, reduce phase is
>> restarted from 0% several times, but always restarts at this point. Can
>> someone tell me what I should do? Many thanks!
>>
>>
>> Namenode log:
>>
>> 2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>> 2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>> 2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>> 2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>> java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)
>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:416)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>> 2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>> 2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>>
>>
>>
>> Datanode log:
>>
>> 2013-06-24 19:25:54,695 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>> BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022,
>> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
>> 2013-06-24 19:25:54,699 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
>> BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /
>> 10.35.99.108:59710 dest: /10.35.99.108:50010
>> 2013-06-24 19:25:56,473 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
>> java.io.IOException: Connection reset by peer
>>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>>     at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>>     at
>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>     at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>     at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
>>     at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
>>     at java.io.FilterInputStream.read(FilterInputStream.java:133)
>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>     at java.io.DataInputStream.read(DataInputStream.java:149)
>>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
>>     at java.lang.Thread.run(Thread.java:679)
>> 2013-06-24 19:25:56,476 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026,
>> type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
>>
>
>

Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Yuzhang Han <yu...@gmail.com>.
Thank you, Omkar.

I didn't see other errors on datanode and namenode logs. My namenode 50070
interface shows

Configured Capacity : 393.72 GB DFS Used : 60.86 GB Non DFS Used :
137.51 GB DFS
Remaining : 195.35 GB DFS Used% : 15.46% DFS Remaining% : 49.62% Block Pool
Used : 60.86 GB Block Pool Used% : 15.46% DataNodes usages : Min %
Median %Max %stdev %

14.55% 16.37% 16.37% 0.91%


It doesn't imply insufficient disk space, does it? Can you think of any
other possible cause of the exceptions?


On Mon, Jun 24, 2013 at 6:17 PM, Omkar Joshi <oj...@hortonworks.com> wrote:

> Hi,
>
> I see there are 2 datanodes and for some reason namenode is not able to
> create even single replica for requested blocks. are you sure the system on
> which these datanodes are running have sufficient disk space? Do you see
> any other errors in datanode/namenode logs?
>
> What must be happening is as file creation in hdfs is failing it is
> marking that reduce attempt as failure and restarting it. Keep checking
> namenode state when it reaches 67%.
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>
> On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>wrote:
>
>> Hello,
>>
>> I am using YARN. I get some exceptions at my namenode and datanode. They
>> are thrown when my Reduce progress gets 67%. Then, reduce phase is
>> restarted from 0% several times, but always restarts at this point. Can
>> someone tell me what I should do? Many thanks!
>>
>>
>> Namenode log:
>>
>> 2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>> 2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>> 2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>> 2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>> java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)
>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:416)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>> 2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>> 2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>>
>>
>>
>> Datanode log:
>>
>> 2013-06-24 19:25:54,695 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>> BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022,
>> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
>> 2013-06-24 19:25:54,699 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
>> BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /
>> 10.35.99.108:59710 dest: /10.35.99.108:50010
>> 2013-06-24 19:25:56,473 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
>> java.io.IOException: Connection reset by peer
>>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>>     at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>>     at
>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>     at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>     at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
>>     at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
>>     at java.io.FilterInputStream.read(FilterInputStream.java:133)
>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>     at java.io.DataInputStream.read(DataInputStream.java:149)
>>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
>>     at java.lang.Thread.run(Thread.java:679)
>> 2013-06-24 19:25:56,476 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026,
>> type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
>>
>
>

Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Yuzhang Han <yu...@gmail.com>.
Thank you, Omkar.

I didn't see other errors on datanode and namenode logs. My namenode 50070
interface shows

Configured Capacity : 393.72 GB DFS Used : 60.86 GB Non DFS Used :
137.51 GB DFS
Remaining : 195.35 GB DFS Used% : 15.46% DFS Remaining% : 49.62% Block Pool
Used : 60.86 GB Block Pool Used% : 15.46% DataNodes usages : Min %
Median %Max %stdev %

14.55% 16.37% 16.37% 0.91%


It doesn't imply insufficient disk space, does it? Can you think of any
other possible cause of the exceptions?


On Mon, Jun 24, 2013 at 6:17 PM, Omkar Joshi <oj...@hortonworks.com> wrote:

> Hi,
>
> I see there are 2 datanodes and for some reason namenode is not able to
> create even single replica for requested blocks. are you sure the system on
> which these datanodes are running have sufficient disk space? Do you see
> any other errors in datanode/namenode logs?
>
> What must be happening is as file creation in hdfs is failing it is
> marking that reduce attempt as failure and restarting it. Keep checking
> namenode state when it reaches 67%.
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>
> On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>wrote:
>
>> Hello,
>>
>> I am using YARN. I get some exceptions at my namenode and datanode. They
>> are thrown when my Reduce progress gets 67%. Then, reduce phase is
>> restarted from 0% several times, but always restarts at this point. Can
>> someone tell me what I should do? Many thanks!
>>
>>
>> Namenode log:
>>
>> 2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>> 2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>> 2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>> 2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>> java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)
>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:416)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>> 2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>> 2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>>
>>
>>
>> Datanode log:
>>
>> 2013-06-24 19:25:54,695 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>> BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022,
>> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
>> 2013-06-24 19:25:54,699 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
>> BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /
>> 10.35.99.108:59710 dest: /10.35.99.108:50010
>> 2013-06-24 19:25:56,473 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
>> java.io.IOException: Connection reset by peer
>>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>>     at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>>     at
>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>     at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>     at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
>>     at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
>>     at java.io.FilterInputStream.read(FilterInputStream.java:133)
>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>     at java.io.DataInputStream.read(DataInputStream.java:149)
>>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
>>     at java.lang.Thread.run(Thread.java:679)
>> 2013-06-24 19:25:56,476 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026,
>> type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
>>
>
>

Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Yuzhang Han <yu...@gmail.com>.
Thank you, Omkar.

I didn't see other errors on datanode and namenode logs. My namenode 50070
interface shows

Configured Capacity : 393.72 GB DFS Used : 60.86 GB Non DFS Used :
137.51 GB DFS
Remaining : 195.35 GB DFS Used% : 15.46% DFS Remaining% : 49.62% Block Pool
Used : 60.86 GB Block Pool Used% : 15.46% DataNodes usages : Min %
Median %Max %stdev %

14.55% 16.37% 16.37% 0.91%


It doesn't imply insufficient disk space, does it? Can you think of any
other possible cause of the exceptions?


On Mon, Jun 24, 2013 at 6:17 PM, Omkar Joshi <oj...@hortonworks.com> wrote:

> Hi,
>
> I see there are 2 datanodes and for some reason namenode is not able to
> create even single replica for requested blocks. are you sure the system on
> which these datanodes are running have sufficient disk space? Do you see
> any other errors in datanode/namenode logs?
>
> What must be happening is as file creation in hdfs is failing it is
> marking that reduce attempt as failure and restarting it. Keep checking
> namenode state when it reaches 67%.
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>
> On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>wrote:
>
>> Hello,
>>
>> I am using YARN. I get some exceptions at my namenode and datanode. They
>> are thrown when my Reduce progress gets 67%. Then, reduce phase is
>> restarted from 0% several times, but always restarts at this point. Can
>> someone tell me what I should do? Many thanks!
>>
>>
>> Namenode log:
>>
>> 2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>> 2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>> 2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>> 2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>> java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
>> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)
>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:416)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>> 2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
>> 2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
>> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>>
>>
>>
>> Datanode log:
>>
>> 2013-06-24 19:25:54,695 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>> BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022,
>> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
>> 2013-06-24 19:25:54,699 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
>> BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /
>> 10.35.99.108:59710 dest: /10.35.99.108:50010
>> 2013-06-24 19:25:56,473 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
>> java.io.IOException: Connection reset by peer
>>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>>     at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>>     at
>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>     at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>     at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
>>     at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
>>     at java.io.FilterInputStream.read(FilterInputStream.java:133)
>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>     at java.io.DataInputStream.read(DataInputStream.java:149)
>>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
>>     at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
>>     at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
>>     at java.lang.Thread.run(Thread.java:679)
>> 2013-06-24 19:25:56,476 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
>> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026,
>> type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
>>
>
>

Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Omkar Joshi <oj...@hortonworks.com>.
Hi,

I see there are 2 datanodes and for some reason namenode is not able to
create even single replica for requested blocks. are you sure the system on
which these datanodes are running have sufficient disk space? Do you see
any other errors in datanode/namenode logs?

What must be happening is as file creation in hdfs is failing it is marking
that reduce attempt as failure and restarting it. Keep checking namenode
state when it reaches 67%.

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>wrote:

> Hello,
>
> I am using YARN. I get some exceptions at my namenode and datanode. They
> are thrown when my Reduce progress gets 67%. Then, reduce phase is
> restarted from 0% several times, but always restarts at this point. Can
> someone tell me what I should do? Many thanks!
>
>
> Namenode log:
>
> 2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
> 2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
> 2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
> 2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
> java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:416)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
> 2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
> 2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>
>
>
> Datanode log:
>
> 2013-06-24 19:25:54,695 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
> BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022,
> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
> 2013-06-24 19:25:54,699 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
> BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /
> 10.35.99.108:59710 dest: /10.35.99.108:50010
> 2013-06-24 19:25:56,473 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
> java.io.IOException: Connection reset by peer
>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>     at
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>     at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>     at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
>     at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
>     at java.io.FilterInputStream.read(FilterInputStream.java:133)
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>     at java.io.DataInputStream.read(DataInputStream.java:149)
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>     at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
>     at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
>     at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
>     at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
>     at java.lang.Thread.run(Thread.java:679)
> 2013-06-24 19:25:56,476 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026,
> type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
>

Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Ravi Prakash <ra...@ymail.com>.
I wonder if this is another instance of https://issues.apache.org/jira/browse/HDFS-1172 . I had noticed this behavior when I run Slive with extremely small file sizes. 




________________________________
 From: Yuzhang Han <yu...@gmail.com>
To: user@hadoop.apache.org 
Sent: Monday, June 24, 2013 3:01 PM
Subject: "could only be replicated to 0 nodes instead of minReplication" exception during job execution
 


Hello,


I am using YARN. I get some exceptions at my namenode and datanode. They are thrown when my Reduce progress gets 67%. Then, reduce phase is restarted from 0% several times, but always restarts at this point. Can someone tell me what I should do? Many thanks!


Namenode log:


2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454) at
 org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger


Datanode log:

2013-06-24 19:25:54,695 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2013-06-24 19:25:54,699 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /10.35.99.108:59710 dest: /10.35.99.108:50010
2013-06-24 19:25:56,473 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
    at sun.nio.ch.IOUtil.read(IOUtil.java:224)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
    at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
    at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.FilterInputStream.read(FilterInputStream.java:133)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
    at java.lang.Thread.run(Thread.java:679)
2013-06-24 19:25:56,476 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026, type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.

Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Ravi Prakash <ra...@ymail.com>.
I wonder if this is another instance of https://issues.apache.org/jira/browse/HDFS-1172 . I had noticed this behavior when I run Slive with extremely small file sizes. 




________________________________
 From: Yuzhang Han <yu...@gmail.com>
To: user@hadoop.apache.org 
Sent: Monday, June 24, 2013 3:01 PM
Subject: "could only be replicated to 0 nodes instead of minReplication" exception during job execution
 


Hello,


I am using YARN. I get some exceptions at my namenode and datanode. They are thrown when my Reduce progress gets 67%. Then, reduce phase is restarted from 0% several times, but always restarts at this point. Can someone tell me what I should do? Many thanks!


Namenode log:


2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454) at
 org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger


Datanode log:

2013-06-24 19:25:54,695 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2013-06-24 19:25:54,699 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /10.35.99.108:59710 dest: /10.35.99.108:50010
2013-06-24 19:25:56,473 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
    at sun.nio.ch.IOUtil.read(IOUtil.java:224)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
    at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
    at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.FilterInputStream.read(FilterInputStream.java:133)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
    at java.lang.Thread.run(Thread.java:679)
2013-06-24 19:25:56,476 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026, type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.

Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Omkar Joshi <oj...@hortonworks.com>.
Hi,

I see there are 2 datanodes and for some reason namenode is not able to
create even single replica for requested blocks. are you sure the system on
which these datanodes are running have sufficient disk space? Do you see
any other errors in datanode/namenode logs?

What must be happening is as file creation in hdfs is failing it is marking
that reduce attempt as failure and restarting it. Keep checking namenode
state when it reaches 67%.

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>wrote:

> Hello,
>
> I am using YARN. I get some exceptions at my namenode and datanode. They
> are thrown when my Reduce progress gets 67%. Then, reduce phase is
> restarted from 0% several times, but always restarts at this point. Can
> someone tell me what I should do? Many thanks!
>
>
> Namenode log:
>
> 2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
> 2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
> 2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
> 2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
> java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:416)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
> 2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
> 2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>
>
>
> Datanode log:
>
> 2013-06-24 19:25:54,695 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
> BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022,
> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
> 2013-06-24 19:25:54,699 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
> BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /
> 10.35.99.108:59710 dest: /10.35.99.108:50010
> 2013-06-24 19:25:56,473 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
> java.io.IOException: Connection reset by peer
>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>     at
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>     at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>     at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
>     at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
>     at java.io.FilterInputStream.read(FilterInputStream.java:133)
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>     at java.io.DataInputStream.read(DataInputStream.java:149)
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>     at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
>     at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
>     at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
>     at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
>     at java.lang.Thread.run(Thread.java:679)
> 2013-06-24 19:25:56,476 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026,
> type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
>

Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Omkar Joshi <oj...@hortonworks.com>.
Hi,

I see there are 2 datanodes and for some reason namenode is not able to
create even single replica for requested blocks. are you sure the system on
which these datanodes are running have sufficient disk space? Do you see
any other errors in datanode/namenode logs?

What must be happening is as file creation in hdfs is failing it is marking
that reduce attempt as failure and restarting it. Keep checking namenode
state when it reaches 67%.

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>wrote:

> Hello,
>
> I am using YARN. I get some exceptions at my namenode and datanode. They
> are thrown when my Reduce progress gets 67%. Then, reduce phase is
> restarted from 0% several times, but always restarts at this point. Can
> someone tell me what I should do? Many thanks!
>
>
> Namenode log:
>
> 2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
> 2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
> 2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
> 2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
> java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:416)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
> 2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
> 2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>
>
>
> Datanode log:
>
> 2013-06-24 19:25:54,695 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
> BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022,
> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
> 2013-06-24 19:25:54,699 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
> BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /
> 10.35.99.108:59710 dest: /10.35.99.108:50010
> 2013-06-24 19:25:56,473 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
> java.io.IOException: Connection reset by peer
>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>     at
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>     at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>     at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
>     at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
>     at java.io.FilterInputStream.read(FilterInputStream.java:133)
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>     at java.io.DataInputStream.read(DataInputStream.java:149)
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>     at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
>     at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
>     at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
>     at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
>     at java.lang.Thread.run(Thread.java:679)
> 2013-06-24 19:25:56,476 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026,
> type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
>

Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Ravi Prakash <ra...@ymail.com>.
I wonder if this is another instance of https://issues.apache.org/jira/browse/HDFS-1172 . I had noticed this behavior when I run Slive with extremely small file sizes. 




________________________________
 From: Yuzhang Han <yu...@gmail.com>
To: user@hadoop.apache.org 
Sent: Monday, June 24, 2013 3:01 PM
Subject: "could only be replicated to 0 nodes instead of minReplication" exception during job execution
 


Hello,


I am using YARN. I get some exceptions at my namenode and datanode. They are thrown when my Reduce progress gets 67%. Then, reduce phase is restarted from 0% several times, but always restarts at this point. Can someone tell me what I should do? Many thanks!


Namenode log:


2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454) at
 org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger


Datanode log:

2013-06-24 19:25:54,695 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2013-06-24 19:25:54,699 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /10.35.99.108:59710 dest: /10.35.99.108:50010
2013-06-24 19:25:56,473 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
    at sun.nio.ch.IOUtil.read(IOUtil.java:224)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
    at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
    at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.FilterInputStream.read(FilterInputStream.java:133)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
    at java.lang.Thread.run(Thread.java:679)
2013-06-24 19:25:56,476 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026, type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.

Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Ravi Prakash <ra...@ymail.com>.
I wonder if this is another instance of https://issues.apache.org/jira/browse/HDFS-1172 . I had noticed this behavior when I run Slive with extremely small file sizes. 




________________________________
 From: Yuzhang Han <yu...@gmail.com>
To: user@hadoop.apache.org 
Sent: Monday, June 24, 2013 3:01 PM
Subject: "could only be replicated to 0 nodes instead of minReplication" exception during job execution
 


Hello,


I am using YARN. I get some exceptions at my namenode and datanode. They are thrown when my Reduce progress gets 67%. Then, reduce phase is restarted from 0% several times, but always restarts at this point. Can someone tell me what I should do? Many thanks!


Namenode log:


2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454) at
 org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger


Datanode log:

2013-06-24 19:25:54,695 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2013-06-24 19:25:54,699 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /10.35.99.108:59710 dest: /10.35.99.108:50010
2013-06-24 19:25:56,473 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
    at sun.nio.ch.IOUtil.read(IOUtil.java:224)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
    at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
    at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.FilterInputStream.read(FilterInputStream.java:133)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
    at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
    at java.lang.Thread.run(Thread.java:679)
2013-06-24 19:25:56,476 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026, type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.

Re: "could only be replicated to 0 nodes instead of minReplication" exception during job execution

Posted by Omkar Joshi <oj...@hortonworks.com>.
Hi,

I see there are 2 datanodes and for some reason namenode is not able to
create even single replica for requested blocks. are you sure the system on
which these datanodes are running have sufficient disk space? Do you see
any other errors in datanode/namenode logs?

What must be happening is as file creation in hdfs is failing it is marking
that reduce attempt as failure and restarting it. Keep checking namenode
state when it reaches 67%.

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Mon, Jun 24, 2013 at 3:01 PM, Yuzhang Han <yu...@gmail.com>wrote:

> Hello,
>
> I am using YARN. I get some exceptions at my namenode and datanode. They
> are thrown when my Reduce progress gets 67%. Then, reduce phase is
> restarted from 0% several times, but always restarts at this point. Can
> someone tell me what I should do? Many thanks!
>
>
> Namenode log:
>
> 2013-06-24 19:08:50,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_654446797771285606_5062{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
> 2013-06-24 19:08:50,349 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
> 2013-06-24 19:08:50,350 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:ubuntu (auth:SIMPLE) cause:java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
> 2013-06-24 19:08:50,353 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.224.2.190:49375: error: java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
> java.io.IOException: File /output/_temporary/1/_temporary/attempt_1372090853102_0001_r_000002_0/part-00002 could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and no node(s) are excluded in this operation.
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2155)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:416)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
> 2013-06-24 19:08:50,413 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.224.2.190:50010 is added to blk_8924314838535676494_5063{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.224.2.190:50010|RBW]]} size 0
> 2013-06-24 19:08:50,418 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1
> For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
>
>
>
> Datanode log:
>
> 2013-06-24 19:25:54,695 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
> BP-1724882733-10.10.79.145-1372090400593:blk_-2417373821601940925_6022,
> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
> 2013-06-24 19:25:54,699 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
> BP-1724882733-10.10.79.145-1372090400593:blk_3177955398059619584_6033 src: /
> 10.35.99.108:59710 dest: /10.35.99.108:50010
> 2013-06-24 19:25:56,473 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026
> java.io.IOException: Connection reset by peer
>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>     at
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>     at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>     at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
>     at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
>     at java.io.FilterInputStream.read(FilterInputStream.java:133)
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>     at java.io.DataInputStream.read(DataInputStream.java:149)
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>     at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
>     at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:644)
>     at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
>     at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
>     at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
>     at java.lang.Thread.run(Thread.java:679)
> 2013-06-24 19:25:56,476 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
> BP-1724882733-10.10.79.145-1372090400593:blk_8751401862589207807_6026,
> type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
>