You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Murillo Flores <mu...@chaordic.com.br> on 2014/04/04 19:25:51 UTC

Exception in createBlockOutputStream trying to start application

Hello everybody,

I'm running a hadoop 2.2.0 cluster with 120 "slaves" running a datanode and
a nodemanager each, and 1 "master" running the namenode, the secondary
namenode, the resource manager (yarn) and the JobHistoryServer. In this
cluster, eventually my attempts to start an application that downloads all
it's input data from s3 fails with messages like this one below:

14/04/02 21:43:12 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as
10.91.140.216:50010
 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
14/04/02 21:43:12 INFO hdfs.DFSClient: Abandoning
BP-1137777807-10.225.13.33-1396473453940:blk_1073741844_1020
14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /
10.225.13.33:8020 from root sending #113
14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /
10.225.13.33:8020 from root got value #113
14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: abandonBlock took 2ms
14/04/02 21:43:12 INFO hdfs.DFSClient: Excluding datanode
10.91.140.216:50010
14/04/02 21:43:12 WARN hdfs.DFSClient: DataStreamer Exception
java.io.IOException: Unable to create new block.
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1100)
 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
14/04/02 21:43:12 WARN hdfs.DFSClient: Could not get block locations.
Source file
"/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split" -
Aborting...
14/04/02 21:43:12 INFO mapreduce.JobSubmitter: Cleaning up the staging area
/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001
14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /
10.225.13.33:8020 from root sending #114
14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /
10.225.13.33:8020 from root got value #114
14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: delete took 35ms
14/04/02 21:43:12 ERROR security.UserGroupInformation:
PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException:
Bad connect ack with firstBadLink as 10.91.140.216:50010
14/04/02 21:43:12 ERROR security.UserGroupInformation:
PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException:
Bad connect ack with firstBadLink as 10.91.140.216:50010
14/04/02 21:43:12 ERROR streaming.StreamJob: Error Launching job : Bad
connect ack with firstBadLink as 10.91.140.216:50010
 Streaming Command Failed!
14/04/02 21:43:12 ERROR hdfs.DFSClient: Failed to close file
/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split
java.io.IOException: Bad connect ack with firstBadLink as
10.91.140.216:50010
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)


In the datanodes log, I'm seeing both "Connection timeout" and "Premature
EOF" exceptions. Looks like the first datanode that receives the data from
the namenode fails with an exception like this one below:

14/04/02 21:32:39 ERROR datanode.DataNode:
DataNode{data=FSDataset{dirpath='[/mnt/hadoop/current,
/mnt/data1/hadoop/current, /mnt/data2/hadoop/current,
/mnt/data3/hadoop/current]'}, localName='
ec2-54-80-176-6.compute-1.amazonaws.com:50010',
storageID='DS-1182379849-10.91.166.67-50010-1396473477988',
xmitsInProgress=0}:Exception transfering block
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 to mirror
10.65.0.211:50010: java.net.SocketTimeoutException: 105000 millis timeout
while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/
10.65.0.211:50010]
14/04/02 21:32:39 INFO datanode.DataNode: opWriteBlock
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received
exception java.net.SocketTimeoutException: 105000 millis timeout while
waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/
10.65.0.211:50010]
14/04/02 21:32:39 ERROR datanode.DataNode:
ec2-54-80-176-6.compute-1.amazonaws.com:50010:DataXceiver error processing
WRITE_BLOCK operation  src: /10.225.13.33:36287 dest: /10.91.166.67:50010
java.net.SocketTimeoutException: 105000 millis timeout while waiting for
channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/
10.65.0.211:50010]
 at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
 at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
 at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
 at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1490)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:511)


And the other datanodes dealing with the same block (receiving it from
another datanode instead of namenode?), fails with this:

14/04/02 21:30:54 DEBUG datanode.DataNode: opWriteBlock:
stage=PIPELINE_SETUP_CREATE, clientname=DFSClient_NONMAPREDUCE_1796898153_1
  block  =BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
newGs=0, bytesRcvd=[0, 0]
  targets=[10.91.165.2:50010, 10.31.188.219:50010, 10.91.137.130:50010,
10.91.133.28:50010, 10.91.140.216:50010, 10.65.2.239:50010,
10.91.134.151:50010, 10.91.132.32:50010]; pipelineSize=10, srcDataNode=:0
14/04/02 21:30:54 DEBUG datanode.DataNode: isDatanode=false, isClient=true,
isTransfer=false
14/04/02 21:30:54 DEBUG datanode.DataNode: writeBlock receive buf size
131071 tcp no delay true
14/04/02 21:30:54 INFO datanode.DataNode: Receiving
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 src: /
10.91.166.67:37776 dest: /10.65.0.211:50010
14/04/02 21:30:54 DEBUG datanode.DataNode: BlockReceiver:
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
  isClient  =true, clientname=DFSClient_NONMAPREDUCE_1796898153_1
  isDatanode=false, srcDataNode=:0
  inAddr=/10.91.166.67:37776, myAddr=/10.65.0.211:50010
  cachingStrategy = CachingStrategy(dropBehind=null, readahead=null)
14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo blockfile is
/mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836
of size 0
14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo metafile is
/mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836_1012.meta
of size 0
 14/04/02 21:30:54 DEBUG datanode.DataNode: Connecting to datanode
10.91.165.2:50010
14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to
ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop
sending #234
14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to
ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop got
value #234
14/04/02 21:30:55 DEBUG ipc.ProtobufRpcEngine: Call:
blockReceivedAndDeleted took 3ms
--
14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
metricssystem.MetricsSystem.Sink_gangliaQsize, type int32, value 0, slope
both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
metricssystem.MetricsSystem.SnapshotAvgTime, type double, value
0.14285714285714288, slope both from hostname
ec2-54-80-198-241.compute-1.amazonaws.com
14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
metricssystem.MetricsSystem.Sink_gangliaNumOps, type float, value 87, slope
positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
metricssystem.MetricsSystem.NumActiveSinks, type int32, value 1, slope both
from hostname ec2-54-80-198-241.compute-1.amazonaws.com
14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
metricssystem.MetricsSystem.DroppedPubAll, type float, value 0, slope
positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
metricssystem.MetricsSystem.SnapshotNumOps, type float, value 615, slope
positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
metricssystem.MetricsSystem.NumAllSinks, type int32, value 0, slope both
from hostname ec2-54-80-198-241.compute-1.amazonaws.com
14/04/02 21:32:39 DEBUG impl.MetricsSinkAdapter: Done
14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 got response for
connect ack  from downstream datanode with firstbadlink as
10.91.133.28:50010
14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 forwarding connect ack
to upstream firstbadlink is 10.91.133.28:50010
 14/04/02 21:32:41 INFO datanode.DataNode: Exception for
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
java.io.IOException: Premature EOF from inputStream
 at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
 at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
 at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
 at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
 at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
 at java.lang.Thread.run(Thread.java:744)
14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder:
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder:
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
type=HAS_DOWNSTREAM_IN_PIPELINE terminating
14/04/02 21:32:41 INFO datanode.DataNode: opWriteBlock
BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received
exception java.io.IOException: Premature EOF from inputStream
14/04/02 21:32:41 ERROR datanode.DataNode:
ec2-54-80-198-241.compute-1.amazonaws.com:50010:DataXceiver error
processing WRITE_BLOCK operation  src: /10.91.166.67:37776 dest: /
10.65.0.211:50010
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
 at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
 at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
 at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
 at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)



I've already changed some configurations trying to prevent this exceptions
to happen, like the ones below, and increased the maximun number of opened
file descriptors and processes (ulimit). Worth to note that none of them
have helped so far :/

dfs.datanode.max.xcievers -> 4096
dfs.client.block.write.retries -> 12
dfs.namenode.handler.count -> 64
dfs.replication -> 3
ulimit:
* soft nofile 65536
* hard nofile 65536

Does anyone have any clue about what could be going on at my cluster, or
where else could I take a look to get some helpful information?
Thanks.

-- 
*Murillo*

Chaordic
*www.chaordic.com.br <http://www.chaordic.com.br/>*
55 48 3232.3200



-- 
*Murillo*

Chaordic
*www.chaordic.com.br <http://www.chaordic.com.br/>*
55 48 3232.3200

Re: Exception in createBlockOutputStream trying to start application

Posted by Murillo Flores <mu...@chaordic.com.br>.
Thanks for your answer Azuryy.
Do you know any configuration that we could tune to be more permissive with
this errors? I'm not sure that 'dfs.client.block.write.retries' will serve
this purpose.




2014-04-04 19:44 GMT-03:00 Azuryy <az...@gmail.com>:

> Hi Mirillo,
>
> Generally EOF error was caused by network
>
> You can ignore this Exception, which just tell you one DN in the pipe line
> is bad, dfsclient will avoid locate it next time.
>
>
> Sent from my iPhone5s
>
> On 2014年4月5日, at 1:25, Murillo Flores <mu...@chaordic.com.br> wrote:
>
> Hello everybody,
>
> I'm running a hadoop 2.2.0 cluster with 120 "slaves" running a datanode
> and a nodemanager each, and 1 "master" running the namenode, the secondary
> namenode, the resource manager (yarn) and the JobHistoryServer. In this
> cluster, eventually my attempts to start an application that downloads all
> it's input data from s3 fails with messages like this one below:
>
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 10.91.140.216:50010
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Abandoning
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741844_1020
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root sending #113
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root got value #113
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: abandonBlock took 2ms
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Excluding datanode
> 10.91.140.216:50010
> 14/04/02 21:43:12 WARN hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Unable to create new block.
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1100)
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 WARN hdfs.DFSClient: Could not get block locations.
> Source file
> "/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split" -
> Aborting...
> 14/04/02 21:43:12 INFO mapreduce.JobSubmitter: Cleaning up the staging
> area /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root sending #114
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root got value #114
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: delete took 35ms
> 14/04/02 21:43:12 ERROR security.UserGroupInformation:
> PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException:
> Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR security.UserGroupInformation:
> PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException:
> Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR streaming.StreamJob: Error Launching job : Bad
> connect ack with firstBadLink as 10.91.140.216:50010
>  Streaming Command Failed!
> 14/04/02 21:43:12 ERROR hdfs.DFSClient: Failed to close file
> /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split
> java.io.IOException: Bad connect ack with firstBadLink as
> 10.91.140.216:50010
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
>
>
> In the datanodes log, I'm seeing both "Connection timeout" and "Premature
> EOF" exceptions. Looks like the first datanode that receives the data from
> the namenode fails with an exception like this one below:
>
> 14/04/02 21:32:39 ERROR datanode.DataNode:
> DataNode{data=FSDataset{dirpath='[/mnt/hadoop/current,
> /mnt/data1/hadoop/current, /mnt/data2/hadoop/current,
> /mnt/data3/hadoop/current]'}, localName='
> ec2-54-80-176-6.compute-1.amazonaws.com:50010',
> storageID='DS-1182379849-10.91.166.67-50010-1396473477988',
> xmitsInProgress=0}:Exception transfering block
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 to mirror
> 10.65.0.211:50010: java.net.SocketTimeoutException: 105000 millis timeout
> while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776remote=/
> 10.65.0.211:50010]
> 14/04/02 21:32:39 INFO datanode.DataNode: opWriteBlock
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received
> exception java.net.SocketTimeoutException: 105000 millis timeout while
> waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776remote=/
> 10.65.0.211:50010]
> 14/04/02 21:32:39 ERROR datanode.DataNode:
> ec2-54-80-176-6.compute-1.amazonaws.com:50010:DataXceiver error
> processing WRITE_BLOCK operation  src: /10.225.13.33:36287 dest: /
> 10.91.166.67:50010
> java.net.SocketTimeoutException: 105000 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776remote=/
> 10.65.0.211:50010]
>  at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>  at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
>  at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
>  at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1490)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:511)
>
>
> And the other datanodes dealing with the same block (receiving it from
> another datanode instead of namenode?), fails with this:
>
> 14/04/02 21:30:54 DEBUG datanode.DataNode: opWriteBlock:
> stage=PIPELINE_SETUP_CREATE, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   block  =BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
> newGs=0, bytesRcvd=[0, 0]
>   targets=[10.91.165.2:50010, 10.31.188.219:50010, 10.91.137.130:50010,
> 10.91.133.28:50010, 10.91.140.216:50010, 10.65.2.239:50010,
> 10.91.134.151:50010, 10.91.132.32:50010]; pipelineSize=10, srcDataNode=:0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: isDatanode=false,
> isClient=true, isTransfer=false
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeBlock receive buf size
> 131071 tcp no delay true
> 14/04/02 21:30:54 INFO datanode.DataNode: Receiving
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 src: /
> 10.91.166.67:37776 dest: /10.65.0.211:50010
> 14/04/02 21:30:54 DEBUG datanode.DataNode: BlockReceiver:
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
>   isClient  =true, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   isDatanode=false, srcDataNode=:0
>   inAddr=/10.91.166.67:37776, myAddr=/10.65.0.211:50010
>   cachingStrategy = CachingStrategy(dropBehind=null, readahead=null)
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo blockfile is
> /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836
> of size 0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo metafile is
> /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836_1012.meta
> of size 0
>  14/04/02 21:30:54 DEBUG datanode.DataNode: Connecting to datanode
> 10.91.165.2:50010
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to
> ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop
> sending #234
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to
> ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop
> got value #234
> 14/04/02 21:30:55 DEBUG ipc.ProtobufRpcEngine: Call:
> blockReceivedAndDeleted took 3ms
> --
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.Sink_gangliaQsize, type int32, value 0, slope
> both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.SnapshotAvgTime, type double, value
> 0.14285714285714288, slope both from hostname
> ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.Sink_gangliaNumOps, type float, value 87, slope
> positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.NumActiveSinks, type int32, value 1, slope both
> from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.DroppedPubAll, type float, value 0, slope
> positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.SnapshotNumOps, type float, value 615, slope
> positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.NumAllSinks, type int32, value 0, slope both
> from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG impl.MetricsSinkAdapter: Done
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 got response for
> connect ack  from downstream datanode with firstbadlink as
> 10.91.133.28:50010
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 forwarding connect
> ack to upstream firstbadlink is 10.91.133.28:50010
>  14/04/02 21:32:41 INFO datanode.DataNode: Exception for
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
> java.io.IOException: Premature EOF from inputStream
>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>  at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
>  at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
>  at java.lang.Thread.run(Thread.java:744)
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder:
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
> type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder:
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
> type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> 14/04/02 21:32:41 INFO datanode.DataNode: opWriteBlock
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received
> exception java.io.IOException: Premature EOF from inputStream
> 14/04/02 21:32:41 ERROR datanode.DataNode:
> ec2-54-80-198-241.compute-1.amazonaws.com:50010:DataXceiver error
> processing WRITE_BLOCK operation  src: /10.91.166.67:37776 dest: /
> 10.65.0.211:50010
> java.io.IOException: Premature EOF from inputStream
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
>  at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
>
>
>
> I've already changed some configurations trying to prevent this exceptions
> to happen, like the ones below, and increased the maximun number of opened
> file descriptors and processes (ulimit). Worth to note that none of them
> have helped so far :/
>
> dfs.datanode.max.xcievers -> 4096
> dfs.client.block.write.retries -> 12
> dfs.namenode.handler.count -> 64
> dfs.replication -> 3
> ulimit:
> * soft nofile 65536
> * hard nofile 65536
>
> Does anyone have any clue about what could be going on at my cluster, or
> where else could I take a look to get some helpful information?
> Thanks.
>
> --
> *Murillo*
>
> Chaordic
> *www.chaordic.com.br <http://www.chaordic.com.br/>*
> 55 48 3232.3200
>
>
>
> --
> *Murillo*
>
> Chaordic
> *www.chaordic.com.br <http://www.chaordic.com.br/>*
> 55 48 3232.3200
>
>


-- 
*Murillo*

 Chaordic
*www.chaordic.com.br <http://www.chaordic.com.br/>*
55 48 3232.3200

Re: Exception in createBlockOutputStream trying to start application

Posted by Murillo Flores <mu...@chaordic.com.br>.
Thanks for your answer Azuryy.
Do you know any configuration that we could tune to be more permissive with
this errors? I'm not sure that 'dfs.client.block.write.retries' will serve
this purpose.




2014-04-04 19:44 GMT-03:00 Azuryy <az...@gmail.com>:

> Hi Mirillo,
>
> Generally EOF error was caused by network
>
> You can ignore this Exception, which just tell you one DN in the pipe line
> is bad, dfsclient will avoid locate it next time.
>
>
> Sent from my iPhone5s
>
> On 2014年4月5日, at 1:25, Murillo Flores <mu...@chaordic.com.br> wrote:
>
> Hello everybody,
>
> I'm running a hadoop 2.2.0 cluster with 120 "slaves" running a datanode
> and a nodemanager each, and 1 "master" running the namenode, the secondary
> namenode, the resource manager (yarn) and the JobHistoryServer. In this
> cluster, eventually my attempts to start an application that downloads all
> it's input data from s3 fails with messages like this one below:
>
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 10.91.140.216:50010
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Abandoning
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741844_1020
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root sending #113
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root got value #113
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: abandonBlock took 2ms
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Excluding datanode
> 10.91.140.216:50010
> 14/04/02 21:43:12 WARN hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Unable to create new block.
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1100)
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 WARN hdfs.DFSClient: Could not get block locations.
> Source file
> "/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split" -
> Aborting...
> 14/04/02 21:43:12 INFO mapreduce.JobSubmitter: Cleaning up the staging
> area /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root sending #114
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root got value #114
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: delete took 35ms
> 14/04/02 21:43:12 ERROR security.UserGroupInformation:
> PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException:
> Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR security.UserGroupInformation:
> PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException:
> Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR streaming.StreamJob: Error Launching job : Bad
> connect ack with firstBadLink as 10.91.140.216:50010
>  Streaming Command Failed!
> 14/04/02 21:43:12 ERROR hdfs.DFSClient: Failed to close file
> /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split
> java.io.IOException: Bad connect ack with firstBadLink as
> 10.91.140.216:50010
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
>
>
> In the datanodes log, I'm seeing both "Connection timeout" and "Premature
> EOF" exceptions. Looks like the first datanode that receives the data from
> the namenode fails with an exception like this one below:
>
> 14/04/02 21:32:39 ERROR datanode.DataNode:
> DataNode{data=FSDataset{dirpath='[/mnt/hadoop/current,
> /mnt/data1/hadoop/current, /mnt/data2/hadoop/current,
> /mnt/data3/hadoop/current]'}, localName='
> ec2-54-80-176-6.compute-1.amazonaws.com:50010',
> storageID='DS-1182379849-10.91.166.67-50010-1396473477988',
> xmitsInProgress=0}:Exception transfering block
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 to mirror
> 10.65.0.211:50010: java.net.SocketTimeoutException: 105000 millis timeout
> while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776remote=/
> 10.65.0.211:50010]
> 14/04/02 21:32:39 INFO datanode.DataNode: opWriteBlock
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received
> exception java.net.SocketTimeoutException: 105000 millis timeout while
> waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776remote=/
> 10.65.0.211:50010]
> 14/04/02 21:32:39 ERROR datanode.DataNode:
> ec2-54-80-176-6.compute-1.amazonaws.com:50010:DataXceiver error
> processing WRITE_BLOCK operation  src: /10.225.13.33:36287 dest: /
> 10.91.166.67:50010
> java.net.SocketTimeoutException: 105000 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776remote=/
> 10.65.0.211:50010]
>  at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>  at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
>  at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
>  at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1490)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:511)
>
>
> And the other datanodes dealing with the same block (receiving it from
> another datanode instead of namenode?), fails with this:
>
> 14/04/02 21:30:54 DEBUG datanode.DataNode: opWriteBlock:
> stage=PIPELINE_SETUP_CREATE, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   block  =BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
> newGs=0, bytesRcvd=[0, 0]
>   targets=[10.91.165.2:50010, 10.31.188.219:50010, 10.91.137.130:50010,
> 10.91.133.28:50010, 10.91.140.216:50010, 10.65.2.239:50010,
> 10.91.134.151:50010, 10.91.132.32:50010]; pipelineSize=10, srcDataNode=:0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: isDatanode=false,
> isClient=true, isTransfer=false
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeBlock receive buf size
> 131071 tcp no delay true
> 14/04/02 21:30:54 INFO datanode.DataNode: Receiving
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 src: /
> 10.91.166.67:37776 dest: /10.65.0.211:50010
> 14/04/02 21:30:54 DEBUG datanode.DataNode: BlockReceiver:
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
>   isClient  =true, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   isDatanode=false, srcDataNode=:0
>   inAddr=/10.91.166.67:37776, myAddr=/10.65.0.211:50010
>   cachingStrategy = CachingStrategy(dropBehind=null, readahead=null)
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo blockfile is
> /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836
> of size 0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo metafile is
> /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836_1012.meta
> of size 0
>  14/04/02 21:30:54 DEBUG datanode.DataNode: Connecting to datanode
> 10.91.165.2:50010
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to
> ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop
> sending #234
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to
> ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop
> got value #234
> 14/04/02 21:30:55 DEBUG ipc.ProtobufRpcEngine: Call:
> blockReceivedAndDeleted took 3ms
> --
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.Sink_gangliaQsize, type int32, value 0, slope
> both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.SnapshotAvgTime, type double, value
> 0.14285714285714288, slope both from hostname
> ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.Sink_gangliaNumOps, type float, value 87, slope
> positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.NumActiveSinks, type int32, value 1, slope both
> from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.DroppedPubAll, type float, value 0, slope
> positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.SnapshotNumOps, type float, value 615, slope
> positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.NumAllSinks, type int32, value 0, slope both
> from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG impl.MetricsSinkAdapter: Done
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 got response for
> connect ack  from downstream datanode with firstbadlink as
> 10.91.133.28:50010
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 forwarding connect
> ack to upstream firstbadlink is 10.91.133.28:50010
>  14/04/02 21:32:41 INFO datanode.DataNode: Exception for
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
> java.io.IOException: Premature EOF from inputStream
>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>  at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
>  at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
>  at java.lang.Thread.run(Thread.java:744)
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder:
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
> type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder:
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
> type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> 14/04/02 21:32:41 INFO datanode.DataNode: opWriteBlock
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received
> exception java.io.IOException: Premature EOF from inputStream
> 14/04/02 21:32:41 ERROR datanode.DataNode:
> ec2-54-80-198-241.compute-1.amazonaws.com:50010:DataXceiver error
> processing WRITE_BLOCK operation  src: /10.91.166.67:37776 dest: /
> 10.65.0.211:50010
> java.io.IOException: Premature EOF from inputStream
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
>  at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
>
>
>
> I've already changed some configurations trying to prevent this exceptions
> to happen, like the ones below, and increased the maximun number of opened
> file descriptors and processes (ulimit). Worth to note that none of them
> have helped so far :/
>
> dfs.datanode.max.xcievers -> 4096
> dfs.client.block.write.retries -> 12
> dfs.namenode.handler.count -> 64
> dfs.replication -> 3
> ulimit:
> * soft nofile 65536
> * hard nofile 65536
>
> Does anyone have any clue about what could be going on at my cluster, or
> where else could I take a look to get some helpful information?
> Thanks.
>
> --
> *Murillo*
>
> Chaordic
> *www.chaordic.com.br <http://www.chaordic.com.br/>*
> 55 48 3232.3200
>
>
>
> --
> *Murillo*
>
> Chaordic
> *www.chaordic.com.br <http://www.chaordic.com.br/>*
> 55 48 3232.3200
>
>


-- 
*Murillo*

 Chaordic
*www.chaordic.com.br <http://www.chaordic.com.br/>*
55 48 3232.3200

Re: Exception in createBlockOutputStream trying to start application

Posted by Murillo Flores <mu...@chaordic.com.br>.
Thanks for your answer Azuryy.
Do you know any configuration that we could tune to be more permissive with
this errors? I'm not sure that 'dfs.client.block.write.retries' will serve
this purpose.




2014-04-04 19:44 GMT-03:00 Azuryy <az...@gmail.com>:

> Hi Mirillo,
>
> Generally EOF error was caused by network
>
> You can ignore this Exception, which just tell you one DN in the pipe line
> is bad, dfsclient will avoid locate it next time.
>
>
> Sent from my iPhone5s
>
> On 2014年4月5日, at 1:25, Murillo Flores <mu...@chaordic.com.br> wrote:
>
> Hello everybody,
>
> I'm running a hadoop 2.2.0 cluster with 120 "slaves" running a datanode
> and a nodemanager each, and 1 "master" running the namenode, the secondary
> namenode, the resource manager (yarn) and the JobHistoryServer. In this
> cluster, eventually my attempts to start an application that downloads all
> it's input data from s3 fails with messages like this one below:
>
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 10.91.140.216:50010
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Abandoning
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741844_1020
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root sending #113
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root got value #113
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: abandonBlock took 2ms
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Excluding datanode
> 10.91.140.216:50010
> 14/04/02 21:43:12 WARN hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Unable to create new block.
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1100)
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 WARN hdfs.DFSClient: Could not get block locations.
> Source file
> "/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split" -
> Aborting...
> 14/04/02 21:43:12 INFO mapreduce.JobSubmitter: Cleaning up the staging
> area /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root sending #114
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root got value #114
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: delete took 35ms
> 14/04/02 21:43:12 ERROR security.UserGroupInformation:
> PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException:
> Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR security.UserGroupInformation:
> PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException:
> Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR streaming.StreamJob: Error Launching job : Bad
> connect ack with firstBadLink as 10.91.140.216:50010
>  Streaming Command Failed!
> 14/04/02 21:43:12 ERROR hdfs.DFSClient: Failed to close file
> /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split
> java.io.IOException: Bad connect ack with firstBadLink as
> 10.91.140.216:50010
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
>
>
> In the datanodes log, I'm seeing both "Connection timeout" and "Premature
> EOF" exceptions. Looks like the first datanode that receives the data from
> the namenode fails with an exception like this one below:
>
> 14/04/02 21:32:39 ERROR datanode.DataNode:
> DataNode{data=FSDataset{dirpath='[/mnt/hadoop/current,
> /mnt/data1/hadoop/current, /mnt/data2/hadoop/current,
> /mnt/data3/hadoop/current]'}, localName='
> ec2-54-80-176-6.compute-1.amazonaws.com:50010',
> storageID='DS-1182379849-10.91.166.67-50010-1396473477988',
> xmitsInProgress=0}:Exception transfering block
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 to mirror
> 10.65.0.211:50010: java.net.SocketTimeoutException: 105000 millis timeout
> while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776remote=/
> 10.65.0.211:50010]
> 14/04/02 21:32:39 INFO datanode.DataNode: opWriteBlock
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received
> exception java.net.SocketTimeoutException: 105000 millis timeout while
> waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776remote=/
> 10.65.0.211:50010]
> 14/04/02 21:32:39 ERROR datanode.DataNode:
> ec2-54-80-176-6.compute-1.amazonaws.com:50010:DataXceiver error
> processing WRITE_BLOCK operation  src: /10.225.13.33:36287 dest: /
> 10.91.166.67:50010
> java.net.SocketTimeoutException: 105000 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776remote=/
> 10.65.0.211:50010]
>  at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>  at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
>  at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
>  at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1490)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:511)
>
>
> And the other datanodes dealing with the same block (receiving it from
> another datanode instead of namenode?), fails with this:
>
> 14/04/02 21:30:54 DEBUG datanode.DataNode: opWriteBlock:
> stage=PIPELINE_SETUP_CREATE, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   block  =BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
> newGs=0, bytesRcvd=[0, 0]
>   targets=[10.91.165.2:50010, 10.31.188.219:50010, 10.91.137.130:50010,
> 10.91.133.28:50010, 10.91.140.216:50010, 10.65.2.239:50010,
> 10.91.134.151:50010, 10.91.132.32:50010]; pipelineSize=10, srcDataNode=:0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: isDatanode=false,
> isClient=true, isTransfer=false
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeBlock receive buf size
> 131071 tcp no delay true
> 14/04/02 21:30:54 INFO datanode.DataNode: Receiving
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 src: /
> 10.91.166.67:37776 dest: /10.65.0.211:50010
> 14/04/02 21:30:54 DEBUG datanode.DataNode: BlockReceiver:
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
>   isClient  =true, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   isDatanode=false, srcDataNode=:0
>   inAddr=/10.91.166.67:37776, myAddr=/10.65.0.211:50010
>   cachingStrategy = CachingStrategy(dropBehind=null, readahead=null)
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo blockfile is
> /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836
> of size 0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo metafile is
> /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836_1012.meta
> of size 0
>  14/04/02 21:30:54 DEBUG datanode.DataNode: Connecting to datanode
> 10.91.165.2:50010
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to
> ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop
> sending #234
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to
> ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop
> got value #234
> 14/04/02 21:30:55 DEBUG ipc.ProtobufRpcEngine: Call:
> blockReceivedAndDeleted took 3ms
> --
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.Sink_gangliaQsize, type int32, value 0, slope
> both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.SnapshotAvgTime, type double, value
> 0.14285714285714288, slope both from hostname
> ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.Sink_gangliaNumOps, type float, value 87, slope
> positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.NumActiveSinks, type int32, value 1, slope both
> from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.DroppedPubAll, type float, value 0, slope
> positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.SnapshotNumOps, type float, value 615, slope
> positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.NumAllSinks, type int32, value 0, slope both
> from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG impl.MetricsSinkAdapter: Done
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 got response for
> connect ack  from downstream datanode with firstbadlink as
> 10.91.133.28:50010
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 forwarding connect
> ack to upstream firstbadlink is 10.91.133.28:50010
>  14/04/02 21:32:41 INFO datanode.DataNode: Exception for
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
> java.io.IOException: Premature EOF from inputStream
>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>  at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
>  at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
>  at java.lang.Thread.run(Thread.java:744)
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder:
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
> type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder:
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
> type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> 14/04/02 21:32:41 INFO datanode.DataNode: opWriteBlock
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received
> exception java.io.IOException: Premature EOF from inputStream
> 14/04/02 21:32:41 ERROR datanode.DataNode:
> ec2-54-80-198-241.compute-1.amazonaws.com:50010:DataXceiver error
> processing WRITE_BLOCK operation  src: /10.91.166.67:37776 dest: /
> 10.65.0.211:50010
> java.io.IOException: Premature EOF from inputStream
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
>  at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
>
>
>
> I've already changed some configurations trying to prevent this exceptions
> to happen, like the ones below, and increased the maximun number of opened
> file descriptors and processes (ulimit). Worth to note that none of them
> have helped so far :/
>
> dfs.datanode.max.xcievers -> 4096
> dfs.client.block.write.retries -> 12
> dfs.namenode.handler.count -> 64
> dfs.replication -> 3
> ulimit:
> * soft nofile 65536
> * hard nofile 65536
>
> Does anyone have any clue about what could be going on at my cluster, or
> where else could I take a look to get some helpful information?
> Thanks.
>
> --
> *Murillo*
>
> Chaordic
> *www.chaordic.com.br <http://www.chaordic.com.br/>*
> 55 48 3232.3200
>
>
>
> --
> *Murillo*
>
> Chaordic
> *www.chaordic.com.br <http://www.chaordic.com.br/>*
> 55 48 3232.3200
>
>


-- 
*Murillo*

 Chaordic
*www.chaordic.com.br <http://www.chaordic.com.br/>*
55 48 3232.3200

Re: Exception in createBlockOutputStream trying to start application

Posted by Murillo Flores <mu...@chaordic.com.br>.
Thanks for your answer Azuryy.
Do you know any configuration that we could tune to be more permissive with
this errors? I'm not sure that 'dfs.client.block.write.retries' will serve
this purpose.




2014-04-04 19:44 GMT-03:00 Azuryy <az...@gmail.com>:

> Hi Mirillo,
>
> Generally EOF error was caused by network
>
> You can ignore this Exception, which just tell you one DN in the pipe line
> is bad, dfsclient will avoid locate it next time.
>
>
> Sent from my iPhone5s
>
> On 2014年4月5日, at 1:25, Murillo Flores <mu...@chaordic.com.br> wrote:
>
> Hello everybody,
>
> I'm running a hadoop 2.2.0 cluster with 120 "slaves" running a datanode
> and a nodemanager each, and 1 "master" running the namenode, the secondary
> namenode, the resource manager (yarn) and the JobHistoryServer. In this
> cluster, eventually my attempts to start an application that downloads all
> it's input data from s3 fails with messages like this one below:
>
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 10.91.140.216:50010
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Abandoning
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741844_1020
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root sending #113
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root got value #113
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: abandonBlock took 2ms
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Excluding datanode
> 10.91.140.216:50010
> 14/04/02 21:43:12 WARN hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Unable to create new block.
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1100)
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 WARN hdfs.DFSClient: Could not get block locations.
> Source file
> "/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split" -
> Aborting...
> 14/04/02 21:43:12 INFO mapreduce.JobSubmitter: Cleaning up the staging
> area /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root sending #114
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to
> /10.225.13.33:8020 from root got value #114
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: delete took 35ms
> 14/04/02 21:43:12 ERROR security.UserGroupInformation:
> PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException:
> Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR security.UserGroupInformation:
> PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException:
> Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR streaming.StreamJob: Error Launching job : Bad
> connect ack with firstBadLink as 10.91.140.216:50010
>  Streaming Command Failed!
> 14/04/02 21:43:12 ERROR hdfs.DFSClient: Failed to close file
> /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split
> java.io.IOException: Bad connect ack with firstBadLink as
> 10.91.140.216:50010
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
>  at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
>
>
> In the datanodes log, I'm seeing both "Connection timeout" and "Premature
> EOF" exceptions. Looks like the first datanode that receives the data from
> the namenode fails with an exception like this one below:
>
> 14/04/02 21:32:39 ERROR datanode.DataNode:
> DataNode{data=FSDataset{dirpath='[/mnt/hadoop/current,
> /mnt/data1/hadoop/current, /mnt/data2/hadoop/current,
> /mnt/data3/hadoop/current]'}, localName='
> ec2-54-80-176-6.compute-1.amazonaws.com:50010',
> storageID='DS-1182379849-10.91.166.67-50010-1396473477988',
> xmitsInProgress=0}:Exception transfering block
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 to mirror
> 10.65.0.211:50010: java.net.SocketTimeoutException: 105000 millis timeout
> while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776remote=/
> 10.65.0.211:50010]
> 14/04/02 21:32:39 INFO datanode.DataNode: opWriteBlock
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received
> exception java.net.SocketTimeoutException: 105000 millis timeout while
> waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776remote=/
> 10.65.0.211:50010]
> 14/04/02 21:32:39 ERROR datanode.DataNode:
> ec2-54-80-176-6.compute-1.amazonaws.com:50010:DataXceiver error
> processing WRITE_BLOCK operation  src: /10.225.13.33:36287 dest: /
> 10.91.166.67:50010
> java.net.SocketTimeoutException: 105000 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776remote=/
> 10.65.0.211:50010]
>  at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>  at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
>  at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
>  at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1490)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:511)
>
>
> And the other datanodes dealing with the same block (receiving it from
> another datanode instead of namenode?), fails with this:
>
> 14/04/02 21:30:54 DEBUG datanode.DataNode: opWriteBlock:
> stage=PIPELINE_SETUP_CREATE, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   block  =BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
> newGs=0, bytesRcvd=[0, 0]
>   targets=[10.91.165.2:50010, 10.31.188.219:50010, 10.91.137.130:50010,
> 10.91.133.28:50010, 10.91.140.216:50010, 10.65.2.239:50010,
> 10.91.134.151:50010, 10.91.132.32:50010]; pipelineSize=10, srcDataNode=:0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: isDatanode=false,
> isClient=true, isTransfer=false
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeBlock receive buf size
> 131071 tcp no delay true
> 14/04/02 21:30:54 INFO datanode.DataNode: Receiving
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 src: /
> 10.91.166.67:37776 dest: /10.65.0.211:50010
> 14/04/02 21:30:54 DEBUG datanode.DataNode: BlockReceiver:
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
>   isClient  =true, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   isDatanode=false, srcDataNode=:0
>   inAddr=/10.91.166.67:37776, myAddr=/10.65.0.211:50010
>   cachingStrategy = CachingStrategy(dropBehind=null, readahead=null)
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo blockfile is
> /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836
> of size 0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo metafile is
> /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836_1012.meta
> of size 0
>  14/04/02 21:30:54 DEBUG datanode.DataNode: Connecting to datanode
> 10.91.165.2:50010
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to
> ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop
> sending #234
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to
> ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop
> got value #234
> 14/04/02 21:30:55 DEBUG ipc.ProtobufRpcEngine: Call:
> blockReceivedAndDeleted took 3ms
> --
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.Sink_gangliaQsize, type int32, value 0, slope
> both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.SnapshotAvgTime, type double, value
> 0.14285714285714288, slope both from hostname
> ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.Sink_gangliaNumOps, type float, value 87, slope
> positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.NumActiveSinks, type int32, value 1, slope both
> from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.DroppedPubAll, type float, value 0, slope
> positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.SnapshotNumOps, type float, value 615, slope
> positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric
> metricssystem.MetricsSystem.NumAllSinks, type int32, value 0, slope both
> from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG impl.MetricsSinkAdapter: Done
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 got response for
> connect ack  from downstream datanode with firstbadlink as
> 10.91.133.28:50010
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 forwarding connect
> ack to upstream firstbadlink is 10.91.133.28:50010
>  14/04/02 21:32:41 INFO datanode.DataNode: Exception for
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
> java.io.IOException: Premature EOF from inputStream
>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>  at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
>  at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
>  at java.lang.Thread.run(Thread.java:744)
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder:
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
> type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder:
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012,
> type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> 14/04/02 21:32:41 INFO datanode.DataNode: opWriteBlock
> BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received
> exception java.io.IOException: Premature EOF from inputStream
> 14/04/02 21:32:41 ERROR datanode.DataNode:
> ec2-54-80-198-241.compute-1.amazonaws.com:50010:DataXceiver error
> processing WRITE_BLOCK operation  src: /10.91.166.67:37776 dest: /
> 10.65.0.211:50010
> java.io.IOException: Premature EOF from inputStream
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
>  at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
>  at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
>
>
>
> I've already changed some configurations trying to prevent this exceptions
> to happen, like the ones below, and increased the maximun number of opened
> file descriptors and processes (ulimit). Worth to note that none of them
> have helped so far :/
>
> dfs.datanode.max.xcievers -> 4096
> dfs.client.block.write.retries -> 12
> dfs.namenode.handler.count -> 64
> dfs.replication -> 3
> ulimit:
> * soft nofile 65536
> * hard nofile 65536
>
> Does anyone have any clue about what could be going on at my cluster, or
> where else could I take a look to get some helpful information?
> Thanks.
>
> --
> *Murillo*
>
> Chaordic
> *www.chaordic.com.br <http://www.chaordic.com.br/>*
> 55 48 3232.3200
>
>
>
> --
> *Murillo*
>
> Chaordic
> *www.chaordic.com.br <http://www.chaordic.com.br/>*
> 55 48 3232.3200
>
>


-- 
*Murillo*

 Chaordic
*www.chaordic.com.br <http://www.chaordic.com.br/>*
55 48 3232.3200

Re: Exception in createBlockOutputStream trying to start application

Posted by Azuryy <az...@gmail.com>.
Hi Mirillo,

Generally EOF error was caused by network 

You can ignore this Exception, which just tell you one DN in the pipe line is bad, dfsclient will avoid locate it next time.


Sent from my iPhone5s

> On 2014年4月5日, at 1:25, Murillo Flores <mu...@chaordic.com.br> wrote:
> 
> Hello everybody,
> 
> I'm running a hadoop 2.2.0 cluster with 120 "slaves" running a datanode and a nodemanager each, and 1 "master" running the namenode, the secondary namenode, the resource manager (yarn) and the JobHistoryServer. In this cluster, eventually my attempts to start an application that downloads all it's input data from s3 fails with messages like this one below:
> 
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Abandoning BP-1137777807-10.225.13.33-1396473453940:blk_1073741844_1020
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root sending #113
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root got value #113
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: abandonBlock took 2ms
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Excluding datanode 10.91.140.216:50010
> 14/04/02 21:43:12 WARN hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Unable to create new block.
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1100)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 WARN hdfs.DFSClient: Could not get block locations. Source file "/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split" - Aborting...
> 14/04/02 21:43:12 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root sending #114
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root got value #114
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: delete took 35ms
> 14/04/02 21:43:12 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR streaming.StreamJob: Error Launching job : Bad connect ack with firstBadLink as 10.91.140.216:50010
> Streaming Command Failed!
> 14/04/02 21:43:12 ERROR hdfs.DFSClient: Failed to close file /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split
> java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 
> In the datanodes log, I'm seeing both "Connection timeout" and "Premature EOF" exceptions. Looks like the first datanode that receives the data from the namenode fails with an exception like this one below:
> 
> 14/04/02 21:32:39 ERROR datanode.DataNode: DataNode{data=FSDataset{dirpath='[/mnt/hadoop/current, /mnt/data1/hadoop/current, /mnt/data2/hadoop/current, /mnt/data3/hadoop/current]'}, localName='ec2-54-80-176-6.compute-1.amazonaws.com:50010', storageID='DS-1182379849-10.91.166.67-50010-1396473477988', xmitsInProgress=0}:Exception transfering block BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 to mirror 10.65.0.211:50010: java.net.SocketTimeoutException: 105000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/10.65.0.211:50010]
> 14/04/02 21:32:39 INFO datanode.DataNode: opWriteBlock BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received exception java.net.SocketTimeoutException: 105000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/10.65.0.211:50010]
> 14/04/02 21:32:39 ERROR datanode.DataNode: ec2-54-80-176-6.compute-1.amazonaws.com:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.225.13.33:36287 dest: /10.91.166.67:50010
> java.net.SocketTimeoutException: 105000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/10.65.0.211:50010]
> 	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> 	at java.io.FilterInputStream.read(FilterInputStream.java:83)
> 	at java.io.FilterInputStream.read(FilterInputStream.java:83)
> 	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1490)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:511)
> 
> And the other datanodes dealing with the same block (receiving it from another datanode instead of namenode?), fails with this:
> 
> 14/04/02 21:30:54 DEBUG datanode.DataNode: opWriteBlock: stage=PIPELINE_SETUP_CREATE, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   block  =BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012, newGs=0, bytesRcvd=[0, 0]
>   targets=[10.91.165.2:50010, 10.31.188.219:50010, 10.91.137.130:50010, 10.91.133.28:50010, 10.91.140.216:50010, 10.65.2.239:50010, 10.91.134.151:50010, 10.91.132.32:50010]; pipelineSize=10, srcDataNode=:0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: isDatanode=false, isClient=true, isTransfer=false
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeBlock receive buf size 131071 tcp no delay true
> 14/04/02 21:30:54 INFO datanode.DataNode: Receiving BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 src: /10.91.166.67:37776 dest: /10.65.0.211:50010
> 14/04/02 21:30:54 DEBUG datanode.DataNode: BlockReceiver: BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
>   isClient  =true, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   isDatanode=false, srcDataNode=:0
>   inAddr=/10.91.166.67:37776, myAddr=/10.65.0.211:50010
>   cachingStrategy = CachingStrategy(dropBehind=null, readahead=null)
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo blockfile is /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836 of size 0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo metafile is /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836_1012.meta of size 0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: Connecting to datanode 10.91.165.2:50010
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop sending #234
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop got value #234
> 14/04/02 21:30:55 DEBUG ipc.ProtobufRpcEngine: Call: blockReceivedAndDeleted took 3ms
> --
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.Sink_gangliaQsize, type int32, value 0, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.SnapshotAvgTime, type double, value 0.14285714285714288, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.Sink_gangliaNumOps, type float, value 87, slope positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.NumActiveSinks, type int32, value 1, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.DroppedPubAll, type float, value 0, slope positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.SnapshotNumOps, type float, value 615, slope positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.NumAllSinks, type int32, value 0, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG impl.MetricsSinkAdapter: Done
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 got response for connect ack  from downstream datanode with firstbadlink as 10.91.133.28:50010
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 forwarding connect ack to upstream firstbadlink is 10.91.133.28:50010
> 14/04/02 21:32:41 INFO datanode.DataNode: Exception for BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
> java.io.IOException: Premature EOF from inputStream
> 	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
> 	at java.lang.Thread.run(Thread.java:744)
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder: BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012, type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder: BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012, type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> 14/04/02 21:32:41 INFO datanode.DataNode: opWriteBlock BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received exception java.io.IOException: Premature EOF from inputStream
> 14/04/02 21:32:41 ERROR datanode.DataNode: ec2-54-80-198-241.compute-1.amazonaws.com:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.91.166.67:37776 dest: /10.65.0.211:50010
> java.io.IOException: Premature EOF from inputStream
> 	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
> 
> 
> I've already changed some configurations trying to prevent this exceptions to happen, like the ones below, and increased the maximun number of opened file descriptors and processes (ulimit). Worth to note that none of them have helped so far :/
> 
> dfs.datanode.max.xcievers -> 4096
> dfs.client.block.write.retries -> 12
> dfs.namenode.handler.count -> 64
> dfs.replication -> 3
> ulimit:
> * soft nofile 65536
> * hard nofile 65536
> 
> Does anyone have any clue about what could be going on at my cluster, or where else could I take a look to get some helpful information?
> Thanks.
> 
> -- 
> Murillo
> 
> Chaordic
> www.chaordic.com.br
> 55 48 3232.3200
> 
> 
> 
> -- 
> Murillo
> 
> Chaordic
> www.chaordic.com.br
> 55 48 3232.3200

Re: Exception in createBlockOutputStream trying to start application

Posted by Azuryy <az...@gmail.com>.
Hi Mirillo,

Generally EOF error was caused by network 

You can ignore this Exception, which just tell you one DN in the pipe line is bad, dfsclient will avoid locate it next time.


Sent from my iPhone5s

> On 2014年4月5日, at 1:25, Murillo Flores <mu...@chaordic.com.br> wrote:
> 
> Hello everybody,
> 
> I'm running a hadoop 2.2.0 cluster with 120 "slaves" running a datanode and a nodemanager each, and 1 "master" running the namenode, the secondary namenode, the resource manager (yarn) and the JobHistoryServer. In this cluster, eventually my attempts to start an application that downloads all it's input data from s3 fails with messages like this one below:
> 
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Abandoning BP-1137777807-10.225.13.33-1396473453940:blk_1073741844_1020
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root sending #113
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root got value #113
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: abandonBlock took 2ms
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Excluding datanode 10.91.140.216:50010
> 14/04/02 21:43:12 WARN hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Unable to create new block.
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1100)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 WARN hdfs.DFSClient: Could not get block locations. Source file "/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split" - Aborting...
> 14/04/02 21:43:12 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root sending #114
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root got value #114
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: delete took 35ms
> 14/04/02 21:43:12 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR streaming.StreamJob: Error Launching job : Bad connect ack with firstBadLink as 10.91.140.216:50010
> Streaming Command Failed!
> 14/04/02 21:43:12 ERROR hdfs.DFSClient: Failed to close file /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split
> java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 
> In the datanodes log, I'm seeing both "Connection timeout" and "Premature EOF" exceptions. Looks like the first datanode that receives the data from the namenode fails with an exception like this one below:
> 
> 14/04/02 21:32:39 ERROR datanode.DataNode: DataNode{data=FSDataset{dirpath='[/mnt/hadoop/current, /mnt/data1/hadoop/current, /mnt/data2/hadoop/current, /mnt/data3/hadoop/current]'}, localName='ec2-54-80-176-6.compute-1.amazonaws.com:50010', storageID='DS-1182379849-10.91.166.67-50010-1396473477988', xmitsInProgress=0}:Exception transfering block BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 to mirror 10.65.0.211:50010: java.net.SocketTimeoutException: 105000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/10.65.0.211:50010]
> 14/04/02 21:32:39 INFO datanode.DataNode: opWriteBlock BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received exception java.net.SocketTimeoutException: 105000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/10.65.0.211:50010]
> 14/04/02 21:32:39 ERROR datanode.DataNode: ec2-54-80-176-6.compute-1.amazonaws.com:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.225.13.33:36287 dest: /10.91.166.67:50010
> java.net.SocketTimeoutException: 105000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/10.65.0.211:50010]
> 	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> 	at java.io.FilterInputStream.read(FilterInputStream.java:83)
> 	at java.io.FilterInputStream.read(FilterInputStream.java:83)
> 	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1490)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:511)
> 
> And the other datanodes dealing with the same block (receiving it from another datanode instead of namenode?), fails with this:
> 
> 14/04/02 21:30:54 DEBUG datanode.DataNode: opWriteBlock: stage=PIPELINE_SETUP_CREATE, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   block  =BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012, newGs=0, bytesRcvd=[0, 0]
>   targets=[10.91.165.2:50010, 10.31.188.219:50010, 10.91.137.130:50010, 10.91.133.28:50010, 10.91.140.216:50010, 10.65.2.239:50010, 10.91.134.151:50010, 10.91.132.32:50010]; pipelineSize=10, srcDataNode=:0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: isDatanode=false, isClient=true, isTransfer=false
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeBlock receive buf size 131071 tcp no delay true
> 14/04/02 21:30:54 INFO datanode.DataNode: Receiving BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 src: /10.91.166.67:37776 dest: /10.65.0.211:50010
> 14/04/02 21:30:54 DEBUG datanode.DataNode: BlockReceiver: BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
>   isClient  =true, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   isDatanode=false, srcDataNode=:0
>   inAddr=/10.91.166.67:37776, myAddr=/10.65.0.211:50010
>   cachingStrategy = CachingStrategy(dropBehind=null, readahead=null)
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo blockfile is /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836 of size 0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo metafile is /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836_1012.meta of size 0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: Connecting to datanode 10.91.165.2:50010
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop sending #234
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop got value #234
> 14/04/02 21:30:55 DEBUG ipc.ProtobufRpcEngine: Call: blockReceivedAndDeleted took 3ms
> --
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.Sink_gangliaQsize, type int32, value 0, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.SnapshotAvgTime, type double, value 0.14285714285714288, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.Sink_gangliaNumOps, type float, value 87, slope positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.NumActiveSinks, type int32, value 1, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.DroppedPubAll, type float, value 0, slope positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.SnapshotNumOps, type float, value 615, slope positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.NumAllSinks, type int32, value 0, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG impl.MetricsSinkAdapter: Done
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 got response for connect ack  from downstream datanode with firstbadlink as 10.91.133.28:50010
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 forwarding connect ack to upstream firstbadlink is 10.91.133.28:50010
> 14/04/02 21:32:41 INFO datanode.DataNode: Exception for BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
> java.io.IOException: Premature EOF from inputStream
> 	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
> 	at java.lang.Thread.run(Thread.java:744)
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder: BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012, type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder: BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012, type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> 14/04/02 21:32:41 INFO datanode.DataNode: opWriteBlock BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received exception java.io.IOException: Premature EOF from inputStream
> 14/04/02 21:32:41 ERROR datanode.DataNode: ec2-54-80-198-241.compute-1.amazonaws.com:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.91.166.67:37776 dest: /10.65.0.211:50010
> java.io.IOException: Premature EOF from inputStream
> 	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
> 
> 
> I've already changed some configurations trying to prevent this exceptions to happen, like the ones below, and increased the maximun number of opened file descriptors and processes (ulimit). Worth to note that none of them have helped so far :/
> 
> dfs.datanode.max.xcievers -> 4096
> dfs.client.block.write.retries -> 12
> dfs.namenode.handler.count -> 64
> dfs.replication -> 3
> ulimit:
> * soft nofile 65536
> * hard nofile 65536
> 
> Does anyone have any clue about what could be going on at my cluster, or where else could I take a look to get some helpful information?
> Thanks.
> 
> -- 
> Murillo
> 
> Chaordic
> www.chaordic.com.br
> 55 48 3232.3200
> 
> 
> 
> -- 
> Murillo
> 
> Chaordic
> www.chaordic.com.br
> 55 48 3232.3200

Re: Exception in createBlockOutputStream trying to start application

Posted by Azuryy <az...@gmail.com>.
Hi Mirillo,

Generally EOF error was caused by network 

You can ignore this Exception, which just tell you one DN in the pipe line is bad, dfsclient will avoid locate it next time.


Sent from my iPhone5s

> On 2014��4��5��, at 1:25, Murillo Flores <mu...@chaordic.com.br> wrote:
> 
> Hello everybody,
> 
> I'm running a hadoop 2.2.0 cluster with 120 "slaves" running a datanode and a nodemanager each, and 1 "master" running the namenode, the secondary namenode, the resource manager (yarn) and the JobHistoryServer. In this cluster, eventually my attempts to start an application that downloads all it's input data from s3 fails with messages like this one below:
> 
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Abandoning BP-1137777807-10.225.13.33-1396473453940:blk_1073741844_1020
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root sending #113
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root got value #113
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: abandonBlock took 2ms
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Excluding datanode 10.91.140.216:50010
> 14/04/02 21:43:12 WARN hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Unable to create new block.
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1100)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 WARN hdfs.DFSClient: Could not get block locations. Source file "/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split" - Aborting...
> 14/04/02 21:43:12 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root sending #114
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root got value #114
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: delete took 35ms
> 14/04/02 21:43:12 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR streaming.StreamJob: Error Launching job : Bad connect ack with firstBadLink as 10.91.140.216:50010
> Streaming Command Failed!
> 14/04/02 21:43:12 ERROR hdfs.DFSClient: Failed to close file /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split
> java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 
> In the datanodes log, I'm seeing both "Connection timeout" and "Premature EOF" exceptions. Looks like the first datanode that receives the data from the namenode fails with an exception like this one below:
> 
> 14/04/02 21:32:39 ERROR datanode.DataNode: DataNode{data=FSDataset{dirpath='[/mnt/hadoop/current, /mnt/data1/hadoop/current, /mnt/data2/hadoop/current, /mnt/data3/hadoop/current]'}, localName='ec2-54-80-176-6.compute-1.amazonaws.com:50010', storageID='DS-1182379849-10.91.166.67-50010-1396473477988', xmitsInProgress=0}:Exception transfering block BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 to mirror 10.65.0.211:50010: java.net.SocketTimeoutException: 105000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/10.65.0.211:50010]
> 14/04/02 21:32:39 INFO datanode.DataNode: opWriteBlock BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received exception java.net.SocketTimeoutException: 105000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/10.65.0.211:50010]
> 14/04/02 21:32:39 ERROR datanode.DataNode: ec2-54-80-176-6.compute-1.amazonaws.com:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.225.13.33:36287 dest: /10.91.166.67:50010
> java.net.SocketTimeoutException: 105000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/10.65.0.211:50010]
> 	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> 	at java.io.FilterInputStream.read(FilterInputStream.java:83)
> 	at java.io.FilterInputStream.read(FilterInputStream.java:83)
> 	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1490)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:511)
> 
> And the other datanodes dealing with the same block (receiving it from another datanode instead of namenode?), fails with this:
> 
> 14/04/02 21:30:54 DEBUG datanode.DataNode: opWriteBlock: stage=PIPELINE_SETUP_CREATE, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   block  =BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012, newGs=0, bytesRcvd=[0, 0]
>   targets=[10.91.165.2:50010, 10.31.188.219:50010, 10.91.137.130:50010, 10.91.133.28:50010, 10.91.140.216:50010, 10.65.2.239:50010, 10.91.134.151:50010, 10.91.132.32:50010]; pipelineSize=10, srcDataNode=:0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: isDatanode=false, isClient=true, isTransfer=false
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeBlock receive buf size 131071 tcp no delay true
> 14/04/02 21:30:54 INFO datanode.DataNode: Receiving BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 src: /10.91.166.67:37776 dest: /10.65.0.211:50010
> 14/04/02 21:30:54 DEBUG datanode.DataNode: BlockReceiver: BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
>   isClient  =true, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   isDatanode=false, srcDataNode=:0
>   inAddr=/10.91.166.67:37776, myAddr=/10.65.0.211:50010
>   cachingStrategy = CachingStrategy(dropBehind=null, readahead=null)
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo blockfile is /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836 of size 0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo metafile is /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836_1012.meta of size 0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: Connecting to datanode 10.91.165.2:50010
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop sending #234
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop got value #234
> 14/04/02 21:30:55 DEBUG ipc.ProtobufRpcEngine: Call: blockReceivedAndDeleted took 3ms
> --
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.Sink_gangliaQsize, type int32, value 0, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.SnapshotAvgTime, type double, value 0.14285714285714288, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.Sink_gangliaNumOps, type float, value 87, slope positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.NumActiveSinks, type int32, value 1, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.DroppedPubAll, type float, value 0, slope positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.SnapshotNumOps, type float, value 615, slope positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.NumAllSinks, type int32, value 0, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG impl.MetricsSinkAdapter: Done
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 got response for connect ack  from downstream datanode with firstbadlink as 10.91.133.28:50010
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 forwarding connect ack to upstream firstbadlink is 10.91.133.28:50010
> 14/04/02 21:32:41 INFO datanode.DataNode: Exception for BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
> java.io.IOException: Premature EOF from inputStream
> 	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
> 	at java.lang.Thread.run(Thread.java:744)
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder: BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012, type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder: BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012, type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> 14/04/02 21:32:41 INFO datanode.DataNode: opWriteBlock BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received exception java.io.IOException: Premature EOF from inputStream
> 14/04/02 21:32:41 ERROR datanode.DataNode: ec2-54-80-198-241.compute-1.amazonaws.com:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.91.166.67:37776 dest: /10.65.0.211:50010
> java.io.IOException: Premature EOF from inputStream
> 	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
> 
> 
> I've already changed some configurations trying to prevent this exceptions to happen, like the ones below, and increased the maximun number of opened file descriptors and processes (ulimit). Worth to note that none of them have helped so far :/
> 
> dfs.datanode.max.xcievers -> 4096
> dfs.client.block.write.retries -> 12
> dfs.namenode.handler.count -> 64
> dfs.replication -> 3
> ulimit:
> * soft nofile 65536
> * hard nofile 65536
> 
> Does anyone have any clue about what could be going on at my cluster, or where else could I take a look to get some helpful information?
> Thanks.
> 
> -- 
> Murillo
> 
> Chaordic
> www.chaordic.com.br
> 55 48 3232.3200
> 
> 
> 
> -- 
> Murillo
> 
> Chaordic
> www.chaordic.com.br
> 55 48 3232.3200

Re: Exception in createBlockOutputStream trying to start application

Posted by Azuryy <az...@gmail.com>.
Hi Mirillo,

Generally EOF error was caused by network 

You can ignore this Exception, which just tell you one DN in the pipe line is bad, dfsclient will avoid locate it next time.


Sent from my iPhone5s

> On 2014��4��5��, at 1:25, Murillo Flores <mu...@chaordic.com.br> wrote:
> 
> Hello everybody,
> 
> I'm running a hadoop 2.2.0 cluster with 120 "slaves" running a datanode and a nodemanager each, and 1 "master" running the namenode, the secondary namenode, the resource manager (yarn) and the JobHistoryServer. In this cluster, eventually my attempts to start an application that downloads all it's input data from s3 fails with messages like this one below:
> 
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Abandoning BP-1137777807-10.225.13.33-1396473453940:blk_1073741844_1020
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root sending #113
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root got value #113
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: abandonBlock took 2ms
> 14/04/02 21:43:12 INFO hdfs.DFSClient: Excluding datanode 10.91.140.216:50010
> 14/04/02 21:43:12 WARN hdfs.DFSClient: DataStreamer Exception
> java.io.IOException: Unable to create new block.
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1100)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 14/04/02 21:43:12 WARN hdfs.DFSClient: Could not get block locations. Source file "/tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split" - Aborting...
> 14/04/02 21:43:12 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root sending #114
> 14/04/02 21:43:12 DEBUG ipc.Client: IPC Client (1577412535) connection to /10.225.13.33:8020 from root got value #114
> 14/04/02 21:43:12 DEBUG ipc.ProtobufRpcEngine: Call: delete took 35ms
> 14/04/02 21:43:12 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 14/04/02 21:43:12 ERROR streaming.StreamJob: Error Launching job : Bad connect ack with firstBadLink as 10.91.140.216:50010
> Streaming Command Failed!
> 14/04/02 21:43:12 ERROR hdfs.DFSClient: Failed to close file /tmp/hadoop-yarn/staging/root/.staging/job_1396473480140_0001/job.split
> java.io.IOException: Bad connect ack with firstBadLink as 10.91.140.216:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1166)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 
> In the datanodes log, I'm seeing both "Connection timeout" and "Premature EOF" exceptions. Looks like the first datanode that receives the data from the namenode fails with an exception like this one below:
> 
> 14/04/02 21:32:39 ERROR datanode.DataNode: DataNode{data=FSDataset{dirpath='[/mnt/hadoop/current, /mnt/data1/hadoop/current, /mnt/data2/hadoop/current, /mnt/data3/hadoop/current]'}, localName='ec2-54-80-176-6.compute-1.amazonaws.com:50010', storageID='DS-1182379849-10.91.166.67-50010-1396473477988', xmitsInProgress=0}:Exception transfering block BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 to mirror 10.65.0.211:50010: java.net.SocketTimeoutException: 105000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/10.65.0.211:50010]
> 14/04/02 21:32:39 INFO datanode.DataNode: opWriteBlock BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received exception java.net.SocketTimeoutException: 105000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/10.65.0.211:50010]
> 14/04/02 21:32:39 ERROR datanode.DataNode: ec2-54-80-176-6.compute-1.amazonaws.com:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.225.13.33:36287 dest: /10.91.166.67:50010
> java.net.SocketTimeoutException: 105000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.91.166.67:37776 remote=/10.65.0.211:50010]
> 	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> 	at java.io.FilterInputStream.read(FilterInputStream.java:83)
> 	at java.io.FilterInputStream.read(FilterInputStream.java:83)
> 	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1490)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:511)
> 
> And the other datanodes dealing with the same block (receiving it from another datanode instead of namenode?), fails with this:
> 
> 14/04/02 21:30:54 DEBUG datanode.DataNode: opWriteBlock: stage=PIPELINE_SETUP_CREATE, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   block  =BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012, newGs=0, bytesRcvd=[0, 0]
>   targets=[10.91.165.2:50010, 10.31.188.219:50010, 10.91.137.130:50010, 10.91.133.28:50010, 10.91.140.216:50010, 10.65.2.239:50010, 10.91.134.151:50010, 10.91.132.32:50010]; pipelineSize=10, srcDataNode=:0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: isDatanode=false, isClient=true, isTransfer=false
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeBlock receive buf size 131071 tcp no delay true
> 14/04/02 21:30:54 INFO datanode.DataNode: Receiving BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 src: /10.91.166.67:37776 dest: /10.65.0.211:50010
> 14/04/02 21:30:54 DEBUG datanode.DataNode: BlockReceiver: BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
>   isClient  =true, clientname=DFSClient_NONMAPREDUCE_1796898153_1
>   isDatanode=false, srcDataNode=:0
>   inAddr=/10.91.166.67:37776, myAddr=/10.65.0.211:50010
>   cachingStrategy = CachingStrategy(dropBehind=null, readahead=null)
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo blockfile is /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836 of size 0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: writeTo metafile is /mnt/hadoop/current/BP-1137777807-10.225.13.33-1396473453940/current/rbw/blk_1073741836_1012.meta of size 0
> 14/04/02 21:30:54 DEBUG datanode.DataNode: Connecting to datanode 10.91.165.2:50010
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop sending #234
> 14/04/02 21:30:55 DEBUG ipc.Client: IPC Client (1818694875) connection to ec2-54-81-119-149.compute-1.amazonaws.com/10.225.13.33:8020 from hadoop got value #234
> 14/04/02 21:30:55 DEBUG ipc.ProtobufRpcEngine: Call: blockReceivedAndDeleted took 3ms
> --
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.Sink_gangliaQsize, type int32, value 0, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.SnapshotAvgTime, type double, value 0.14285714285714288, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.Sink_gangliaNumOps, type float, value 87, slope positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.NumActiveSinks, type int32, value 1, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.DroppedPubAll, type float, value 0, slope positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.SnapshotNumOps, type float, value 615, slope positive from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG ganglia.GangliaSink31: Emitting metric metricssystem.MetricsSystem.NumAllSinks, type int32, value 0, slope both from hostname ec2-54-80-198-241.compute-1.amazonaws.com
> 14/04/02 21:32:39 DEBUG impl.MetricsSinkAdapter: Done
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 got response for connect ack  from downstream datanode with firstbadlink as 10.91.133.28:50010
> 14/04/02 21:32:41 INFO datanode.DataNode: Datanode 8 forwarding connect ack to upstream firstbadlink is 10.91.133.28:50010
> 14/04/02 21:32:41 INFO datanode.DataNode: Exception for BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012
> java.io.IOException: Premature EOF from inputStream
> 	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
> 	at java.lang.Thread.run(Thread.java:744)
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder: BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012, type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
> 14/04/02 21:32:41 INFO datanode.DataNode: PacketResponder: BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012, type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> 14/04/02 21:32:41 INFO datanode.DataNode: opWriteBlock BP-1137777807-10.225.13.33-1396473453940:blk_1073741836_1012 received exception java.io.IOException: Premature EOF from inputStream
> 14/04/02 21:32:41 ERROR datanode.DataNode: ec2-54-80-198-241.compute-1.amazonaws.com:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.91.166.67:37776 dest: /10.65.0.211:50010
> java.io.IOException: Premature EOF from inputStream
> 	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
> 
> 
> I've already changed some configurations trying to prevent this exceptions to happen, like the ones below, and increased the maximun number of opened file descriptors and processes (ulimit). Worth to note that none of them have helped so far :/
> 
> dfs.datanode.max.xcievers -> 4096
> dfs.client.block.write.retries -> 12
> dfs.namenode.handler.count -> 64
> dfs.replication -> 3
> ulimit:
> * soft nofile 65536
> * hard nofile 65536
> 
> Does anyone have any clue about what could be going on at my cluster, or where else could I take a look to get some helpful information?
> Thanks.
> 
> -- 
> Murillo
> 
> Chaordic
> www.chaordic.com.br
> 55 48 3232.3200
> 
> 
> 
> -- 
> Murillo
> 
> Chaordic
> www.chaordic.com.br
> 55 48 3232.3200