You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Joseph Naegele <jn...@grierforensics.com> on 2016/12/20 16:11:55 UTC

SocketTimeoutException in DataXceiver

Hi folks,

I'm experiencing the exact symptoms of HDFS-770 (https://issues.apache.org/jira/browse/HDFS-770) using Spark and a basic HDFS deployment. Everything is running locally on a single machine. I'm using Hadoop 2.7.3. My HDFS deployment consists of a single 8 TB disk with replication disabled, otherwise everything is vanilla Hadoop 2.7.3. My Spark job uses a Hive ORC writer to write a  dataset to disk. The dataset itself is < 100 GB uncompressed, ~17 GB compressed.

It does not appear to be a Spark issue. The datanode's logs show it receives the first ~500 packets for a block, then nothing for a minute, then the default channel read timeout of 60000 ms causes the exception:

2016-12-19 18:36:50,632 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1695049761-192.168.2.211-1479228275669:blk_1073957413_216632 received exception java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 remote=/127.0.0.1:55866]
2016-12-19 18:36:50,632 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: lamport.grierforensics.com:50010:DataXceiver error processing WRITE_BLOCK operation  src: /127.0.0.1:55866 dst: /127.0.0.1:50010
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 remote=/127.0.0.1:55866]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        ...

On the Spark side, all is well until the datanode's socket exception results in Spark experiencing a DFSOutputStream ResponseProcessor exception, followed by Spark aborting due to all datanodes being bad:

2016-12-19 18:36:59.014 WARN DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-1695049761-192.168.2.211-1479228275669:blk_1073957413_216632
java.io.EOFException: Premature EOF: no length prefix available
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2203)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:867)

...
Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad. Aborting...
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1206)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1004)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:548)

I haven't tried adjusting the timeout yet for the same reason specified by the reporter of HDFS-770: I'm running everything locally, with no other tasks running on the system so why would I need a socket read timeout greater than 60 seconds? I haven't observed any CPU, memory or disk bottlenecks.

Lowering the number of cores used by Spark does help alleviate the problem, but doesn't eliminate it, which led me to believe the issue may be disk contention (i.e. too many client writers?), but again, I haven't observed any disk IO bottlenecks at all.

Does anyone else still experience HDFS-770 (https://issues.apache.org/jira/browse/HDFS-770) and is there a general approach/solution?

Thanks

---
Joe Naegele
Grier Forensics



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: SocketTimeoutException in DataXceiver

Posted by Wei-Chiu Chuang <we...@cloudera.com>.
This looks like a general issue, and there are multiple possible
explanations.
It could be either a flaky NIC, or flaky network switches.

On the other hand, if the DataNode is busy and all dataXceiver threads are
used (by default: 4096 threads), this error may also be seen at the client
side. Take a look at your DataNode log and see if you spot error messages
like "Xceiver count 4097  exceeds the limit of concurrent xcievers: 4096".
If this is the case, try to increase dfs.datanode.max.transfer.threads.
Depending on the incoming traffic and application, you might double or even
quadruple that number.

Your second error message is interesting. It might be corrupt blocks on
DataNodes (either hardware or software -- a few known bugs can lead to
this. I haven't checked if they are fixed in Hadoop 2.7.3. But there could
be other undiscovered bugs). It might due to a unresponsive DN (garbage
collection pause, kernel pause -- there are a few scenarios where kernel
could pause a process). You will need to look at datanode logs and kernel
dmesg log to understand why, and it often is time-consuming.


On Tue, Dec 20, 2016 at 8:11 AM, Joseph Naegele <jnaegele@grierforensics.com
> wrote:

> Hi folks,
>
> I'm experiencing the exact symptoms of HDFS-770 (
> https://issues.apache.org/jira/browse/HDFS-770) using Spark and a basic
> HDFS deployment. Everything is running locally on a single machine. I'm
> using Hadoop 2.7.3. My HDFS deployment consists of a single 8 TB disk with
> replication disabled, otherwise everything is vanilla Hadoop 2.7.3. My
> Spark job uses a Hive ORC writer to write a  dataset to disk. The dataset
> itself is < 100 GB uncompressed, ~17 GB compressed.
>
> It does not appear to be a Spark issue. The datanode's logs show it
> receives the first ~500 packets for a block, then nothing for a minute,
> then the default channel read timeout of 60000 ms causes the exception:
>
> 2016-12-19 18:36:50,632 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> opWriteBlock BP-1695049761-192.168.2.211-1479228275669:blk_1073957413_216632
> received exception java.net.SocketTimeoutException: 60000 millis timeout
> while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/127.0.0.1:50010 remote=/127.0.0.1:55866]
> 2016-12-19 18:36:50,632 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
> lamport.grierforensics.com:50010:DataXceiver error processing WRITE_BLOCK
> operation  src: /127.0.0.1:55866 dst: /127.0.0.1:50010
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/127.0.0.1:50010 remote=/127.0.0.1:55866]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(
> SocketIOWithTimeout.java:164)
>         at org.apache.hadoop.net.SocketInputStream.read(
> SocketInputStream.java:161)
>         ...
>
> On the Spark side, all is well until the datanode's socket exception
> results in Spark experiencing a DFSOutputStream ResponseProcessor
> exception, followed by Spark aborting due to all datanodes being bad:
>
> 2016-12-19 18:36:59.014 WARN DFSClient: DFSOutputStream ResponseProcessor
> exception  for block BP-1695049761-192.168.2.211-
> 1479228275669:blk_1073957413_216632
> java.io.EOFException: Premature EOF: no length prefix available
>         at org.apache.hadoop.hdfs.protocolPB.PBHelper.
> vintPrefixed(PBHelper.java:2203)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.
> PipelineAck.readFields(PipelineAck.java:176)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$
> ResponseProcessor.run(DFSOutputStream.java:867)
>
> ...
> Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad.
> Aborting...
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.
> setupPipelineForAppendOrRecovery(DFSOutputStream.java:1206)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.
> processDatanodeError(DFSOutputStream.java:1004)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.
> run(DFSOutputStream.java:548)
>
> I haven't tried adjusting the timeout yet for the same reason specified by
> the reporter of HDFS-770: I'm running everything locally, with no other
> tasks running on the system so why would I need a socket read timeout
> greater than 60 seconds? I haven't observed any CPU, memory or disk
> bottlenecks.
>
> Lowering the number of cores used by Spark does help alleviate the
> problem, but doesn't eliminate it, which led me to believe the issue may be
> disk contention (i.e. too many client writers?), but again, I haven't
> observed any disk IO bottlenecks at all.
>
> Does anyone else still experience HDFS-770 (https://issues.apache.org/
> jira/browse/HDFS-770) and is there a general approach/solution?
>
> Thanks
>
> ---
> Joe Naegele
> Grier Forensics
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: user-help@hadoop.apache.org
>
>


-- 
A very happy Clouderan