You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by java8964 <ja...@hotmail.com> on 2015/08/24 17:49:48 UTC
HDFS throughput problem after upgrade to Hadoop 2.2.0

Hi, 
Recently we upgrade our production cluster from Hadoop V1.1.0 to V2.2.0. One issue we found out is that the HDFS throughput is worse than before.
We saw lot of "Timeout Exception" in the hadoop data log. Here is the basic information related to our cluster:
1) One HDFS NameNode2) One HDFS 2nd NameNode3) 42 Data/Task nodes4) 2 Edge nodes
First, we observed some HDFS client (using "hadoop fs -put") get the following message in the console:
java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.20.95.157:20888 remote=/10.20.95.157:50010]        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
>From what I saw so far, the message always complaining writing to the 3rd data node, and failed and had to retry.
Most of the time, the HDFS write operation will succeed after retry. But we got lots of "Timeout Exception" occurrences in the data log.
Then we added the following settings in the "hdfs-site.xml":
  <property>    <name>dfs.client.socket-timeout</name>    <value>180000</value>  </property>  <property>    <name>dfs.datanode.socket.write.timeout</name>    <value>960000</value>  </property>
But we still saw lots of Timeout Exception, with only longer timeout value, like following:015-08-16 11:10:36,466 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode{data=FSDataset{dirpath='[/data1/hdfs/data/current, /data2/hdfs/data/current, /data3/hdfs/data/current, /data0/hdfs/data/current]'}, localName='p2-bigin144.ad.prodcc.net:50010', storageID='DS-709172270-10.20.95.176-50010-1427848090396', xmitsInProgress=0}:Exception transfering block BP-834217708-10.20.95.130-1438701195738:blk_1074671541_1099532031180 to mirror 10.20.95.162:50010: java.net.SocketTimeoutException: 185000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.20.95.176:32663 remote=/10.20.95.162:50010]
What I can find out so far are:
1) The timeout exception happens on connecting to lots of nodes (The destination IP are changing), so it doesn't look like one bad data node causing this.2) The "dfs.datanode.handler.count" is already set to 10, same as before we upgrade. I think that is enough thread for HDFS data node.3) Our HDFS daily usage didn't change significantly before/after upgrade4) While I still try to find out if any network changes before/after upgrade, but what I got so far from network team is none.5) This is in our own data center, not in the public cloud.
So if anyone faced similar issues before? What part I can check, if I want to find the root cause of this? Most of MR jobs and HDFS operation will succeed eventually, but the performance is impact by this.
Thanks
Yong