You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Ken Weiner <ke...@gumgum.com> on 2009/12/23 20:10:00 UTC

Recurring Hadoop DataNode ERROR in logs

We have seen the following HADOOP error occur about 100 times a day spread
out thoughout the day on each RegionServer/DataNode in our always-on
HBase/Hadoop cluster.

>From *hadoop-gumgum-datanode-xxxxxxxxxxxx.log*

*2009-12-23* *09:58:29*,*717* *ERROR*
*org.apache.hadoop.hdfs.server.datanode.DataNode:*
*DatanodeRegistration*(*10.255.9.187:50010*,
*storageID=DS-1057956046-10.255.9.187-50010-1248395287725*,
*infoPort=50075*, *ipcPort=50020*)*:DataXceiver*
*java.net.SocketTimeoutException:* *480000* *millis* *timeout* *while*
*waiting* *for* *channel* *to* *be* *ready* *for* *write.* *ch* *:*
*java.nio.channels.SocketChannel*[*connected*
*local=/10.255.9.187:50010* *remote=/10.255.9.187:46154*]
	*at* *org.apache.hadoop.net.SocketIOWithTimeout.waitForIO*(*SocketIOWithTimeout.java:246*)
	*at* *org.apache.hadoop.net.SocketOutputStream.waitForWritable*(*SocketOutputStream.java:159*)
	*at* *org.apache.hadoop.net.SocketOutputStream.transferToFully*(*SocketOutputStream.java:198*)
	*at* *org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks*(*BlockSender.java:313*)
	*at* *org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock*(*BlockSender.java:400*)
	*at* *org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock*(*DataXceiver.java:180*)
	*at* *org.apache.hadoop.hdfs.server.datanode.DataXceiver.run*(*DataXceiver.java:95*)
	*at* *java.lang.Thread.run*(*Thread.java:619*)


Are other people seeing this error too?  How serious is it?  Can it be
prevented?

I found a few things that seem related, but I'm not sure how they apply to
the HBase environment:
http://issues.apache.org/jira/browse/HDFS-693
https://issues.apache.org/jira/browse/HADOOP-3831

Info on our environment:
1 Node: Master/NameNode/JobTracker (EC2 m1.large)
3 Nodes: RegionServer/DataNode/TaskTracker (EC2 m1.large)

Thanks!

-Ken Weiner
 GumGum & BEDROCK

Re: Recurring Hadoop DataNode ERROR in logs

Posted by stack <st...@duboce.net>.
On Wed, Dec 23, 2009 at 2:24 PM, Ken Weiner <ke...@gumgum.com> wrote:

> HBase itself isn't complaining. I only see the error from Hadoop.  I wasn't
> sure if there are some side effects I'm not aware of.  Is it causing any
> problems to my application?


IIUC, nothing but some extra latency while the dropped socket is
reestablished.



> I see the error occur even when my system is
> under relatively light load with no MR jobs running.  I also forgot to
> mention that I saw this error with HBase 0.20.0/Hadoop 0.20.0 and I
> recently
> upgraded to HBase 0.20.2/Hadoop 0.20.1 and the error still occurs at the
> same rate after the upgrade.
>
> Yeah.  This would not be surprising.  This is hdfs behavior.

You could up the default timeout in hdfs if you wanted less
incidence: dfs.datanode.socket.write.timeout.  But, then sockets would never
timeout.

St.Ack



> Ken
>
> On Wed, Dec 23, 2009 at 11:49 AM, stack <st...@duboce.net> wrote:
>
> > Yeah, see HADOOP-3831.   It looks like datanode timing out unused
> > connections.  As I understand it, later, when dfsclient wants to use this
> > block, it just sets up the socket again -- silently, transparently below
> > the
> > level at which the application can see.  Do I have it right?  Is hbase
> > itself complaining?
> >
> > St.Ack
> >
> >
>

Re: Recurring Hadoop DataNode ERROR in logs

Posted by Ken Weiner <ke...@gumgum.com>.
HBase itself isn't complaining. I only see the error from Hadoop.  I wasn't
sure if there are some side effects I'm not aware of.  Is it causing any
problems to my application?  I see the error occur even when my system is
under relatively light load with no MR jobs running.  I also forgot to
mention that I saw this error with HBase 0.20.0/Hadoop 0.20.0 and I recently
upgraded to HBase 0.20.2/Hadoop 0.20.1 and the error still occurs at the
same rate after the upgrade.

Ken

On Wed, Dec 23, 2009 at 11:49 AM, stack <st...@duboce.net> wrote:

> Yeah, see HADOOP-3831.   It looks like datanode timing out unused
> connections.  As I understand it, later, when dfsclient wants to use this
> block, it just sets up the socket again -- silently, transparently below
> the
> level at which the application can see.  Do I have it right?  Is hbase
> itself complaining?
>
> St.Ack
>
>

Re: Recurring Hadoop DataNode ERROR in logs

Posted by stack <st...@duboce.net>.
Yeah, see HADOOP-3831.   It looks like datanode timing out unused
connections.  As I understand it, later, when dfsclient wants to use this
block, it just sets up the socket again -- silently, transparently below the
level at which the application can see.  Do I have it right?  Is hbase
itself complaining?

St.Ack

On Wed, Dec 23, 2009 at 11:10 AM, Ken Weiner <ke...@gumgum.com> wrote:

> We have seen the following HADOOP error occur about 100 times a day spread
> out thoughout the day on each RegionServer/DataNode in our always-on
> HBase/Hadoop cluster.
>
> From *hadoop-gumgum-datanode-xxxxxxxxxxxx.log*
>
> *2009-12-23* *09:58:29*,*717* *ERROR*
> *org.apache.hadoop.hdfs.server.datanode.DataNode:*
> *DatanodeRegistration*(*10.255.9.187:50010*,
> *storageID=DS-1057956046-10.255.9.187-50010-1248395287725*,
> *infoPort=50075*, *ipcPort=50020*)*:DataXceiver*
> *java.net.SocketTimeoutException:* *480000* *millis* *timeout* *while*
> *waiting* *for* *channel* *to* *be* *ready* *for* *write.* *ch* *:*
> *java.nio.channels.SocketChannel*[*connected*
> *local=/10.255.9.187:50010* *remote=/10.255.9.187:46154*]
>        *at*
> *org.apache.hadoop.net.SocketIOWithTimeout.waitForIO*(*SocketIOWithTimeout.java:246*)
>        *at*
> *org.apache.hadoop.net.SocketOutputStream.waitForWritable*(*SocketOutputStream.java:159*)
>        *at*
> *org.apache.hadoop.net.SocketOutputStream.transferToFully*(*SocketOutputStream.java:198*)
>        *at*
> *org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks*(*BlockSender.java:313*)
>        *at*
> *org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock*(*BlockSender.java:400*)
>        *at*
> *org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock*(*DataXceiver.java:180*)
>        *at*
> *org.apache.hadoop.hdfs.server.datanode.DataXceiver.run*(*DataXceiver.java:95*)
>        *at* *java.lang.Thread.run*(*Thread.java:619*)
>
>
> Are other people seeing this error too?  How serious is it?  Can it be
> prevented?
>
> I found a few things that seem related, but I'm not sure how they apply to
> the HBase environment:
> http://issues.apache.org/jira/browse/HDFS-693
> https://issues.apache.org/jira/browse/HADOOP-3831
>
> Info on our environment:
> 1 Node: Master/NameNode/JobTracker (EC2 m1.large)
> 3 Nodes: RegionServer/DataNode/TaskTracker (EC2 m1.large)
>
> Thanks!
>
> -Ken Weiner
>  GumGum & BEDROCK
>

Re: Recurring Hadoop DataNode ERROR in logs

Posted by Lars George <la...@gmail.com>.
Hi Ken,

Yes I have the same on a smallish cluster under heavy MR based import  
load. I am discussing with JD - maybe it is HDFS-101 related. Some  
needs to apply the patch and try.

Lars

On Dec 23, 2009, at 20:10, Ken Weiner <ke...@gumgum.com> wrote:

> We have seen the following HADOOP error occur about 100 times a day  
> spread
> out thoughout the day on each RegionServer/DataNode in our always-on
> HBase/Hadoop cluster.
>
> From *hadoop-gumgum-datanode-xxxxxxxxxxxx.log*
>
> *2009-12-23* *09:58:29*,*717* *ERROR*
> *org.apache.hadoop.hdfs.server.datanode.DataNode:*
> *DatanodeRegistration*(*10.255.9.187:50010*,
> *storageID=DS-1057956046-10.255.9.187-50010-1248395287725*,
> *infoPort=50075*, *ipcPort=50020*)*:DataXceiver*
> *java.net.SocketTimeoutException:* *480000* *millis* *timeout* *while*
> *waiting* *for* *channel* *to* *be* *ready* *for* *write.* *ch* *:*
> *java.nio.channels.SocketChannel*[*connected*
> *local=/10.255.9.187:50010* *remote=/10.255.9.187:46154*]
>    *at* *org.apache.hadoop.net.SocketIOWithTimeout.waitForIO* 
> (*SocketIOWithTimeout.java:246*)
>    *at* *org.apache.hadoop.net.SocketOutputStream.waitForWritable* 
> (*SocketOutputStream.java:159*)
>    *at* *org.apache.hadoop.net.SocketOutputStream.transferToFully* 
> (*SocketOutputStream.java:198*)
>    *at*  
> *org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks* 
> (*BlockSender.java:313*)
>    *at*  
> *org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock* 
> (*BlockSender.java:400*)
>    *at*  
> *org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock* 
> (*DataXceiver.java:180*)
>    *at* *org.apache.hadoop.hdfs.server.datanode.DataXceiver.run* 
> (*DataXceiver.java:95*)
>    *at* *java.lang.Thread.run*(*Thread.java:619*)
>
>
> Are other people seeing this error too?  How serious is it?  Can it be
> prevented?
>
> I found a few things that seem related, but I'm not sure how they  
> apply to
> the HBase environment:
> http://issues.apache.org/jira/browse/HDFS-693
> https://issues.apache.org/jira/browse/HADOOP-3831
>
> Info on our environment:
> 1 Node: Master/NameNode/JobTracker (EC2 m1.large)
> 3 Nodes: RegionServer/DataNode/TaskTracker (EC2 m1.large)
>
> Thanks!
>
> -Ken Weiner
> GumGum & BEDROCK