You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Wayne <wa...@gmail.com> on 2011/01/06 14:46:49 UTC

Node Shutdown Problems

I had another node go down last night. No load at the time, it just seems it
had issues and shut itself down. Any help would be greatly appreciated. Why
would the file system go away? Is this an hadoop problem or a hardware
problem or ??

Here is a sampling of the logs. Please let me know what other information is
needed to help figure this out.

This is the first sign of trouble.

07:47:51,079 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream
ResponseProcessor exception for block
blk_-5666628798929448999_724103java.net.SocketTimeoutException: 69000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel

07:47:51,080 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for
block blk_-5666628798929448999_724103 bad datanode[0]

It fails 6 recovery attempts each on 3 nodes including itself for the same
block

07:47:51,129 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery
attempt #0 from primary datanode x.x.x.x6:50010
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.ipc.RemoteException: java.io.IOException:
blk_-5666628798929448999_724103 is already commited, storedBlock ==
null.

Then it logs 30+ fatals errors regarding the WAL.

2011-01-06 07:48:42,075 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer:
java.io.IOException: Error Recovery for block
blk_-5666628798929448999_724103 failed because recovery from primary
datanode x.x.x.x8:50010 failed 6 times. Pipeline was x.x.x.x8:50010.
Aborting...
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2741)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2172)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2371)
2011-01-06 07:48:42,077 FATAL
org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append.
Requesting close of hlog

Finally it gives up and we see the message below.

2011-01-06 07:48:42,233 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region
server serverName=x.x.x.x7,60020,1294257428578, load=(requests=76,
regions=320, usedHeap=3511, maxHeap=7987): File System not available
java.io.IOException: File system is not available

In case it helps the hadoop logs have the following showing up 6x.

07:48:30,936 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2
on 50020, call recoverBlock(blk_-5666628798929448999_724103, false,
[Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@3d40e2db) from
x.x.x.x7:42652: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: blk_-5666628798929448999_724103 is already
commited, storedBlock == null.

The other nodes it was trying to talk to had no hbase log entries of
interest but their hadoop logs looked like the following 6x.

2011-01-06 07:48:37,013 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 50020, call recoverBlock(blk_-5666628798929448999_724103,
false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@58c2ebf) from
x.x.x.x7:42979: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: blk_-5666628798929448999_724103 is already
commited, storedBlock == null.

I am not sure if it is related but I see java.net.SocketTimeoutException
happening a lot on most of the nodes in the hadoop log (not at the same time
as the node going down). I see some recent activity related to this, but I
am not sure if it has anything to do with our problems. I also see errors
like the one below, may be totally unrelated but as I look closely at the
logs I see these things for the first time.

05:45:33,059 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(x.x.x.x8:50010,
storageID=DS-201495735-x.x.x.x8-50010-1293567644535, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.IOException: Block blk_3051235367015871067_707839 is not valid.

Thanks in advance for any assistance anyone can provide.

Re: Node Shutdown Problems

Posted by Wayne <wa...@gmail.com>.

Thanks for the reply. We are running hadoop branch-0.20-append. We have
upped our xceivers to 4096 and restarted the hadoop cluster. We had upped
them before but had yet to restart. Hopefully again this is our mistake not
setting up correctly from the start. So far so good.

Thanks for your help.


On Fri, Jan 7, 2011 at 1:09 AM, Stack <st...@duboce.net> wrote:

> On Thu, Jan 6, 2011 at 5:46 AM, Wayne <wa...@gmail.com> wrote:
> > I had another node go down last night. No load at the time, it just seems
> it
> > had issues and shut itself down. Any help would be greatly appreciated.
> Why
> > would the file system go away?
>
> Incorrect ulimit or xceivers or it was shutdown from under hbase.
>
>
> > Here is a sampling of the logs. Please let me know what other information
> is
> > needed to help figure this out.
> >
> > This is the first sign of trouble.
> >
> > 07:47:51,079 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream
> > ResponseProcessor exception  for block
> > blk_-5666628798929448999_724103java.net.SocketTimeoutException: 69000
> > millis timeout while waiting for channel to be ready for read. ch :
> > java.nio.channels.SocketChannel
> >
> > 07:47:51,080 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for
> > block blk_-5666628798929448999_724103 bad datanode[0]
> >
> > It fails 6 recovery attempts each on 3 nodes including itself for the
> same
> > block
> >
>
> Check datanode logs, in particular those participating in this blocks
> replication.  They may tell you something.  Grepping the block in the
> namenode log can also be revealing.
>
>
>
>
> > 2011-01-06 07:48:42,075 ERROR
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > java.io.IOException: Error Recovery for block
> > blk_-5666628798929448999_724103 failed  because recovery from primary
> > datanode x.x.x.x8:50010 failed 6 times.  Pipeline was x.x.x.x8:50010.
> > Aborting...
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2741)
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2172)
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2371)
> > 2011-01-06 07:48:42,077 FATAL
> > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append.
> > Requesting close of hlog
> >
>
>
> Thats as though you were running with one replica only, or the other
> replicas had died on the dfsclient and had been marked bad.
>
> The regionserver will shut itself down if it can't write the
> filesystem to protect against data loss.
>
> >
> > Finally it gives up and we see the message below.
> >
> > 2011-01-06 07:48:42,233 FATAL
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region
> > server serverName=x.x.x.x7,60020,1294257428578, load=(requests=76,
> > regions=320, usedHeap=3511, maxHeap=7987): File System not available
> > java.io.IOException: File system is not available
> >
>
> Yeah, as far as its concerned, the FS is hosed.
>
>
> What Hadoop are you running on top of (Sorry, you've probably already
> said in earlier emails).
>
>
> > I am not sure if it is related but I see java.net.SocketTimeoutException
> > happening a lot on most of the nodes in the hadoop log (not at the same
> time
> > as the node going down). I see some recent activity related to this, but
> I
> > am not sure if it has anything to do with our problems. I also see errors
> > like the one below, may be totally unrelated but as I look closely at the
> > logs I see these things for the first time.
> >
> > 05:45:33,059 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(x.x.x.x8:50010,
> > storageID=DS-201495735-x.x.x.x8-50010-1293567644535, infoPort=50075,
> > ipcPort=50020):DataXceiver
> > java.io.IOException: Block blk_3051235367015871067_707839 is not valid.
> >
>
> If you grep this log in namenode log, does it tell you anything?
>
> Thanks Wayne,
> St.Ack
>
>
> >
> > Thanks in advance for any assistance anyone can provide.
> >
>

Re: Node Shutdown Problems

Posted by Stack <st...@duboce.net>.

On Thu, Jan 6, 2011 at 5:46 AM, Wayne <wa...@gmail.com> wrote:
> I had another node go down last night. No load at the time, it just seems it
> had issues and shut itself down. Any help would be greatly appreciated. Why
> would the file system go away?

Incorrect ulimit or xceivers or it was shutdown from under hbase.


> Here is a sampling of the logs. Please let me know what other information is
> needed to help figure this out.
>
> This is the first sign of trouble.
>
> 07:47:51,079 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream
> ResponseProcessor exception  for block
> blk_-5666628798929448999_724103java.net.SocketTimeoutException: 69000
> millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel
>
> 07:47:51,080 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for
> block blk_-5666628798929448999_724103 bad datanode[0]
>
> It fails 6 recovery attempts each on 3 nodes including itself for the same
> block
>

Check datanode logs, in particular those participating in this blocks
replication.  They may tell you something.  Grepping the block in the
namenode log can also be revealing.




> 2011-01-06 07:48:42,075 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> java.io.IOException: Error Recovery for block
> blk_-5666628798929448999_724103 failed  because recovery from primary
> datanode x.x.x.x8:50010 failed 6 times.  Pipeline was x.x.x.x8:50010.
> Aborting...
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2741)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2172)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2371)
> 2011-01-06 07:48:42,077 FATAL
> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append.
> Requesting close of hlog
>


Thats as though you were running with one replica only, or the other
replicas had died on the dfsclient and had been marked bad.

The regionserver will shut itself down if it can't write the
filesystem to protect against data loss.

>
> Finally it gives up and we see the message below.
>
> 2011-01-06 07:48:42,233 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region
> server serverName=x.x.x.x7,60020,1294257428578, load=(requests=76,
> regions=320, usedHeap=3511, maxHeap=7987): File System not available
> java.io.IOException: File system is not available
>

Yeah, as far as its concerned, the FS is hosed.


What Hadoop are you running on top of (Sorry, you've probably already
said in earlier emails).


> I am not sure if it is related but I see java.net.SocketTimeoutException
> happening a lot on most of the nodes in the hadoop log (not at the same time
> as the node going down). I see some recent activity related to this, but I
> am not sure if it has anything to do with our problems. I also see errors
> like the one below, may be totally unrelated but as I look closely at the
> logs I see these things for the first time.
>
> 05:45:33,059 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(x.x.x.x8:50010,
> storageID=DS-201495735-x.x.x.x8-50010-1293567644535, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.IOException: Block blk_3051235367015871067_707839 is not valid.
>

If you grep this log in namenode log, does it tell you anything?

Thanks Wayne,
St.Ack


>
> Thanks in advance for any assistance anyone can provide.
>