You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Andy Sautins <an...@returnpath.net> on 2011/04/10 01:35:05 UTC

DFS stability running HBase and dfs.datanode.handler.count...

    I ran across an mailing list posting from 1/4/2009 that seemed to indicate increasing dfs.datanode.handler.count could help improve DFS stability (http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%3C49605FE0.9040509@duboce.net%3E ).  The posting seems to indicate the wiki was updated, but I don't seen anything in the wiki about increasing dfs.datanode.handler.count.   I have seen a few other notes that seem to show examples that have raised dfs.datanode.handler.count including one from an IBM article (http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/ ) and the Pro Hadoop book, but other than that the only other mention I see is from cloudera seems luke-warm on increasing dfs.datanode.handler.count (http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/ ).

    Given the post is from 2009 I thought I'd ask if anyone has had any success improving stability of HBase/DFS when increasing dfs.datanode.handler.count.  The specific error we are seeing somewhat  frequently ( few hundred times per day ) in the datanode longs is as follows:

2011-04-09 00:12:48,035 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.18.0.33:50010, storageID=DS-1501576934-10.18.0.33-50010-1296248656454, infoPort=50075, ipcPort=50020):DataXceiver
java.io.IOException: Block blk_-163126943925471435_28809750 is not valid.

   The above seems to correspond to ClosedChannelExceptions in the hbase regionserver logs as well as some warnings about long write to hlog ( some in the 50+ seconds ).

    The biggest end-user facing issue we are seeing is that Task Trackers keep getting blacklisted.  It's quite possible our problem is unrelated to anything HBase, but I thought it was worth asking given what we've been seeing.

   We are currently running 0.91 on an 18 node cluster with ~3k total regions and each region server is running with 2G of memory.

   Any insight would be appreciated.

   Thanks

    Andy

Re: DFS stability running HBase and dfs.datanode.handler.count...

Posted by Jean-Daniel Cryans <jd...@apache.org>.
You can confirm if it's really a GC issue by taking a look at the
master log; if the log splitting started before the errors in the
region server log, then something happened to the region server. It's
usually GC, but it could be something else.

The fact that mslab helped does point to a GC issue.

Regarding configurations, you can tweak zookeeper.session.timeout to
set the timeout higher than the default 180000ms (3 minutes). It might
also be worth making sure that your region server and zookeeper agree
on the same timeout when the region server starts, depending on how
you start zookeeper it might end up being different. The line looks
like this:

17:39:20,189 INFO org.apache.zookeeper.server.NIOServerCnxn:
Established session 0xsome_hex with negotiated timeout 180000 for
client blah_address

J-D

On Mon, Apr 18, 2011 at 9:33 AM, Andy Sautins
<an...@returnpath.net> wrote:
>
>   J-D, thanks for the response.  I suspect you are right that the error message was a red herring.  However, I don't see "Client session timed out" in any of our logs when the region server dies so I'm not sure if it is the situation you are thinking of.  Note that we are going to enable GC logging to get an idea of how long we pause in GC in our regionservers.  We also set hbase.hregion.memstore.mslab.enabled to true in hbase-site.xml which seemed to have helped, although I don't have too many datapoints at this point to confirm.  Before we enabled mslab we had region servers die every day or two.  After enabling mslab we just went 4 days without a regionserver dying which was better.
>
>   Following down our latest Region server that died, it seems to die because it can't close the hlog which sounds like the situation you mentioned in your response, although I'm not sure I fully understand IO Fencing.  The stack traces are below.
>
>   Are there any configuration parameters that could help keep the regionserver from dying when it encounters this situation?  We generally just restart the regionserver and it recovers nicely, but it would be ideal to not need to login and re-start regionservers due to this situation.
>
>   At this point it sounds like my best bet is to try to understand if we are still seeing big GC pauses even with mslab enabled to identify if that is our problem.
>
>   Thanks for your help
>
>   Andy
>
>
>
> 2011-04-18 02:04:12,862 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting close of hlog
> java.io.IOException: Error Recovery for block blk_-4878958038737199123_30707596 failed  because recovery from primary datanode 10.18.0.28:50010 failed 6 times.  Pipeline was 10.18.0.28:50010. Aborting...
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
> 2011-04-18 02:04:12,862 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting close of hlog
> java.io.IOException: Error Recovery for block blk_-4878958038737199123_30707596 failed  because recovery from primary datanode 10.18.0.28:50010 failed 6 times.  Pipeline was 10.18.0.28:50010. Aborting...
>
>  The above exception seems to correlate to the following error on the datanode.
>
> 2011-04-18 02:04:08,773 WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to getBlockMetaDataInfo for block (=blk_-4878958038737199123_30707596) from datanode (=10.18.0.28:50010)
> java.io.IOException: Block blk_-4878958038737199123_30711904 length is 45855569 does not match block file length 45853290
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset.validateBlockMetadata(FSDataset.java:1661)
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset.startBlockRecovery(FSDataset.java:1921)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.startBlockRecovery(DataNode.java:1658)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1772)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1930)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>        at java.lang.reflect.Method.invoke(Unknown Source)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1415)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1411)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Unknown Source)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1409)
> 2011-04-18 02:04:08,774 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50020, call recoverBlock(blk_-4878958038737199123_30707596, false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@2cda6e31) from 10.18.0.28:52651: error: java.io.IOException: All datanodes failed: block=blk_-4878958038737199123_30707596, datanodeids=[10.18.0.28:50010]
>
>  And the block does look like it is associated with a log file:
>
> 2011-04-18 01:37:17,661 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /user/hbase/.logs/hd28.dfs.returnpath.net,60020,1302812828956/hd28.dfs.returnpath.net%3A60020.1303112237535. blk_-4878958038737199123_30707596
>

RE: DFS stability running HBase and dfs.datanode.handler.count...

Posted by Andy Sautins <an...@returnpath.net>.
   J-D, thanks for the response.  I suspect you are right that the error message was a red herring.  However, I don't see "Client session timed out" in any of our logs when the region server dies so I'm not sure if it is the situation you are thinking of.  Note that we are going to enable GC logging to get an idea of how long we pause in GC in our regionservers.  We also set hbase.hregion.memstore.mslab.enabled to true in hbase-site.xml which seemed to have helped, although I don't have too many datapoints at this point to confirm.  Before we enabled mslab we had region servers die every day or two.  After enabling mslab we just went 4 days without a regionserver dying which was better.

   Following down our latest Region server that died, it seems to die because it can't close the hlog which sounds like the situation you mentioned in your response, although I'm not sure I fully understand IO Fencing.  The stack traces are below.

   Are there any configuration parameters that could help keep the regionserver from dying when it encounters this situation?  We generally just restart the regionserver and it recovers nicely, but it would be ideal to not need to login and re-start regionservers due to this situation.

   At this point it sounds like my best bet is to try to understand if we are still seeing big GC pauses even with mslab enabled to identify if that is our problem.

   Thanks for your help

   Andy



2011-04-18 02:04:12,862 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting close of hlog
java.io.IOException: Error Recovery for block blk_-4878958038737199123_30707596 failed  because recovery from primary datanode 10.18.0.28:50010 failed 6 times.  Pipeline was 10.18.0.28:50010. Aborting...
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
2011-04-18 02:04:12,862 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting close of hlog
java.io.IOException: Error Recovery for block blk_-4878958038737199123_30707596 failed  because recovery from primary datanode 10.18.0.28:50010 failed 6 times.  Pipeline was 10.18.0.28:50010. Aborting...

  The above exception seems to correlate to the following error on the datanode.

2011-04-18 02:04:08,773 WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to getBlockMetaDataInfo for block (=blk_-4878958038737199123_30707596) from datanode (=10.18.0.28:50010)
java.io.IOException: Block blk_-4878958038737199123_30711904 length is 45855569 does not match block file length 45853290
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.validateBlockMetadata(FSDataset.java:1661)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.startBlockRecovery(FSDataset.java:1921)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.startBlockRecovery(DataNode.java:1658)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1772)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1930)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1415)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1411)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Unknown Source)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1409)
2011-04-18 02:04:08,774 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50020, call recoverBlock(blk_-4878958038737199123_30707596, false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@2cda6e31) from 10.18.0.28:52651: error: java.io.IOException: All datanodes failed: block=blk_-4878958038737199123_30707596, datanodeids=[10.18.0.28:50010]

  And the block does look like it is associated with a log file:

2011-04-18 01:37:17,661 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /user/hbase/.logs/hd28.dfs.returnpath.net,60020,1302812828956/hd28.dfs.returnpath.net%3A60020.1303112237535. blk_-4878958038737199123_30707596




-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Thursday, April 14, 2011 11:20 AM
To: user@hbase.apache.org
Subject: Re: DFS stability running HBase and dfs.datanode.handler.count...

This is probably a red herring, for example if the region server had a
big GC pause then the master could have already split the log and the
region server wouldn't be able to close it (that's our version of IO
fencing). So from that exception look back in the log and see if
there's anything like :

INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have
not heard from server in some_big_number ms

J-D

On Thu, Apr 14, 2011 at 7:24 AM, Andy Sautins
<an...@returnpath.net> wrote:
>
>  Thanks for the response stack.  Yes we tried increasing dfs.datanode.handler.count to 8.   At this point I would say it didn't seem to resolve the issue we are seeing, but we it also doesn't seem to be hurting anything so for right now we're going to leave it in at 8 while we continue to debug.
>
>  In regard to the original error I posted ( Block 'x' is not valid ) we have chased that down thanks to your suggestion of looking at the logs for the history of the block.  It _looks_ like our 'is not valid' block errors are unrelated and due to chmod or deleting mapreduce output directories directly after a run.  We are still isolating that but it looks like it's not HBase releated so I'll move that to another list.  Thank you very much for your debugging suggestions.
>
>   The one issue we are still seeing is that we will occasionally have a regionserver die with the following exception.  I need to chase that down a little more but it seems similar to a post from 2/13/2011 (http://www.mail-archive.com/user@hbase.apache.org/msg05550.html ) that I'm not sure was ever resolved or not.  If anyone has any insight on how to debug the following error a little more I would appreciate any thoughts you might have.
>
> 2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file /user/hbase/.logs/hd10.dfs.returnpath.net,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921 : java.io.IOException: Error Recovery for block blk_1315316969665710488_29842654 failed  because recovery from primary datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010. Aborting...
> java.io.IOException: Error Recovery for block blk_1315316969665710488_29842654 failed  because recovery from primary datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010. Aborting...
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
>
> Other than the above exception causing a region server to die occasionally everything seems to be working well.
>
> Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop 0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above exception.  We do have ulimit set ( memory unlimited and files 32k ) for the user running hbase.
>
> Thanks again for your help
>
>  Andy
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Sunday, April 10, 2011 1:16 PM
> To: user@hbase.apache.org
> Cc: Andy Sautins
> Subject: Re: DFS stability running HBase and dfs.datanode.handler.count...
>
> Did you try upping it Andy?  Andrew Purtell's recommendation though old would have come of experience.  The Intel article reads like sales but there is probably merit to its suggestion.  The Cloudera article is more unsure about the effect of upping handlers though it allows "...could be set a bit higher."
>
> I just looked at our prod frontend and its set to 3 still.  I don't see your exceptions in our DN log.
>
> What version of hadoop?  You say hbase 0.91.  You mean 0.90.1?
>
> ulimit and nproc are set sufficiently high for hadoop/hbase user?
>
> If you grep 163126943925471435_28809750 in namenode log, do you see a delete occur before a later open?
>
> St.Ack
>
> On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins <an...@returnpath.net> wrote:
>>
>>    I ran across an mailing list posting from 1/4/2009 that seemed to indicate increasing dfs.datanode.handler.count could help improve DFS stability (http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%3C49605FE0.9040509@duboce.net%3E ).  The posting seems to indicate the wiki was updated, but I don't seen anything in the wiki about increasing dfs.datanode.handler.count.   I have seen a few other notes that seem to show examples that have raised dfs.datanode.handler.count including one from an IBM article (http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/ ) and the Pro Hadoop book, but other than that the only other mention I see is from cloudera seems luke-warm on increasing dfs.datanode.handler.count (http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/ ).
>>
>>    Given the post is from 2009 I thought I'd ask if anyone has had any success improving stability of HBase/DFS when increasing dfs.datanode.handler.count.  The specific error we are seeing somewhat  frequently ( few hundred times per day ) in the datanode longs is as follows:
>>
>> 2011-04-09 00:12:48,035 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(10.18.0.33:50010,
>> storageID=DS-1501576934-10.18.0.33-50010-1296248656454,
>> infoPort=50075, ipcPort=50020):DataXceiver
>> java.io.IOException: Block blk_-163126943925471435_28809750 is not valid.
>>
>>   The above seems to correspond to ClosedChannelExceptions in the hbase regionserver logs as well as some warnings about long write to hlog ( some in the 50+ seconds ).
>>
>>    The biggest end-user facing issue we are seeing is that Task Trackers keep getting blacklisted.  It's quite possible our problem is unrelated to anything HBase, but I thought it was worth asking given what we've been seeing.
>>
>>   We are currently running 0.91 on an 18 node cluster with ~3k total regions and each region server is running with 2G of memory.
>>
>>   Any insight would be appreciated.
>>
>>   Thanks
>>
>>    Andy
>>
>

Re: DFS stability running HBase and dfs.datanode.handler.count...

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Then this would be a heap limitation problem, have a look at your GC
log during the compaction.

J-D

On Thu, Apr 14, 2011 at 10:31 AM, Ben Aldrich <vr...@gmail.com> wrote:
> Our heapsize is set to 2gb, I think my dev issue was because I was running
> things off of a few vm's. Even though the compaction is in another thread it
> would still fail to respond during major compaction.
>
> -Ben
>
> On Thu, Apr 14, 2011 at 11:26 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> Ben, the compaction is done in a background thread, it doesn't block
>> anything. Now if you had a heap close to 2GB, you could easily run
>> into issues.
>>
>> J-D
>>
>> On Thu, Apr 14, 2011 at 10:23 AM, Ben Aldrich <vr...@gmail.com> wrote:
>> > Just to chime in here, the other thing we changed was our max_file_size
>> is
>> > now set to 2gb instead of 512mb. This could be causing long compaction
>> > times. If a compaction takes too long it won't respond and can be marked
>> as
>> > dead. I have had this happen on my dev cluster a few times.
>> >
>> > -Ben
>> >
>> > On Thu, Apr 14, 2011 at 11:20 AM, Jean-Daniel Cryans <
>> jdcryans@apache.org>wrote:
>> >
>> >> This is probably a red herring, for example if the region server had a
>> >> big GC pause then the master could have already split the log and the
>> >> region server wouldn't be able to close it (that's our version of IO
>> >> fencing). So from that exception look back in the log and see if
>> >> there's anything like :
>> >>
>> >> INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have
>> >> not heard from server in some_big_number ms
>> >>
>> >> J-D
>> >>
>> >> On Thu, Apr 14, 2011 at 7:24 AM, Andy Sautins
>> >> <an...@returnpath.net> wrote:
>> >> >
>> >> >  Thanks for the response stack.  Yes we tried increasing
>> >> dfs.datanode.handler.count to 8.   At this point I would say it didn't
>> seem
>> >> to resolve the issue we are seeing, but we it also doesn't seem to be
>> >> hurting anything so for right now we're going to leave it in at 8 while
>> we
>> >> continue to debug.
>> >> >
>> >> >  In regard to the original error I posted ( Block 'x' is not valid )
>> we
>> >> have chased that down thanks to your suggestion of looking at the logs
>> for
>> >> the history of the block.  It _looks_ like our 'is not valid' block
>> errors
>> >> are unrelated and due to chmod or deleting mapreduce output directories
>> >> directly after a run.  We are still isolating that but it looks like
>> it's
>> >> not HBase releated so I'll move that to another list.  Thank you very
>> much
>> >> for your debugging suggestions.
>> >> >
>> >> >   The one issue we are still seeing is that we will occasionally have
>> a
>> >> regionserver die with the following exception.  I need to chase that
>> down a
>> >> little more but it seems similar to a post from 2/13/2011 (
>> >> http://www.mail-archive.com/user@hbase.apache.org/msg05550.html ) that
>> I'm
>> >> not sure was ever resolved or not.  If anyone has any insight on how to
>> >> debug the following error a little more I would appreciate any thoughts
>> you
>> >> might have.
>> >> >
>> >> > 2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient:
>> Exception
>> >> closing file /user/hbase/.logs/hd10.dfs.returnpath.net
>> >> ,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921 :
>> >> java.io.IOException: Error Recovery for block
>> >> blk_1315316969665710488_29842654 failed  because recovery from primary
>> >> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was
>> 10.18.0.16:50010.
>> >> Aborting...
>> >> > java.io.IOException: Error Recovery for block
>> >> blk_1315316969665710488_29842654 failed  because recovery from primary
>> >> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was
>> 10.18.0.16:50010.
>> >> Aborting...
>> >> >        at
>> >>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
>> >> >        at
>> >>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
>> >> >
>> >> > Other than the above exception causing a region server to die
>> >> occasionally everything seems to be working well.
>> >> >
>> >> > Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop
>> >> 0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above
>> exception.
>> >>  We do have ulimit set ( memory unlimited and files 32k ) for the user
>> >> running hbase.
>> >> >
>> >> > Thanks again for your help
>> >> >
>> >> >  Andy
>> >> >
>> >> > -----Original Message-----
>> >> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>> >> Stack
>> >> > Sent: Sunday, April 10, 2011 1:16 PM
>> >> > To: user@hbase.apache.org
>> >> > Cc: Andy Sautins
>> >> > Subject: Re: DFS stability running HBase and
>> >> dfs.datanode.handler.count...
>> >> >
>> >> > Did you try upping it Andy?  Andrew Purtell's recommendation though
>> old
>> >> would have come of experience.  The Intel article reads like sales but
>> there
>> >> is probably merit to its suggestion.  The Cloudera article is more
>> unsure
>> >> about the effect of upping handlers though it allows "...could be set a
>> bit
>> >> higher."
>> >> >
>> >> > I just looked at our prod frontend and its set to 3 still.  I don't
>> see
>> >> your exceptions in our DN log.
>> >> >
>> >> > What version of hadoop?  You say hbase 0.91.  You mean 0.90.1?
>> >> >
>> >> > ulimit and nproc are set sufficiently high for hadoop/hbase user?
>> >> >
>> >> > If you grep 163126943925471435_28809750 in namenode log, do you see a
>> >> delete occur before a later open?
>> >> >
>> >> > St.Ack
>> >> >
>> >> > On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins <
>> >> andy.sautins@returnpath.net> wrote:
>> >> >>
>> >> >>    I ran across an mailing list posting from 1/4/2009 that seemed to
>> >> indicate increasing dfs.datanode.handler.count could help improve DFS
>> >> stability (
>> >>
>> http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%3C49605FE0.9040509@duboce.net%3E).
>>  The posting seems to indicate the wiki was updated, but I don't seen
>> >> anything in the wiki about increasing dfs.datanode.handler.count.   I
>> have
>> >> seen a few other notes that seem to show examples that have raised
>> >> dfs.datanode.handler.count including one from an IBM article (
>> >>
>> http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/)
>> and the Pro Hadoop book, but other than that the only other mention I see
>> >> is from cloudera seems luke-warm on increasing
>> dfs.datanode.handler.count (
>> >>
>> http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/
>> ).
>> >> >>
>> >> >>    Given the post is from 2009 I thought I'd ask if anyone has had
>> any
>> >> success improving stability of HBase/DFS when increasing
>> >> dfs.datanode.handler.count.  The specific error we are seeing somewhat
>> >>  frequently ( few hundred times per day ) in the datanode longs is as
>> >> follows:
>> >> >>
>> >> >> 2011-04-09 00:12:48,035 ERROR
>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> >> >> DatanodeRegistration(10.18.0.33:50010,
>> >> >> storageID=DS-1501576934-10.18.0.33-50010-1296248656454,
>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>> >> >> java.io.IOException: Block blk_-163126943925471435_28809750 is not
>> >> valid.
>> >> >>
>> >> >>   The above seems to correspond to ClosedChannelExceptions in the
>> hbase
>> >> regionserver logs as well as some warnings about long write to hlog (
>> some
>> >> in the 50+ seconds ).
>> >> >>
>> >> >>    The biggest end-user facing issue we are seeing is that Task
>> Trackers
>> >> keep getting blacklisted.  It's quite possible our problem is unrelated
>> to
>> >> anything HBase, but I thought it was worth asking given what we've been
>> >> seeing.
>> >> >>
>> >> >>   We are currently running 0.91 on an 18 node cluster with ~3k total
>> >> regions and each region server is running with 2G of memory.
>> >> >>
>> >> >>   Any insight would be appreciated.
>> >> >>
>> >> >>   Thanks
>> >> >>
>> >> >>    Andy
>> >> >>
>> >> >
>> >>
>> >
>>
>

Re: DFS stability running HBase and dfs.datanode.handler.count...

Posted by Ben Aldrich <vr...@gmail.com>.
Our heapsize is set to 2gb, I think my dev issue was because I was running
things off of a few vm's. Even though the compaction is in another thread it
would still fail to respond during major compaction.

-Ben

On Thu, Apr 14, 2011 at 11:26 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Ben, the compaction is done in a background thread, it doesn't block
> anything. Now if you had a heap close to 2GB, you could easily run
> into issues.
>
> J-D
>
> On Thu, Apr 14, 2011 at 10:23 AM, Ben Aldrich <vr...@gmail.com> wrote:
> > Just to chime in here, the other thing we changed was our max_file_size
> is
> > now set to 2gb instead of 512mb. This could be causing long compaction
> > times. If a compaction takes too long it won't respond and can be marked
> as
> > dead. I have had this happen on my dev cluster a few times.
> >
> > -Ben
> >
> > On Thu, Apr 14, 2011 at 11:20 AM, Jean-Daniel Cryans <
> jdcryans@apache.org>wrote:
> >
> >> This is probably a red herring, for example if the region server had a
> >> big GC pause then the master could have already split the log and the
> >> region server wouldn't be able to close it (that's our version of IO
> >> fencing). So from that exception look back in the log and see if
> >> there's anything like :
> >>
> >> INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have
> >> not heard from server in some_big_number ms
> >>
> >> J-D
> >>
> >> On Thu, Apr 14, 2011 at 7:24 AM, Andy Sautins
> >> <an...@returnpath.net> wrote:
> >> >
> >> >  Thanks for the response stack.  Yes we tried increasing
> >> dfs.datanode.handler.count to 8.   At this point I would say it didn't
> seem
> >> to resolve the issue we are seeing, but we it also doesn't seem to be
> >> hurting anything so for right now we're going to leave it in at 8 while
> we
> >> continue to debug.
> >> >
> >> >  In regard to the original error I posted ( Block 'x' is not valid )
> we
> >> have chased that down thanks to your suggestion of looking at the logs
> for
> >> the history of the block.  It _looks_ like our 'is not valid' block
> errors
> >> are unrelated and due to chmod or deleting mapreduce output directories
> >> directly after a run.  We are still isolating that but it looks like
> it's
> >> not HBase releated so I'll move that to another list.  Thank you very
> much
> >> for your debugging suggestions.
> >> >
> >> >   The one issue we are still seeing is that we will occasionally have
> a
> >> regionserver die with the following exception.  I need to chase that
> down a
> >> little more but it seems similar to a post from 2/13/2011 (
> >> http://www.mail-archive.com/user@hbase.apache.org/msg05550.html ) that
> I'm
> >> not sure was ever resolved or not.  If anyone has any insight on how to
> >> debug the following error a little more I would appreciate any thoughts
> you
> >> might have.
> >> >
> >> > 2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient:
> Exception
> >> closing file /user/hbase/.logs/hd10.dfs.returnpath.net
> >> ,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921 :
> >> java.io.IOException: Error Recovery for block
> >> blk_1315316969665710488_29842654 failed  because recovery from primary
> >> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was
> 10.18.0.16:50010.
> >> Aborting...
> >> > java.io.IOException: Error Recovery for block
> >> blk_1315316969665710488_29842654 failed  because recovery from primary
> >> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was
> 10.18.0.16:50010.
> >> Aborting...
> >> >        at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
> >> >        at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
> >> >
> >> > Other than the above exception causing a region server to die
> >> occasionally everything seems to be working well.
> >> >
> >> > Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop
> >> 0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above
> exception.
> >>  We do have ulimit set ( memory unlimited and files 32k ) for the user
> >> running hbase.
> >> >
> >> > Thanks again for your help
> >> >
> >> >  Andy
> >> >
> >> > -----Original Message-----
> >> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> >> Stack
> >> > Sent: Sunday, April 10, 2011 1:16 PM
> >> > To: user@hbase.apache.org
> >> > Cc: Andy Sautins
> >> > Subject: Re: DFS stability running HBase and
> >> dfs.datanode.handler.count...
> >> >
> >> > Did you try upping it Andy?  Andrew Purtell's recommendation though
> old
> >> would have come of experience.  The Intel article reads like sales but
> there
> >> is probably merit to its suggestion.  The Cloudera article is more
> unsure
> >> about the effect of upping handlers though it allows "...could be set a
> bit
> >> higher."
> >> >
> >> > I just looked at our prod frontend and its set to 3 still.  I don't
> see
> >> your exceptions in our DN log.
> >> >
> >> > What version of hadoop?  You say hbase 0.91.  You mean 0.90.1?
> >> >
> >> > ulimit and nproc are set sufficiently high for hadoop/hbase user?
> >> >
> >> > If you grep 163126943925471435_28809750 in namenode log, do you see a
> >> delete occur before a later open?
> >> >
> >> > St.Ack
> >> >
> >> > On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins <
> >> andy.sautins@returnpath.net> wrote:
> >> >>
> >> >>    I ran across an mailing list posting from 1/4/2009 that seemed to
> >> indicate increasing dfs.datanode.handler.count could help improve DFS
> >> stability (
> >>
> http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%3C49605FE0.9040509@duboce.net%3E).
>  The posting seems to indicate the wiki was updated, but I don't seen
> >> anything in the wiki about increasing dfs.datanode.handler.count.   I
> have
> >> seen a few other notes that seem to show examples that have raised
> >> dfs.datanode.handler.count including one from an IBM article (
> >>
> http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/)
> and the Pro Hadoop book, but other than that the only other mention I see
> >> is from cloudera seems luke-warm on increasing
> dfs.datanode.handler.count (
> >>
> http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/
> ).
> >> >>
> >> >>    Given the post is from 2009 I thought I'd ask if anyone has had
> any
> >> success improving stability of HBase/DFS when increasing
> >> dfs.datanode.handler.count.  The specific error we are seeing somewhat
> >>  frequently ( few hundred times per day ) in the datanode longs is as
> >> follows:
> >> >>
> >> >> 2011-04-09 00:12:48,035 ERROR
> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> >> DatanodeRegistration(10.18.0.33:50010,
> >> >> storageID=DS-1501576934-10.18.0.33-50010-1296248656454,
> >> >> infoPort=50075, ipcPort=50020):DataXceiver
> >> >> java.io.IOException: Block blk_-163126943925471435_28809750 is not
> >> valid.
> >> >>
> >> >>   The above seems to correspond to ClosedChannelExceptions in the
> hbase
> >> regionserver logs as well as some warnings about long write to hlog (
> some
> >> in the 50+ seconds ).
> >> >>
> >> >>    The biggest end-user facing issue we are seeing is that Task
> Trackers
> >> keep getting blacklisted.  It's quite possible our problem is unrelated
> to
> >> anything HBase, but I thought it was worth asking given what we've been
> >> seeing.
> >> >>
> >> >>   We are currently running 0.91 on an 18 node cluster with ~3k total
> >> regions and each region server is running with 2G of memory.
> >> >>
> >> >>   Any insight would be appreciated.
> >> >>
> >> >>   Thanks
> >> >>
> >> >>    Andy
> >> >>
> >> >
> >>
> >
>

Re: DFS stability running HBase and dfs.datanode.handler.count...

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Ben, the compaction is done in a background thread, it doesn't block
anything. Now if you had a heap close to 2GB, you could easily run
into issues.

J-D

On Thu, Apr 14, 2011 at 10:23 AM, Ben Aldrich <vr...@gmail.com> wrote:
> Just to chime in here, the other thing we changed was our max_file_size is
> now set to 2gb instead of 512mb. This could be causing long compaction
> times. If a compaction takes too long it won't respond and can be marked as
> dead. I have had this happen on my dev cluster a few times.
>
> -Ben
>
> On Thu, Apr 14, 2011 at 11:20 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> This is probably a red herring, for example if the region server had a
>> big GC pause then the master could have already split the log and the
>> region server wouldn't be able to close it (that's our version of IO
>> fencing). So from that exception look back in the log and see if
>> there's anything like :
>>
>> INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have
>> not heard from server in some_big_number ms
>>
>> J-D
>>
>> On Thu, Apr 14, 2011 at 7:24 AM, Andy Sautins
>> <an...@returnpath.net> wrote:
>> >
>> >  Thanks for the response stack.  Yes we tried increasing
>> dfs.datanode.handler.count to 8.   At this point I would say it didn't seem
>> to resolve the issue we are seeing, but we it also doesn't seem to be
>> hurting anything so for right now we're going to leave it in at 8 while we
>> continue to debug.
>> >
>> >  In regard to the original error I posted ( Block 'x' is not valid ) we
>> have chased that down thanks to your suggestion of looking at the logs for
>> the history of the block.  It _looks_ like our 'is not valid' block errors
>> are unrelated and due to chmod or deleting mapreduce output directories
>> directly after a run.  We are still isolating that but it looks like it's
>> not HBase releated so I'll move that to another list.  Thank you very much
>> for your debugging suggestions.
>> >
>> >   The one issue we are still seeing is that we will occasionally have a
>> regionserver die with the following exception.  I need to chase that down a
>> little more but it seems similar to a post from 2/13/2011 (
>> http://www.mail-archive.com/user@hbase.apache.org/msg05550.html ) that I'm
>> not sure was ever resolved or not.  If anyone has any insight on how to
>> debug the following error a little more I would appreciate any thoughts you
>> might have.
>> >
>> > 2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient: Exception
>> closing file /user/hbase/.logs/hd10.dfs.returnpath.net
>> ,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921 :
>> java.io.IOException: Error Recovery for block
>> blk_1315316969665710488_29842654 failed  because recovery from primary
>> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010.
>> Aborting...
>> > java.io.IOException: Error Recovery for block
>> blk_1315316969665710488_29842654 failed  because recovery from primary
>> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010.
>> Aborting...
>> >        at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
>> >        at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
>> >
>> > Other than the above exception causing a region server to die
>> occasionally everything seems to be working well.
>> >
>> > Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop
>> 0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above exception.
>>  We do have ulimit set ( memory unlimited and files 32k ) for the user
>> running hbase.
>> >
>> > Thanks again for your help
>> >
>> >  Andy
>> >
>> > -----Original Message-----
>> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>> Stack
>> > Sent: Sunday, April 10, 2011 1:16 PM
>> > To: user@hbase.apache.org
>> > Cc: Andy Sautins
>> > Subject: Re: DFS stability running HBase and
>> dfs.datanode.handler.count...
>> >
>> > Did you try upping it Andy?  Andrew Purtell's recommendation though old
>> would have come of experience.  The Intel article reads like sales but there
>> is probably merit to its suggestion.  The Cloudera article is more unsure
>> about the effect of upping handlers though it allows "...could be set a bit
>> higher."
>> >
>> > I just looked at our prod frontend and its set to 3 still.  I don't see
>> your exceptions in our DN log.
>> >
>> > What version of hadoop?  You say hbase 0.91.  You mean 0.90.1?
>> >
>> > ulimit and nproc are set sufficiently high for hadoop/hbase user?
>> >
>> > If you grep 163126943925471435_28809750 in namenode log, do you see a
>> delete occur before a later open?
>> >
>> > St.Ack
>> >
>> > On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins <
>> andy.sautins@returnpath.net> wrote:
>> >>
>> >>    I ran across an mailing list posting from 1/4/2009 that seemed to
>> indicate increasing dfs.datanode.handler.count could help improve DFS
>> stability (
>> http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%3C49605FE0.9040509@duboce.net%3E).  The posting seems to indicate the wiki was updated, but I don't seen
>> anything in the wiki about increasing dfs.datanode.handler.count.   I have
>> seen a few other notes that seem to show examples that have raised
>> dfs.datanode.handler.count including one from an IBM article (
>> http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/) and the Pro Hadoop book, but other than that the only other mention I see
>> is from cloudera seems luke-warm on increasing dfs.datanode.handler.count (
>> http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/).
>> >>
>> >>    Given the post is from 2009 I thought I'd ask if anyone has had any
>> success improving stability of HBase/DFS when increasing
>> dfs.datanode.handler.count.  The specific error we are seeing somewhat
>>  frequently ( few hundred times per day ) in the datanode longs is as
>> follows:
>> >>
>> >> 2011-04-09 00:12:48,035 ERROR
>> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> >> DatanodeRegistration(10.18.0.33:50010,
>> >> storageID=DS-1501576934-10.18.0.33-50010-1296248656454,
>> >> infoPort=50075, ipcPort=50020):DataXceiver
>> >> java.io.IOException: Block blk_-163126943925471435_28809750 is not
>> valid.
>> >>
>> >>   The above seems to correspond to ClosedChannelExceptions in the hbase
>> regionserver logs as well as some warnings about long write to hlog ( some
>> in the 50+ seconds ).
>> >>
>> >>    The biggest end-user facing issue we are seeing is that Task Trackers
>> keep getting blacklisted.  It's quite possible our problem is unrelated to
>> anything HBase, but I thought it was worth asking given what we've been
>> seeing.
>> >>
>> >>   We are currently running 0.91 on an 18 node cluster with ~3k total
>> regions and each region server is running with 2G of memory.
>> >>
>> >>   Any insight would be appreciated.
>> >>
>> >>   Thanks
>> >>
>> >>    Andy
>> >>
>> >
>>
>

Re: DFS stability running HBase and dfs.datanode.handler.count...

Posted by Ben Aldrich <vr...@gmail.com>.
Just to chime in here, the other thing we changed was our max_file_size is
now set to 2gb instead of 512mb. This could be causing long compaction
times. If a compaction takes too long it won't respond and can be marked as
dead. I have had this happen on my dev cluster a few times.

-Ben

On Thu, Apr 14, 2011 at 11:20 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> This is probably a red herring, for example if the region server had a
> big GC pause then the master could have already split the log and the
> region server wouldn't be able to close it (that's our version of IO
> fencing). So from that exception look back in the log and see if
> there's anything like :
>
> INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have
> not heard from server in some_big_number ms
>
> J-D
>
> On Thu, Apr 14, 2011 at 7:24 AM, Andy Sautins
> <an...@returnpath.net> wrote:
> >
> >  Thanks for the response stack.  Yes we tried increasing
> dfs.datanode.handler.count to 8.   At this point I would say it didn't seem
> to resolve the issue we are seeing, but we it also doesn't seem to be
> hurting anything so for right now we're going to leave it in at 8 while we
> continue to debug.
> >
> >  In regard to the original error I posted ( Block 'x' is not valid ) we
> have chased that down thanks to your suggestion of looking at the logs for
> the history of the block.  It _looks_ like our 'is not valid' block errors
> are unrelated and due to chmod or deleting mapreduce output directories
> directly after a run.  We are still isolating that but it looks like it's
> not HBase releated so I'll move that to another list.  Thank you very much
> for your debugging suggestions.
> >
> >   The one issue we are still seeing is that we will occasionally have a
> regionserver die with the following exception.  I need to chase that down a
> little more but it seems similar to a post from 2/13/2011 (
> http://www.mail-archive.com/user@hbase.apache.org/msg05550.html ) that I'm
> not sure was ever resolved or not.  If anyone has any insight on how to
> debug the following error a little more I would appreciate any thoughts you
> might have.
> >
> > 2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient: Exception
> closing file /user/hbase/.logs/hd10.dfs.returnpath.net
> ,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921 :
> java.io.IOException: Error Recovery for block
> blk_1315316969665710488_29842654 failed  because recovery from primary
> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010.
> Aborting...
> > java.io.IOException: Error Recovery for block
> blk_1315316969665710488_29842654 failed  because recovery from primary
> datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010.
> Aborting...
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
> >
> > Other than the above exception causing a region server to die
> occasionally everything seems to be working well.
> >
> > Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop
> 0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above exception.
>  We do have ulimit set ( memory unlimited and files 32k ) for the user
> running hbase.
> >
> > Thanks again for your help
> >
> >  Andy
> >
> > -----Original Message-----
> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> Stack
> > Sent: Sunday, April 10, 2011 1:16 PM
> > To: user@hbase.apache.org
> > Cc: Andy Sautins
> > Subject: Re: DFS stability running HBase and
> dfs.datanode.handler.count...
> >
> > Did you try upping it Andy?  Andrew Purtell's recommendation though old
> would have come of experience.  The Intel article reads like sales but there
> is probably merit to its suggestion.  The Cloudera article is more unsure
> about the effect of upping handlers though it allows "...could be set a bit
> higher."
> >
> > I just looked at our prod frontend and its set to 3 still.  I don't see
> your exceptions in our DN log.
> >
> > What version of hadoop?  You say hbase 0.91.  You mean 0.90.1?
> >
> > ulimit and nproc are set sufficiently high for hadoop/hbase user?
> >
> > If you grep 163126943925471435_28809750 in namenode log, do you see a
> delete occur before a later open?
> >
> > St.Ack
> >
> > On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins <
> andy.sautins@returnpath.net> wrote:
> >>
> >>    I ran across an mailing list posting from 1/4/2009 that seemed to
> indicate increasing dfs.datanode.handler.count could help improve DFS
> stability (
> http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%3C49605FE0.9040509@duboce.net%3E).  The posting seems to indicate the wiki was updated, but I don't seen
> anything in the wiki about increasing dfs.datanode.handler.count.   I have
> seen a few other notes that seem to show examples that have raised
> dfs.datanode.handler.count including one from an IBM article (
> http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/) and the Pro Hadoop book, but other than that the only other mention I see
> is from cloudera seems luke-warm on increasing dfs.datanode.handler.count (
> http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/).
> >>
> >>    Given the post is from 2009 I thought I'd ask if anyone has had any
> success improving stability of HBase/DFS when increasing
> dfs.datanode.handler.count.  The specific error we are seeing somewhat
>  frequently ( few hundred times per day ) in the datanode longs is as
> follows:
> >>
> >> 2011-04-09 00:12:48,035 ERROR
> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(10.18.0.33:50010,
> >> storageID=DS-1501576934-10.18.0.33-50010-1296248656454,
> >> infoPort=50075, ipcPort=50020):DataXceiver
> >> java.io.IOException: Block blk_-163126943925471435_28809750 is not
> valid.
> >>
> >>   The above seems to correspond to ClosedChannelExceptions in the hbase
> regionserver logs as well as some warnings about long write to hlog ( some
> in the 50+ seconds ).
> >>
> >>    The biggest end-user facing issue we are seeing is that Task Trackers
> keep getting blacklisted.  It's quite possible our problem is unrelated to
> anything HBase, but I thought it was worth asking given what we've been
> seeing.
> >>
> >>   We are currently running 0.91 on an 18 node cluster with ~3k total
> regions and each region server is running with 2G of memory.
> >>
> >>   Any insight would be appreciated.
> >>
> >>   Thanks
> >>
> >>    Andy
> >>
> >
>

Re: DFS stability running HBase and dfs.datanode.handler.count...

Posted by Jean-Daniel Cryans <jd...@apache.org>.
This is probably a red herring, for example if the region server had a
big GC pause then the master could have already split the log and the
region server wouldn't be able to close it (that's our version of IO
fencing). So from that exception look back in the log and see if
there's anything like :

INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have
not heard from server in some_big_number ms

J-D

On Thu, Apr 14, 2011 at 7:24 AM, Andy Sautins
<an...@returnpath.net> wrote:
>
>  Thanks for the response stack.  Yes we tried increasing dfs.datanode.handler.count to 8.   At this point I would say it didn't seem to resolve the issue we are seeing, but we it also doesn't seem to be hurting anything so for right now we're going to leave it in at 8 while we continue to debug.
>
>  In regard to the original error I posted ( Block 'x' is not valid ) we have chased that down thanks to your suggestion of looking at the logs for the history of the block.  It _looks_ like our 'is not valid' block errors are unrelated and due to chmod or deleting mapreduce output directories directly after a run.  We are still isolating that but it looks like it's not HBase releated so I'll move that to another list.  Thank you very much for your debugging suggestions.
>
>   The one issue we are still seeing is that we will occasionally have a regionserver die with the following exception.  I need to chase that down a little more but it seems similar to a post from 2/13/2011 (http://www.mail-archive.com/user@hbase.apache.org/msg05550.html ) that I'm not sure was ever resolved or not.  If anyone has any insight on how to debug the following error a little more I would appreciate any thoughts you might have.
>
> 2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file /user/hbase/.logs/hd10.dfs.returnpath.net,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921 : java.io.IOException: Error Recovery for block blk_1315316969665710488_29842654 failed  because recovery from primary datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010. Aborting...
> java.io.IOException: Error Recovery for block blk_1315316969665710488_29842654 failed  because recovery from primary datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010. Aborting...
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
>
> Other than the above exception causing a region server to die occasionally everything seems to be working well.
>
> Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop 0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above exception.  We do have ulimit set ( memory unlimited and files 32k ) for the user running hbase.
>
> Thanks again for your help
>
>  Andy
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Sunday, April 10, 2011 1:16 PM
> To: user@hbase.apache.org
> Cc: Andy Sautins
> Subject: Re: DFS stability running HBase and dfs.datanode.handler.count...
>
> Did you try upping it Andy?  Andrew Purtell's recommendation though old would have come of experience.  The Intel article reads like sales but there is probably merit to its suggestion.  The Cloudera article is more unsure about the effect of upping handlers though it allows "...could be set a bit higher."
>
> I just looked at our prod frontend and its set to 3 still.  I don't see your exceptions in our DN log.
>
> What version of hadoop?  You say hbase 0.91.  You mean 0.90.1?
>
> ulimit and nproc are set sufficiently high for hadoop/hbase user?
>
> If you grep 163126943925471435_28809750 in namenode log, do you see a delete occur before a later open?
>
> St.Ack
>
> On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins <an...@returnpath.net> wrote:
>>
>>    I ran across an mailing list posting from 1/4/2009 that seemed to indicate increasing dfs.datanode.handler.count could help improve DFS stability (http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%3C49605FE0.9040509@duboce.net%3E ).  The posting seems to indicate the wiki was updated, but I don't seen anything in the wiki about increasing dfs.datanode.handler.count.   I have seen a few other notes that seem to show examples that have raised dfs.datanode.handler.count including one from an IBM article (http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/ ) and the Pro Hadoop book, but other than that the only other mention I see is from cloudera seems luke-warm on increasing dfs.datanode.handler.count (http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/ ).
>>
>>    Given the post is from 2009 I thought I'd ask if anyone has had any success improving stability of HBase/DFS when increasing dfs.datanode.handler.count.  The specific error we are seeing somewhat  frequently ( few hundred times per day ) in the datanode longs is as follows:
>>
>> 2011-04-09 00:12:48,035 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(10.18.0.33:50010,
>> storageID=DS-1501576934-10.18.0.33-50010-1296248656454,
>> infoPort=50075, ipcPort=50020):DataXceiver
>> java.io.IOException: Block blk_-163126943925471435_28809750 is not valid.
>>
>>   The above seems to correspond to ClosedChannelExceptions in the hbase regionserver logs as well as some warnings about long write to hlog ( some in the 50+ seconds ).
>>
>>    The biggest end-user facing issue we are seeing is that Task Trackers keep getting blacklisted.  It's quite possible our problem is unrelated to anything HBase, but I thought it was worth asking given what we've been seeing.
>>
>>   We are currently running 0.91 on an 18 node cluster with ~3k total regions and each region server is running with 2G of memory.
>>
>>   Any insight would be appreciated.
>>
>>   Thanks
>>
>>    Andy
>>
>

RE: DFS stability running HBase and dfs.datanode.handler.count...

Posted by Andy Sautins <an...@returnpath.net>.
  Thanks for the response stack.  Yes we tried increasing dfs.datanode.handler.count to 8.   At this point I would say it didn't seem to resolve the issue we are seeing, but we it also doesn't seem to be hurting anything so for right now we're going to leave it in at 8 while we continue to debug.

  In regard to the original error I posted ( Block 'x' is not valid ) we have chased that down thanks to your suggestion of looking at the logs for the history of the block.  It _looks_ like our 'is not valid' block errors are unrelated and due to chmod or deleting mapreduce output directories directly after a run.  We are still isolating that but it looks like it's not HBase releated so I'll move that to another list.  Thank you very much for your debugging suggestions.

   The one issue we are still seeing is that we will occasionally have a regionserver die with the following exception.  I need to chase that down a little more but it seems similar to a post from 2/13/2011 (http://www.mail-archive.com/user@hbase.apache.org/msg05550.html ) that I'm not sure was ever resolved or not.  If anyone has any insight on how to debug the following error a little more I would appreciate any thoughts you might have.

2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file /user/hbase/.logs/hd10.dfs.returnpath.net,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921 : java.io.IOException: Error Recovery for block blk_1315316969665710488_29842654 failed  because recovery from primary datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010. Aborting...
java.io.IOException: Error Recovery for block blk_1315316969665710488_29842654 failed  because recovery from primary datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010. Aborting...
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)

Other than the above exception causing a region server to die occasionally everything seems to be working well.

Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop 0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above exception.  We do have ulimit set ( memory unlimited and files 32k ) for the user running hbase.  

Thanks again for your help

 Andy

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Sunday, April 10, 2011 1:16 PM
To: user@hbase.apache.org
Cc: Andy Sautins
Subject: Re: DFS stability running HBase and dfs.datanode.handler.count...

Did you try upping it Andy?  Andrew Purtell's recommendation though old would have come of experience.  The Intel article reads like sales but there is probably merit to its suggestion.  The Cloudera article is more unsure about the effect of upping handlers though it allows "...could be set a bit higher."

I just looked at our prod frontend and its set to 3 still.  I don't see your exceptions in our DN log.

What version of hadoop?  You say hbase 0.91.  You mean 0.90.1?

ulimit and nproc are set sufficiently high for hadoop/hbase user?

If you grep 163126943925471435_28809750 in namenode log, do you see a delete occur before a later open?

St.Ack

On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins <an...@returnpath.net> wrote:
>
>    I ran across an mailing list posting from 1/4/2009 that seemed to indicate increasing dfs.datanode.handler.count could help improve DFS stability (http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%3C49605FE0.9040509@duboce.net%3E ).  The posting seems to indicate the wiki was updated, but I don't seen anything in the wiki about increasing dfs.datanode.handler.count.   I have seen a few other notes that seem to show examples that have raised dfs.datanode.handler.count including one from an IBM article (http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/ ) and the Pro Hadoop book, but other than that the only other mention I see is from cloudera seems luke-warm on increasing dfs.datanode.handler.count (http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/ ).
>
>    Given the post is from 2009 I thought I'd ask if anyone has had any success improving stability of HBase/DFS when increasing dfs.datanode.handler.count.  The specific error we are seeing somewhat  frequently ( few hundred times per day ) in the datanode longs is as follows:
>
> 2011-04-09 00:12:48,035 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(10.18.0.33:50010, 
> storageID=DS-1501576934-10.18.0.33-50010-1296248656454, 
> infoPort=50075, ipcPort=50020):DataXceiver
> java.io.IOException: Block blk_-163126943925471435_28809750 is not valid.
>
>   The above seems to correspond to ClosedChannelExceptions in the hbase regionserver logs as well as some warnings about long write to hlog ( some in the 50+ seconds ).
>
>    The biggest end-user facing issue we are seeing is that Task Trackers keep getting blacklisted.  It's quite possible our problem is unrelated to anything HBase, but I thought it was worth asking given what we've been seeing.
>
>   We are currently running 0.91 on an 18 node cluster with ~3k total regions and each region server is running with 2G of memory.
>
>   Any insight would be appreciated.
>
>   Thanks
>
>    Andy
>

Re: DFS stability running HBase and dfs.datanode.handler.count...

Posted by Stack <st...@duboce.net>.
Did you try upping it Andy?  Andrew Purtell's recommendation though
old would have come of experience.  The Intel article reads like sales
but there is probably merit to its suggestion.  The Cloudera article
is more unsure about the effect of upping handlers though it allows
"...could be set a bit higher."

I just looked at our prod frontend and its set to 3 still.  I don't
see your exceptions in our DN log.

What version of hadoop?  You say hbase 0.91.  You mean 0.90.1?

ulimit and nproc are set sufficiently high for hadoop/hbase user?

If you grep 163126943925471435_28809750 in namenode log, do you see a
delete occur before a later open?

St.Ack

On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins
<an...@returnpath.net> wrote:
>
>    I ran across an mailing list posting from 1/4/2009 that seemed to indicate increasing dfs.datanode.handler.count could help improve DFS stability (http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%3C49605FE0.9040509@duboce.net%3E ).  The posting seems to indicate the wiki was updated, but I don't seen anything in the wiki about increasing dfs.datanode.handler.count.   I have seen a few other notes that seem to show examples that have raised dfs.datanode.handler.count including one from an IBM article (http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/ ) and the Pro Hadoop book, but other than that the only other mention I see is from cloudera seems luke-warm on increasing dfs.datanode.handler.count (http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/ ).
>
>    Given the post is from 2009 I thought I'd ask if anyone has had any success improving stability of HBase/DFS when increasing dfs.datanode.handler.count.  The specific error we are seeing somewhat  frequently ( few hundred times per day ) in the datanode longs is as follows:
>
> 2011-04-09 00:12:48,035 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.18.0.33:50010, storageID=DS-1501576934-10.18.0.33-50010-1296248656454, infoPort=50075, ipcPort=50020):DataXceiver
> java.io.IOException: Block blk_-163126943925471435_28809750 is not valid.
>
>   The above seems to correspond to ClosedChannelExceptions in the hbase regionserver logs as well as some warnings about long write to hlog ( some in the 50+ seconds ).
>
>    The biggest end-user facing issue we are seeing is that Task Trackers keep getting blacklisted.  It's quite possible our problem is unrelated to anything HBase, but I thought it was worth asking given what we've been seeing.
>
>   We are currently running 0.91 on an 18 node cluster with ~3k total regions and each region server is running with 2G of memory.
>
>   Any insight would be appreciated.
>
>   Thanks
>
>    Andy
>