You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Eran Kutner <er...@gigya.com> on 2013/06/18 14:11:16 UTC

Slow log splitting (Hbase 0.94.6)

Hi,
We had a brute force cluster shutdown event that was followed by log
recovery when the cluster went back online.
The cluster took hours to split the logs and recover the regions, all of
which might have made sense since we have quite a lot of regions (around
13K) but the weird thing is that there was no obvious bottleneck during the
recovery process. CPU was almost idle on all the nodes, IO was on 5-20%
utilization, memory was OK, network wasn't overloaded, but still it was
slow.
Any idea what can be slowing it down?

Thanks.

-eran

Re: Slow log splitting (Hbase 0.94.6)

Posted by Eran Kutner <er...@gigya.com>.

Sorry, forgot to mention that, we're using CDH4.3 so Hadoop 2.0.0

I'm not sure exactly what to look for in the namenode logs but grepping for
"lease" only produced a handful of results, all seem benign.
There were two occurrences of this:
2013-06-17 09:58:40,709 INFO
org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder:
HDFS_NameNode, pendingcreates: 7] has expired hard limit
2013-06-17 09:58:40,709 WARN BlockStateChange: BLOCK*
BlockInfoUnderConstruction.initLeaseRecovery: No blocks found, lease
removed.
2013-06-17 09:58:40,709 WARN BlockStateChange: BLOCK*
BlockInfoUnderConstruction.initLeaseRecovery: No blocks found, lease
removed.
2013-06-17 09:58:40,710 WARN BlockStateChange: BLOCK*
BlockInfoUnderConstruction.initLeaseRecovery: No blocks found, lease
removed.
2013-06-17 09:58:40,710 WARN BlockStateChange: BLOCK*
BlockInfoUnderConstruction.initLeaseRecovery: No blocks found, lease
removed.
2013-06-17 09:58:40,710 WARN BlockStateChange: BLOCK*
BlockInfoUnderConstruction.initLeaseRecovery: No blocks found, lease
removed.
2013-06-17 09:58:40,710 WARN BlockStateChange: BLOCK*
BlockInfoUnderConstruction.initLeaseRecovery: No blocks found, lease
removed.
2013-06-17 09:58:40,710 WARN BlockStateChange: BLOCK*
BlockInfoUnderConstruction.initLeaseRecovery: No blocks found, lease
removed.

and one of this:
2013-06-17 10:04:25,012 INFO
org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder:
DFSClient_NONMAPREDUCE_1958633016_62, pendingcreates: 1] has expired hard
limit
2013-06-17 10:04:25,012 WARN org.apache.hadoop.hdfs.StateChange: BLOCK*
internalReleaseLease: All existing blocks are COMPLETE, lease removed, file
closed.

However, I did notice quite a lot of these errors (almost a million of
them):
INFO org.apache.hadoop.security.JniBasedUnixGroupsMapping: Error getting
groups for hbase: No entry for user

I don't know why is the NN even trying to resolve the user security groups
since I have dfs.permissions set to false. Can this be the cause of the
problem?

-eran

On Tue, Jun 18, 2013 at 3:15 PM, Ted Yu <yu...@gmail.com> wrote:

> What Hadoop version are you using ?
>
> Can you check NameNode log to see if lease recovery took long time ?
>
> Cheers
>
> On Jun 18, 2013, at 5:11 AM, Eran Kutner <er...@gigya.com> wrote:
>
> > Hi,
> > We had a brute force cluster shutdown event that was followed by log
> > recovery when the cluster went back online.
> > The cluster took hours to split the logs and recover the regions, all of
> > which might have made sense since we have quite a lot of regions (around
> > 13K) but the weird thing is that there was no obvious bottleneck during
> the
> > recovery process. CPU was almost idle on all the nodes, IO was on 5-20%
> > utilization, memory was OK, network wasn't overloaded, but still it was
> > slow.
> > Any idea what can be slowing it down?
> >
> > Thanks.
> >
> > -eran
>

Re: Slow log splitting (Hbase 0.94.6)

Posted by Ted Yu <yu...@gmail.com>.

What Hadoop version are you using ?

Can you check NameNode log to see if lease recovery took long time ?

Cheers

On Jun 18, 2013, at 5:11 AM, Eran Kutner <er...@gigya.com> wrote:

> Hi,
> We had a brute force cluster shutdown event that was followed by log
> recovery when the cluster went back online.
> The cluster took hours to split the logs and recover the regions, all of
> which might have made sense since we have quite a lot of regions (around
> 13K) but the weird thing is that there was no obvious bottleneck during the
> recovery process. CPU was almost idle on all the nodes, IO was on 5-20%
> utilization, memory was OK, network wasn't overloaded, but still it was
> slow.
> Any idea what can be slowing it down?
> 
> Thanks.
> 
> -eran