You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2008/11/27 03:10:44 UTC

[jira] Updated: (HBASE-616) " We slept XXXXXX ms, ten times longer than scheduled: 3000" happens frequently.

     [ https://issues.apache.org/jira/browse/HBASE-616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-616:
------------------------

    Fix Version/s: 0.19.0

jgray ran his uploader. Oddly, all seemed to work fine and the upload seemed to have finished nicely but then the gc log hung like this:

{code}
4702.183: [GC 723285K->644523K(903424K), 0.1020550 secs]
4741.964: [GC 751595K->674674K(903488K), 0.1030280 secs]
5386.336: [GC 781746K->698672K(903744K), 299.1547740 secs]
5685.743: [Full GC
{code}

It was hung after the emission of 'Full GC'.

Eventually it cleared with the GC having been hung for 400 seconds:

{code}
4643.716: [GC 693488K->616512K(897216K), 0.0775520 secs]
4702.183: [GC 723285K->644523K(903424K), 0.1020550 secs]
4741.964: [GC 751595K->674674K(903488K), 0.1030280 secs]
5386.336: [GC 781746K->698672K(903744K), 299.1547740 secs]
5685.743: [Full GC 698672K->694148K(903744K), 434.7056360 secs]
6161.052: [Full GC 789962K->630359K(903744K), 23.9435320 secs]
6197.831: [Full GC 737687K->478895K(903744K), 28.7396540 secs]
6261.250: [GC 586222K->481646K(903744K), 0.0385590 secs]
6305.213: [GC 588974K->481846K(905472K), 0.1706230 secs]
6359.292: [GC 591478K->482681K(906048K), 0.0147330 secs]
6469.725: [GC 592313K->482763K(907072K), 0.0059020 secs]
...
{code}

Looking in ganglia, the whole cluster had gone into heavy swap.  Jon looking at top, etc., saw that the machines had loads of free memory.  Could not explain why swap, low CPU, and free memory combination.

Whatever the cause of cluster swap seems to be hanging GC upsetting hbase.

Looking in logs, nothing really obvious though hdfs and deletes are suspect (pset has reported deletes being problematic in past, so has streamy).  See a bunch of deleted commit files around that time but then looking at block deletes in datanode, they seem to be pretty constantly deleting bunches over time.  Datanodes at the time were handling startup of a new MR job which was opening loads of files.  

One thing to try is a rerun with delete of commit logs disabled.

> " We slept XXXXXX ms, ten times longer than scheduled: 3000" happens frequently.
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-616
>                 URL: https://issues.apache.org/jira/browse/HBASE-616
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.19.0
>
>
> Just saw the below in a log... all in a row on the one server.
> {code}
>    4493 2008-05-05 18:08:17,512 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 34557ms, ten times longer than scheduled: 3000
>    4494 2008-05-05 18:11:08,879 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 30576ms, ten times longer than scheduled: 3000
>    4495 2008-05-05 18:30:45,056 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 1091720ms, ten times longer than scheduled: 3000
>    4496 2008-05-05 18:30:45,056 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 1094209ms, ten times longer than scheduled: 10000
>    4497 2008-05-05 18:30:45,429 FATAL org.apache.hadoop.hbase.HRegionServer: unable to report to master for 1092093 milliseconds - aborting server
> {code}
> We're seeing these kinda outages pretty frequently.  In the case above, it was small cluster that was using TableReduce to insert.  The MR, HDFS and HBase were all running on same nodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.