You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Zhoushuaifeng <zh...@huawei.com> on 2011/04/25 13:39:37 UTC

"NOT flushing memstore for region" keep on printing for half an hour

Hi,
The version is hbase 0.90.1.
"NOT flushing memstore for region" keep on printing for half an hour in the regionserver. Then I restart hbase. I think there may be deadlock or cycling.
I know that when splitting region, it will doclose of region, and set writestate.writesEnabled = false  and may run close preflush. This will make flush fail and print "NOT flushing memstore for region". But It should be finished after a while.

2011-04-18 16:28:27,960 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. because regionserver60020.cacheFlusher; priority=-1, compaction queue size=1
2011-04-18 16:28:30,171 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
2011-04-18 16:28:30,171 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. has too many store files; delaying flush up to 90000ms
2011-04-18 16:28:32,119 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200
2011-04-18 16:28:32,285 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124206693, entries=5226, filesize=255913736. New hlog /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124311822
2011-04-18 16:28:32,287 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Found 1 hlogs to remove out of total 2; oldest outstanding sequenceid is 11037 from region 031f37c9c23fcab17797b06b90205610
2011-04-18 16:28:32,288 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: moving old hlog file /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303123945481 whose highest sequenceid is 6052 to /hbase/.oldlogs/linux253%3A60020.1303123945481
2011-04-18 16:28:42,701 INFO org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 4 file(s), new file=hdfs://10.18.52.108:9000/hbase/ufdr/031f37c9c23fcab17797b06b90205610/value/4398465741579485290, size=281.4m; total size for store is 468.8m
2011-04-18 16:28:42,712 INFO org.apache.hadoop.hbase.regionserver.HRegion: completed compaction on region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. after 1mins, 40sec
2011-04-18 16:28:42,741 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
2011-04-18 16:28:42,770 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.: disabling compactions & flushes
2011-04-18 16:28:42,770 INFO org.apache.hadoop.hbase.regionserver.HRegion: Running close preflush of ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
2011-04-18 16:28:42,771 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., current region memstore size 105.6m
2011-04-18 16:28:42,818 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting, commencing flushing stores
2011-04-18 16:28:42,846 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false
2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false
......
2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false
2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
Mon Apr 18 17:04:24 IST 2011 Starting regionserver on linux253
ulimit -n 1024

Zhou Shuaifeng(Frank)

Re: "NOT flushing memstore for region" keep on printing for half an hour

Posted by Zhoushuaifeng <zh...@huawei.com>.

I think I may find out the problem:

1, When split region, it will close parent region, and set writestate.writesEnabled = false:

  private List<StoreFile> doClose(final boolean abort)
  throws IOException {
    synchronized (writestate) {
      // Disable compacting and flushing by background threads for this
      // region.
      writestate.writesEnabled = false;

2, If the memstore is large enouth, preflush will happen:

    if (!abort && !wasFlushing && worthPreFlushing()) {
      LOG.info("Running close preflush of " + this.getRegionNameAsString());
      internalFlushcache();
    }
    this.closing.set(true);
    lock.writeLock().lock();

3, IOException happened, and preflushing failed, and closing parent failed:
    createSplitDir(this.parent.getFilesystem(), this.splitdir);
    this.journal.add(JournalEntry.CREATE_SPLIT_DIR);

    List<StoreFile> hstoreFilesToSplit = this.parent.close(false);
if (hstoreFilesToSplit == null) {


4, roll back split is calling, but split state stay in "CREATE_SPLIT_DIR", so , only clenupSplitDir will happen.
    while (iterator.hasPrevious()) {
      JournalEntry je = iterator.previous();
      switch(je) {
      case CREATE_SPLIT_DIR:
        cleanupSplitDir(fs, this.splitdir);
        break;

      case CLOSED_PARENT_REGION:

5, what about writestate.writesEnabled? it stayed in false, no one handle it. So, even split is roll back, but no flush can success in parent region.


Zhou Shuaifeng(Frank)


-----邮件原件-----
发件人: saint.ack@gmail.com [mailto:saint.ack@gmail.com] 代表 Stack
发送时间: 2011年4月26日 11:41
收件人: dev@hbase.apache.org
抄送: Yanlijun
主题: Re: "NOT flushing memstore for region" keep on printing for half an hour

2011/4/25 Zhoushuaifeng <zh...@huawei.com>:
> Thanks St,
> I found that when running closing parent region to prepare for split, preflush occurred. But for some reason, an IOException throwed from hdfsclient, and caused preflush failed. Then split failed.
> We have made some change on HDFS and will check why Exception happened.
> But I don't know details on how hbase handle the exception. When rolling back the split operation, will it reset writestate.writesEnabled to true?
> If not, it will hanging writestate.writesEnabled to false and causing all other flush operations false.
>

If we fail a split, we rollback.  Usually the rollback is clean (There
is a unit test that exercises various failures splitting verifying
rollback works) but for sure there could be bugs in here.

I would be interested in the log of that regionserver.

Thanks,
St.Ack

RE: "NOT flushing memstore for region" keep on printing for half an hour

Posted by Ramkrishna S Vasudevan <ra...@huawei.com>.

hi St,
Pls find my analysis on HBASE-3821
try { 
       LOG.info("Running rollback of failed split of " + 
         parent.getRegionNameAsString() + "; " + ioe.getMessage()); 
       st.rollback(this.server); 
       LOG.info("Successful rollback of failed split of " + 
         parent.getRegionNameAsString()); 
     } catch (RuntimeException e) { 
       // If failed rollback, kill this server to avoid having a hole in
table. 
       LOG.info("Failed rollback of failed split of " + 
         parent.getRegionNameAsString() + " -- aborting server", e); 
       this.server.abort("Failed split"); 
     } 
     return;

Here when we try to roll back there is again a chance of getting the
IOException due to DFS errors.
But we try to catch RunTimeException and not IOException.
so this exception gets propogated to the run() of CompactSplitThread.

And there again we dont handle it and the exception that occured during
rollback is not handled.  
Regards
Ram

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Tuesday, April 26, 2011 11:56 PM
To: dev@hbase.apache.org
Subject: Re: "NOT flushing memstore for region" keep on printing for half an
hour

Nice digging Zhou.  If you have a patch, we'd love to have it.
St.Ack

On Tue, Apr 26, 2011 at 4:34 AM, Zhoushuaifeng <zh...@huawei.com>
wrote:
> Hi St,
> I have made an issue in hbase jira:
> https://issues.apache.org/jira/browse/HBASE-3821
> Including my analysis, please check it.
> If it's reasonable, I can provide patch to fix it.
>
> Zhou Shuaifeng(Frank)
>

Re: "NOT flushing memstore for region" keep on printing for half an hour

Posted by Stack <st...@duboce.net>.

Nice digging Zhou.  If you have a patch, we'd love to have it.
St.Ack

On Tue, Apr 26, 2011 at 4:34 AM, Zhoushuaifeng <zh...@huawei.com> wrote:
> Hi St,
> I have made an issue in hbase jira:
> https://issues.apache.org/jira/browse/HBASE-3821
> Including my analysis, please check it.
> If it's reasonable, I can provide patch to fix it.
>
> Zhou Shuaifeng(Frank)
>

Re: "NOT flushing memstore for region" keep on printing for half an hour

Posted by Zhoushuaifeng <zh...@huawei.com>.

Hi St,
I have made an issue in hbase jira:
https://issues.apache.org/jira/browse/HBASE-3821
Including my analysis, please check it.
If it's reasonable, I can provide patch to fix it.

Zhou Shuaifeng(Frank)

Re: "NOT flushing memstore for region" keep on printing for half an hour

Posted by Stack <st...@duboce.net>.

2011/4/25 Zhoushuaifeng <zh...@huawei.com>:
> Thanks St,
> I found that when running closing parent region to prepare for split, preflush occurred. But for some reason, an IOException throwed from hdfsclient, and caused preflush failed. Then split failed.
> We have made some change on HDFS and will check why Exception happened.
> But I don't know details on how hbase handle the exception. When rolling back the split operation, will it reset writestate.writesEnabled to true?
> If not, it will hanging writestate.writesEnabled to false and causing all other flush operations false.
>

If we fail a split, we rollback.  Usually the rollback is clean (There
is a unit test that exercises various failures splitting verifying
rollback works) but for sure there could be bugs in here.

I would be interested in the log of that regionserver.

Thanks,
St.Ack

Re: "NOT flushing memstore for region" keep on printing for half an hour

Posted by Zhoushuaifeng <zh...@huawei.com>.

Thanks St,
I found that when running closing parent region to prepare for split, preflush occurred. But for some reason, an IOException throwed from hdfsclient, and caused preflush failed. Then split failed.
We have made some change on HDFS and will check why Exception happened. 
But I don't know details on how hbase handle the exception. When rolling back the split operation, will it reset writestate.writesEnabled to true?
If not, it will hanging writestate.writesEnabled to false and causing all other flush operations false.

Zhou Shuaifeng(Frank)

-----邮件原件-----
发件人: saint.ack@gmail.com [mailto:saint.ack@gmail.com] 代表 Stack
发送时间: 2011年4月26日 0:10
收件人: dev@hbase.apache.org
抄送: Yanlijun
主题: Re: "NOT flushing memstore for region" keep on printing for half an hour

Please fix this:

ulimit -n 1024

Read the requirements section:
http://hbase.apache.org/book/notsoquick.html#requirements.  In
particular see the ulimit/nproc section.

Regards the below, we are in this bit of code:
http://hbase.apache.org/xref/org/apache/hadoop/hbase/regionserver/HRegion.html#869

It looks like that region is closing at the time as you say.  Maybe
there is something in here.  Please file an issue and include more of
the log and we'll dig in.

Thanks for writing the list.

St.Ack

Re: "NOT flushing memstore for region" keep on printing for half an hour

Posted by Stack <st...@duboce.net>.

Please fix this:

ulimit -n 1024

Read the requirements section:
http://hbase.apache.org/book/notsoquick.html#requirements.  In
particular see the ulimit/nproc section.

Regards the below, we are in this bit of code:
http://hbase.apache.org/xref/org/apache/hadoop/hbase/regionserver/HRegion.html#869

It looks like that region is closing at the time as you say.  Maybe
there is something in here.  Please file an issue and include more of
the log and we'll dig in.

Thanks for writing the list.

St.Ack


On Mon, Apr 25, 2011 at 4:39 AM, Zhoushuaifeng <zh...@huawei.com> wrote:
> Hi,
> The version is hbase 0.90.1.
> "NOT flushing memstore for region" keep on printing for half an hour in the regionserver. Then I restart hbase. I think there may be deadlock or cycling.
> I know that when splitting region, it will doclose of region, and set writestate.writesEnabled = false  and may run close preflush. This will make flush fail and print "NOT flushing memstore for region". But It should be finished after a while.
>
> 2011-04-18 16:28:27,960 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. because regionserver60020.cacheFlusher; priority=-1, compaction queue size=1
> 2011-04-18 16:28:30,171 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
> 2011-04-18 16:28:30,171 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. has too many store files; delaying flush up to 90000ms
> 2011-04-18 16:28:32,119 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200
> 2011-04-18 16:28:32,285 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124206693, entries=5226, filesize=255913736. New hlog /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124311822
> 2011-04-18 16:28:32,287 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Found 1 hlogs to remove out of total 2; oldest outstanding sequenceid is 11037 from region 031f37c9c23fcab17797b06b90205610
> 2011-04-18 16:28:32,288 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: moving old hlog file /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303123945481 whose highest sequenceid is 6052 to /hbase/.oldlogs/linux253%3A60020.1303123945481
> 2011-04-18 16:28:42,701 INFO org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 4 file(s), new file=hdfs://10.18.52.108:9000/hbase/ufdr/031f37c9c23fcab17797b06b90205610/value/4398465741579485290, size=281.4m; total size for store is 468.8m
> 2011-04-18 16:28:42,712 INFO org.apache.hadoop.hbase.regionserver.HRegion: completed compaction on region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. after 1mins, 40sec
> 2011-04-18 16:28:42,741 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
> 2011-04-18 16:28:42,770 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.: disabling compactions & flushes
> 2011-04-18 16:28:42,770 INFO org.apache.hadoop.hbase.regionserver.HRegion: Running close preflush of ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
> 2011-04-18 16:28:42,771 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., current region memstore size 105.6m
> 2011-04-18 16:28:42,818 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting, commencing flushing stores
> 2011-04-18 16:28:42,846 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false
> 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
> 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false
> ......
> 2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false
> 2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
> Mon Apr 18 17:04:24 IST 2011 Starting regionserver on linux253
> ulimit -n 1024
>
> Zhou Shuaifeng(Frank)
>
>
>

RE: "NOT flushing memstore for region" keep on printing for half an hour

Posted by Ramkrishna S Vasudevan <ra...@huawei.com>.

Hi,
The preflush will happen whenever the worthPreFlush() condition is satisfied
which calls internalFlush.

That is where this problem occurs.

Regards
Ram

****************************************************************************
***********
This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!

-----Original Message-----
From: Zhoushuaifeng [mailto:zhoushuaifeng@huawei.com] 
Sent: Monday, April 25, 2011 5:10 PM
To: dev@hbase.apache.org
Cc: Yanlijun
Subject: "NOT flushing memstore for region" keep on printing for half an
hour

Hi,
The version is hbase 0.90.1.
"NOT flushing memstore for region" keep on printing for half an hour in the
regionserver. Then I restart hbase. I think there may be deadlock or
cycling.
I know that when splitting region, it will doclose of region, and set
writestate.writesEnabled = false  and may run close preflush. This will make
flush fail and print "NOT flushing memstore for region". But It should be
finished after a while.

2011-04-18 16:28:27,960 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
requested for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. because
regionserver60020.cacheFlusher; priority=-1, compaction queue size=1
2011-04-18 16:28:30,171 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
2011-04-18 16:28:30,171 WARN
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. has too many store
files; delaying flush up to 90000ms
2011-04-18 16:28:32,119 INFO
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs
-- HDFS-200
2011-04-18 16:28:32,285 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
Roll
/hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124206693,
entries=5226, filesize=255913736. New hlog
/hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124311822
2011-04-18 16:28:32,287 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Found 1 hlogs to remove out of total 2; oldest outstanding sequenceid is
11037 from region 031f37c9c23fcab17797b06b90205610
2011-04-18 16:28:32,288 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
moving old hlog file
/hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303123945481
whose highest sequenceid is 6052 to
/hbase/.oldlogs/linux253%3A60020.1303123945481
2011-04-18 16:28:42,701 INFO org.apache.hadoop.hbase.regionserver.Store:
Completed major compaction of 4 file(s), new
file=hdfs://10.18.52.108:9000/hbase/ufdr/031f37c9c23fcab17797b06b90205610/va
lue/4398465741579485290, size=281.4m; total size for store is 468.8m
2011-04-18 16:28:42,712 INFO org.apache.hadoop.hbase.regionserver.HRegion:
completed compaction on region
ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. after 1mins, 40sec
2011-04-18 16:28:42,741 INFO
org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of
region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
2011-04-18 16:28:42,770 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Closing ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.: disabling
compactions & flushes
2011-04-18 16:28:42,770 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Running close preflush of
ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
2011-04-18 16:28:42,771 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Started memstore flush for
ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., current region
memstore size 105.6m
2011-04-18 16:28:42,818 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Finished snapshotting, commencing flushing stores
2011-04-18 16:28:42,846 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
NOT flushing memstore for region
ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false,
writesEnabled=false
2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
NOT flushing memstore for region
ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false,
writesEnabled=false
......
2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
NOT flushing memstore for region
ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false,
writesEnabled=false
2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
Mon Apr 18 17:04:24 IST 2011 Starting regionserver on linux253
ulimit -n 1024

Zhou Shuaifeng(Frank)