You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Matt Kapilevich (JIRA)" <ji...@apache.org> on 2014/07/28 17:29:41 UTC

[jira] [Commented] (HBASE-10499) In write heavy scenario one of the regions does not get flushed causing RegionTooBusyException

    [ https://issues.apache.org/jira/browse/HBASE-10499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076303#comment-14076303 ] 

Matt Kapilevich commented on HBASE-10499:
-----------------------------------------

We're also hitting this issue. The RegionServer is unable to flush one of the regions. I am seeing this in the logs:

{code}
2014-07-24 14:58:42,970 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: Completed compaction: Request = regionName=users,53400000000000000000000000000000000000,1404907131059.6eae263f1f0fd698cca73964118e16b0., storeName=ids, fileCount=3, fileSize=185.0 M, priority=6, time=2961355376528391; duration=7sec
2014-07-24 14:58:44,778 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region users,87080000000000000000000000000000000000,1404907131062.2a8dd55c1011e92f4e32d705196f1e15. after a delay of 9557
2014-07-24 14:58:54,778 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region users,87080000000000000000000000000000000000,1404907131062.2a8dd55c1011e92f4e32d705196f1e15. after a delay of 9316
2014-07-24 14:59:04,778 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region users,87080000000000000000000000000000000000,1404907131062.2a8dd55c1011e92f4e32d705196f1e15. after a delay of 7454
2014-07-24 14:59:14,778 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region users,87080000000000000000000000000000000000,1404907131062.2a8dd55c1011e92f4e32d705196f1e15. after a delay of 11197
2014-07-24 14:59:24,778 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region users,87080000000000000000000000000000000000,1404907131062.2a8dd55c1011e92f4e32d705196f1e15. after a delay of 19777
2014-07-24 14:59:34,778 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region users,87080000000000000000000000000000000000,1404907131062.2a8dd55c1011e92f4e32d705196f1e15. after a delay of 20814
2014-07-24 14:59:44,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region users,87080000000000000000000000000000000000,1404907131062.2a8dd55c1011e92f4e32d705196f1e15. after a delay of 22542
2014-07-24 14:59:54,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region users,87080000000000000000000000000000000000,1404907131062.2a8dd55c1011e92f4e32d705196f1e15. after a delay of 16096
2014-07-24 15:00:04,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region users,87080000000000000000000000000000000000,1404907131062.2a8dd55c1011e92f4e32d705196f1e15. after a delay of 21794
2014-07-24 15:00:14,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region users,87080000000000000000000000000000000000,1404907131062.2a8dd55c1011e92f4e32d705196f1e15. after a delay of 20873
{code}

The unfortunate thing here is that the RegionServer never recovers, even after we stop doing Puts.

We're on 0.96.1.1-cdh5.0.2

Are there any workarounds for this? Would setting hbase.hstore.flusher.count to 2 help? (I'm running with default 1).

> In write heavy scenario one of the regions does not get flushed causing RegionTooBusyException
> ----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-10499
>                 URL: https://issues.apache.org/jira/browse/HBASE-10499
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Critical
>             Fix For: 0.99.0
>
>         Attachments: HBASE-10499.patch, hbase-root-master-ip-10-157-0-229.zip, hbase-root-regionserver-ip-10-93-128-92.zip, master_4e39.log, master_576f.log, rs_4e39.log, rs_576f.log, t1.dump, t2.dump, workloada_0.98.dat
>
>
> I got this while testing 0.98RC.  But am not sure if it is specific to this version.  Doesn't seem so to me.  
> Also it is something similar to HBASE-5312 and HBASE-5568.
> Using 10 threads i do writes to 4 RS using YCSB. The table created has 200 regions.  In one of the run with 0.98 server and 0.98 client I faced this problem like the hlogs became more and the system requested flushes for those many regions.
> One by one everything was flushed except one and that one thing remained unflushed.  The ripple effect of this on the client side
> {code}
> com.yahoo.ycsb.DBException: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 54 actions: RegionTooBusyException: 54 times,
>         at com.yahoo.ycsb.db.HBaseClient.cleanup(HBaseClient.java:245)
>         at com.yahoo.ycsb.DBWrapper.cleanup(DBWrapper.java:73)
>         at com.yahoo.ycsb.ClientThread.run(Client.java:307)
> Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 54 actions: RegionTooBusyException: 54 times,
>         at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:187)
>         at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:171)
>         at org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:897)
>         at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:961)
>         at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1225)
>         at com.yahoo.ycsb.db.HBaseClient.cleanup(HBaseClient.java:232)
>         ... 2 more
> {code}
> On one of the RS
> {code}
> 2014-02-11 08:45:58,714 INFO  [regionserver60020.logRoller] wal.FSHLog: Too many hlogs: logs=38, maxlogs=32; forcing flush of 23 regions(s): 97d8ae2f78910cc5ded5fbb1ddad8492, d396b8a1da05c871edcb68a15608fdf2, 01a68742a1be3a9705d574ad68fec1d7, 1250381046301e7465b6cf398759378e, 127c133f47d0419bd5ab66675aff76d4, 9f01c5d25ddc6675f750968873721253, 29c055b5690839c2fa357cd8e871741e, ca4e33e3eb0d5f8314ff9a870fc43463, acfc6ae756e193b58d956cb71ccf0aa3, 187ea304069bc2a3c825bc10a59c7e84, 0ea411edc32d5c924d04bf126fa52d1e, e2f9331fc7208b1b230a24045f3c869e, d9309ca864055eddf766a330352efc7a, 1a71bdf457288d449050141b5ff00c69, 0ba9089db28e977f86a27f90bbab9717, fdbb3242d3b673bbe4790a47bc30576f, bbadaa1f0e62d8a8650080b824187850, b1a5de30d8603bd5d9022e09c574501b, cc6a9fabe44347ed65e7c325faa72030, 313b17dbff2497f5041b57fe13fa651e, 6b788c498503ddd3e1433a4cd3fb4e39, 3d71274fe4f815882e9626e1cfa050d1, acc43e4b42c1a041078774f4f20a3ff5
> ......................................................
> 2014-02-11 08:47:49,580 INFO  [regionserver60020.logRoller] wal.FSHLog: Too many hlogs: logs=53, maxlogs=32; forcing flush of 2 regions(s): fdbb3242d3b673bbe4790a47bc30576f, 6b788c498503ddd3e1433a4cd3fb4e39
> {code}
> {code}
> 2014-02-11 09:42:44,237 INFO  [regionserver60020.periodicFlusher] regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region usertable,user3654,1392107806977.fdbb3242d3b673bbe4790a47bc30576f. after a delay of 16689
> 2014-02-11 09:42:44,237 INFO  [regionserver60020.periodicFlusher] regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region usertable,user6264,1392107806983.6b788c498503ddd3e1433a4cd3fb4e39. after a delay of 15868
> 2014-02-11 09:42:54,238 INFO  [regionserver60020.periodicFlusher] regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region usertable,user3654,1392107806977.fdbb3242d3b673bbe4790a47bc30576f. after a delay of 20847
> 2014-02-11 09:42:54,238 INFO  [regionserver60020.periodicFlusher] regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region usertable,user6264,1392107806983.6b788c498503ddd3e1433a4cd3fb4e39. after a delay of 20099
> 2014-02-11 09:43:04,238 INFO  [regionserver60020.periodicFlusher] regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region usertable,user3654,1392107806977.fdbb3242d3b673bbe4790a47bc30576f. after a delay of 8677
> {code}
> {code}
> 2014-02-11 10:31:21,020 INFO  [regionserver60020.logRoller] wal.FSHLog: Too many hlogs: logs=54, maxlogs=32; forcing flush of 1 regions(s): fdbb3242d3b673bbe4790a47bc30576f
> {code}
> I restarted another RS and there were region movements with other regions but this region stays with the RS that has this issue.  One important observation is that in HRegion.internalflushCache() we need to add a debug log here
> {code}
> // If nothing to flush, return and avoid logging start/stop flush.
>     if (this.memstoreSize.get() <= 0) {
>       return false;
>     }
> {code}
> Because we can see that the region is requsted for a flush but it does not happen and no logs related to flush are printed in the logs. so due to some reason this memstore.size() has become 0( I assume this).  The earlier bugs were also due to similar reason.



--
This message was sent by Atlassian JIRA
(v6.2#6252)