You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2010/04/03 07:33:27 UTC
[jira] Commented: (HBASE-2375) Make decision to split based on aggregate size of all StoreFiles and revisit related config params

    [ https://issues.apache.org/jira/browse/HBASE-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853081#action_12853081 ] 

stack commented on HBASE-2375:
------------------------------

In 'testForceSplitMultiFamily', do you intend filling two families or just the second family?  Is the bug that you we don't split if second family is over split limit?

In Store#compact, where we do ' if(forceSplit) {   '... is this ok if multiple families?  If multiple families, one of the families will have the best midkey... is that what we split on or is it just the first?  (I see later in MemStoreFlusher that it may have the same problem in that it will split on the first family that meets the criteria).  Maybe this is ok for now.  Perhaps file improvement to improve on this behavior.

I'm do not follow this bit:

{code}
+      // Do not trigger any splits otherwise, so always return references=true
+      // which will prevent splitting.
+       
       if (!fs.exists(this.regionCompactionDir) &&
           !fs.mkdirs(this.regionCompactionDir)) {
         LOG.warn("Mkdir on " + this.regionCompactionDir.toString() + " failed");
-        return checkSplit(forceSplit);
+        return DO_NOT_SPLIT;
       }
{code}

If we falied make the compactiondir, we used to check for split, now we don't split.  Whats up here?

Same later in the file at #238 in patch and at #247.

Should we change this data member name and the configuration that feeds it, its description at least to explain now we are doing size of Store rather then maximum file size.

Thinking about it, we'll now split more often -- because we split sooner (see your first narrative above Jon and no need to wait on compaction to finish before we split -- this takes time).  Also, split will make more references than in past because usually in past we'd split one file after big compaction.  Now we split before compaction so the 3 or 4 files in family will all be split using References.  Ain't sure if this will make any difference.  Just stating what will happen.

This change in default needs to make it out to hbase-default.xml: 

484 +        conf.getInt("hbase.hstore.compactionThreshold", 5);

Otherwise, patch looks good.  Did you try it?





> Make decision to split based on aggregate size of all StoreFiles and revisit related config params
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2375
>                 URL: https://issues.apache.org/jira/browse/HBASE-2375
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: regionserver
>    Affects Versions: 0.20.3
>            Reporter: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.20.4, 0.21.0
>
>         Attachments: HBASE-2375-v8.patch
>
>
> Currently we will make the decision to split a region when a single StoreFile in a single family exceeds the maximum region size.  This issue is about changing the decision to split to be based on the aggregate size of all StoreFiles in a single family (but still not aggregating across families).  This would move a check to split after flushes rather than after compactions.  This issue should also deal with revisiting our default values for some related configuration parameters.
> The motivating factor for this change comes from watching the behavior of RegionServers during heavy write scenarios.
> Today the default behavior goes like this:
> - We fill up regions, and as long as you are not under global RS heap pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) StoreFiles.
> - After we get 3 StoreFiles (hbase.hstore.compactionThreshold) we trigger a compaction on this region.
> - Compaction queues notwithstanding, this will create a 192MB file, not triggering a split based on max region size (hbase.hregion.max.filesize).
> - You'll then flush two more 64MB MemStores and hit the compactionThreshold and trigger a compaction.
> - You end up with 192 + 64 + 64 in a single compaction.  This will create a single 320MB and will trigger a split.
> - While you are performing the compaction (which now writes out 64MB more than the split size, so is about 5X slower than the time it takes to do a single flush), you are still taking on additional writes into MemStore.
> - Compaction finishes, decision to split is made, region is closed.  The region now has to flush whichever edits made it to MemStore while the compaction ran.  This flushing, in our tests, is by far the dominating factor in how long data is unavailable during a split.  We measured about 1 second to do the region closing, master assignment, reopening.  Flushing could take 5-6 seconds, during which time the region is unavailable.
> - The daughter regions re-open on the same RS.  Immediately when the StoreFiles are opened, a compaction is triggered across all of their StoreFiles because they contain references.  Since we cannot currently split a split, we need to not hang on to these references for long.
> This described behavior is really bad because of how often we have to rewrite data onto HDFS.  Imports are usually just IO bound as the RS waits to flush and compact.  In the above example, the first cell to be inserted into this region ends up being written to HDFS 4 times (initial flush, first compaction w/ no split decision, second compaction w/ split decision, third compaction on daughter region).  In addition, we leave a large window where we take on edits (during the second compaction of 320MB) and then must make the region unavailable as we flush it.
> If we increased the compactionThreshold to be 5 and determined splits based on aggregate size, the behavior becomes:
> - We fill up regions, and as long as you are not under global RS heap pressure, you will write out 64MB (hbase.hregion.memstore.flush.size) StoreFiles.
> - After each MemStore flush, we calculate the aggregate size of all StoreFiles.  We can also check the compactionThreshold.  For the first three flushes, both would not hit the limit.  On the fourth flush, we would see total aggregate size = 256MB and determine to make a split.
> - Decision to split is made, region is closed.  This time, the region just has to flush out whichever edits made it to the MemStore during the snapshot/flush of the previous MemStore.  So this time window has shrunk by more than 75% as it was the time to write 64MB from memory not 320MB from aggregating 5 hdfs files.  This will greatly reduce the time data is unavailable during splits.
> - The daughter regions re-open on the same RS.  Immediately when the StoreFiles are opened, a compaction is triggered across all of their StoreFiles because they contain references.  This would stay the same.
> In this example, we only write a given cell twice (instead of 4 times) while drastically reducing data unavailability during splits.  On the original flush, and post-split to remove references.  The other benefit of post-split compaction (which doesn't change) is that we then get good data locality as the resulting StoreFile will be written to the local DataNode.  In another jira, we should deal with opening up one of the daughter regions on a different RS to distribute load better, but that's outside the scope of this one.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.