You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sean Busbey (JIRA)" <ji...@apache.org> on 2019/08/16 20:11:00 UTC
[jira] [Commented] (HBASE-22749) Distributed MOB compactions

    [ https://issues.apache.org/jira/browse/HBASE-22749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909365#comment-16909365 ] 

Sean Busbey commented on HBASE-22749:
-------------------------------------

h2. region sizing - splitting, normalizers, etc

Need to expressly state wether or not this change to per-region accounting plans to alter the current assumptions that use of the feature means that the MOB data isn’t counted when determining region size for decisions to normalize or split.

h2. write amplification

Current description of the unified compactor’s handling of MOB data doesn’t include anything about doing the kind of mob file partitioning that was previously done. I think this will behave a lot like the example from Section 3.1.1 in the MOB Design v5 document you reference, specifically where MOB stuff is segregated in a dedicated CF. We still end up getting unbounded write amplification.

Consider this use case, which I think is in line with the assumptions laid out in both your description and in the MOB Design v5 document:

* Table with 50k regions
* MOB values that are 300KiB
* No updates, no deletes
* periodic flushes set to 6 hours
* periodic major and mob compaction set to weekly
* infrequent writes (slow enough so that only periodic flushes happen, but enough that all regions have a mob write)

Under the current MOB implementation with a monthly partition policy, I can reason that:

* we’ll be generating 200k hfiles in the mob directory per day due to periodic flushes 
* at the first week we’ll have 1.4m new hfiles, which we’ll compact probably into a low-double-digit number of hfiles
* at the second week, we’ll have 1.4m new hfiles plus the results of the first compaction. we will probably compact this into a low-double-digit number of hfiles
* at the third week, same thing again
* on the fourth week, same thing again
* After that fourth week things repeat, but the files generated will be in a new partition and so anything from prior won’t be rewritten again.

In the steady state:

* we should have a number of hfiles that stays under the limits of HDFS
* for a given MOB value, we should only write it to HDFS no more than 5 times (flush + between 1 and 4 compactions)

So that means we have a write amplification of ~5x regardless of splits or merges from normalization.

For the new design I don’t think there’s currently any bound. If I use the default compaction strategy:

* We’ll still be generating 200k hfiles per day
* at the first week we’ll have 1.4m new hfiles which we’ll compact to 50k hfiles.
* at the end of second week we’ll have 1.4m new hfiles + the existing 50k hfiles, and we’ll compact to 50k hfiles
* third week, same thing
* forth week, same thing
* this will repeat until each of the 50k files hits 10 GiB - 20 GiB depending on configs (~35-70k cells)

At the extreme of exactly 1 mob value per region per periodic flush that would mean 1-2 thousand weeks. Splits over that time period would mean probably we’d keep rewriting indefinitely. So the amount of amplification is essentially going to be driven by the periodicity of the mob compaction chore.

With default configs we can still get memstores that have ~1GiB of MOB values and still only do periodic flushes, so this can remain a non-trivial amount of load on HDFS.

If we enable partial major mob compaction we’d avoid writing the values repeatedly, but we’d against HDFS limitations in ~10 days.

h2. MOB compaction request chore and Partial major mob compaction

It’s a bit confusing going through the “Partial major MOB compaction” section where it currently is in the write up. As I understand things, you’re essentially describing a strategy for the process that has to pick particular regions to issue major compaction requests against instead of just requesting the whole table be compacted. Since this is an optimization of cluster IO use that’s possible _once we have per-region accounting and maintenance of MOB data_ I think it’d be clearer if it was in a section _after_ you describe the “scalable MOB compactions” stuff.

instead of starting that section off with the description of the compaction request chore, you can explain the accounting changes to store maintenance, the resulting changes to cleaning, and then end with the explanation about how folks won’t have to schedule maintenance tasks themselves with a section that’s labeled as the description of the “MOB Compaction Request Chore” and include there the description fo the region prioritization strategy. Another good strategy to mention there is prioritizing the store files we know haven’t been converted to include accounting information the cleaner needs.

h2. split tracking for the above

Could we do this with entries in hbase:meta or a journal instead of individual files? It’s going to get very messy when there are tables with tens-of-thousands or hundreds-of-thousands of regions.

h2. metrics needed

Some metrics that would be useful while reading the design:

* compaction request chore, esp when using partial major mob comapction. time to evaluate, number of regions with/without siblings, number of regions selected
* mob cleaner chore - time to evaluate, # mob files referenced, # mob files not referenced, # old store files, skipped store files (e.g. from bulkload)
* unified compact/flush - number of new reference files / number of passed through reference files / number of no longer needed references (not cells, specifically to have an idea of the size of reference metadata we’re keeping around)

h2. performance considerations section

include a description of tests to run to compare before and after performance. e.g. using Load Test Tool to run X records for a single node, 5 node, 10 node with values of ~100KiB, ~1MiB, ~10MiB

{code}
hbase ltt -mob_threshold 102400 -generator \
    org.apache.hadoop.hbase.util.LoadTestDataGeneratorWithMOB:example_mob:102400:104857 \
    -num_keys 10000 -write 3:1024 -tn table_1 -families plain,example_mob
{code}



h2. MOB 2.0 Compaction for small MOB (below 50KiB) - remove since we said it is a non-goal

h2. MOB 2.0 functional testing (stress tool / fault injections)

* should be framed in terms of using ltt to do the data load/reading. something like the above sample command.
* What about Chaos Monkey for HDFS failures / RS failures / Master failures? if it’s not necessary, include reasoning on why?


h2. tools updates

* Don’t snapshots need to change? Intuitively I would expect us to need to walk the hfiles from the snapshot to get the list of mob files to include, rather than just including everything in the mobdir?
* Similarly, doesn’t incremental backup/restore need to adapt?
* Prior to this refactoring I would have the offline compaction tool to offload my non-mob data compaction and I’d have the external mob compaction tool to offload my mob data. What about updating the offline compaction tool to optionally handle mob data needed? 
* HFile Pretty Printer tool needs some updates. should include something about the mob reference metadata (like a count in normal mode and a listing in verbose maybe?). the mob integrity check option needs to include validating that all the references found in cells are also found in the file metadata



> Distributed MOB compactions 
> ----------------------------
>
>                 Key: HBASE-22749
>                 URL: https://issues.apache.org/jira/browse/HBASE-22749
>             Project: HBase
>          Issue Type: New Feature
>          Components: mob
>            Reporter: Vladimir Rodionov
>            Assignee: Vladimir Rodionov
>            Priority: Major
>         Attachments: HBase-MOB-2.0-v1.pdf, HBase-MOB-2.0-v2.1.pdf, HBase-MOB-2.0-v2.pdf
>
>
> There are several  drawbacks in the original MOB 1.0  (Moderate Object Storage) implementation, which can limit the adoption of the MOB feature:  
> # MOB compactions are executed in a Master as a chore, which limits scalability because all I/O goes through a single HBase Master server. 
> # Yarn/Mapreduce framework is required to run MOB compactions in a scalable way, but this won’t work in a stand-alone HBase cluster.
> # Two separate compactors for MOB and for regular store files and their interactions can result in a data loss (see HBASE-22075)
> The design goals for MOB 2.0 were to provide 100% MOB 1.0 - compatible implementation, which is free of the above drawbacks and can be used as a drop in replacement in existing MOB deployments. So, these are design goals of a MOB 2.0:
> # Make MOB compactions scalable without relying on Yarn/Mapreduce framework
> # Provide unified compactor for both MOB and regular store files
> # Make it more robust especially w.r.t. to data losses. 
> # Simplify and reduce the overall MOB code.
> # Provide 100% compatible implementation with MOB 1.0.
> # No migration of data should be required between MOB 1.0 and MOB 2.0 - just software upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)