You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Mikael Sitruk <mi...@gmail.com> on 2012/02/19 23:05:04 UTC

Re: Major Compaction Concerns

A follow-up...
1. my CF were already working with BF they used ROWCOL, (i didn't pay
attention to that at the time i wrote my answers)
2. I see form the logs that the BF is already 100% - is it bad? should I
had more memory for BF?
3. HLog compression (HBASE-4608) is not scheduled yet, is it by intention?
4. Compaction.ratio is only for 0.92.x releases, so i cannot use it yet.
5. all other patches are also for 0.92/0.94 so my situation will not be
better till then, beside playing with the log rolling size, and max number
of store files
6. I have also noticed that in a workload of pure insert (no read, empty
regions, new keys) the store files on the RS can reach more than 4500
files, nevertheless with a update/read scenario the store files were not
passing 1500 files per region (the throttling of the flush was active and
not in insert) Is there an explanation for that?
7. I also have a 0.92 fresh install, and checking there the behavior
(additional result soon, hopefully)

Mikael.S


On Sat, Jan 14, 2012 at 11:30 PM, Mikael Sitruk <mi...@gmail.com>wrote:

> Wow, thank you very much for all those precious explanations, pointers and
> examples. It's a lot to ingest... I will try them (at least what i can with
> 0.90.4 (yes i'm upgrading from 0.90.1 to 0.90.4)) and keep you informed.
> BTW I'm already using compression (GZ), the current data is randomized so
> I don't have so much gain as you mentioned ( i think i'm around 30% only).
> It seems that BF is one of the major thing i need to look up with the
> compaction.ratio, and i need a different setting for my different CF. (one
> CF has small set of column and each update will change 50% --> ROWCOL, the
> second CF has always a new column per update --> ROW)
> I'm not keeping more than one version neither, and you wrote this is not a
> point query.
>
> A suggestion is perhaps to take all those example/explanation and add them
> to the book for future reference.
>
> Regards,
> Mikael.S
>
>
> On Sat, Jan 14, 2012 at 4:06 AM, Nicolas Spiegelberg <ns...@fb.com>wrote:
>
>> >I'm sorry but i don't understand, of course i have a disk and network
>> >saturation and the flush stop to flush because he is waiting for
>> >compaction
>> >to finish. Since this a major compaction was triggered - all the
>> >stores (large number)  present on the disks (7 disk per RS) will be
>> >grabbed
>> >for major compact, and the I/O is affected. Network is also affected
>> since
>> >all are major compacting at the same time and replicating files on same
>> >time (1GB network).
>>
>> When you have an IO problem, there are multiple pieces at play that you
>> can adjust:
>>
>> Write: HLog, Flush, Compaction
>> Read: Point Query, Scan
>>
>> If your writes are far more than your reads, then you should relax one of
>> the write pieces.
>> - HLog: You can't really adjust HLog IO outside of key compression
>> (HBASE-4608)
>> - Flush: You can adjust your compression.  None->LZO == 5x compression.
>> LZO->GZ == 2x compression.  Both are at the expense of CPU.  HBASE-4241
>> minimizes flush IO significantly in the update-heavy use case (discussed
>> this in the last email).
>> - Compaction: You can lower the compaction ratio to minimize the amount of
>> rewrites over time.  That's why I suggested changing the ratio from 1.2 ->
>> 0.25.  This gives a ~50% IO reduction (blog post on this forthcoming @
>> http://www.facebook.com/UsingHBase ).
>>
>> However, you may have a lot more reads than you think.  For example, let's
>> say read:write ratio is 1:10, so significantly read dominated.  Without
>> any of the optimizations I listed in the previous email, your real read
>> ratio is multiplied by the StoreFile count (because you naively read all
>> StoreFiles).  So let say, during congestion, you have 20 StoreFiles.
>> 1*20:10 means that you're now 2:1 read dominated.  You need features to
>> reduce the number of StoreFiles you scan when the StoreFile count is high.
>>
>> - Point Query: bloom filters (HBASE-1200, HBASE-2794), lazy seek
>> (HBASE-4465), and seek optimizations (HBASE-4433, HBASE-4434, HBASE-4469,
>> HBASE-4532)
>> - Scan: not as many optimizations here.  Mostly revolve around proper
>> usage & seek-next optimization when using filters. Don't have JIRA numbers
>> here, but probably half-dozen small tweaks were added to 0.92.
>>
>> >I don't have an increment workload (the workload either update columns on
>> >a
>> >CF or add column on a CF for the same key), so how those patch will help?
>>
>> Increment & read->update workload end up roughly picking up the same
>> optimizations.  Adding a column to an existing row is no different than
>> adding a new row as far as optimizations are concerned because there's
>> nothing to de-dupe.
>>
>> >I don't say this is a bad thing, this is just an observation from our
>> >test,
>> >HBase will slow down the flush in case too many store file are present,
>> >and
>> >will add pressure on GC and memory affecting performance.
>> >The update workload does not send all the row content for a certain key
>> so
>> >only partial data is written, in order to get all the row i presume that
>> >reading the newest Store is not enough ("all" stores need to be read
>> >collecting the more up to date field a rebuild a full row), or i'm
>> missing
>> >something?
>>
>> Reading all row columns is the same as doing a scan.  You're not doing a
>> point query if you don't specify the exact key (columns) you're looking
>> for.  Setting versions to unlimited, then getting all versions of a
>> particular ROW+COL would also be considered a scan vs a point query as far
>> as optimizations are concerned.
>>
>> >1. If i did not set a specific property for bloom filter (BF), does it
>> >means that i'm not using them (the book only refer to BF with regards to
>> >CF)?
>>
>> By default, bloom filters are disabled, so you need to enable them to get
>> the optimizations.  This is by design.  Bloom Filters trade off cache
>> space for low-overhead probabilistic queries.  Default is 8-bytes per
>> bloom entry (key) & 1% false positive rate.  You can use 'bin/hbase
>> org.apache.hadoop.hbase.io.hfile.HFile' (look at help, then -f to specify
>> a StoreFile and then use -m for meta info) to see your StoreFile's average
>> KV size.  If size(KV) == 100 bytes, then blooms use 8% of the space in
>> cache, which is better than loading the StoreFile block only to get a
>> miss.
>>
>> Whether to use a ROW or ROWCOL bloom filter depends on your write & read
>> pattern.  If you read the entire row at a time, use a ROW bloom.  If you
>> point query, ROW or ROWCOL are both options.  If you write all columns for
>> a row at the same time, definitely use a ROW bloom.  If you have a small
>> column range and you update them at different rates/times, then a ROWCOL
>> bloom filter may be more helpful.  ROWCOL is really useful if a scan query
>> for a ROW will normally return results, but a point query for a ROWCOL may
>> have a high miss rate.  A perfect example is storing unique hash-values
>> for a user on disk.  You'd use 'user' as the row & the hash as the column.
>>  Most instances, the hash won't be a duplicate, so a ROWCOL bloom would be
>> better.
>>
>> >3. How can we ensure that compaction will not suck too much I/O if we
>> >cannot control major compaction?
>>
>> TCP Congestion Control will ensure that a single TCP socket won't consume
>> too much bandwidth, so that part of compactions is automatically handled.
>> The part that you need to handle is the number of simultaneous TCP sockets
>> (currently 1 until multi-threaded compactions) & the aggregate data volume
>> transferred over time.  As I said, this is controlled by compaction.ratio.
>>  If temporary high StoreFile counts cause you to bottleneck, slight
>> latency variance is an annoyance of the current compaction algorithm but
>> the underlying problem you should be looking at solving is the system's
>> inability to filter out the unnecessary StoreFiles.
>>
>>
>
>
>
>


-- 
Mikael.S

Re: Major Compaction Concerns

Posted by Nicolas Spiegelberg <ns...@fb.com>.
>1. my CF were already working with BF they used ROWCOL, (i didn't pay
>attention to that at the time i wrote my answers)
>2. I see form the logs that the BF is already 100% - is it bad? should I
>had more memory for BF?

Since Bloom Filters are a probabilistic optimization, it's kinda hard to
analyze your efficiency.  Mostly, we rely on theory and a little bit of
experimentation.  Basically, you want your key queries to have a high miss
rate on HFiles.  This doesn't mean that the key doesn't exist in the
Store.  It just means that you're not constantly writing to it, so it
doesn't exist in all N StoreFiles.  Optimally, you want 1 of the blooms to
hit (key exists in file) and N-1 to miss. Metrics that you can look at
(not sure about the versions of when these were introduced):

keymaybeinbloomcnt : number of bloom hits
keynotinbloomcnt : number of bloom misses.
staticbloomsizekb : size that bloom data takes up in memory (HFileV1)


Note that per-CF metrics are added in 0.94 so you can watch bloom
efficiency in finer granularity.

>3. HLog compression (HBASE-4608) is not scheduled yet, is it by intention?

There's limited bandwidth and this is an open source project, so... :)

>4. Compaction.ratio is only for 0.92.x releases, so i cannot use it yet.

"hbase.hstore.compaction.ratio" is in 0.90
(https://svn.apache.org/repos/asf/hbase/branches/0.90/src/main/java/org/apa
che/hadoop/hbase/regionserver/Store.java)


>6. I have also noticed that in a workload of pure insert (no read, empty
>regions, new keys) the store files on the RS can reach more than 4500
>files, nevertheless with a update/read scenario the store files were not
>passing 1500 files per region (the throttling of the flush was active and
>not in insert) Is there an explanation for that?

That depends on the size of your major compacted data.  Updates will
dedupe and lower your compaction volume.