You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2011/01/04 20:45:47 UTC

[jira] Created: (HBASE-3411) manually compact memstores?

manually compact memstores?
---------------------------

                 Key: HBASE-3411
                 URL: https://issues.apache.org/jira/browse/HBASE-3411
             Project: HBase
          Issue Type: Brainstorming
          Components: regionserver
            Reporter: Todd Lipcon
         Attachments: hbase-3411.txt

I have a theory and some experiments that indicate our heap fragmentation issues has to do with the KV buffers from memstores ending up entirely interleaved in the old gen. I had a bit of wine and came up with a wacky idea to have a thread which continuously defragments memstore data buffers into contiguous segments, hopefully to keep old gen fragmentation down.

It didn't seem to work just yet, but wanted to show the patch to some people.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Resolved] (HBASE-3411) manually compact memstores?

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon resolved HBASE-3411.
--------------------------------

    Resolution: Won't Fix

This seems to have been obviated by MSLAB in later 0.90, 0.92+
                
> manually compact memstores?
> ---------------------------
>
>                 Key: HBASE-3411
>                 URL: https://issues.apache.org/jira/browse/HBASE-3411
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: regionserver
>            Reporter: Todd Lipcon
>         Attachments: hbase-3411.txt
>
>
> I have a theory and some experiments that indicate our heap fragmentation issues has to do with the KV buffers from memstores ending up entirely interleaved in the old gen. I had a bit of wine and came up with a wacky idea to have a thread which continuously defragments memstore data buffers into contiguous segments, hopefully to keep old gen fragmentation down.
> It didn't seem to work just yet, but wanted to show the patch to some people.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HBASE-3411) manually compact memstores?

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HBASE-3411:
-------------------------------

    Attachment: hbase-3411.txt

Here it is in all its buggy glory

> manually compact memstores?
> ---------------------------
>
>                 Key: HBASE-3411
>                 URL: https://issues.apache.org/jira/browse/HBASE-3411
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: regionserver
>            Reporter: Todd Lipcon
>         Attachments: hbase-3411.txt
>
>
> I have a theory and some experiments that indicate our heap fragmentation issues has to do with the KV buffers from memstores ending up entirely interleaved in the old gen. I had a bit of wine and came up with a wacky idea to have a thread which continuously defragments memstore data buffers into contiguous segments, hopefully to keep old gen fragmentation down.
> It didn't seem to work just yet, but wanted to show the patch to some people.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3411) manually compact memstores?

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977445#action_12977445 ] 

stack commented on HBASE-3411:
------------------------------

IRC where Todd explains more what he was at:
{code}
11:27 < tlipcon> i was up late last night trying a crazy experiment
11:27 -!- Infin1ty [~Infin1ty@pdpc/supporter/active/infin1ty] has joined #hbase
11:27 < tlipcon> sadly it didn't work, but i may have a JVM bug to blame for it
11:27 < tlipcon> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6999988
11:27 < jdcryans> stop breaking everything! ;)
11:28 < tlipcon> so I need to try again on some different jvm versions/options
11:28  * tlipcon added a thread that does something counterintuitive
11:28 < tlipcon> it runs through memstores and copies all the bytes just for the hell of it
11:28 < tlipcon> and drops the old kvset
11:28 < jdcryans> to generate garbage?
11:28 < tlipcon> basically makes a deep copy of the kvset
11:28 < tlipcon> to compact all the byte[] references into a single new byte[]
11:29 < tlipcon> my theory is that all our CMS issues are because the heap gets really fragmented. I wrote a simple program that inserts randomly into a bunch of kvsets and saw the same behavior
11:29 < larsgeorge> oh, your own allocator, tres chique
11:29 < tlipcon> it makes sense because our insertion order is random across regions, and so when the data is promoted into old gen it's completely fragmented, like ABCABACADDBCADABAD if those are different memstores
11:29 < tlipcon> then when we flush a memstore we end up with AB ABA ADDB ADABAD
11:30 < tlipcon> ie fragmentation
11:30 < tlipcon> so this thread just runs through memstores and recopies them into contiguous buffers and drops the old one, in an attempt to defragment the old gen continuously
11:30 < tlipcon> unfortunately it didn't work :)
11:30 < larsgeorge> tlipcon: you are THAT close to allocating your own DirectByteBuffer and use that.... I know it!
11:30 < tlipcon> yea, that gets tricky though
11:31 < tlipcon> unless we add a copy for every read, or some kind of ref counting
11:31 < tlipcon> the issue is deallocating them
11:31 < tlipcon> actually have another idea, too... basically use jni to mmap(MAP_ANON) a big chunk of VM for each memstore
11:32 < tlipcon> then we compact into there
11:32 < tlipcon> or just allocate from there for memstore data
11:32 < tlipcon> then when the memstore flushes we just munmap it
11:32 < tlipcon> would end up using lots of vmem but our res size should be identical if not better
11:33 < tlipcon> but again the issue is those leaked references for zero-copy
11:33 < tlipcon> need some more atomic counters floating around
11:33 < tlipcon> but I think it's odable
11:33 < St^Ack> tlipcon: compact all byte references into one?
11:33 < tlipcon> St^Ack: yea that's what it's doing now, ish
11:34 < tlipcon> i did a little mini heuristic
11:34 < tlipcon> it takes the total data size of the memstore, and compacts any byte buffers smaller than 1/4 that size or something
11:34 < tlipcon> otherwise i found that big memstores wasted a lot of copying
11:41 < tlipcon> it'll probably make more sense
11:41 < St^Ack> thanks honey
11:41 < St^Ack> Your ABC picture above, is that kvs or is it memstores you are drawing?
11:41 < tlipcon> it's byte[]s in the heap
11:41 < tlipcon> right
11:41 < St^Ack> k
11:42 < tlipcon> basically the data from memstores ends up interleaved throughout the heap
11:42 < tlipcon> so when we free one we only free up a bunch of kv-sized segments in the old gen
11:42 < tlipcon> sorry I should say kv-buffer-sized
11:44 -!- ak2 [cfabb465@gateway/web/freenode/ip.207.171.180.101] has joined #hbase
11:45 -!- rberger [~rberger@adsl-99-48-184-49.dsl.snfc21.sbcglobal.net] has quit [Remote host closed the connection]
11:46 -!- rberger [~rberger@adsl-99-48-184-49.dsl.snfc21.sbcglobal.net] has joined #hbase
11:46  * St^Ack looking
...
11:46 < tlipcon> i'm sure it's buggy, i got various byte buffer overflow errors and such
...
11:51 < St^Ack> tlipcon: oh, this is pretty basic todd...
11:52 < St^Ack> why don't it work?
...
11:52 < tlipcon> you mean my memorycompactor crap?
11:52 < St^Ack> yeah
11:52 < tlipcon> unclear as of yet
11:52 < tlipcon> it was 3am so I went to bed
11:52 < St^Ack> smile
11:52 -!- matt_c [~matt_c@gateway.the-worldco.com] has quit [Read error: Connection reset by peer]
11:52 < tlipcon> one theory is that now we're generating garbage twice as fast ;-)
11:52 -!- matt_c [~matt_c@gateway.the-worldco.com] has joined #hbase
11:52 < tlipcon> and still odd sizes... so we just have fragmentation at a larger granularity
11:53 < tlipcon> so perhaps having the compactor compact into slabs of a small number of preset sizes would work
11:53 -!- amoksoft [~Adium@204.15.3.162] has quit [Ping timeout: 272 seconds]
11:53 -!- cheddar [~cheddar@c-24-5-65-170.hsd1.ca.comcast.net] has joined #hbase
11:54 -!- matt_c_ [~matt_c@gateway.the-worldco.com] has joined #hbase
11:55 -!- posix4e [~posix4e@38.102.147.105] has joined #hbase
11:56 < St^Ack> hat new sun bug is a downer
11:57 < dj_ryan> hmm fragmentation eh
11:58 -!- amoksoft [~Adium@204.15.3.162] has joined #hbase
11:58 -!- matt_c [~matt_c@gateway.the-worldco.com] has quit [Ping timeout: 260 seconds]
11:59 < tlipcon> yea i hacked around for a while last night with the jhat API also
11:59 -!- patrick_angeles [~Adium@cpe-24-193-230-114.nyc.res.rr.com] has joined #hbase
12:00 < dj_ryan> what does the jhat api get you?
12:01 < dj_ryan> 3401 looks interesting
12:01 < tlipcon> looking at a heap dump to see how the objects actually end up laid out in old gen
12:01 < tlipcon> i did a little test app and the jhat analyzer for that
12:01 < tlipcon> but didn't try it on a 5G heap dump yet
....
12:04 < St^Ack> That sun bug you posted would seem to indicate that we should recommend folks NOT run > u20
12:04 < tlipcon> St^Ack: well, it's unclear
12:04 < tlipcon> because the bug was potentially introduced by another "fix" which "improved" the fragmentation behavior
12:05 < tlipcon> so we should really do some kind of test to see
12:05 < tlipcon> the very best thing we could do, I think, is to have a fake regionserver that doesn't touch disk, but goes through all the motions of servicing a workload
12:05 < St^Ack> tlipcon: i'd read about u21 fix for fragmentation ... bummer that it makes it worse
12:05 < tlipcon> then we could try it with different JVM options, send it to the hotspot people to use as a test case, etc
12:06 < St^Ack> tlipcon: ok
12:06 < St^Ack> i could work on that
12:06 < tlipcon> but it's a messy prospect
12:06 < tlipcon> i tried to make a little fake one the other day, but i don't think it's quite realistic enough
12:07 < St^Ack> so you can't gen the promotion failures?
12:07 < St^Ack> the ycsb is no good for that because all keys same size?
12:07 < tlipcon> ycsb generates it real easily on a real cluster
12:07 < dj_ryan> we can get zipflan for length
12:08 < tlipcon> even with a simple YCSB workload I get the same frag issue
12:08 < tlipcon> I need to read the JDK source a bit more to understand how the promotion copy works
12:08 < tlipcon> I think it needs to find contiguous space for all the promoted objects
12:09 < dj_ryan> that is a good question
12:09 < dj_ryan> i doubt that is the case
12:09 < tlipcon> there's a concept of a PLAB
12:09 < tlipcon> but the details out there are kind of hazy
...
{code}

> manually compact memstores?
> ---------------------------
>
>                 Key: HBASE-3411
>                 URL: https://issues.apache.org/jira/browse/HBASE-3411
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: regionserver
>            Reporter: Todd Lipcon
>         Attachments: hbase-3411.txt
>
>
> I have a theory and some experiments that indicate our heap fragmentation issues has to do with the KV buffers from memstores ending up entirely interleaved in the old gen. I had a bit of wine and came up with a wacky idea to have a thread which continuously defragments memstore data buffers into contiguous segments, hopefully to keep old gen fragmentation down.
> It didn't seem to work just yet, but wanted to show the patch to some people.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.