You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Andrew Purtell <ap...@apache.org> on 2011/03/19 19:15:21 UTC

minor compaction bug (was: upsert case performance problem)

See below.

Doing some testing on that I let the mapreduce program and an hbase shell flushing every 60 seconds run overnight. The result on two tables was:

562 store files!

   ip-10-170-34-18.us-west-1.compute.internal:60020 1300494200158
       requests=51, regions=1, usedHeap=1980, maxHeap=3960
       akamai.ip,,1300494562755.1b0614eaecca0d232d7315ff4a3ebb87.
             stores=1, storefiles=562, storefileSizeMB=310, memstoreSizeMB=1, storefileIndexSizeMB=2

528 store files!

    ip-10-170-49-35.us-west-1.compute.internal:60020 1300494214101
        requests=79, regions=1, usedHeap=1830, maxHeap=3960
        akamai.domain,,1300494560898.af85225ae650574dbc4caa34df8b6a35.
             stores=1, storefiles=528, storefileSizeMB=460, memstoreSizeMB=3, storefileIndexSizeMB=3

... so that killed performance after a while ...

Here's something else.

   - Andy

--- On Sat, 3/19/11, Andrew Purtell <ap...@apache.org> wrote:

From: Andrew Purtell <ap...@apache.org>
Subject: upsert case performance problem (doubts about ConcurrentSkipListMap)
To: dev@hbase.apache.org
Date: Saturday, March 19, 2011, 11:10 AM

I have a mapreduce task put together for experimentation which does a lot of Increments over three tables and Puts to another. I set writeToWAL to false. My HBase includes the patch that fixes serialization of writeToWAL for Increments. MemstoreLAB is enabled but is probably not a factor, but still need to test to exclude it.

After starting a job up on a test cluster on EC2 with 20 mappers over 10 slaves I see initially 10-15K/ops/sec/server. This performance drops over a short time to stabilize around 1K/ops/sec/server. So I flush the tables with the shell. Immediately after flushing the tables, performance is back up to 10-15K/ops/sec/server. If I don't flush, performance remains low indefinitely. If I flush only the table receiving the Gets, performance remains low. 

If I set the shell to flush in a loop every 60 seconds, performance repeatedly drops during that interval, then recovers after flushing.

When Gary and I went to NCHC in Taiwan, we saw a guy from PhiCloud present something similar to this regarding 0.89DR. He measured the performance of the memstore for a get-and-put use case over time and graphed it, looked like time increased on a staircase with a trend to O(n). This was a surprising result. ConcurrentSkipListMap#put is supposed to run in O(log n). His workaround was to flush after some fixed number of gets+puts, 1000 I think. At the time we weren't sure what was going on given the language barrier.

Sound familiar?

I don't claim to really understand what is going on, but need to get to the bottom of this. Going to look at it in depth starting Monday.

   - Andy

Re: minor compaction bug (was: upsert case performance problem)

Posted by Andrew Purtell <ap...@apache.org>.

Hi Dhruba,

> another bottleneck that I am seeing is that all transactions need to come to
> a halt when rolling hlogs, the reason being that all transactions need to be
> drained before we can close the hlog

I didn't measure the rate but I'd expect quite often due to a constant as-many-writes-as-we-can-push workload. 

The performance limitation in memstore suggested by the impact of flushes was the dominant factor.

> InitialOccupancyFactor
> what is the size of ur NewGen?

This is what I'm testing with: -Xmx4000m -Xms4000m -Xmn400m \
  -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 \
  -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseParNewGC \
  -XX:+CMSParallelRemarkEnabled -XX:MaxGCPauseMillis=100 \
  -XX:+UseMembar

> how many client threads 

20 single threaded mapreduce clients

> and how many region server handler threads are u using?

100 per rs

> For increment operation, I introduced the concept of a 
> ModifyableKeyValue whereby every increment actually updates
> the same KeyValue record if found in the MemStore (instead
> of creating a new KeyValue record and re-inserting
> it into memstore).

Patch! Patch! Patch! :-) :-)
(I'd ... consider ... trying it.)

   - Andy

--- On Sat, 3/19/11, Dhruba Borthakur <dh...@gmail.com> wrote:

> From: Dhruba Borthakur <dh...@gmail.com>
> Subject: Re: minor compaction bug (was: upsert case performance problem)
> To: dev@hbase.apache.org, apurtell@apache.org
> Date: Saturday, March 19, 2011, 10:24 PM
> Hi andrew,
> 
> I have been doing a set of experiments for the last one month on  a workload
> that is purely "increments". I too have seen that the performance drops when
> the memstore fills up. My guess is that although the complexity is O(logn),
> still when n is large the time needed to insert/lookup could be large. It
> would have been nice if it were a hashMap instead of a tree, but the
> tradeoff is that we would have to sort it while writing to hfile.
> 
> another bottleneck that I am seeing is that all transactions need to come to
> a halt when rolling hlogs, the reason being that all transactions need to be
> drained before we can close the hlog. how frequently is this occuring in ur
> case?
> 
> how much GC are u seeing and what is the InitialOccupancyFactor for the JVM,
> I have set InitialOccupancyFactor to 40 in my case. what is the size of ur
> NewGen?
> 
> how many client threads and how many region server handler threads are u
> using?
> 
> For increment operation, I introduced the concept of a ModifyableKeyValue
> whereby every increment actually updates the same KeyValue record if found
> in the MemStore (instead of creating a new KeyValue record and re-inserting
> it into memstore).
> 
> I am very interested in exchanging notes and what else u find,
> thanks,
> dhruba
> 
> On Sat, Mar 19, 2011 at 11:15 AM, Andrew Purtell <ap...@apache.org>wrote:
> 
> > See below.
> >
> > Doing some testing on that I let the mapreduce program
> and an hbase shell
> > flushing every 60 seconds run overnight. The result on
> two tables was:
> >
> > 562 store files!
> >
> >   ip-10-170-34-18.us-west-1.compute.internal:60020
> 1300494200158
> >       requests=51, regions=1,
> usedHeap=1980, maxHeap=3960
> >   
>    akamai.ip,,1300494562755.1b0614eaecca0d232d7315ff4a3ebb87.
> >         
>    stores=1, storefiles=562,
> storefileSizeMB=310,
> > memstoreSizeMB=1, storefileIndexSizeMB=2
> >
> > 528 store files!
> >
> >   
> ip-10-170-49-35.us-west-1.compute.internal:60020
> 1300494214101
> >        requests=79, regions=1,
> usedHeap=1830, maxHeap=3960
> >       
> akamai.domain,,1300494560898.af85225ae650574dbc4caa34df8b6a35.
> >         
>    stores=1, storefiles=528,
> storefileSizeMB=460,
> > memstoreSizeMB=3, storefileIndexSizeMB=3
> >
> > ... so that killed performance after a while ...
> >
> > Here's something else.
> >
> >   - Andy
> >
> > --- On Sat, 3/19/11, Andrew Purtell <ap...@apache.org>
> wrote:
> >
> > From: Andrew Purtell <ap...@apache.org>
> > Subject: upsert case performance problem (doubts
> about
> > ConcurrentSkipListMap)
> > To: dev@hbase.apache.org
> > Date: Saturday, March 19, 2011, 11:10 AM
> >
> > I have a mapreduce task put together for
> experimentation which does a lot
> > of Increments over three tables and Puts to another. I
> set writeToWAL to
> > false. My HBase includes the patch that fixes
> serialization of writeToWAL
> > for Increments. MemstoreLAB is enabled but is probably
> not a factor, but
> > still need to test to exclude it.
> >
> > After starting a job up on a test cluster on EC2 with
> 20 mappers over 10
> > slaves I see initially 10-15K/ops/sec/server. This
> performance drops over a
> > short time to stabilize around 1K/ops/sec/server. So I
> flush the tables with
> > the shell. Immediately after flushing the tables,
> performance is back up to
> > 10-15K/ops/sec/server. If I don't flush, performance
> remains low
> > indefinitely. If I flush only the table receiving the
> Gets, performance
> > remains low.
> >
> > If I set the shell to flush in a loop every 60
> seconds, performance
> > repeatedly drops during that interval, then recovers
> after flushing.
> >
> > When Gary and I went to NCHC in Taiwan, we saw a guy
> from PhiCloud present
> > something similar to this regarding 0.89DR. He
> measured the performance of
> > the memstore for a get-and-put use case over time and
> graphed it, looked
> > like time increased on a staircase with a trend to
> O(n). This was a
> > surprising result. ConcurrentSkipListMap#put is
> supposed to run in O(log n).
> > His workaround was to flush after some fixed number of
> gets+puts, 1000 I
> > think. At the time we weren't sure what was going on
> given the language
> > barrier.
> >
> > Sound familiar?
> >
> > I don't claim to really understand what is going on,
> but need to get to the
> > bottom of this. Going to look at it in depth starting
> Monday.
> >
> >   - Andy
> >
> >
> >
> >
> >
> 
> 
> -- 
> Connect to me at http://www.facebook.com/dhruba
>

Re: minor compaction bug (was: upsert case performance problem)

Posted by Dhruba Borthakur <dh...@gmail.com>.

Hi andrew,

I have been doing a set of experiments for the last one month on  a workload
that is purely "increments". I too have seen that the performance drops when
the memstore fills up. My guess is that although the complexity is O(logn),
still when n is large the time needed to insert/lookup could be large. It
would have been nice if it were a hashMap instead of a tree, but the
tradeoff is that we would have to sort it while writing to hfile.

another bottleneck that I am seeing is that all transactions need to come to
a halt when rolling hlogs, the reason being that all transactions need to be
drained before we can close the hlog. how frequently is this occuring in ur
case?

how much GC are u seeing and what is the InitialOccupancyFactor for the JVM,
I have set InitialOccupancyFactor to 40 in my case. what is the size of ur
NewGen?

how many client threads and how many region server handler threads are u
using?

For increment operation, I introduced the concept of a ModifyableKeyValue
whereby every increment actually updates the same KeyValue record if found
in the MemStore (instead of creating a new KeyValue record and re-inserting
it into memstore).

I am very interested in exchanging notes and what else u find,
thanks,
dhruba

On Sat, Mar 19, 2011 at 11:15 AM, Andrew Purtell <ap...@apache.org>wrote:

> See below.
>
> Doing some testing on that I let the mapreduce program and an hbase shell
> flushing every 60 seconds run overnight. The result on two tables was:
>
> 562 store files!
>
>   ip-10-170-34-18.us-west-1.compute.internal:60020 1300494200158
>       requests=51, regions=1, usedHeap=1980, maxHeap=3960
>       akamai.ip,,1300494562755.1b0614eaecca0d232d7315ff4a3ebb87.
>             stores=1, storefiles=562, storefileSizeMB=310,
> memstoreSizeMB=1, storefileIndexSizeMB=2
>
> 528 store files!
>
>    ip-10-170-49-35.us-west-1.compute.internal:60020 1300494214101
>        requests=79, regions=1, usedHeap=1830, maxHeap=3960
>        akamai.domain,,1300494560898.af85225ae650574dbc4caa34df8b6a35.
>             stores=1, storefiles=528, storefileSizeMB=460,
> memstoreSizeMB=3, storefileIndexSizeMB=3
>
> ... so that killed performance after a while ...
>
> Here's something else.
>
>   - Andy
>
> --- On Sat, 3/19/11, Andrew Purtell <ap...@apache.org> wrote:
>
> From: Andrew Purtell <ap...@apache.org>
> Subject: upsert case performance problem (doubts about
> ConcurrentSkipListMap)
> To: dev@hbase.apache.org
> Date: Saturday, March 19, 2011, 11:10 AM
>
> I have a mapreduce task put together for experimentation which does a lot
> of Increments over three tables and Puts to another. I set writeToWAL to
> false. My HBase includes the patch that fixes serialization of writeToWAL
> for Increments. MemstoreLAB is enabled but is probably not a factor, but
> still need to test to exclude it.
>
> After starting a job up on a test cluster on EC2 with 20 mappers over 10
> slaves I see initially 10-15K/ops/sec/server. This performance drops over a
> short time to stabilize around 1K/ops/sec/server. So I flush the tables with
> the shell. Immediately after flushing the tables, performance is back up to
> 10-15K/ops/sec/server. If I don't flush, performance remains low
> indefinitely. If I flush only the table receiving the Gets, performance
> remains low.
>
> If I set the shell to flush in a loop every 60 seconds, performance
> repeatedly drops during that interval, then recovers after flushing.
>
> When Gary and I went to NCHC in Taiwan, we saw a guy from PhiCloud present
> something similar to this regarding 0.89DR. He measured the performance of
> the memstore for a get-and-put use case over time and graphed it, looked
> like time increased on a staircase with a trend to O(n). This was a
> surprising result. ConcurrentSkipListMap#put is supposed to run in O(log n).
> His workaround was to flush after some fixed number of gets+puts, 1000 I
> think. At the time we weren't sure what was going on given the language
> barrier.
>
> Sound familiar?
>
> I don't claim to really understand what is going on, but need to get to the
> bottom of this. Going to look at it in depth starting Monday.
>
>   - Andy
>
>
>
>
>


-- 
Connect to me at http://www.facebook.com/dhruba