You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Ted Yu <yu...@gmail.com> on 2011/10/24 19:19:23 UTC

CSLM performance Was: SILT - nice keyvalue store paper

Akash:
Take a look at the following two possible replacements for CSLM:
https://github.com/mspiegel/lockfreeskiptree
https://github.com/nbronson/snaptree

About your test, I got advice from other expert:

- the JVM warm-up penalty is being accrued by the CSLM run
- Use thread.yield() to end a request, as otherwise the active thread may be
able to run through its time slice without incurring much lock contention.

You can publish your code and results under HBASE-3992.

Thanks

On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jg...@fb.com> wrote:

> Oh, and when running these experiments, you should look at the impact at
> which order they are run in, whether you run them multiple times per JVM
> instance, etc.  Basically, you need to be cognizant of the HotSpot
> optimizations the JVM is doing at runtime.
>
> > -----Original Message-----
> > From: Jonathan Gray [mailto:jgray@fb.com]
> > Sent: Sunday, October 23, 2011 4:20 PM
> > To: dev@hbase.apache.org
> > Subject: RE: SILT - nice keyvalue store paper
> >
> > Very nice experiment, Akash.  Keep getting your hands dirty and digging!
>  :)
> >
> > I think your results might change if you bump the test up to 1000 threads
> or
> > so.  100 threads can still perform okay when there's a global lock but
> the
> > contention at 1000 threads will kill you and that's when CSLM should do
> much
> > better.  (1000 handler threads is approx. what I run with on RS in prod).
> > Though I am a bit surprised that at 100 threads the TreeMap was
> significantly
> > faster.  Your inconsistent results are a bit odd, you might try an order
> of
> > magnitude more operations per thread.  You might also gather some
> > statistics about tree size and per operation latency.
> >
> > I've done some isolated CSLM benchmarks in the past and have never been
> > able to reproduce any of the slowness people suggest.  I recall trying
> some
> > impractically large MemStores and everything still being quite fast.
> >
> > Over in Cassandra, I believe they have a two-level CSLM with the first
> map
> > key being the row and then the columns for each row in their own CSLM.
> > I've been told this is somewhat of a pain point for them.  And keep in
> mind
> > they have one shard/region per node and we generally have several smaller
> > MemStores on each node (tens to thousands).  Not sure we would want to
> > try that.  There could be some interesting optimizations if you had very
> > specific issues, like if you had a ton of reads to MemStore and not many
> > writes you could keep some kind of mirrored hashmap.
> >
> > And for writes, the WAL is definitely the latency bottleneck.  But if you
> are
> > doing lots of small operations, so your WALEdits are not large, and with
> some
> > of the HLog batching features going in to trunk, you end up with hundreds
> of
> > requests per HLog sync.  And although the syncs are higher latency, with
> > batching you end up getting high throughput.  And the bottleneck shifts.
> >
> > Each sync will take approx. 1-5ms, so let's say 250 requests per HLog
> sync
> > batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req =
> 600K/sec,
> > very reasonable).  If you're mixing in reads as well (or if you're doing
> > increments which do a read and write), then this adds to the CPU usage
> and
> > contention without adding to HLog throughput.
> >
> > All of a sudden the bottleneck becomes CPU/contention and not HLog
> > latency or throughput.  Highly concurrent increments/counters with a
> largely
> > in-memory dataset can easily be CPU bottlenecked.
> >
> > For one specific application Dhruba and I worked on, we made some good
> > improvements in CPU efficiency by reducing the number of operations and
> > increasing efficiency on the CSLM.  Doing things like always taking a
> tailMap
> > and working from that instead of starting at the root node, using an
> iterator()
> > and taking advantage of the available remove() semantics, or simply just
> > mutating things that are normally immutable :)  Unfortunately many of
> these
> > optimizations were semi-horrid hacks and introduced things like
> > ModifiableKeyValues, so they all haven't made their way to apache.
> >
> > In the end, after our optimizations, the real world workload Dhruba and I
> > were working with was not all in-memory so the bottleneck in production
> > became the random reads (so increasing the block cache hit ratio is the
> > focus) rather than CPU contention or HLog throughput.
> >
> > JG
> >
> > From: Akash Ashok [mailto:thehellmaker@gmail.com]
> > Sent: Sunday, October 23, 2011 2:57 AM
> > To: dev@hbase.apache.org
> > Subject: Re: SILT - nice keyvalue store paper
> >
> > I was running some similar tests and came across a surprising finding. I
> > compared reads and write on ConcurrentSkipListMap ( which the memstore
> > uses) and synchronized TreeMap ( Which was literally treemap
> > synchronized). Executed concurrent reads, writes and deletes on both of
> > them.
> > Surprisingly synchronized treeMap performed better, though just slightly
> > better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.
> >
> > Here are the output of a few runs
> >
> > Sometimes the difference was considerable Using HBaseMap it took
> > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms
> >
> > And sometimes the difference was negligible Using HBaseMap it took
> > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms
> >
> > I've attaching the test  java file which I wrote to test it.
> > This might be a very minor differece but still surprising considering the
> fact
> > that ConcurrentSkipListMap uses fancy 2 level indexes which they say
> > improves the deletion performance.
> >
> > And here are the details about the test run.
> > 100 Threads each fetching 1,000,000 records
> > 100 threads each adding 1,000,000 records.
> > 100 threads each deletin 1,000,000 records ( Reads, Writes and deletes
> > simultaneously )
> >
> > Cheers,
> > Akash A
> > On Sun, Oct 23, 2011 at 3:25 AM, Stack
> > <st...@duboce.net>> wrote:
> > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal
> > <nk...@gmail.com>> wrote:
> > > I would think that the bottleneck for insert is the wal part?
> > > It would be possible to do a kind of memory list preparation during
> > > the wal insertion, and if the wal insertion is confirmed, do the
> > > insertion in the memory list. But it's strange to have the insertion in
> > memory important vs.
> > > the insertion on disk...
> > >
> > Yes, WAL is the long pole writing.  But MemStore has issues too; Dhruba
> says
> > cpu above.  Reading and writing it is also 'slow'.
> > St.Ack
>
>

Re: CSLM performance Was: SILT - nice keyvalue store paper

Posted by Ted Dunning <td...@maprtech.com>.
No.  The problem is that you want to emulate real-world behavior which is
probably closer to 1000 threads each doing a single transaction and yielding
than anything else.  For instance, if your traffic originates from a
web-farm, each transaction will be in a thread that yields when it finishes
the current request.  That probably will only result in a few transactions
and each transaction will be handled by a region server in a different
thread.  How this reflects to the threads that actual touch the map being
considered is an interesting question, but there are likely to be a hundred
or more such threads.

Inserts, on the other hand, are much more likely to come from a single
thread doing lots of inserts but each insert will be a separate request and
thus be handled by different threads on the hbase regionserver side.

The problem that this causes is that you can blow cache as the access to the
table is handed from core to core and socket to socket.

On Mon, Oct 24, 2011 at 11:14 PM, Akash Ashok <th...@gmail.com>wrote:

> Wouldn't it be enough if I increase the number of iterations by a factor if
> say 100 per thread ?
>

Re: CSLM performance Was: SILT - nice keyvalue store paper

Posted by Li Pi <li...@idle.li>.
http://www.azulsystems.com/events/javaone_2009/session/2009_J1_Benchmark.pdf

Might also be useful.

On Tue, Oct 25, 2011 at 9:07 AM, Ted Yu <yu...@gmail.com> wrote:
> http://code.google.com/p/caliper/ might be useful.
>
> On Tue, Oct 25, 2011 at 7:36 AM, Andrew Purtell <ap...@apache.org> wrote:
>
>> Also please make note of what JVM version and report it with the results.
>>
>>
>> Best regards,
>>
>>
>>    - Andy
>>
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>>
>>
>> ----- Original Message -----
>> > From: Li Pi <li...@idle.li>
>> > To: dev@hbase.apache.org
>> > Cc:
>> > Sent: Monday, October 24, 2011 11:27 PM
>> > Subject: Re: CSLM performance Was: SILT - nice keyvalue store paper
>> >
>> > You might also want to run the code a few times, and only take results
>> > from the latter half. Let the JVM warm up and JIT things for a fair
>> > comparison.
>> >
>> > On Mon, Oct 24, 2011 at 11:14 PM, Akash Ashok <th...@gmail.com>
>> > wrote:
>> >>  On Mon, Oct 24, 2011 at 10:49 PM, Ted Yu <yu...@gmail.com> wrote:
>> >>
>> >>>  Akash:
>> >>>  Take a look at the following two possible replacements for CSLM:
>> >>>  https://github.com/mspiegel/lockfreeskiptree
>> >>>  https://github.com/nbronson/snaptree
>> >>
>> >>  Thanks Ted :) :) :). Was pretty much on the lookout for other
>> structures. I
>> >>  tried
>> >>
>> >
>> http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/SyncSortedMap.html
>> >>  but didn't perform that great. Will look into these replacements.
>> >>
>> >>
>> >>>
>> >>>  About your test, I got advice from other expert:
>> >>>
>> >>>  - the JVM warm-up penalty is being accrued by the CSLM run
>> >>>  - Use thread.yield() to end a request, as otherwise the active thread
>> > may
>> >>>  be
>> >>>  able to run through its time slice without incurring much lock
>> > contention.
>> >>>
>> >>
>> >>  Hmm. I was just wondering. Since each thread 10000+ inserts and deletes
>> are
>> >>  being run
>> >>  they have to context switch quite a few times during its lifecycle
>> right ?
>> >>  Wouldn't it be enough if I increase the number of iterations by a
>> > factor if
>> >>  say 100 per thread ?
>> >>
>> >>
>> >>
>> >>>  You can publish your code and results under HBASE-3992.
>> >>>
>> >>>  Thanks
>> >>>
>> >>>  On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jg...@fb.com>
>> > wrote:
>> >>>
>> >>>  > Oh, and when running these experiments, you should look at the
>> > impact at
>> >>>  > which order they are run in, whether you run them multiple times
>> > per JVM
>> >>>  > instance, etc.  Basically, you need to be cognizant of the HotSpot
>> >>>  > optimizations the JVM is doing at runtime.
>> >>>  >
>> >>>  > > -----Original Message-----
>> >>>  > > From: Jonathan Gray [mailto:jgray@fb.com]
>> >>>  > > Sent: Sunday, October 23, 2011 4:20 PM
>> >>>  > > To: dev@hbase.apache.org
>> >>>  > > Subject: RE: SILT - nice keyvalue store paper
>> >>>  > >
>> >>>  > > Very nice experiment, Akash.  Keep getting your hands dirty
>> > and
>> >>>  digging!
>> >>>  >  :)
>> >>>  > >
>> >>>  > > I think your results might change if you bump the test up to
>> > 1000
>> >>>  threads
>> >>>  > or
>> >>>  > > so.  100 threads can still perform okay when there's a
>> > global lock but
>> >>>  > the
>> >>>  > > contention at 1000 threads will kill you and that's when
>> > CSLM should do
>> >>>  > much
>> >>>  > > better.  (1000 handler threads is approx. what I run with on
>> > RS in
>> >>>  prod).
>> >>>  > > Though I am a bit surprised that at 100 threads the TreeMap
>> > was
>> >>>  > significantly
>> >>>  > > faster.  Your inconsistent results are a bit odd, you might
>> > try an
>> >>>  order
>> >>>  > of
>> >>>  > > magnitude more operations per thread.  You might also gather
>> > some
>> >>>  > > statistics about tree size and per operation latency.
>> >>>  > >
>> >>>  > > I've done some isolated CSLM benchmarks in the past and
>> > have never been
>> >>>  > > able to reproduce any of the slowness people suggest.  I
>> > recall trying
>> >>>  > some
>> >>>  > > impractically large MemStores and everything still being
>> > quite fast.
>> >>>  > >
>> >>>  > > Over in Cassandra, I believe they have a two-level CSLM with
>> > the first
>> >>>  > map
>> >>>  > > key being the row and then the columns for each row in their
>> > own CSLM.
>> >>>  > > I've been told this is somewhat of a pain point for them.
>> >  And keep in
>> >>>  > mind
>> >>>  > > they have one shard/region per node and we generally have
>> > several
>> >>>  smaller
>> >>>  > > MemStores on each node (tens to thousands).  Not sure we
>> > would want to
>> >>>  > > try that.  There could be some interesting optimizations if
>> > you had
>> >>>  very
>> >>>  > > specific issues, like if you had a ton of reads to MemStore
>> > and not
>> >>>  many
>> >>>  > > writes you could keep some kind of mirrored hashmap.
>> >>>  > >
>> >>>  > > And for writes, the WAL is definitely the latency bottleneck.
>> >  But if
>> >>>  you
>> >>>  > are
>> >>>  > > doing lots of small operations, so your WALEdits are not
>> > large, and
>> >>>  with
>> >>>  > some
>> >>>  > > of the HLog batching features going in to trunk, you end up
>> > with
>> >>>  hundreds
>> >>>  > of
>> >>>  > > requests per HLog sync.  And although the syncs are higher
>> > latency,
>> >>>  with
>> >>>  > > batching you end up getting high throughput.  And the
>> > bottleneck
>> >>>  shifts.
>> >>>  > >
>> >>>  > > Each sync will take approx. 1-5ms, so let's say 250
>> > requests per HLog
>> >>>  > sync
>> >>>  > > batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100
>> > bytes/req =
>> >>>  > 600K/sec,
>> >>>  > > very reasonable).  If you're mixing in reads as well (or
>> > if you're
>> >>>  doing
>> >>>  > > increments which do a read and write), then this adds to the
>> > CPU usage
>> >>>  > and
>> >>>  > > contention without adding to HLog throughput.
>> >>>  > >
>> >>>  > > All of a sudden the bottleneck becomes CPU/contention and not
>> > HLog
>> >>>  > > latency or throughput.  Highly concurrent increments/counters
>> > with a
>> >>>  > largely
>> >>>  > > in-memory dataset can easily be CPU bottlenecked.
>> >>>  > >
>> >>>  > > For one specific application Dhruba and I worked on, we made
>> > some good
>> >>>  > > improvements in CPU efficiency by reducing the number of
>> > operations and
>> >>>  > > increasing efficiency on the CSLM.  Doing things like always
>> > taking a
>> >>>  > tailMap
>> >>>  > > and working from that instead of starting at the root node,
>> > using an
>> >>>  > iterator()
>> >>>  > > and taking advantage of the available remove() semantics, or
>> > simply
>> >>>  just
>> >>>  > > mutating things that are normally immutable :)  Unfortunately
>> > many of
>> >>>  > these
>> >>>  > > optimizations were semi-horrid hacks and introduced things
>> > like
>> >>>  > > ModifiableKeyValues, so they all haven't made their way
>> > to apache.
>> >>>  > >
>> >>>  > > In the end, after our optimizations, the real world workload
>> > Dhruba and
>> >>>  I
>> >>>  > > were working with was not all in-memory so the bottleneck in
>> > production
>> >>>  > > became the random reads (so increasing the block cache hit
>> > ratio is the
>> >>>  > > focus) rather than CPU contention or HLog throughput.
>> >>>  > >
>> >>>  > > JG
>> >>>  > >
>> >>>  > > From: Akash Ashok [mailto:thehellmaker@gmail.com]
>> >>>  > > Sent: Sunday, October 23, 2011 2:57 AM
>> >>>  > > To: dev@hbase.apache.org
>> >>>  > > Subject: Re: SILT - nice keyvalue store paper
>> >>>  > >
>> >>>  > > I was running some similar tests and came across a surprising
>> > finding.
>> >>>  I
>> >>>  > > compared reads and write on ConcurrentSkipListMap ( which the
>> > memstore
>> >>>  > > uses) and synchronized TreeMap ( Which was literally treemap
>> >>>  > > synchronized). Executed concurrent reads, writes and deletes
>> > on both of
>> >>>  > > them.
>> >>>  > > Surprisingly synchronized treeMap performed better, though
>> > just
>> >>>  slightly
>> >>>  > > better, than ConcurrentSkipListMap which KeyValueSkipListSet
>> > uses.
>> >>>  > >
>> >>>  > > Here are the output of a few runs
>> >>>  > >
>> >>>  > > Sometimes the difference was considerable Using HBaseMap it
>> > took
>> >>>  > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms
>> >>>  > >
>> >>>  > > And sometimes the difference was negligible Using HBaseMap it
>> > took
>> >>>  > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms
>> >>>  > >
>> >>>  > > I've attaching the test  java file which I wrote to test
>> > it.
>> >>>  > > This might be a very minor differece but still surprising
>> > considering
>> >>>  the
>> >>>  > fact
>> >>>  > > that ConcurrentSkipListMap uses fancy 2 level indexes which
>> > they say
>> >>>  > > improves the deletion performance.
>> >>>  > >
>> >>>  > > And here are the details about the test run.
>> >>>  > > 100 Threads each fetching 1,000,000 records
>> >>>  > > 100 threads each adding 1,000,000 records.
>> >>>  > > 100 threads each deletin 1,000,000 records ( Reads, Writes
>> > and deletes
>> >>>  > > simultaneously )
>> >>>  > >
>> >>>  > > Cheers,
>> >>>  > > Akash A
>> >>>  > > On Sun, Oct 23, 2011 at 3:25 AM, Stack
>> >>>  > > <st...@duboce.net>>
>> > wrote:
>> >>>  > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal
>> >>>  > > <nk...@gmail.com>>
>> > wrote:
>> >>>  > > > I would think that the bottleneck for insert is the wal
>> > part?
>> >>>  > > > It would be possible to do a kind of memory list
>> > preparation during
>> >>>  > > > the wal insertion, and if the wal insertion is
>> > confirmed, do the
>> >>>  > > > insertion in the memory list. But it's strange to
>> > have the insertion
>> >>>  in
>> >>>  > > memory important vs.
>> >>>  > > > the insertion on disk...
>> >>>  > > >
>> >>>  > > Yes, WAL is the long pole writing.  But MemStore has issues
>> > too; Dhruba
>> >>>  > says
>> >>>  > > cpu above.  Reading and writing it is also 'slow'.
>> >>>  > > St.Ack
>> >>>  >
>> >>>  >
>> >>>
>> >>
>> >
>

Re: CSLM performance Was: SILT - nice keyvalue store paper

Posted by Ted Yu <yu...@gmail.com>.
http://code.google.com/p/caliper/ might be useful.

On Tue, Oct 25, 2011 at 7:36 AM, Andrew Purtell <ap...@apache.org> wrote:

> Also please make note of what JVM version and report it with the results.
>
>
> Best regards,
>
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>
>
> ----- Original Message -----
> > From: Li Pi <li...@idle.li>
> > To: dev@hbase.apache.org
> > Cc:
> > Sent: Monday, October 24, 2011 11:27 PM
> > Subject: Re: CSLM performance Was: SILT - nice keyvalue store paper
> >
> > You might also want to run the code a few times, and only take results
> > from the latter half. Let the JVM warm up and JIT things for a fair
> > comparison.
> >
> > On Mon, Oct 24, 2011 at 11:14 PM, Akash Ashok <th...@gmail.com>
> > wrote:
> >>  On Mon, Oct 24, 2011 at 10:49 PM, Ted Yu <yu...@gmail.com> wrote:
> >>
> >>>  Akash:
> >>>  Take a look at the following two possible replacements for CSLM:
> >>>  https://github.com/mspiegel/lockfreeskiptree
> >>>  https://github.com/nbronson/snaptree
> >>
> >>  Thanks Ted :) :) :). Was pretty much on the lookout for other
> structures. I
> >>  tried
> >>
> >
> http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/SyncSortedMap.html
> >>  but didn't perform that great. Will look into these replacements.
> >>
> >>
> >>>
> >>>  About your test, I got advice from other expert:
> >>>
> >>>  - the JVM warm-up penalty is being accrued by the CSLM run
> >>>  - Use thread.yield() to end a request, as otherwise the active thread
> > may
> >>>  be
> >>>  able to run through its time slice without incurring much lock
> > contention.
> >>>
> >>
> >>  Hmm. I was just wondering. Since each thread 10000+ inserts and deletes
> are
> >>  being run
> >>  they have to context switch quite a few times during its lifecycle
> right ?
> >>  Wouldn't it be enough if I increase the number of iterations by a
> > factor if
> >>  say 100 per thread ?
> >>
> >>
> >>
> >>>  You can publish your code and results under HBASE-3992.
> >>>
> >>>  Thanks
> >>>
> >>>  On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jg...@fb.com>
> > wrote:
> >>>
> >>>  > Oh, and when running these experiments, you should look at the
> > impact at
> >>>  > which order they are run in, whether you run them multiple times
> > per JVM
> >>>  > instance, etc.  Basically, you need to be cognizant of the HotSpot
> >>>  > optimizations the JVM is doing at runtime.
> >>>  >
> >>>  > > -----Original Message-----
> >>>  > > From: Jonathan Gray [mailto:jgray@fb.com]
> >>>  > > Sent: Sunday, October 23, 2011 4:20 PM
> >>>  > > To: dev@hbase.apache.org
> >>>  > > Subject: RE: SILT - nice keyvalue store paper
> >>>  > >
> >>>  > > Very nice experiment, Akash.  Keep getting your hands dirty
> > and
> >>>  digging!
> >>>  >  :)
> >>>  > >
> >>>  > > I think your results might change if you bump the test up to
> > 1000
> >>>  threads
> >>>  > or
> >>>  > > so.  100 threads can still perform okay when there's a
> > global lock but
> >>>  > the
> >>>  > > contention at 1000 threads will kill you and that's when
> > CSLM should do
> >>>  > much
> >>>  > > better.  (1000 handler threads is approx. what I run with on
> > RS in
> >>>  prod).
> >>>  > > Though I am a bit surprised that at 100 threads the TreeMap
> > was
> >>>  > significantly
> >>>  > > faster.  Your inconsistent results are a bit odd, you might
> > try an
> >>>  order
> >>>  > of
> >>>  > > magnitude more operations per thread.  You might also gather
> > some
> >>>  > > statistics about tree size and per operation latency.
> >>>  > >
> >>>  > > I've done some isolated CSLM benchmarks in the past and
> > have never been
> >>>  > > able to reproduce any of the slowness people suggest.  I
> > recall trying
> >>>  > some
> >>>  > > impractically large MemStores and everything still being
> > quite fast.
> >>>  > >
> >>>  > > Over in Cassandra, I believe they have a two-level CSLM with
> > the first
> >>>  > map
> >>>  > > key being the row and then the columns for each row in their
> > own CSLM.
> >>>  > > I've been told this is somewhat of a pain point for them.
> >  And keep in
> >>>  > mind
> >>>  > > they have one shard/region per node and we generally have
> > several
> >>>  smaller
> >>>  > > MemStores on each node (tens to thousands).  Not sure we
> > would want to
> >>>  > > try that.  There could be some interesting optimizations if
> > you had
> >>>  very
> >>>  > > specific issues, like if you had a ton of reads to MemStore
> > and not
> >>>  many
> >>>  > > writes you could keep some kind of mirrored hashmap.
> >>>  > >
> >>>  > > And for writes, the WAL is definitely the latency bottleneck.
> >  But if
> >>>  you
> >>>  > are
> >>>  > > doing lots of small operations, so your WALEdits are not
> > large, and
> >>>  with
> >>>  > some
> >>>  > > of the HLog batching features going in to trunk, you end up
> > with
> >>>  hundreds
> >>>  > of
> >>>  > > requests per HLog sync.  And although the syncs are higher
> > latency,
> >>>  with
> >>>  > > batching you end up getting high throughput.  And the
> > bottleneck
> >>>  shifts.
> >>>  > >
> >>>  > > Each sync will take approx. 1-5ms, so let's say 250
> > requests per HLog
> >>>  > sync
> >>>  > > batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100
> > bytes/req =
> >>>  > 600K/sec,
> >>>  > > very reasonable).  If you're mixing in reads as well (or
> > if you're
> >>>  doing
> >>>  > > increments which do a read and write), then this adds to the
> > CPU usage
> >>>  > and
> >>>  > > contention without adding to HLog throughput.
> >>>  > >
> >>>  > > All of a sudden the bottleneck becomes CPU/contention and not
> > HLog
> >>>  > > latency or throughput.  Highly concurrent increments/counters
> > with a
> >>>  > largely
> >>>  > > in-memory dataset can easily be CPU bottlenecked.
> >>>  > >
> >>>  > > For one specific application Dhruba and I worked on, we made
> > some good
> >>>  > > improvements in CPU efficiency by reducing the number of
> > operations and
> >>>  > > increasing efficiency on the CSLM.  Doing things like always
> > taking a
> >>>  > tailMap
> >>>  > > and working from that instead of starting at the root node,
> > using an
> >>>  > iterator()
> >>>  > > and taking advantage of the available remove() semantics, or
> > simply
> >>>  just
> >>>  > > mutating things that are normally immutable :)  Unfortunately
> > many of
> >>>  > these
> >>>  > > optimizations were semi-horrid hacks and introduced things
> > like
> >>>  > > ModifiableKeyValues, so they all haven't made their way
> > to apache.
> >>>  > >
> >>>  > > In the end, after our optimizations, the real world workload
> > Dhruba and
> >>>  I
> >>>  > > were working with was not all in-memory so the bottleneck in
> > production
> >>>  > > became the random reads (so increasing the block cache hit
> > ratio is the
> >>>  > > focus) rather than CPU contention or HLog throughput.
> >>>  > >
> >>>  > > JG
> >>>  > >
> >>>  > > From: Akash Ashok [mailto:thehellmaker@gmail.com]
> >>>  > > Sent: Sunday, October 23, 2011 2:57 AM
> >>>  > > To: dev@hbase.apache.org
> >>>  > > Subject: Re: SILT - nice keyvalue store paper
> >>>  > >
> >>>  > > I was running some similar tests and came across a surprising
> > finding.
> >>>  I
> >>>  > > compared reads and write on ConcurrentSkipListMap ( which the
> > memstore
> >>>  > > uses) and synchronized TreeMap ( Which was literally treemap
> >>>  > > synchronized). Executed concurrent reads, writes and deletes
> > on both of
> >>>  > > them.
> >>>  > > Surprisingly synchronized treeMap performed better, though
> > just
> >>>  slightly
> >>>  > > better, than ConcurrentSkipListMap which KeyValueSkipListSet
> > uses.
> >>>  > >
> >>>  > > Here are the output of a few runs
> >>>  > >
> >>>  > > Sometimes the difference was considerable Using HBaseMap it
> > took
> >>>  > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms
> >>>  > >
> >>>  > > And sometimes the difference was negligible Using HBaseMap it
> > took
> >>>  > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms
> >>>  > >
> >>>  > > I've attaching the test  java file which I wrote to test
> > it.
> >>>  > > This might be a very minor differece but still surprising
> > considering
> >>>  the
> >>>  > fact
> >>>  > > that ConcurrentSkipListMap uses fancy 2 level indexes which
> > they say
> >>>  > > improves the deletion performance.
> >>>  > >
> >>>  > > And here are the details about the test run.
> >>>  > > 100 Threads each fetching 1,000,000 records
> >>>  > > 100 threads each adding 1,000,000 records.
> >>>  > > 100 threads each deletin 1,000,000 records ( Reads, Writes
> > and deletes
> >>>  > > simultaneously )
> >>>  > >
> >>>  > > Cheers,
> >>>  > > Akash A
> >>>  > > On Sun, Oct 23, 2011 at 3:25 AM, Stack
> >>>  > > <st...@duboce.net>>
> > wrote:
> >>>  > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal
> >>>  > > <nk...@gmail.com>>
> > wrote:
> >>>  > > > I would think that the bottleneck for insert is the wal
> > part?
> >>>  > > > It would be possible to do a kind of memory list
> > preparation during
> >>>  > > > the wal insertion, and if the wal insertion is
> > confirmed, do the
> >>>  > > > insertion in the memory list. But it's strange to
> > have the insertion
> >>>  in
> >>>  > > memory important vs.
> >>>  > > > the insertion on disk...
> >>>  > > >
> >>>  > > Yes, WAL is the long pole writing.  But MemStore has issues
> > too; Dhruba
> >>>  > says
> >>>  > > cpu above.  Reading and writing it is also 'slow'.
> >>>  > > St.Ack
> >>>  >
> >>>  >
> >>>
> >>
> >

Re: CSLM performance Was: SILT - nice keyvalue store paper

Posted by Andrew Purtell <ap...@apache.org>.
Also please make note of what JVM version and report it with the results.


Best regards,


   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)


----- Original Message -----
> From: Li Pi <li...@idle.li>
> To: dev@hbase.apache.org
> Cc: 
> Sent: Monday, October 24, 2011 11:27 PM
> Subject: Re: CSLM performance Was: SILT - nice keyvalue store paper
> 
> You might also want to run the code a few times, and only take results
> from the latter half. Let the JVM warm up and JIT things for a fair
> comparison.
> 
> On Mon, Oct 24, 2011 at 11:14 PM, Akash Ashok <th...@gmail.com> 
> wrote:
>>  On Mon, Oct 24, 2011 at 10:49 PM, Ted Yu <yu...@gmail.com> wrote:
>> 
>>>  Akash:
>>>  Take a look at the following two possible replacements for CSLM:
>>>  https://github.com/mspiegel/lockfreeskiptree
>>>  https://github.com/nbronson/snaptree
>> 
>>  Thanks Ted :) :) :). Was pretty much on the lookout for other structures. I
>>  tried
>> 
> http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/SyncSortedMap.html
>>  but didn't perform that great. Will look into these replacements.
>> 
>> 
>>> 
>>>  About your test, I got advice from other expert:
>>> 
>>>  - the JVM warm-up penalty is being accrued by the CSLM run
>>>  - Use thread.yield() to end a request, as otherwise the active thread 
> may
>>>  be
>>>  able to run through its time slice without incurring much lock 
> contention.
>>> 
>> 
>>  Hmm. I was just wondering. Since each thread 10000+ inserts and deletes are
>>  being run
>>  they have to context switch quite a few times during its lifecycle right ?
>>  Wouldn't it be enough if I increase the number of iterations by a 
> factor if
>>  say 100 per thread ?
>> 
>> 
>> 
>>>  You can publish your code and results under HBASE-3992.
>>> 
>>>  Thanks
>>> 
>>>  On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jg...@fb.com> 
> wrote:
>>> 
>>>  > Oh, and when running these experiments, you should look at the 
> impact at
>>>  > which order they are run in, whether you run them multiple times 
> per JVM
>>>  > instance, etc.  Basically, you need to be cognizant of the HotSpot
>>>  > optimizations the JVM is doing at runtime.
>>>  >
>>>  > > -----Original Message-----
>>>  > > From: Jonathan Gray [mailto:jgray@fb.com]
>>>  > > Sent: Sunday, October 23, 2011 4:20 PM
>>>  > > To: dev@hbase.apache.org
>>>  > > Subject: RE: SILT - nice keyvalue store paper
>>>  > >
>>>  > > Very nice experiment, Akash.  Keep getting your hands dirty 
> and
>>>  digging!
>>>  >  :)
>>>  > >
>>>  > > I think your results might change if you bump the test up to 
> 1000
>>>  threads
>>>  > or
>>>  > > so.  100 threads can still perform okay when there's a 
> global lock but
>>>  > the
>>>  > > contention at 1000 threads will kill you and that's when 
> CSLM should do
>>>  > much
>>>  > > better.  (1000 handler threads is approx. what I run with on 
> RS in
>>>  prod).
>>>  > > Though I am a bit surprised that at 100 threads the TreeMap 
> was
>>>  > significantly
>>>  > > faster.  Your inconsistent results are a bit odd, you might 
> try an
>>>  order
>>>  > of
>>>  > > magnitude more operations per thread.  You might also gather 
> some
>>>  > > statistics about tree size and per operation latency.
>>>  > >
>>>  > > I've done some isolated CSLM benchmarks in the past and 
> have never been
>>>  > > able to reproduce any of the slowness people suggest.  I 
> recall trying
>>>  > some
>>>  > > impractically large MemStores and everything still being 
> quite fast.
>>>  > >
>>>  > > Over in Cassandra, I believe they have a two-level CSLM with 
> the first
>>>  > map
>>>  > > key being the row and then the columns for each row in their 
> own CSLM.
>>>  > > I've been told this is somewhat of a pain point for them. 
>  And keep in
>>>  > mind
>>>  > > they have one shard/region per node and we generally have 
> several
>>>  smaller
>>>  > > MemStores on each node (tens to thousands).  Not sure we 
> would want to
>>>  > > try that.  There could be some interesting optimizations if 
> you had
>>>  very
>>>  > > specific issues, like if you had a ton of reads to MemStore 
> and not
>>>  many
>>>  > > writes you could keep some kind of mirrored hashmap.
>>>  > >
>>>  > > And for writes, the WAL is definitely the latency bottleneck. 
>  But if
>>>  you
>>>  > are
>>>  > > doing lots of small operations, so your WALEdits are not 
> large, and
>>>  with
>>>  > some
>>>  > > of the HLog batching features going in to trunk, you end up 
> with
>>>  hundreds
>>>  > of
>>>  > > requests per HLog sync.  And although the syncs are higher 
> latency,
>>>  with
>>>  > > batching you end up getting high throughput.  And the 
> bottleneck
>>>  shifts.
>>>  > >
>>>  > > Each sync will take approx. 1-5ms, so let's say 250 
> requests per HLog
>>>  > sync
>>>  > > batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 
> bytes/req =
>>>  > 600K/sec,
>>>  > > very reasonable).  If you're mixing in reads as well (or 
> if you're
>>>  doing
>>>  > > increments which do a read and write), then this adds to the 
> CPU usage
>>>  > and
>>>  > > contention without adding to HLog throughput.
>>>  > >
>>>  > > All of a sudden the bottleneck becomes CPU/contention and not 
> HLog
>>>  > > latency or throughput.  Highly concurrent increments/counters 
> with a
>>>  > largely
>>>  > > in-memory dataset can easily be CPU bottlenecked.
>>>  > >
>>>  > > For one specific application Dhruba and I worked on, we made 
> some good
>>>  > > improvements in CPU efficiency by reducing the number of 
> operations and
>>>  > > increasing efficiency on the CSLM.  Doing things like always 
> taking a
>>>  > tailMap
>>>  > > and working from that instead of starting at the root node, 
> using an
>>>  > iterator()
>>>  > > and taking advantage of the available remove() semantics, or 
> simply
>>>  just
>>>  > > mutating things that are normally immutable :)  Unfortunately 
> many of
>>>  > these
>>>  > > optimizations were semi-horrid hacks and introduced things 
> like
>>>  > > ModifiableKeyValues, so they all haven't made their way 
> to apache.
>>>  > >
>>>  > > In the end, after our optimizations, the real world workload 
> Dhruba and
>>>  I
>>>  > > were working with was not all in-memory so the bottleneck in 
> production
>>>  > > became the random reads (so increasing the block cache hit 
> ratio is the
>>>  > > focus) rather than CPU contention or HLog throughput.
>>>  > >
>>>  > > JG
>>>  > >
>>>  > > From: Akash Ashok [mailto:thehellmaker@gmail.com]
>>>  > > Sent: Sunday, October 23, 2011 2:57 AM
>>>  > > To: dev@hbase.apache.org
>>>  > > Subject: Re: SILT - nice keyvalue store paper
>>>  > >
>>>  > > I was running some similar tests and came across a surprising 
> finding.
>>>  I
>>>  > > compared reads and write on ConcurrentSkipListMap ( which the 
> memstore
>>>  > > uses) and synchronized TreeMap ( Which was literally treemap
>>>  > > synchronized). Executed concurrent reads, writes and deletes 
> on both of
>>>  > > them.
>>>  > > Surprisingly synchronized treeMap performed better, though 
> just
>>>  slightly
>>>  > > better, than ConcurrentSkipListMap which KeyValueSkipListSet 
> uses.
>>>  > >
>>>  > > Here are the output of a few runs
>>>  > >
>>>  > > Sometimes the difference was considerable Using HBaseMap it 
> took
>>>  > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms
>>>  > >
>>>  > > And sometimes the difference was negligible Using HBaseMap it 
> took
>>>  > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms
>>>  > >
>>>  > > I've attaching the test  java file which I wrote to test 
> it.
>>>  > > This might be a very minor differece but still surprising 
> considering
>>>  the
>>>  > fact
>>>  > > that ConcurrentSkipListMap uses fancy 2 level indexes which 
> they say
>>>  > > improves the deletion performance.
>>>  > >
>>>  > > And here are the details about the test run.
>>>  > > 100 Threads each fetching 1,000,000 records
>>>  > > 100 threads each adding 1,000,000 records.
>>>  > > 100 threads each deletin 1,000,000 records ( Reads, Writes 
> and deletes
>>>  > > simultaneously )
>>>  > >
>>>  > > Cheers,
>>>  > > Akash A
>>>  > > On Sun, Oct 23, 2011 at 3:25 AM, Stack
>>>  > > <st...@duboce.net>> 
> wrote:
>>>  > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal
>>>  > > <nk...@gmail.com>> 
> wrote:
>>>  > > > I would think that the bottleneck for insert is the wal 
> part?
>>>  > > > It would be possible to do a kind of memory list 
> preparation during
>>>  > > > the wal insertion, and if the wal insertion is 
> confirmed, do the
>>>  > > > insertion in the memory list. But it's strange to 
> have the insertion
>>>  in
>>>  > > memory important vs.
>>>  > > > the insertion on disk...
>>>  > > >
>>>  > > Yes, WAL is the long pole writing.  But MemStore has issues 
> too; Dhruba
>>>  > says
>>>  > > cpu above.  Reading and writing it is also 'slow'.
>>>  > > St.Ack
>>>  >
>>>  >
>>> 
>> 
>

Re: CSLM performance Was: SILT - nice keyvalue store paper

Posted by Li Pi <li...@idle.li>.
You might also want to run the code a few times, and only take results
from the latter half. Let the JVM warm up and JIT things for a fair
comparison.

On Mon, Oct 24, 2011 at 11:14 PM, Akash Ashok <th...@gmail.com> wrote:
> On Mon, Oct 24, 2011 at 10:49 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Akash:
>> Take a look at the following two possible replacements for CSLM:
>> https://github.com/mspiegel/lockfreeskiptree
>> https://github.com/nbronson/snaptree
>
> Thanks Ted :) :) :). Was pretty much on the lookout for other structures. I
> tried
> http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/SyncSortedMap.html
> but didn't perform that great. Will look into these replacements.
>
>
>>
>> About your test, I got advice from other expert:
>>
>> - the JVM warm-up penalty is being accrued by the CSLM run
>> - Use thread.yield() to end a request, as otherwise the active thread may
>> be
>> able to run through its time slice without incurring much lock contention.
>>
>
> Hmm. I was just wondering. Since each thread 10000+ inserts and deletes are
> being run
> they have to context switch quite a few times during its lifecycle right ?
> Wouldn't it be enough if I increase the number of iterations by a factor if
> say 100 per thread ?
>
>
>
>> You can publish your code and results under HBASE-3992.
>>
>> Thanks
>>
>> On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jg...@fb.com> wrote:
>>
>> > Oh, and when running these experiments, you should look at the impact at
>> > which order they are run in, whether you run them multiple times per JVM
>> > instance, etc.  Basically, you need to be cognizant of the HotSpot
>> > optimizations the JVM is doing at runtime.
>> >
>> > > -----Original Message-----
>> > > From: Jonathan Gray [mailto:jgray@fb.com]
>> > > Sent: Sunday, October 23, 2011 4:20 PM
>> > > To: dev@hbase.apache.org
>> > > Subject: RE: SILT - nice keyvalue store paper
>> > >
>> > > Very nice experiment, Akash.  Keep getting your hands dirty and
>> digging!
>> >  :)
>> > >
>> > > I think your results might change if you bump the test up to 1000
>> threads
>> > or
>> > > so.  100 threads can still perform okay when there's a global lock but
>> > the
>> > > contention at 1000 threads will kill you and that's when CSLM should do
>> > much
>> > > better.  (1000 handler threads is approx. what I run with on RS in
>> prod).
>> > > Though I am a bit surprised that at 100 threads the TreeMap was
>> > significantly
>> > > faster.  Your inconsistent results are a bit odd, you might try an
>> order
>> > of
>> > > magnitude more operations per thread.  You might also gather some
>> > > statistics about tree size and per operation latency.
>> > >
>> > > I've done some isolated CSLM benchmarks in the past and have never been
>> > > able to reproduce any of the slowness people suggest.  I recall trying
>> > some
>> > > impractically large MemStores and everything still being quite fast.
>> > >
>> > > Over in Cassandra, I believe they have a two-level CSLM with the first
>> > map
>> > > key being the row and then the columns for each row in their own CSLM.
>> > > I've been told this is somewhat of a pain point for them.  And keep in
>> > mind
>> > > they have one shard/region per node and we generally have several
>> smaller
>> > > MemStores on each node (tens to thousands).  Not sure we would want to
>> > > try that.  There could be some interesting optimizations if you had
>> very
>> > > specific issues, like if you had a ton of reads to MemStore and not
>> many
>> > > writes you could keep some kind of mirrored hashmap.
>> > >
>> > > And for writes, the WAL is definitely the latency bottleneck.  But if
>> you
>> > are
>> > > doing lots of small operations, so your WALEdits are not large, and
>> with
>> > some
>> > > of the HLog batching features going in to trunk, you end up with
>> hundreds
>> > of
>> > > requests per HLog sync.  And although the syncs are higher latency,
>> with
>> > > batching you end up getting high throughput.  And the bottleneck
>> shifts.
>> > >
>> > > Each sync will take approx. 1-5ms, so let's say 250 requests per HLog
>> > sync
>> > > batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req =
>> > 600K/sec,
>> > > very reasonable).  If you're mixing in reads as well (or if you're
>> doing
>> > > increments which do a read and write), then this adds to the CPU usage
>> > and
>> > > contention without adding to HLog throughput.
>> > >
>> > > All of a sudden the bottleneck becomes CPU/contention and not HLog
>> > > latency or throughput.  Highly concurrent increments/counters with a
>> > largely
>> > > in-memory dataset can easily be CPU bottlenecked.
>> > >
>> > > For one specific application Dhruba and I worked on, we made some good
>> > > improvements in CPU efficiency by reducing the number of operations and
>> > > increasing efficiency on the CSLM.  Doing things like always taking a
>> > tailMap
>> > > and working from that instead of starting at the root node, using an
>> > iterator()
>> > > and taking advantage of the available remove() semantics, or simply
>> just
>> > > mutating things that are normally immutable :)  Unfortunately many of
>> > these
>> > > optimizations were semi-horrid hacks and introduced things like
>> > > ModifiableKeyValues, so they all haven't made their way to apache.
>> > >
>> > > In the end, after our optimizations, the real world workload Dhruba and
>> I
>> > > were working with was not all in-memory so the bottleneck in production
>> > > became the random reads (so increasing the block cache hit ratio is the
>> > > focus) rather than CPU contention or HLog throughput.
>> > >
>> > > JG
>> > >
>> > > From: Akash Ashok [mailto:thehellmaker@gmail.com]
>> > > Sent: Sunday, October 23, 2011 2:57 AM
>> > > To: dev@hbase.apache.org
>> > > Subject: Re: SILT - nice keyvalue store paper
>> > >
>> > > I was running some similar tests and came across a surprising finding.
>> I
>> > > compared reads and write on ConcurrentSkipListMap ( which the memstore
>> > > uses) and synchronized TreeMap ( Which was literally treemap
>> > > synchronized). Executed concurrent reads, writes and deletes on both of
>> > > them.
>> > > Surprisingly synchronized treeMap performed better, though just
>> slightly
>> > > better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.
>> > >
>> > > Here are the output of a few runs
>> > >
>> > > Sometimes the difference was considerable Using HBaseMap it took
>> > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms
>> > >
>> > > And sometimes the difference was negligible Using HBaseMap it took
>> > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms
>> > >
>> > > I've attaching the test  java file which I wrote to test it.
>> > > This might be a very minor differece but still surprising considering
>> the
>> > fact
>> > > that ConcurrentSkipListMap uses fancy 2 level indexes which they say
>> > > improves the deletion performance.
>> > >
>> > > And here are the details about the test run.
>> > > 100 Threads each fetching 1,000,000 records
>> > > 100 threads each adding 1,000,000 records.
>> > > 100 threads each deletin 1,000,000 records ( Reads, Writes and deletes
>> > > simultaneously )
>> > >
>> > > Cheers,
>> > > Akash A
>> > > On Sun, Oct 23, 2011 at 3:25 AM, Stack
>> > > <st...@duboce.net>> wrote:
>> > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal
>> > > <nk...@gmail.com>> wrote:
>> > > > I would think that the bottleneck for insert is the wal part?
>> > > > It would be possible to do a kind of memory list preparation during
>> > > > the wal insertion, and if the wal insertion is confirmed, do the
>> > > > insertion in the memory list. But it's strange to have the insertion
>> in
>> > > memory important vs.
>> > > > the insertion on disk...
>> > > >
>> > > Yes, WAL is the long pole writing.  But MemStore has issues too; Dhruba
>> > says
>> > > cpu above.  Reading and writing it is also 'slow'.
>> > > St.Ack
>> >
>> >
>>
>

Re: CSLM performance Was: SILT - nice keyvalue store paper

Posted by Akash Ashok <th...@gmail.com>.
On Mon, Oct 24, 2011 at 10:49 PM, Ted Yu <yu...@gmail.com> wrote:

> Akash:
> Take a look at the following two possible replacements for CSLM:
> https://github.com/mspiegel/lockfreeskiptree
> https://github.com/nbronson/snaptree

Thanks Ted :) :) :). Was pretty much on the lookout for other structures. I
tried
http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/SyncSortedMap.html
but didn't perform that great. Will look into these replacements.


>
> About your test, I got advice from other expert:
>
> - the JVM warm-up penalty is being accrued by the CSLM run
> - Use thread.yield() to end a request, as otherwise the active thread may
> be
> able to run through its time slice without incurring much lock contention.
>

Hmm. I was just wondering. Since each thread 10000+ inserts and deletes are
being run
they have to context switch quite a few times during its lifecycle right ?
Wouldn't it be enough if I increase the number of iterations by a factor if
say 100 per thread ?



> You can publish your code and results under HBASE-3992.
>
> Thanks
>
> On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jg...@fb.com> wrote:
>
> > Oh, and when running these experiments, you should look at the impact at
> > which order they are run in, whether you run them multiple times per JVM
> > instance, etc.  Basically, you need to be cognizant of the HotSpot
> > optimizations the JVM is doing at runtime.
> >
> > > -----Original Message-----
> > > From: Jonathan Gray [mailto:jgray@fb.com]
> > > Sent: Sunday, October 23, 2011 4:20 PM
> > > To: dev@hbase.apache.org
> > > Subject: RE: SILT - nice keyvalue store paper
> > >
> > > Very nice experiment, Akash.  Keep getting your hands dirty and
> digging!
> >  :)
> > >
> > > I think your results might change if you bump the test up to 1000
> threads
> > or
> > > so.  100 threads can still perform okay when there's a global lock but
> > the
> > > contention at 1000 threads will kill you and that's when CSLM should do
> > much
> > > better.  (1000 handler threads is approx. what I run with on RS in
> prod).
> > > Though I am a bit surprised that at 100 threads the TreeMap was
> > significantly
> > > faster.  Your inconsistent results are a bit odd, you might try an
> order
> > of
> > > magnitude more operations per thread.  You might also gather some
> > > statistics about tree size and per operation latency.
> > >
> > > I've done some isolated CSLM benchmarks in the past and have never been
> > > able to reproduce any of the slowness people suggest.  I recall trying
> > some
> > > impractically large MemStores and everything still being quite fast.
> > >
> > > Over in Cassandra, I believe they have a two-level CSLM with the first
> > map
> > > key being the row and then the columns for each row in their own CSLM.
> > > I've been told this is somewhat of a pain point for them.  And keep in
> > mind
> > > they have one shard/region per node and we generally have several
> smaller
> > > MemStores on each node (tens to thousands).  Not sure we would want to
> > > try that.  There could be some interesting optimizations if you had
> very
> > > specific issues, like if you had a ton of reads to MemStore and not
> many
> > > writes you could keep some kind of mirrored hashmap.
> > >
> > > And for writes, the WAL is definitely the latency bottleneck.  But if
> you
> > are
> > > doing lots of small operations, so your WALEdits are not large, and
> with
> > some
> > > of the HLog batching features going in to trunk, you end up with
> hundreds
> > of
> > > requests per HLog sync.  And although the syncs are higher latency,
> with
> > > batching you end up getting high throughput.  And the bottleneck
> shifts.
> > >
> > > Each sync will take approx. 1-5ms, so let's say 250 requests per HLog
> > sync
> > > batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req =
> > 600K/sec,
> > > very reasonable).  If you're mixing in reads as well (or if you're
> doing
> > > increments which do a read and write), then this adds to the CPU usage
> > and
> > > contention without adding to HLog throughput.
> > >
> > > All of a sudden the bottleneck becomes CPU/contention and not HLog
> > > latency or throughput.  Highly concurrent increments/counters with a
> > largely
> > > in-memory dataset can easily be CPU bottlenecked.
> > >
> > > For one specific application Dhruba and I worked on, we made some good
> > > improvements in CPU efficiency by reducing the number of operations and
> > > increasing efficiency on the CSLM.  Doing things like always taking a
> > tailMap
> > > and working from that instead of starting at the root node, using an
> > iterator()
> > > and taking advantage of the available remove() semantics, or simply
> just
> > > mutating things that are normally immutable :)  Unfortunately many of
> > these
> > > optimizations were semi-horrid hacks and introduced things like
> > > ModifiableKeyValues, so they all haven't made their way to apache.
> > >
> > > In the end, after our optimizations, the real world workload Dhruba and
> I
> > > were working with was not all in-memory so the bottleneck in production
> > > became the random reads (so increasing the block cache hit ratio is the
> > > focus) rather than CPU contention or HLog throughput.
> > >
> > > JG
> > >
> > > From: Akash Ashok [mailto:thehellmaker@gmail.com]
> > > Sent: Sunday, October 23, 2011 2:57 AM
> > > To: dev@hbase.apache.org
> > > Subject: Re: SILT - nice keyvalue store paper
> > >
> > > I was running some similar tests and came across a surprising finding.
> I
> > > compared reads and write on ConcurrentSkipListMap ( which the memstore
> > > uses) and synchronized TreeMap ( Which was literally treemap
> > > synchronized). Executed concurrent reads, writes and deletes on both of
> > > them.
> > > Surprisingly synchronized treeMap performed better, though just
> slightly
> > > better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.
> > >
> > > Here are the output of a few runs
> > >
> > > Sometimes the difference was considerable Using HBaseMap it took
> > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms
> > >
> > > And sometimes the difference was negligible Using HBaseMap it took
> > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms
> > >
> > > I've attaching the test  java file which I wrote to test it.
> > > This might be a very minor differece but still surprising considering
> the
> > fact
> > > that ConcurrentSkipListMap uses fancy 2 level indexes which they say
> > > improves the deletion performance.
> > >
> > > And here are the details about the test run.
> > > 100 Threads each fetching 1,000,000 records
> > > 100 threads each adding 1,000,000 records.
> > > 100 threads each deletin 1,000,000 records ( Reads, Writes and deletes
> > > simultaneously )
> > >
> > > Cheers,
> > > Akash A
> > > On Sun, Oct 23, 2011 at 3:25 AM, Stack
> > > <st...@duboce.net>> wrote:
> > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal
> > > <nk...@gmail.com>> wrote:
> > > > I would think that the bottleneck for insert is the wal part?
> > > > It would be possible to do a kind of memory list preparation during
> > > > the wal insertion, and if the wal insertion is confirmed, do the
> > > > insertion in the memory list. But it's strange to have the insertion
> in
> > > memory important vs.
> > > > the insertion on disk...
> > > >
> > > Yes, WAL is the long pole writing.  But MemStore has issues too; Dhruba
> > says
> > > cpu above.  Reading and writing it is also 'slow'.
> > > St.Ack
> >
> >
>