You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Todd Lipcon <to...@cloudera.com> on 2011/10/13 07:42:23 UTC

SILT - nice keyvalue store paper

Read this paper last night and thought it had some nice ideas that
would be applicable to HBase:

http://www.cs.cmu.edu/~dga/papers/silt-sosp2011.pdf

I especially like the highly compact indexing scheme (entropy-coded tries)

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: SILT - nice keyvalue store paper

Posted by Akash Ashok <th...@gmail.com>.

Values corrected. there was a mistake in the number of records


> And here are the details about the test run.
> 100 Threads each fetching 10,000 records
> 100 threads each adding 10,000 records.
> 100 threads each deletin 10,000 records
> ( Reads, Writes and deletes simultaneously )
>
> Cheers,
> Akash A
>
>
> On Sun, Oct 23, 2011 at 3:25 AM, Stack <st...@duboce.net> wrote:
>
>> On Sat, Oct 22, 2011 at 2:41 PM, N Keywal <nk...@gmail.com> wrote:
>> > I would think that the bottleneck for insert is the wal part?
>> > It would be possible to do a kind of memory list preparation during the
>> wal
>> > insertion, and if the wal insertion is confirmed, do the insertion in
>> the
>> > memory list. But it's strange to have the insertion in memory important
>> vs.
>> > the insertion on disk...
>> >
>>
>> Yes, WAL is the long pole writing.  But MemStore has issues too;
>> Dhruba says cpu above.  Reading and writing it is also 'slow'.
>> St.Ack
>>
>
>

Re: SILT - nice keyvalue store paper

Posted by Akash Ashok <th...@gmail.com>.

On Mon, Oct 24, 2011 at 4:50 AM, Jonathan Gray <jg...@fb.com> wrote:

> Very nice experiment, Akash.  Keep getting your hands dirty and digging!
>  :)
>
> I think your results might change if you bump the test up to 1000 threads
> or so.  100 threads can still perform okay when there's a global lock but
> the contention at 1000 threads will kill you and that's when CSLM should do
> much better.  (1000 handler threads is approx. what I run with on RS in
> prod).  Though I am a bit surprised that at 100 threads the TreeMap was
> significantly faster.  Your inconsistent results are a bit odd, you might
> try an order of magnitude more operations per thread.  You might also gather
> some statistics about tree size and per operation latency.
>

Hey JG . At 1000 threads the situation reversed, CSLM performed much better
as you said. I just took it a little overboard just to see what happens at
10,000  threads. TreeMap globally synchronized again performed better
consistently. which was super duper surprising, I am just trying to figure
out why this is happening. Will post the results soon.


> I've done some isolated CSLM benchmarks in the past and have never been
> able to reproduce any of the slowness people suggest.  I recall trying some
> impractically large MemStores and everything still being quite fast.


> Over in Cassandra, I believe they have a two-level CSLM with the first map
> key being the row and then the columns for each row in their own CSLM.  I've
> been told this is somewhat of a pain point for them.  And keep in mind they
> have one shard/region per node and we generally have several smaller
> MemStores on each node (tens to thousands).  Not sure we would want to try
> that.  There could be some interesting optimizations if you had very
> specific issues, like if you had a ton of reads to MemStore and not many
> writes you could keep some kind of mirrored hashmap.
>
> And for writes, the WAL is definitely the latency bottleneck.  But if you
> are doing lots of small operations, so your WALEdits are not large, and with
> some of the HLog batching features going in to trunk, you end up with
> hundreds of requests per HLog sync.  And although the syncs are higher
> latency, with batching you end up getting high throughput.  And the
> bottleneck shifts.
>
> Each sync will take approx. 1-5ms, so let's say 250 requests per HLog sync
> batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req = 600K/sec,
> very reasonable).  If you're mixing in reads as well (or if you're doing
> increments which do a read and write), then this adds to the CPU usage and
> contention without adding to HLog throughput.
>
> All of a sudden the bottleneck becomes CPU/contention and not HLog latency
> or throughput.  Highly concurrent increments/counters with a largely
> in-memory dataset can easily be CPU bottlenecked.
>
> For one specific application Dhruba and I worked on, we made some good
> improvements in CPU efficiency by reducing the number of operations and
> increasing efficiency on the CSLM.  Doing things like always taking a
> tailMap and working from that instead of starting at the root node, using an
> iterator() and taking advantage of the available remove() semantics, or
> simply just mutating things that are normally immutable :)  Unfortunately
> many of these optimizations were semi-horrid hacks and introduced things
> like ModifiableKeyValues, so they all haven't made their way to apache.
>
> In the end, after our optimizations, the real world workload Dhruba and I
> were working with was not all in-memory so the bottleneck in production
> became the random reads (so increasing the block cache hit ratio is the
> focus) rather than CPU contention or HLog throughput.
>
> JG
>
> From: Akash Ashok [mailto:thehellmaker@gmail.com]
> Sent: Sunday, October 23, 2011 2:57 AM
> To: dev@hbase.apache.org
> Subject: Re: SILT - nice keyvalue store paper
>
> I was running some similar tests and came across a surprising finding. I
> compared reads and write on ConcurrentSkipListMap ( which the memstore uses)
> and synchronized TreeMap ( Which was literally treemap synchronized).
> Executed concurrent reads, writes and deletes on both of them.
> Surprisingly synchronized treeMap performed better, though just slightly
> better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.
>
> Here are the output of a few runs
>
> Sometimes the difference was considerable
> Using HBaseMap it took 20438ms
> Using TreeMap it took 11613ms
> Time Difference:8825ms
>
> And sometimes the difference was negligible
> Using HBaseMap it took 13370ms
> Using TreeMap it took 9482ms
> Time Difference:3888ms
>
> I've attaching the test  java file which I wrote to test it.
> This might be a very minor differece but still surprising considering the
> fact that ConcurrentSkipListMap uses fancy 2 level indexes which they say
> improves the deletion performance.
>
> And here are the details about the test run.
> 100 Threads each fetching 1,000,000 records
> 100 threads each adding 1,000,000 records.
> 100 threads each deletin 1,000,000 records
> ( Reads, Writes and deletes simultaneously )
>
> Cheers,
> Akash A
> On Sun, Oct 23, 2011 at 3:25 AM, Stack <stack@duboce.net<mailto:
> stack@duboce.net>> wrote:
> On Sat, Oct 22, 2011 at 2:41 PM, N Keywal <nkeywal@gmail.com<mailto:
> nkeywal@gmail.com>> wrote:
> > I would think that the bottleneck for insert is the wal part?
> > It would be possible to do a kind of memory list preparation during the
> wal
> > insertion, and if the wal insertion is confirmed, do the insertion in the
> > memory list. But it's strange to have the insertion in memory important
> vs.
> > the insertion on disk...
> >
> Yes, WAL is the long pole writing.  But MemStore has issues too;
> Dhruba says cpu above.  Reading and writing it is also 'slow'.
> St.Ack
>
>

RE: SILT - nice keyvalue store paper

Posted by Vladimir Rodionov <vr...@carrieriq.com>.

There is nothing crazy with async I/O. Netty (which I presume is underlying network library in a latest Hadoop) is totally async.
 You can run 1000, 10,000 and may be more threads on Linux but performance -wise this would sub-optimal decision.
With 20 ms time share, on 8 core CPU some threads can wait their schedule slot for up to 2.5 seconds (1000 threads) or 
25 sec (10000 secs) or 250 secs (100,000). This is worst case scenario, of course. If you do not care about latency it probably won't hurt you too much. 
Another problems with a large number of threads scheduling 

1. is time which kernel scheduler spends doing nothing (from the application point of view).
2. L1/L2/L3 cache trashing

One more: having total number of active threads less than a number of physical cores (or threads in SPARC TX) is #1 requirement of a high-performance application. 
This allows you to cut on threads synchronization cost by utilizing such expensive things like spin locks (busy loops). Stack has already
mentioned LMAX Disruptor framework which is totally bypasses standard Java thread synchronizations (they use CAS, busy loops and thread yielding). 

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: lars hofhansl [lhofhansl@yahoo.com]
Sent: Monday, October 24, 2011 12:04 PM
To: dev@hbase.apache.org
Subject: Re: SILT - nice keyvalue store paper

I am not sure I would generally agree with this statement.
The linux kernel can easily handle 10.000's of threads (just need to keep default stack small).
(there were tests done with 1m threads too).

Doing async IO is all the craze today (see node.js and friends), but that also is not necessarily done with full understanding of the
performance characteristics.

Just my $0.02.

----- Original Message -----
From: Vladimir Rodionov <vr...@carrieriq.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>
Cc:
Sent: Monday, October 24, 2011 11:43 AM
Subject: RE: SILT - nice keyvalue store paper

Jonathan, 1000 threads is a bad application design. They will kill you even w/o contention. In my opinion, over-usage of a *modern*  Java concurrent framework
stuff w/o real understanding of a benefits is a major contributor to a poor performance and poor scalability of a Java application. Before making any decision
on using a fancy data structure which can potentially affect application performance it is always a good idea to run some benchmarks.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Jonathan Gray [jgray@fb.com]
Sent: Sunday, October 23, 2011 4:20 PM
To: dev@hbase.apache.org
Subject: RE: SILT - nice keyvalue store paper

Very nice experiment, Akash.  Keep getting your hands dirty and digging!  :)

I think your results might change if you bump the test up to 1000 threads or so.  100 threads can still perform okay when there's a global lock but the contention at 1000 threads will kill you and that's when CSLM should do much better.  (1000 handler threads is approx. what I run with on RS in prod).  Though I am a bit surprised that at 100 threads the TreeMap was significantly faster.  Your inconsistent results are a bit odd, you might try an order of magnitude more operations per thread.  You might also gather some statistics about tree size and per operation latency.

I've done some isolated CSLM benchmarks in the past and have never been able to reproduce any of the slowness people suggest.  I recall trying some impractically large MemStores and everything still being quite fast.

Over in Cassandra, I believe they have a two-level CSLM with the first map key being the row and then the columns for each row in their own CSLM.  I've been told this is somewhat of a pain point for them.  And keep in mind they have one shard/region per node and we generally have several smaller MemStores on each node (tens to thousands).  Not sure we would want to try that.  There could be some interesting optimizations if you had very specific issues, like if you had a ton of reads to MemStore and not many writes you could keep some kind of mirrored hashmap.

And for writes, the WAL is definitely the latency bottleneck.  But if you are doing lots of small operations, so your WALEdits are not large, and with some of the HLog batching features going in to trunk, you end up with hundreds of requests per HLog sync.  And although the syncs are higher latency, with batching you end up getting high throughput.  And the bottleneck shifts.

Each sync will take approx. 1-5ms, so let's say 250 requests per HLog sync batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req = 600K/sec, very reasonable).  If you're mixing in reads as well (or if you're doing increments which do a read and write), then this adds to the CPU usage and contention without adding to HLog throughput.

All of a sudden the bottleneck becomes CPU/contention and not HLog latency or throughput.  Highly concurrent increments/counters with a largely in-memory dataset can easily be CPU bottlenecked.

For one specific application Dhruba and I worked on, we made some good improvements in CPU efficiency by reducing the number of operations and increasing efficiency on the CSLM.  Doing things like always taking a tailMap and working from that instead of starting at the root node, using an iterator() and taking advantage of the available remove() semantics, or simply just mutating things that are normally immutable :)  Unfortunately many of these optimizations were semi-horrid hacks and introduced things like ModifiableKeyValues, so they all haven't made their way to apache.

In the end, after our optimizations, the real world workload Dhruba and I were working with was not all in-memory so the bottleneck in production became the random reads (so increasing the block cache hit ratio is the focus) rather than CPU contention or HLog throughput.

JG

From: Akash Ashok [mailto:thehellmaker@gmail.com]
Sent: Sunday, October 23, 2011 2:57 AM
To: dev@hbase.apache.org
Subject: Re: SILT - nice keyvalue store paper

I was running some similar tests and came across a surprising finding. I compared reads and write on ConcurrentSkipListMap ( which the memstore uses) and synchronized TreeMap ( Which was literally treemap synchronized). Executed concurrent reads, writes and deletes on both of them.
Surprisingly synchronized treeMap performed better, though just slightly better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.

Here are the output of a few runs

Sometimes the difference was considerable
Using HBaseMap it took 20438ms
Using TreeMap it took 11613ms
Time Difference:8825ms

And sometimes the difference was negligible
Using HBaseMap it took 13370ms
Using TreeMap it took 9482ms
Time Difference:3888ms

I've attaching the test  java file which I wrote to test it.
This might be a very minor differece but still surprising considering the fact that ConcurrentSkipListMap uses fancy 2 level indexes which they say improves the deletion performance.

And here are the details about the test run.
100 Threads each fetching 1,000,000 records
100 threads each adding 1,000,000 records.
100 threads each deletin 1,000,000 records
( Reads, Writes and deletes simultaneously )

Cheers,
Akash A
On Sun, Oct 23, 2011 at 3:25 AM, Stack <st...@duboce.net>> wrote:
On Sat, Oct 22, 2011 at 2:41 PM, N Keywal <nk...@gmail.com>> wrote:
> I would think that the bottleneck for insert is the wal part?
> It would be possible to do a kind of memory list preparation during the wal
> insertion, and if the wal insertion is confirmed, do the insertion in the
> memory list. But it's strange to have the insertion in memory important vs.
> the insertion on disk...
>
Yes, WAL is the long pole writing.  But MemStore has issues too;
Dhruba says cpu above.  Reading and writing it is also 'slow'.
St.Ack

Re: SILT - nice keyvalue store paper

Posted by lars hofhansl <lh...@yahoo.com>.

I am not sure I would generally agree with this statement.
The linux kernel can easily handle 10.000's of threads (just need to keep default stack small).
(there were tests done with 1m threads too).

Doing async IO is all the craze today (see node.js and friends), but that also is not necessarily done with full understanding of the
performance characteristics.

Just my $0.02.

----- Original Message -----
From: Vladimir Rodionov <vr...@carrieriq.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>
Cc: 
Sent: Monday, October 24, 2011 11:43 AM
Subject: RE: SILT - nice keyvalue store paper

Jonathan, 1000 threads is a bad application design. They will kill you even w/o contention. In my opinion, over-usage of a *modern*  Java concurrent framework
stuff w/o real understanding of a benefits is a major contributor to a poor performance and poor scalability of a Java application. Before making any decision
on using a fancy data structure which can potentially affect application performance it is always a good idea to run some benchmarks. 

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Jonathan Gray [jgray@fb.com]
Sent: Sunday, October 23, 2011 4:20 PM
To: dev@hbase.apache.org
Subject: RE: SILT - nice keyvalue store paper

Very nice experiment, Akash.  Keep getting your hands dirty and digging!  :)

I think your results might change if you bump the test up to 1000 threads or so.  100 threads can still perform okay when there's a global lock but the contention at 1000 threads will kill you and that's when CSLM should do much better.  (1000 handler threads is approx. what I run with on RS in prod).  Though I am a bit surprised that at 100 threads the TreeMap was significantly faster.  Your inconsistent results are a bit odd, you might try an order of magnitude more operations per thread.  You might also gather some statistics about tree size and per operation latency.

I've done some isolated CSLM benchmarks in the past and have never been able to reproduce any of the slowness people suggest.  I recall trying some impractically large MemStores and everything still being quite fast.

Over in Cassandra, I believe they have a two-level CSLM with the first map key being the row and then the columns for each row in their own CSLM.  I've been told this is somewhat of a pain point for them.  And keep in mind they have one shard/region per node and we generally have several smaller MemStores on each node (tens to thousands).  Not sure we would want to try that.  There could be some interesting optimizations if you had very specific issues, like if you had a ton of reads to MemStore and not many writes you could keep some kind of mirrored hashmap.

And for writes, the WAL is definitely the latency bottleneck.  But if you are doing lots of small operations, so your WALEdits are not large, and with some of the HLog batching features going in to trunk, you end up with hundreds of requests per HLog sync.  And although the syncs are higher latency, with batching you end up getting high throughput.  And the bottleneck shifts.

Each sync will take approx. 1-5ms, so let's say 250 requests per HLog sync batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req = 600K/sec, very reasonable).  If you're mixing in reads as well (or if you're doing increments which do a read and write), then this adds to the CPU usage and contention without adding to HLog throughput.

All of a sudden the bottleneck becomes CPU/contention and not HLog latency or throughput.  Highly concurrent increments/counters with a largely in-memory dataset can easily be CPU bottlenecked.

For one specific application Dhruba and I worked on, we made some good improvements in CPU efficiency by reducing the number of operations and increasing efficiency on the CSLM.  Doing things like always taking a tailMap and working from that instead of starting at the root node, using an iterator() and taking advantage of the available remove() semantics, or simply just mutating things that are normally immutable :)  Unfortunately many of these optimizations were semi-horrid hacks and introduced things like ModifiableKeyValues, so they all haven't made their way to apache.

In the end, after our optimizations, the real world workload Dhruba and I were working with was not all in-memory so the bottleneck in production became the random reads (so increasing the block cache hit ratio is the focus) rather than CPU contention or HLog throughput.

JG

From: Akash Ashok [mailto:thehellmaker@gmail.com]
Sent: Sunday, October 23, 2011 2:57 AM
To: dev@hbase.apache.org
Subject: Re: SILT - nice keyvalue store paper

I was running some similar tests and came across a surprising finding. I compared reads and write on ConcurrentSkipListMap ( which the memstore uses) and synchronized TreeMap ( Which was literally treemap synchronized). Executed concurrent reads, writes and deletes on both of them.
Surprisingly synchronized treeMap performed better, though just slightly better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.

Here are the output of a few runs

Sometimes the difference was considerable
Using HBaseMap it took 20438ms
Using TreeMap it took 11613ms
Time Difference:8825ms

And sometimes the difference was negligible
Using HBaseMap it took 13370ms
Using TreeMap it took 9482ms
Time Difference:3888ms

I've attaching the test  java file which I wrote to test it.
This might be a very minor differece but still surprising considering the fact that ConcurrentSkipListMap uses fancy 2 level indexes which they say improves the deletion performance.

And here are the details about the test run.
100 Threads each fetching 1,000,000 records
100 threads each adding 1,000,000 records.
100 threads each deletin 1,000,000 records
( Reads, Writes and deletes simultaneously )

Cheers,
Akash A
On Sun, Oct 23, 2011 at 3:25 AM, Stack <st...@duboce.net>> wrote:
On Sat, Oct 22, 2011 at 2:41 PM, N Keywal <nk...@gmail.com>> wrote:
> I would think that the bottleneck for insert is the wal part?
> It would be possible to do a kind of memory list preparation during the wal
> insertion, and if the wal insertion is confirmed, do the insertion in the
> memory list. But it's strange to have the insertion in memory important vs.
> the insertion on disk...
>
Yes, WAL is the long pole writing.  But MemStore has issues too;
Dhruba says cpu above.  Reading and writing it is also 'slow'.
St.Ack

RE: SILT - nice keyvalue store paper

Posted by Vladimir Rodionov <vr...@carrieriq.com>.

Jonathan, 1000 threads is a bad application design. They will kill you even w/o contention. In my opinion, over-usage of a *modern*  Java concurrent framework
stuff w/o real understanding of a benefits is a major contributor to a poor performance and poor scalability of a Java application. Before making any decision
on using a fancy data structure which can potentially affect application performance it is always a good idea to run some benchmarks. 

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Jonathan Gray [jgray@fb.com]
Sent: Sunday, October 23, 2011 4:20 PM
To: dev@hbase.apache.org
Subject: RE: SILT - nice keyvalue store paper

Very nice experiment, Akash.  Keep getting your hands dirty and digging!  :)

I think your results might change if you bump the test up to 1000 threads or so.  100 threads can still perform okay when there's a global lock but the contention at 1000 threads will kill you and that's when CSLM should do much better.  (1000 handler threads is approx. what I run with on RS in prod).  Though I am a bit surprised that at 100 threads the TreeMap was significantly faster.  Your inconsistent results are a bit odd, you might try an order of magnitude more operations per thread.  You might also gather some statistics about tree size and per operation latency.

I've done some isolated CSLM benchmarks in the past and have never been able to reproduce any of the slowness people suggest.  I recall trying some impractically large MemStores and everything still being quite fast.

Over in Cassandra, I believe they have a two-level CSLM with the first map key being the row and then the columns for each row in their own CSLM.  I've been told this is somewhat of a pain point for them.  And keep in mind they have one shard/region per node and we generally have several smaller MemStores on each node (tens to thousands).  Not sure we would want to try that.  There could be some interesting optimizations if you had very specific issues, like if you had a ton of reads to MemStore and not many writes you could keep some kind of mirrored hashmap.

And for writes, the WAL is definitely the latency bottleneck.  But if you are doing lots of small operations, so your WALEdits are not large, and with some of the HLog batching features going in to trunk, you end up with hundreds of requests per HLog sync.  And although the syncs are higher latency, with batching you end up getting high throughput.  And the bottleneck shifts.

Each sync will take approx. 1-5ms, so let's say 250 requests per HLog sync batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req = 600K/sec, very reasonable).  If you're mixing in reads as well (or if you're doing increments which do a read and write), then this adds to the CPU usage and contention without adding to HLog throughput.

All of a sudden the bottleneck becomes CPU/contention and not HLog latency or throughput.  Highly concurrent increments/counters with a largely in-memory dataset can easily be CPU bottlenecked.

For one specific application Dhruba and I worked on, we made some good improvements in CPU efficiency by reducing the number of operations and increasing efficiency on the CSLM.  Doing things like always taking a tailMap and working from that instead of starting at the root node, using an iterator() and taking advantage of the available remove() semantics, or simply just mutating things that are normally immutable :)  Unfortunately many of these optimizations were semi-horrid hacks and introduced things like ModifiableKeyValues, so they all haven't made their way to apache.

In the end, after our optimizations, the real world workload Dhruba and I were working with was not all in-memory so the bottleneck in production became the random reads (so increasing the block cache hit ratio is the focus) rather than CPU contention or HLog throughput.

JG

From: Akash Ashok [mailto:thehellmaker@gmail.com]
Sent: Sunday, October 23, 2011 2:57 AM
To: dev@hbase.apache.org
Subject: Re: SILT - nice keyvalue store paper

I was running some similar tests and came across a surprising finding. I compared reads and write on ConcurrentSkipListMap ( which the memstore uses) and synchronized TreeMap ( Which was literally treemap synchronized). Executed concurrent reads, writes and deletes on both of them.
Surprisingly synchronized treeMap performed better, though just slightly better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.

Here are the output of a few runs

Sometimes the difference was considerable
Using HBaseMap it took 20438ms
Using TreeMap it took 11613ms
Time Difference:8825ms

And sometimes the difference was negligible
Using HBaseMap it took 13370ms
Using TreeMap it took 9482ms
Time Difference:3888ms

I've attaching the test  java file which I wrote to test it.
This might be a very minor differece but still surprising considering the fact that ConcurrentSkipListMap uses fancy 2 level indexes which they say improves the deletion performance.

And here are the details about the test run.
100 Threads each fetching 1,000,000 records
100 threads each adding 1,000,000 records.
100 threads each deletin 1,000,000 records
( Reads, Writes and deletes simultaneously )

Cheers,
Akash A
On Sun, Oct 23, 2011 at 3:25 AM, Stack <st...@duboce.net>> wrote:
On Sat, Oct 22, 2011 at 2:41 PM, N Keywal <nk...@gmail.com>> wrote:
> I would think that the bottleneck for insert is the wal part?
> It would be possible to do a kind of memory list preparation during the wal
> insertion, and if the wal insertion is confirmed, do the insertion in the
> memory list. But it's strange to have the insertion in memory important vs.
> the insertion on disk...
>
Yes, WAL is the long pole writing.  But MemStore has issues too;
Dhruba says cpu above.  Reading and writing it is also 'slow'.
St.Ack

Re: SILT - nice keyvalue store paper

Posted by Mi...@emc.com.

A 3 year old article on DW:
http://www.ibm.com/developerworks/java/library/j-benchmark1/index.html is
very useful for benchmarking java code.

- milind

---
Milind Bhandarkar
Greenplum Labs, EMC
(Disclaimer: Opinions expressed in this email are those of the author, and
do not necessarily represent the views of any organization, past or
present, the author might be affiliated with.)



On 10/23/11 5:05 PM, "Jonathan Gray" <jg...@fb.com> wrote:

>Oh, and when running these experiments, you should look at the impact at
>which order they are run in, whether you run them multiple times per JVM
>instance, etc.  Basically, you need to be cognizant of the HotSpot
>optimizations the JVM is doing at runtime.
>
>> -----Original Message-----
>> From: Jonathan Gray [mailto:jgray@fb.com]
>> Sent: Sunday, October 23, 2011 4:20 PM
>> To: dev@hbase.apache.org
>> Subject: RE: SILT - nice keyvalue store paper
>> 
>> Very nice experiment, Akash.  Keep getting your hands dirty and
>>digging!  :)
>> 
>> I think your results might change if you bump the test up to 1000
>>threads or
>> so.  100 threads can still perform okay when there's a global lock but
>>the
>> contention at 1000 threads will kill you and that's when CSLM should do
>>much
>> better.  (1000 handler threads is approx. what I run with on RS in
>>prod).
>> Though I am a bit surprised that at 100 threads the TreeMap was
>>significantly
>> faster.  Your inconsistent results are a bit odd, you might try an
>>order of
>> magnitude more operations per thread.  You might also gather some
>> statistics about tree size and per operation latency.
>> 
>> I've done some isolated CSLM benchmarks in the past and have never been
>> able to reproduce any of the slowness people suggest.  I recall trying
>>some
>> impractically large MemStores and everything still being quite fast.
>> 
>> Over in Cassandra, I believe they have a two-level CSLM with the first
>>map
>> key being the row and then the columns for each row in their own CSLM.
>> I've been told this is somewhat of a pain point for them.  And keep in
>>mind
>> they have one shard/region per node and we generally have several
>>smaller
>> MemStores on each node (tens to thousands).  Not sure we would want to
>> try that.  There could be some interesting optimizations if you had very
>> specific issues, like if you had a ton of reads to MemStore and not many
>> writes you could keep some kind of mirrored hashmap.
>> 
>> And for writes, the WAL is definitely the latency bottleneck.  But if
>>you are
>> doing lots of small operations, so your WALEdits are not large, and
>>with some
>> of the HLog batching features going in to trunk, you end up with
>>hundreds of
>> requests per HLog sync.  And although the syncs are higher latency, with
>> batching you end up getting high throughput.  And the bottleneck shifts.
>> 
>> Each sync will take approx. 1-5ms, so let's say 250 requests per HLog
>>sync
>> batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req =
>>600K/sec,
>> very reasonable).  If you're mixing in reads as well (or if you're doing
>> increments which do a read and write), then this adds to the CPU usage
>>and
>> contention without adding to HLog throughput.
>> 
>> All of a sudden the bottleneck becomes CPU/contention and not HLog
>> latency or throughput.  Highly concurrent increments/counters with a
>>largely
>> in-memory dataset can easily be CPU bottlenecked.
>> 
>> For one specific application Dhruba and I worked on, we made some good
>> improvements in CPU efficiency by reducing the number of operations and
>> increasing efficiency on the CSLM.  Doing things like always taking a
>>tailMap
>> and working from that instead of starting at the root node, using an
>>iterator()
>> and taking advantage of the available remove() semantics, or simply just
>> mutating things that are normally immutable :)  Unfortunately many of
>>these
>> optimizations were semi-horrid hacks and introduced things like
>> ModifiableKeyValues, so they all haven't made their way to apache.
>> 
>> In the end, after our optimizations, the real world workload Dhruba and
>>I
>> were working with was not all in-memory so the bottleneck in production
>> became the random reads (so increasing the block cache hit ratio is the
>> focus) rather than CPU contention or HLog throughput.
>> 
>> JG
>> 
>> From: Akash Ashok [mailto:thehellmaker@gmail.com]
>> Sent: Sunday, October 23, 2011 2:57 AM
>> To: dev@hbase.apache.org
>> Subject: Re: SILT - nice keyvalue store paper
>> 
>> I was running some similar tests and came across a surprising finding. I
>> compared reads and write on ConcurrentSkipListMap ( which the memstore
>> uses) and synchronized TreeMap ( Which was literally treemap
>> synchronized). Executed concurrent reads, writes and deletes on both of
>> them.
>> Surprisingly synchronized treeMap performed better, though just slightly
>> better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.
>> 
>> Here are the output of a few runs
>> 
>> Sometimes the difference was considerable Using HBaseMap it took
>> 20438ms Using TreeMap it took 11613ms Time Difference:8825ms
>> 
>> And sometimes the difference was negligible Using HBaseMap it took
>> 13370ms Using TreeMap it took 9482ms Time Difference:3888ms
>> 
>> I've attaching the test  java file which I wrote to test it.
>> This might be a very minor differece but still surprising considering
>>the fact
>> that ConcurrentSkipListMap uses fancy 2 level indexes which they say
>> improves the deletion performance.
>> 
>> And here are the details about the test run.
>> 100 Threads each fetching 1,000,000 records
>> 100 threads each adding 1,000,000 records.
>> 100 threads each deletin 1,000,000 records ( Reads, Writes and deletes
>> simultaneously )
>> 
>> Cheers,
>> Akash A
>> On Sun, Oct 23, 2011 at 3:25 AM, Stack
>> <st...@duboce.net>> wrote:
>> On Sat, Oct 22, 2011 at 2:41 PM, N Keywal
>> <nk...@gmail.com>> wrote:
>> > I would think that the bottleneck for insert is the wal part?
>> > It would be possible to do a kind of memory list preparation during
>> > the wal insertion, and if the wal insertion is confirmed, do the
>> > insertion in the memory list. But it's strange to have the insertion
>>in
>> memory important vs.
>> > the insertion on disk...
>> >
>> Yes, WAL is the long pole writing.  But MemStore has issues too; Dhruba
>>says
>> cpu above.  Reading and writing it is also 'slow'.
>> St.Ack
>
>

RE: SILT - nice keyvalue store paper

Posted by Jonathan Gray <jg...@fb.com>.

Oh, and when running these experiments, you should look at the impact at which order they are run in, whether you run them multiple times per JVM instance, etc.  Basically, you need to be cognizant of the HotSpot optimizations the JVM is doing at runtime.

> -----Original Message-----
> From: Jonathan Gray [mailto:jgray@fb.com]
> Sent: Sunday, October 23, 2011 4:20 PM
> To: dev@hbase.apache.org
> Subject: RE: SILT - nice keyvalue store paper
> 
> Very nice experiment, Akash.  Keep getting your hands dirty and digging!  :)
> 
> I think your results might change if you bump the test up to 1000 threads or
> so.  100 threads can still perform okay when there's a global lock but the
> contention at 1000 threads will kill you and that's when CSLM should do much
> better.  (1000 handler threads is approx. what I run with on RS in prod).
> Though I am a bit surprised that at 100 threads the TreeMap was significantly
> faster.  Your inconsistent results are a bit odd, you might try an order of
> magnitude more operations per thread.  You might also gather some
> statistics about tree size and per operation latency.
> 
> I've done some isolated CSLM benchmarks in the past and have never been
> able to reproduce any of the slowness people suggest.  I recall trying some
> impractically large MemStores and everything still being quite fast.
> 
> Over in Cassandra, I believe they have a two-level CSLM with the first map
> key being the row and then the columns for each row in their own CSLM.
> I've been told this is somewhat of a pain point for them.  And keep in mind
> they have one shard/region per node and we generally have several smaller
> MemStores on each node (tens to thousands).  Not sure we would want to
> try that.  There could be some interesting optimizations if you had very
> specific issues, like if you had a ton of reads to MemStore and not many
> writes you could keep some kind of mirrored hashmap.
> 
> And for writes, the WAL is definitely the latency bottleneck.  But if you are
> doing lots of small operations, so your WALEdits are not large, and with some
> of the HLog batching features going in to trunk, you end up with hundreds of
> requests per HLog sync.  And although the syncs are higher latency, with
> batching you end up getting high throughput.  And the bottleneck shifts.
> 
> Each sync will take approx. 1-5ms, so let's say 250 requests per HLog sync
> batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req = 600K/sec,
> very reasonable).  If you're mixing in reads as well (or if you're doing
> increments which do a read and write), then this adds to the CPU usage and
> contention without adding to HLog throughput.
> 
> All of a sudden the bottleneck becomes CPU/contention and not HLog
> latency or throughput.  Highly concurrent increments/counters with a largely
> in-memory dataset can easily be CPU bottlenecked.
> 
> For one specific application Dhruba and I worked on, we made some good
> improvements in CPU efficiency by reducing the number of operations and
> increasing efficiency on the CSLM.  Doing things like always taking a tailMap
> and working from that instead of starting at the root node, using an iterator()
> and taking advantage of the available remove() semantics, or simply just
> mutating things that are normally immutable :)  Unfortunately many of these
> optimizations were semi-horrid hacks and introduced things like
> ModifiableKeyValues, so they all haven't made their way to apache.
> 
> In the end, after our optimizations, the real world workload Dhruba and I
> were working with was not all in-memory so the bottleneck in production
> became the random reads (so increasing the block cache hit ratio is the
> focus) rather than CPU contention or HLog throughput.
> 
> JG
> 
> From: Akash Ashok [mailto:thehellmaker@gmail.com]
> Sent: Sunday, October 23, 2011 2:57 AM
> To: dev@hbase.apache.org
> Subject: Re: SILT - nice keyvalue store paper
> 
> I was running some similar tests and came across a surprising finding. I
> compared reads and write on ConcurrentSkipListMap ( which the memstore
> uses) and synchronized TreeMap ( Which was literally treemap
> synchronized). Executed concurrent reads, writes and deletes on both of
> them.
> Surprisingly synchronized treeMap performed better, though just slightly
> better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.
> 
> Here are the output of a few runs
> 
> Sometimes the difference was considerable Using HBaseMap it took
> 20438ms Using TreeMap it took 11613ms Time Difference:8825ms
> 
> And sometimes the difference was negligible Using HBaseMap it took
> 13370ms Using TreeMap it took 9482ms Time Difference:3888ms
> 
> I've attaching the test  java file which I wrote to test it.
> This might be a very minor differece but still surprising considering the fact
> that ConcurrentSkipListMap uses fancy 2 level indexes which they say
> improves the deletion performance.
> 
> And here are the details about the test run.
> 100 Threads each fetching 1,000,000 records
> 100 threads each adding 1,000,000 records.
> 100 threads each deletin 1,000,000 records ( Reads, Writes and deletes
> simultaneously )
> 
> Cheers,
> Akash A
> On Sun, Oct 23, 2011 at 3:25 AM, Stack
> <st...@duboce.net>> wrote:
> On Sat, Oct 22, 2011 at 2:41 PM, N Keywal
> <nk...@gmail.com>> wrote:
> > I would think that the bottleneck for insert is the wal part?
> > It would be possible to do a kind of memory list preparation during
> > the wal insertion, and if the wal insertion is confirmed, do the
> > insertion in the memory list. But it's strange to have the insertion in
> memory important vs.
> > the insertion on disk...
> >
> Yes, WAL is the long pole writing.  But MemStore has issues too; Dhruba says
> cpu above.  Reading and writing it is also 'slow'.
> St.Ack

RE: SILT - nice keyvalue store paper

Posted by Jonathan Gray <jg...@fb.com>.

Very nice experiment, Akash.  Keep getting your hands dirty and digging!  :)

I think your results might change if you bump the test up to 1000 threads or so.  100 threads can still perform okay when there's a global lock but the contention at 1000 threads will kill you and that's when CSLM should do much better.  (1000 handler threads is approx. what I run with on RS in prod).  Though I am a bit surprised that at 100 threads the TreeMap was significantly faster.  Your inconsistent results are a bit odd, you might try an order of magnitude more operations per thread.  You might also gather some statistics about tree size and per operation latency.

I've done some isolated CSLM benchmarks in the past and have never been able to reproduce any of the slowness people suggest.  I recall trying some impractically large MemStores and everything still being quite fast.

Over in Cassandra, I believe they have a two-level CSLM with the first map key being the row and then the columns for each row in their own CSLM.  I've been told this is somewhat of a pain point for them.  And keep in mind they have one shard/region per node and we generally have several smaller MemStores on each node (tens to thousands).  Not sure we would want to try that.  There could be some interesting optimizations if you had very specific issues, like if you had a ton of reads to MemStore and not many writes you could keep some kind of mirrored hashmap.

And for writes, the WAL is definitely the latency bottleneck.  But if you are doing lots of small operations, so your WALEdits are not large, and with some of the HLog batching features going in to trunk, you end up with hundreds of requests per HLog sync.  And although the syncs are higher latency, with batching you end up getting high throughput.  And the bottleneck shifts.

Each sync will take approx. 1-5ms, so let's say 250 requests per HLog sync batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req = 600K/sec, very reasonable).  If you're mixing in reads as well (or if you're doing increments which do a read and write), then this adds to the CPU usage and contention without adding to HLog throughput.

All of a sudden the bottleneck becomes CPU/contention and not HLog latency or throughput.  Highly concurrent increments/counters with a largely in-memory dataset can easily be CPU bottlenecked.

For one specific application Dhruba and I worked on, we made some good improvements in CPU efficiency by reducing the number of operations and increasing efficiency on the CSLM.  Doing things like always taking a tailMap and working from that instead of starting at the root node, using an iterator() and taking advantage of the available remove() semantics, or simply just mutating things that are normally immutable :)  Unfortunately many of these optimizations were semi-horrid hacks and introduced things like ModifiableKeyValues, so they all haven't made their way to apache.

In the end, after our optimizations, the real world workload Dhruba and I were working with was not all in-memory so the bottleneck in production became the random reads (so increasing the block cache hit ratio is the focus) rather than CPU contention or HLog throughput.

JG

From: Akash Ashok [mailto:thehellmaker@gmail.com]
Sent: Sunday, October 23, 2011 2:57 AM
To: dev@hbase.apache.org
Subject: Re: SILT - nice keyvalue store paper

I was running some similar tests and came across a surprising finding. I compared reads and write on ConcurrentSkipListMap ( which the memstore uses) and synchronized TreeMap ( Which was literally treemap synchronized). Executed concurrent reads, writes and deletes on both of them.
Surprisingly synchronized treeMap performed better, though just slightly better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.

Here are the output of a few runs

Sometimes the difference was considerable
Using HBaseMap it took 20438ms
Using TreeMap it took 11613ms
Time Difference:8825ms

And sometimes the difference was negligible
Using HBaseMap it took 13370ms
Using TreeMap it took 9482ms
Time Difference:3888ms

I've attaching the test  java file which I wrote to test it.
This might be a very minor differece but still surprising considering the fact that ConcurrentSkipListMap uses fancy 2 level indexes which they say improves the deletion performance.

And here are the details about the test run.
100 Threads each fetching 1,000,000 records
100 threads each adding 1,000,000 records.
100 threads each deletin 1,000,000 records
( Reads, Writes and deletes simultaneously )

Cheers,
Akash A
On Sun, Oct 23, 2011 at 3:25 AM, Stack <st...@duboce.net>> wrote:
On Sat, Oct 22, 2011 at 2:41 PM, N Keywal <nk...@gmail.com>> wrote:
> I would think that the bottleneck for insert is the wal part?
> It would be possible to do a kind of memory list preparation during the wal
> insertion, and if the wal insertion is confirmed, do the insertion in the
> memory list. But it's strange to have the insertion in memory important vs.
> the insertion on disk...
>
Yes, WAL is the long pole writing.  But MemStore has issues too;
Dhruba says cpu above.  Reading and writing it is also 'slow'.
St.Ack

Re: SILT - nice keyvalue store paper

Posted by Akash Ashok <th...@gmail.com>.

I was running some similar tests and came across a surprising finding. I
compared reads and write on ConcurrentSkipListMap ( which the memstore uses)
and synchronized TreeMap ( Which was literally treemap synchronized).
Executed concurrent reads, writes and deletes on both of them.
Surprisingly synchronized treeMap performed better, though just slightly
better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.

Here are the output of a few runs

Sometimes the difference was considerable
Using HBaseMap it took 20438ms
Using TreeMap it took 11613ms
Time Difference:8825ms

And sometimes the difference was negligible
Using HBaseMap it took 13370ms
Using TreeMap it took 9482ms
Time Difference:3888ms

I've attaching the test  java file which I wrote to test it.
This might be a very minor differece but still surprising considering the
fact that ConcurrentSkipListMap uses fancy 2 level indexes which they say
improves the deletion performance.

And here are the details about the test run.
100 Threads each fetching 1,000,000 records
100 threads each adding 1,000,000 records.
100 threads each deletin 1,000,000 records
( Reads, Writes and deletes simultaneously )

Cheers,
Akash A

On Sun, Oct 23, 2011 at 3:25 AM, Stack <st...@duboce.net> wrote:

> On Sat, Oct 22, 2011 at 2:41 PM, N Keywal <nk...@gmail.com> wrote:
> > I would think that the bottleneck for insert is the wal part?
> > It would be possible to do a kind of memory list preparation during the
> wal
> > insertion, and if the wal insertion is confirmed, do the insertion in the
> > memory list. But it's strange to have the insertion in memory important
> vs.
> > the insertion on disk...
> >
>
> Yes, WAL is the long pole writing.  But MemStore has issues too;
> Dhruba says cpu above.  Reading and writing it is also 'slow'.
> St.Ack
>

Re: SILT - nice keyvalue store paper

Posted by Stack <st...@duboce.net>.

On Sat, Oct 22, 2011 at 2:41 PM, N Keywal <nk...@gmail.com> wrote:
> I would think that the bottleneck for insert is the wal part?
> It would be possible to do a kind of memory list preparation during the wal
> insertion, and if the wal insertion is confirmed, do the insertion in the
> memory list. But it's strange to have the insertion in memory important vs.
> the insertion on disk...
>

Yes, WAL is the long pole writing.  But MemStore has issues too;
Dhruba says cpu above.  Reading and writing it is also 'slow'.
St.Ack

Re: SILT - nice keyvalue store paper

Posted by N Keywal <nk...@gmail.com>.

Hi,

On Sat, Oct 22, 2011 at 11:33 PM, Stack <st...@duboce.net> wrote:

> On Sat, Oct 22, 2011 at 1:17 AM, Dhruba Borthakur <dh...@gmail.com>
> wrote:
>
> How would you scan in order an HashSet?  Deletes across spans
> (families or all older than a particular timestamp)?
>
> On the other hand, I did have a chat w/ the LMAX folks recently.  They
> had made a point earlier in the day that java Collections and
> Concurrent are well due an overhaul (as an aside in a talk whose
> general thrust was revisit all assumptions especially atop modern
> hardware).  I was asking what the underpinnings of a modern Collection
> might look like and in particular described our issue with
> ConcurrentSkipListMap.  One of the boys threw out the notion of
> catching the inserts in their Disruptor data structure and then
> sorting the data structure.
>

I would think that the bottleneck for insert is the wal part?
It would be possible to do a kind of memory list preparation during the wal
insertion, and if the wal insertion is confirmed, do the insertion in the
memory list. But it's strange to have the insertion in memory important vs.
the insertion on disk...

Re: SILT - nice keyvalue store paper

Posted by Stack <st...@duboce.net>.

On Sat, Oct 22, 2011 at 1:17 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
> One of the current problems with hbase eating lots of cpu is the fact that
> the memstore in a sortedset. Instead, we can make it a hashset so that
> lookup and insertions are much faster. At the time of flushing, we can sort
> the snapshot memstore and write it out to hdfs. This will decrease latencies
> of Puts to a great extent. I will experiment on how this will fare with
> real-life workload.
>

How would you scan in order an HashSet?  Deletes across spans
(families or all older than a particular timestamp)?

On the other hand, I did have a chat w/ the LMAX folks recently.  They
had made a point earlier in the day that java Collections and
Concurrent are well due an overhaul (as an aside in a talk whose
general thrust was revisit all assumptions especially atop modern
hardware).  I was asking what the underpinnings of a modern Collection
might look like and in particular described our issue with
ConcurrentSkipListMap.  One of the boys threw out the notion of
catching the inserts in their Disruptor data structure and then
sorting the data structure.  Seemed a bit of a silly suggestion at the
time but perhaps if we added MVCC to the mix and the read point moved
on after the completion of a sort.  We'd be juggling lots of sorted
lists in memory....

> I have also been playing around with a 5 node test cluster that has
> flashdrives. The flash drives are mounted as xfs filesystems.
> A LookasideCacheFileSystem (http://bit.ly/pnGju0) that is a client side
> layered filter driver on top of hdfs. When HBase flushes data to HDFS, it is
> cached transparently in the LookasideCacheFileSystem.
> The LookasideCacheFileSystem uses the flash drive as a cache. The assumption
> here is that recently flushed hfiles are more likely to be accessed than the
> data in HFiles that were flushed earlier (not yet messed with major
> compactions). I will be measuring the performance benefit of this
> configuration.
>

That sounds sweet Dhruba.
St.Ack

Re: SILT - nice keyvalue store paper

Posted by Mi...@emc.com.

Dhruba,

>I have also been playing around with a 5 node test cluster that has
>flashdrives. The flash drives are mounted as xfs filesystems.
>A LookasideCacheFileSystem (http://bit.ly/pnGju0) that is a client side
>layered filter driver on top of hdfs. When HBase flushes data to HDFS, it
>is
>cached transparently in the LookasideCacheFileSystem.
>The LookasideCacheFileSystem uses the flash drive as a cache. The
>assumption
>here is that recently flushed hfiles are more likely to be accessed than
>the
>data in HFiles that were flushed earlier (not yet messed with major
>compactions). I will be measuring the performance benefit of this
>configuration.

I am extremely interested in the performance results of your experiment.

Please keep me posted, thanks.

As a side note, is there a Jira to incorporate LookasideCacheFileSystem in
Hadoop ?

- milind

---
Milind Bhandarkar
Greenplum Labs, EMC
(Disclaimer: Opinions expressed in this email are those of the author, and
do not necessarily represent the views of any organization, past or
present, the author might be affiliated with.)

Re: SILT - nice keyvalue store paper

Posted by Dhruba Borthakur <dh...@gmail.com>.

Hi Todd,

Thanks for forwarding this paper. I have been mulling similar ideas in my
head for sometime.

One of the current problems with hbase eating lots of cpu is the fact that
the memstore in a sortedset. Instead, we can make it a hashset so that
lookup and insertions are much faster. At the time of flushing, we can sort
the snapshot memstore and write it out to hdfs. This will decrease latencies
of Puts to a great extent. I will experiment on how this will fare with
real-life workload.

I have also been playing around with a 5 node test cluster that has
flashdrives. The flash drives are mounted as xfs filesystems.
A LookasideCacheFileSystem (http://bit.ly/pnGju0) that is a client side
layered filter driver on top of hdfs. When HBase flushes data to HDFS, it is
cached transparently in the LookasideCacheFileSystem.
The LookasideCacheFileSystem uses the flash drive as a cache. The assumption
here is that recently flushed hfiles are more likely to be accessed than the
data in HFiles that were flushed earlier (not yet messed with major
compactions). I will be measuring the performance benefit of this
configuration.

thanks,
dhruba

On Wed, Oct 12, 2011 at 10:42 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Read this paper last night and thought it had some nice ideas that
> would be applicable to HBase:
>
> http://www.cs.cmu.edu/~dga/papers/silt-sosp2011.pdf
>
> I especially like the highly compact indexing scheme (entropy-coded tries)
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

-- 
Subscribe to my posts at http://www.facebook.com/dhruba