You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Erik Forsberg <fo...@opera.com> on 2014/08/18 15:21:23 UTC

LeveledCompaction, streaming bulkload, and lot's of small sstables

Hi!

I'm bulkloading via streaming from Hadoop to my Cassandra cluster. This
results in a rather large set of relatively small (~1MiB) sstables as
the number of mappers that generate sstables on the hadoop cluster is high.

With SizeTieredCompactionStrategy, the cassandra cluster would quickly
compact all these small sstables into decently sized sstables.

With LeveledCompactionStrategy however, it takes a much longer time. I
have multithreaded_compaction: true, but it is only taking on 32
sstables at a time in one single compaction task, so when it starts with
~1500 sstables, it takes quite some time. I'm not running out of I/O.

Is there some configuration knob I can tune to make this happen faster?
I'm getting a bit confused by the description for min_sstable_size,
bucket_high, bucket_low etc - and I'm not sure if they apply in this case.

I'm pondering options for decreasing the number of sstables being
streamed from the hadoop side, but if that is possible remains to be seen.

Thanks!
\EF

Re: LeveledCompaction, streaming bulkload, and lot's of small sstables

Posted by Erik Forsberg <fo...@opera.com>.
On 2014-08-18 19:52, Robert Coli wrote:
> On Mon, Aug 18, 2014 at 6:21 AM, Erik Forsberg <forsberg@opera.com
> <ma...@opera.com>> wrote:
> 
>     Is there some configuration knob I can tune to make this happen faster?
>     I'm getting a bit confused by the description for min_sstable_size,
>     bucket_high, bucket_low etc - and I'm not sure if they apply in this
>     case.
> 
> 
> You probably don't want to use multi-threaded compaction, it is removed
> upstream.
> 
> nodetool setcompactionthroughput 0
> 
> Assuming you have enough IO headroom etc.

OK. I disabled multithreaded and gave it a bit more throughput to play
with, but I still don't think that's the full story.

What I see is the following case:

1) My hadoop cluster is bulkloading around 1000 sstables to the
Cassandra cluster.

2) Cassandra will start compacting.

With SizeTiered, I would see multiple ongoing compactions on the CF in
question, each taking on 32 sstables and compacting to one, all of them
running at the same time.

With Leveled, I see only one compaction, taking on 32 sstables
compacting to one. When that finished, it will start another one. So
it's essentially a serial process, and it takes a much longer time than
what it does with SizeTiered. While this compaction is ongoing, read
performance is not very good.

http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2
mentions LCS is parallelized in Cassandra 1.2, but maybe that patch
doesn't cover my use case (although I realize that my use case is maybe
a bit weird)

So my question is if this is something I can tune? I'm running 1.2.18
now, but am strongly considering upgrade to 2.0.X.

Regards,
\EF



Re: LeveledCompaction, streaming bulkload, and lot's of small sstables

Posted by Robert Coli <rc...@eventbrite.com>.
On Mon, Aug 18, 2014 at 6:21 AM, Erik Forsberg <fo...@opera.com> wrote:

> Is there some configuration knob I can tune to make this happen faster?
> I'm getting a bit confused by the description for min_sstable_size,
> bucket_high, bucket_low etc - and I'm not sure if they apply in this case.
>

You probably don't want to use multi-threaded compaction, it is removed
upstream.

nodetool setcompactionthroughput 0

Assuming you have enough IO headroom etc.

=Rob