You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Robert Wille <rw...@fold3.com> on 2014/01/28 16:57:00 UTC

Heavy update dataset and compaction

I have a dataset which is heavy on updates. The updates are actually
performed by inserting new records and deleting the old ones the following
day. Some records might be updated (replaced) a thousand times before they
are finished.

As I watch SSTables get created and compacted on my staging server (I
haven¹t gone live with this yet), it appears that if I let the compactor do
its default behavior, I¹ll probably end up consuming several times the
amount of disk space as is actually required. I probably need to
periodically trigger a major compaction if I want to avoid that. However,
I¹ve read that major compactions aren¹t really recommended. I¹d like to get
people¹s take on this. I¹d also be interested in people¹s recommendations on
compaction strategy and other compaction-related configuration settings.

Thanks

Robert

Re: Heavy update dataset and compaction

Posted by Robert Wille <rw...@fold3.com>.

> 
> Perhaps a log structured database with immutable data files is not best suited
> for this use case?

Perhaps not, but I have other data structures I¹m moving to Cassandra as
well. This is just the first. Cassandra has actually worked quite well for
this first step, in spite of it not being an optimal tool for this use case.
And, I have to say that records being modified a thousand times is an
extreme case. Most of my data is far less volatile (perhaps dozens of
times).

I greatly appreciate all the information contributed to this mailing list.
It¹s a great resource.

Robert

Re: Heavy update dataset and compaction

Posted by Robert Coli <rc...@eventbrite.com>.

On Tue, Jan 28, 2014 at 7:57 AM, Robert Wille <rw...@fold3.com> wrote:

> I have a dataset which is heavy on updates. The updates are actually
> performed by inserting new records and deleting the old ones the following
> day. Some records might be updated (replaced) a thousand times before they
> are finished.
>

Perhaps a log structured database with immutable data files is not best
suited for this use case?

Are you deleting rows or columns each day?

As I watch SSTables get created and compacted on my staging server (I
> haven't gone live with this yet), it appears that if I let the compactor do
> its default behavior, I'll probably end up consuming several times the
> amount of disk space as is actually required. I probably need to
> periodically trigger a major compaction if I want to avoid that. However,
> I've read that major compactions aren't really recommended. I'd like to get
> people's take on this. I'd also be interested in people's recommendations
> on compaction strategy and other compaction-related configuration settings.
>

This is getting to be a FAQ... but... briefly..

1) yes, you are correct about the amount of space waste. this is why most
people avoid write patterns with lots of overwrite.
2) the docs used to say something incoherent about major compactions, but
suffice it to say that running them regularly is often a viable solution.
they are the optimal way cassandra has available to it to merge data.
3) if you really have some problem related to your One Huge SSTable, you
can always use sstablesplit to split it into N smaller ones.
4) if you really don't want to run a major compaction, you can either use
Level compaction (which has its own caveats) or use checksstablegarbage [1]
and UserDefinedCompaction to strategically manually compact SSTables.

=Rob
[1] https://github.com/cloudian/support-tools#checksstablegarbage

Re: Heavy update dataset and compaction

Posted by Nate McCall <na...@thelastpickle.com>.

LeveledCompactionStrategy is ideal for update heavy workloads. If you are
using a pre 1.2.8 version make sure you set the sstable_size_in_mb up to
the new default of 160.

Also, keep an eye on "Average live cells per slice" and "Average tombstones
per slice" (available in versions > 1.2.11 - so I guess just upgrade if you
are using an older version and not in production yet) in nodetool cfstats
to make sure you reads are not traversing too many tombstones.

On Tue, Jan 28, 2014 at 9:57 AM, Robert Wille <rw...@fold3.com> wrote:

> I have a dataset which is heavy on updates. The updates are actually
> performed by inserting new records and deleting the old ones the following
> day. Some records might be updated (replaced) a thousand times before they
> are finished.
>
> As I watch SSTables get created and compacted on my staging server (I
> haven't gone live with this yet), it appears that if I let the compactor do
> its default behavior, I'll probably end up consuming several times the
> amount of disk space as is actually required. I probably need to
> periodically trigger a major compaction if I want to avoid that. However,
> I've read that major compactions aren't really recommended. I'd like to get
> people's take on this. I'd also be interested in people's recommendations
> on compaction strategy and other compaction-related configuration settings.
>
> Thanks
>
> Robert
>

-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com