You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Alexandru Sicoe <ad...@gmail.com> on 2012/12/05 15:40:22 UTC

Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Hi guys,
Sorry for the late follow-up but I waited to run major compactions on all 3
nodes at a time before replying with my findings.

Basically we were successful on two of the nodes. They both took ~2 days
and 11 hours to complete and at the end we saw one very large file ~900GB
and the rest much smaller (the overall size decreased). This is what we
expected!

But on the 3rd node, we suspect major compaction didn't actually finish
it's job. First of all nodetool compact returned much earlier than the rest
- after one day and 15 hrs. Secondly from the 1.4TBs initially on the node
only about 36GB were freed up (almost the same size as before). Saw nothing
in the server log (debug not enabled). Below I pasted some more details
about file sizes before and after compaction on this third node and disk
occupancy.

The situation is maybe not so dramatic for us because in less than 2 weeks
we will have a down time till after the new year. During this we can
completely delete all the data in the cluster and start fresh with TTLs for
1 month (as suggested by Aaron and 8GB heap as suggested by Alain - thanks).

Questions:

1) Do you expect problems with the 3rd node during 2 weeks more of
operations, in the conditions seen below?
[Note: we expect the minor compactions to continue building up files but
never really getting to compacting the large file and thus not needing much
temporarily extra disk space].

2) Should we restart with leveled compaction next year?
[Note: Aaron was right, we have 1 week rows which get deleted after 1 month
which means older rows end up in big files => to free up space with
SizeTiered we will have no choice but run major compactions which we don't
know if they will work provided that we get at ~1TB / node / 1 month. You
can see we are at the limit!]

3) In case we keep SizeTiered:

    - How can we improve the performance of our major compactions? (we left
all config parameters as default). Would increasing compactions throughput
interfere with writes and reads? What about multi-threaded compactions?

    - Do we still need to run regular repair operations as well? Do these
also do a major compaction or are they completely separate operations?

[Note: we have 3 nodes with RF=2 and inserting at consistency level 1 and
reading at consistency level ALL. We read primarily for exporting reasons -
we export 1 week worth of data at a time].

4) Should we consider increasing the cluster capacity?
[We generate ~5million new rows every week which shouldn't come close to
the hundreds of millions of rows on a node mentioned by Aaron which are the
volumes that would create problems with bloom filters and indexes].

Cheers,
Alex
------------------

The situation in the data folder

    before calling nodetool comapact:

du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
376G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db
305G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db
39G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db
78G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db
81G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db
205M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db
20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db
20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db
20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db
4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db
4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db
4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db
333M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db
92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db
92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db
99M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db
2.5G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db
1.4T    total

    after nodetool comapact returned:

du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
910G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db
19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db
19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db
5.0G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db
4.8G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db
338M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db
339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db
339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db
98M


Looking at the disk occupancy for the logical partition where the data
folder is in:

df /data_bst
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdb1            2927242720 1482502260 1444740460  51% /data_bst


and the situation in the cluster

nodetool -h $HOSTNAME ring (before major compaction)
Address         DC          Rack        Status State   Load
Effective-Ownership Token

113427455640312821154458202477256070484
10.146.44.17    datacenter1 rack1       Up     Normal  1.37 TB
66.67%              0
10.146.44.18    datacenter1 rack1       Up     Normal  1.04 TB
66.67%              56713727820156410577229101238628035242
10.146.44.32    datacenter1 rack1       Up     Normal  1.14 TB
66.67%              113427455640312821154458202477256070484

nodetool -h $HOSTNAME ring (after major compaction) (Note we were inserting
data in the meantime)
Address         DC          Rack        Status State   Load
Effective-Ownership Token

113427455640312821154458202477256070484
10.146.44.17    datacenter1 rack1       Up     Normal  1.38 TB
66.67%              0
10.146.44.18    datacenter1 rack1       Up     Normal  1.08 TB
66.67%              56713727820156410577229101238628035242
10.146.44.32    datacenter1 rack1       Up     Normal  1.19 TB
66.67%              113427455640312821154458202477256070484




On Fri, Nov 23, 2012 at 2:16 AM, aaron morton <aa...@thelastpickle.com>wrote:

> >  From what I know having too much data on one node is bad, not really
> sure why, but  I think that performance will go down due to the size of
> indexes and bloom filters (I may be wrong on the reasons but I'm quite sure
> you can't store too much data per node).
> If you have many hundreds of millions of rows on a node the memory needed
> for bloom filters and index sampling can be significant. These can both be
> tuned.
>
> If you have 1.1T per node the time to do a compaction, repair or upgrade
> may be very significant. Also the time taken to copy this data should you
> need to remove or replace a node may be prohibitive.
>
> > 2. Switch to Leveled compaction strategy.
> I would avoid making a change like that on an unstable / at risk system.
>
> > - Our usage pattern is write once, read once (export) and delete once!
>
>  The column TTL may be of use to you, it removes the need to do a delete.
>
> > - We were thinking of relying on the automatic minor compactions to free
> up space for us but as..
> There are some usage patterns which make life harder for STS. For example
> if you have very long lived rows that are written to and deleted a lot. Row
> fragments that have been around for a while will end up in bigger files,
> and these files get compacted less often.
>
> In this situation, if you are running low on disk space and you think
> there is a lot of deleted data in there, I would run a major compaction. A
> word or warning though, if do this you will need to continue to do it
> regularly. Major compaction creates a single big file, that will not get
> compaction often. There are ways to resolve this, and moving to LDB may
> help in the future.
>
> If you are stuck and worried about disk space it's what I would do. Once
> you are stable again then look at LDB
> http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:
>
> > Hi Alexandru,
> >
> > "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk
> per node for the data dir and separate disk for the commitlog, 12 cores, 24
> GB RAM"
> >
> > I think you should tune your architecture in a very different way. From
> what I know having too much data on one node is bad, not really sure why,
> but  I think that performance will go down due to the size of indexes and
> bloom filters (I may be wrong on the reasons but I'm quite sure you can't
> store too much data per node).
> >
> > Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) would
> be better if you have the choice.
> >
> > "(12GB to Cassandra heap)."
> >
> > The max heap recommanded is 8GB because if you use more than these 8GB
> the Gc jobs will start decreasing your performance.
> >
> > "We now have 1.1 TB worth of data per node (RF = 2)."
> >
> > You should use RF=3 unless one out of consistency or SPOF  doesn't
> matter to you.
> >
> > With RF=2 you are obliged to write at CL.one to remove the single point
> of failure.
> >
> > "1. Start issuing regular major compactions (nodetool compact).
> >      - This is not recommended:
> >             - Stops minor compactions.
> >             - Major performance hit on node (very bad for us because
> need to be taking data all the time)."
> >
> > Actually, major compaction *does not* stop minor compactions. What
> happens is that due to the size of the size of the sstable that remains
> after your major compaction, it will never be compacted with the upcoming
> new sstables, and because of that, your read performance will go down until
> you run an other major compaction.
> >
> > "2. Switch to Leveled compaction strategy.
> >       - It is mentioned to help with deletes and disk space usage. Can
> someone confirm?"
> >
> > From what I know, Leveled compaction will not free disk space. It will
> allow you to use a greater percentage of your total disk space (50% max for
> sized tier compaction vs about 80% for leveled compaction)
> >
> > "Our usage pattern is write once, read once (export) and delete once! "
> >
> > In this case, I think that leveled compaction fits your needs.
> >
> > "Can anyone suggest which (if any) is better? Are there better
> solutions?"
> >
> > Are your sstable compressed ? You have 2 types of built-in compression
> and you may use them depending on the model of each of your CF.
> >
> > see:
> http://www.datastax.com/docs/1.1/operations/tuning#configure-compression
> >
> > Alain
> >
> > 2012/11/22 Alexandru Sicoe <ad...@gmail.com>
> > We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk
> per node for the data dir and separate disk for the commitlog, 12 cores, 24
> GB RAM (12GB to Cassandra heap).
> >
>
>

RE: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Posted by "Poziombka, Wade L" <wa...@intel.com>.
duh, sorry.  That estimate is 2 TB  would be 15 nodes rf = 3

From: Poziombka, Wade L [mailto:wade.l.poziombka@intel.com]
Sent: Friday, December 07, 2012 7:15 AM
To: user@cassandra.apache.org
Subject: RE: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

So if my calculations are correct a terabyte sized database would require a minimum of 15 nodes (RF = 3).  That sound about right?

2000 / 400 * RF

From: aaron morton [mailto:aaron@thelastpickle.com]
Sent: Thursday, December 06, 2012 9:43 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Meaning terabyte size databases.
Lots of people have TB sized systems. Just add more nodes.
300 to 400 Gb is just a rough guideline. The bigger picture is considering how routine and non routine maintenance tasks are going to be carried out.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 7/12/2012, at 4:38 AM, Edward Capriolo <ed...@gmail.com>> wrote:

http://wiki.apache.org/cassandra/LargeDataSetConsiderations
On Thu, Dec 6, 2012 at 9:53 AM, Poziombka, Wade L <wa...@intel.com>> wrote:
"Having so much data on each node is a potential bad day."

Is this discussed somewhere on the Cassandra documentation (limits, practices etc)?  We are also trying to load up quite a lot of data and have hit memory issues (bloom filter etc.) in 1.0.10.  I would like to read up on big data usage of Cassandra.  Meaning terabyte size databases.

I do get your point about the amount of time required to recover downed node. But this 300-400MB business is interesting to me.

Thanks in advance.

Wade

From: aaron morton [mailto:aaron@thelastpickle.com<ma...@thelastpickle.com>]
Sent: Wednesday, December 05, 2012 9:23 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected!
I would recommend having up to 300MB to 400MB per node on a regular HDD with 1GB networking.

But on the 3rd node, we suspect major compaction didn't actually finish it's job...
The file list looks odd. Check the time stamps, on the files. You should not have files older than when compaction started.

8GB heap
The default is 4GB max now days.

1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below?
I cannot answer that.

2) Should we restart with leveled compaction next year?
I would run some tests to see how it works for you workload.

4) Should we consider increasing the cluster capacity?
IMHO yes.
You may also want to do some experiments with turing compression on if it not already enabled.

Having so much data on each node is a potential bad day. If instead you had to move or repair one of those nodes how long would it take for cassandra to stream all the data over ? (Or to rsync the data over.) How long does it take to run nodetool repair on the node ?

With RF 3, if you lose a node you have lost your redundancy. It's important to have a plan about how to get it back and how long it may take.

Hope that helps.

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com<http://www.thelastpickle.com/>

On 6/12/2012, at 3:40 AM, Alexandru Sicoe <ad...@gmail.com>> wrote:

Hi guys,
Sorry for the late follow-up but I waited to run major compactions on all 3 nodes at a time before replying with my findings.

Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected!

But on the 3rd node, we suspect major compaction didn't actually finish it's job. First of all nodetool compact returned much earlier than the rest - after one day and 15 hrs. Secondly from the 1.4TBs initially on the node only about 36GB were freed up (almost the same size as before). Saw nothing in the server log (debug not enabled). Below I pasted some more details about file sizes before and after compaction on this third node and disk occupancy.

The situation is maybe not so dramatic for us because in less than 2 weeks we will have a down time till after the new year. During this we can completely delete all the data in the cluster and start fresh with TTLs for 1 month (as suggested by Aaron and 8GB heap as suggested by Alain - thanks).

Questions:

1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below?
[Note: we expect the minor compactions to continue building up files but never really getting to compacting the large file and thus not needing much temporarily extra disk space].

2) Should we restart with leveled compaction next year?
[Note: Aaron was right, we have 1 week rows which get deleted after 1 month which means older rows end up in big files => to free up space with SizeTiered we will have no choice but run major compactions which we don't know if they will work provided that we get at ~1TB / node / 1 month. You can see we are at the limit!]

3) In case we keep SizeTiered:

    - How can we improve the performance of our major compactions? (we left all config parameters as default). Would increasing compactions throughput interfere with writes and reads? What about multi-threaded compactions?

    - Do we still need to run regular repair operations as well? Do these also do a major compaction or are they completely separate operations?

[Note: we have 3 nodes with RF=2 and inserting at consistency level 1 and reading at consistency level ALL. We read primarily for exporting reasons - we export 1 week worth of data at a time].

4) Should we consider increasing the cluster capacity?
[We generate ~5million new rows every week which shouldn't come close to the hundreds of millions of rows on a node mentioned by Aaron which are the volumes that would create problems with bloom filters and indexes].

Cheers,
Alex
------------------

The situation in the data folder

    before calling nodetool comapact:

du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
376G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db
305G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db
39G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db
78G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db
81G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db
205M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db
20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db
20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db
20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db
4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db
4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db
4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db
333M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db
92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db
92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db
99M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db
2.5G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db
1.4T    total

    after nodetool comapact returned:

du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
910G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db
19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db
19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db
5.0G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db
4.8G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db
338M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db
339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db
339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db
98M


Looking at the disk occupancy for the logical partition where the data folder is in:

df /data_bst
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdb1            2927242720 1482502260 1444740460  51% /data_bst


and the situation in the cluster

nodetool -h $HOSTNAME ring (before major compaction)
Address         DC          Rack        Status State   Load            Effective-Ownership Token
                                                                                           113427455640312821154458202477256070484
10.146.44.17    datacenter1 rack1       Up     Normal  1.37 TB         66.67%              0
10.146.44.18    datacenter1 rack1       Up     Normal  1.04 TB         66.67%              56713727820156410577229101238628035242
10.146.44.32    datacenter1 rack1       Up     Normal  1.14 TB         66.67%              113427455640312821154458202477256070484

nodetool -h $HOSTNAME ring (after major compaction) (Note we were inserting data in the meantime)
Address         DC          Rack        Status State   Load            Effective-Ownership Token
                                                                                           113427455640312821154458202477256070484
10.146.44.17    datacenter1 rack1       Up     Normal  1.38 TB         66.67%              0
10.146.44.18    datacenter1 rack1       Up     Normal  1.08 TB         66.67%              56713727820156410577229101238628035242
10.146.44.32    datacenter1 rack1       Up     Normal  1.19 TB         66.67%              113427455640312821154458202477256070484

On Fri, Nov 23, 2012 at 2:16 AM, aaron morton <aa...@thelastpickle.com>> wrote:
>  From what I know having too much data on one node is bad, not really sure why, but  I think that performance will go down due to the size of indexes and bloom filters (I may be wrong on the reasons but I'm quite sure you can't store too much data per node).
If you have many hundreds of millions of rows on a node the memory needed for bloom filters and index sampling can be significant. These can both be tuned.

If you have 1.1T per node the time to do a compaction, repair or upgrade may be very significant. Also the time taken to copy this data should you need to remove or replace a node may be prohibitive.

> 2. Switch to Leveled compaction strategy.
I would avoid making a change like that on an unstable / at risk system.

> - Our usage pattern is write once, read once (export) and delete once!

 The column TTL may be of use to you, it removes the need to do a delete.

> - We were thinking of relying on the automatic minor compactions to free up space for us but as..
There are some usage patterns which make life harder for STS. For example if you have very long lived rows that are written to and deleted a lot. Row fragments that have been around for a while will end up in bigger files, and these files get compacted less often.

In this situation, if you are running low on disk space and you think there is a lot of deleted data in there, I would run a major compaction. A word or warning though, if do this you will need to continue to do it regularly. Major compaction creates a single big file, that will not get compaction often. There are ways to resolve this, and moving to LDB may help in the future.

If you are stuck and worried about disk space it's what I would do. Once you are stable again then look at LDB http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com<http://www.thelastpickle.com/>

On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ <ar...@gmail.com>> wrote:

> Hi Alexandru,
>
> "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per node for the data dir and separate disk for the commitlog, 12 cores, 24 GB RAM"
>
> I think you should tune your architecture in a very different way. From what I know having too much data on one node is bad, not really sure why, but  I think that performance will go down due to the size of indexes and bloom filters (I may be wrong on the reasons but I'm quite sure you can't store too much data per node).
>
> Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) would be better if you have the choice.
>
> "(12GB to Cassandra heap)."
>
> The max heap recommanded is 8GB because if you use more than these 8GB the Gc jobs will start decreasing your performance.
>
> "We now have 1.1 TB worth of data per node (RF = 2)."
>
> You should use RF=3 unless one out of consistency or SPOF  doesn't matter to you.
>
> With RF=2 you are obliged to write at CL.one to remove the single point of failure.
>
> "1. Start issuing regular major compactions (nodetool compact).
>      - This is not recommended:
>             - Stops minor compactions.
>             - Major performance hit on node (very bad for us because need to be taking data all the time)."
>
> Actually, major compaction *does not* stop minor compactions. What happens is that due to the size of the size of the sstable that remains after your major compaction, it will never be compacted with the upcoming new sstables, and because of that, your read performance will go down until you run an other major compaction.
>
> "2. Switch to Leveled compaction strategy.
>       - It is mentioned to help with deletes and disk space usage. Can someone confirm?"
>
> From what I know, Leveled compaction will not free disk space. It will allow you to use a greater percentage of your total disk space (50% max for sized tier compaction vs about 80% for leveled compaction)
>
> "Our usage pattern is write once, read once (export) and delete once! "
>
> In this case, I think that leveled compaction fits your needs.
>
> "Can anyone suggest which (if any) is better? Are there better solutions?"
>
> Are your sstable compressed ? You have 2 types of built-in compression and you may use them depending on the model of each of your CF.
>
> see: http://www.datastax.com/docs/1.1/operations/tuning#configure-compression
>
> Alain
>
> 2012/11/22 Alexandru Sicoe <ad...@gmail.com>>
> We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per node for the data dir and separate disk for the commitlog, 12 cores, 24 GB RAM (12GB to Cassandra heap).
>





Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Posted by "Hiller, Dean" <De...@nrel.gov>.
When you turn on compression which should be enabled, that should change quite a bit as well.  I am curious though how many nodes with RF=3 on average does have a terabyte as you would hope it is a very low number if you plan on scaling to a petabyte someday.

Later,
Dean

From: <Poziombka>, Wade L <wa...@intel.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Friday, December 7, 2012 6:15 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: RE: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

So if my calculations are correct a terabyte sized database would require a minimum of 15 nodes (RF = 3).  That sound about right?

2000 / 400 * RF

From: aaron morton [mailto:aaron@thelastpickle.com]
Sent: Thursday, December 06, 2012 9:43 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Meaning terabyte size databases.
Lots of people have TB sized systems. Just add more nodes.
300 to 400 Gb is just a rough guideline. The bigger picture is considering how routine and non routine maintenance tasks are going to be carried out.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 7/12/2012, at 4:38 AM, Edward Capriolo <ed...@gmail.com>> wrote:


http://wiki.apache.org/cassandra/LargeDataSetConsiderations

On Thu, Dec 6, 2012 at 9:53 AM, Poziombka, Wade L <wa...@intel.com>> wrote:

“Having so much data on each node is a potential bad day.”

Is this discussed somewhere on the Cassandra documentation (limits, practices etc)?  We are also trying to load up quite a lot of data and have hit memory issues (bloom filter etc.) in 1.0.10.  I would like to read up on big data usage of Cassandra.  Meaning terabyte size databases.

I do get your point about the amount of time required to recover downed node. But this 300-400MB business is interesting to me.

Thanks in advance.

Wade

From: aaron morton [mailto:aaron@thelastpickle.com<ma...@thelastpickle.com>]
Sent: Wednesday, December 05, 2012 9:23 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected!
I would recommend having up to 300MB to 400MB per node on a regular HDD with 1GB networking.

But on the 3rd node, we suspect major compaction didn't actually finish it's job…
The file list looks odd. Check the time stamps, on the files. You should not have files older than when compaction started.

8GB heap
The default is 4GB max now days.

1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below?
I cannot answer that.

2) Should we restart with leveled compaction next year?
I would run some tests to see how it works for you workload.

4) Should we consider increasing the cluster capacity?
IMHO yes.
You may also want to do some experiments with turing compression on if it not already enabled.

Having so much data on each node is a potential bad day. If instead you had to move or repair one of those nodes how long would it take for cassandra to stream all the data over ? (Or to rsync the data over.) How long does it take to run nodetool repair on the node ?

With RF 3, if you lose a node you have lost your redundancy. It's important to have a plan about how to get it back and how long it may take.

Hope that helps.

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com<http://www.thelastpickle.com/>

On 6/12/2012, at 3:40 AM, Alexandru Sicoe <ad...@gmail.com>> wrote:

Hi guys,
Sorry for the late follow-up but I waited to run major compactions on all 3 nodes at a time before replying with my findings.

Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected!

But on the 3rd node, we suspect major compaction didn't actually finish it's job. First of all nodetool compact returned much earlier than the rest - after one day and 15 hrs. Secondly from the 1.4TBs initially on the node only about 36GB were freed up (almost the same size as before). Saw nothing in the server log (debug not enabled). Below I pasted some more details about file sizes before and after compaction on this third node and disk occupancy.

The situation is maybe not so dramatic for us because in less than 2 weeks we will have a down time till after the new year. During this we can completely delete all the data in the cluster and start fresh with TTLs for 1 month (as suggested by Aaron and 8GB heap as suggested by Alain - thanks).

Questions:

1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below?
[Note: we expect the minor compactions to continue building up files but never really getting to compacting the large file and thus not needing much temporarily extra disk space].

2) Should we restart with leveled compaction next year?
[Note: Aaron was right, we have 1 week rows which get deleted after 1 month which means older rows end up in big files => to free up space with SizeTiered we will have no choice but run major compactions which we don't know if they will work provided that we get at ~1TB / node / 1 month. You can see we are at the limit!]

3) In case we keep SizeTiered:

    - How can we improve the performance of our major compactions? (we left all config parameters as default). Would increasing compactions throughput interfere with writes and reads? What about multi-threaded compactions?

    - Do we still need to run regular repair operations as well? Do these also do a major compaction or are they completely separate operations?

[Note: we have 3 nodes with RF=2 and inserting at consistency level 1 and reading at consistency level ALL. We read primarily for exporting reasons - we export 1 week worth of data at a time].

4) Should we consider increasing the cluster capacity?
[We generate ~5million new rows every week which shouldn't come close to the hundreds of millions of rows on a node mentioned by Aaron which are the volumes that would create problems with bloom filters and indexes].

Cheers,
Alex
------------------

The situation in the data folder

    before calling nodetool comapact:

du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
376G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db
305G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db
39G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db
78G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db
81G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db
205M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db
20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db
20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db
20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db
4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db
4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db
4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db
333M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db
92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db
92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db
99M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db
2.5G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db
1.4T    total

    after nodetool comapact returned:

du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
910G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db
19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db
19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db
5.0G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db
4.8G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db
338M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db
339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db
339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db
98M


Looking at the disk occupancy for the logical partition where the data folder is in:

df /data_bst
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdb1            2927242720 1482502260 1444740460  51% /data_bst


and the situation in the cluster

nodetool -h $HOSTNAME ring (before major compaction)
Address         DC          Rack        Status State   Load            Effective-Ownership Token
                                                                                           113427455640312821154458202477256070484
10.146.44.17    datacenter1 rack1       Up     Normal  1.37 TB         66.67%              0
10.146.44.18    datacenter1 rack1       Up     Normal  1.04 TB         66.67%              56713727820156410577229101238628035242
10.146.44.32    datacenter1 rack1       Up     Normal  1.14 TB         66.67%              113427455640312821154458202477256070484

nodetool -h $HOSTNAME ring (after major compaction) (Note we were inserting data in the meantime)
Address         DC          Rack        Status State   Load            Effective-Ownership Token
                                                                                           113427455640312821154458202477256070484
10.146.44.17    datacenter1 rack1       Up     Normal  1.38 TB         66.67%              0
10.146.44.18    datacenter1 rack1       Up     Normal  1.08 TB         66.67%              56713727820156410577229101238628035242
10.146.44.32    datacenter1 rack1       Up     Normal  1.19 TB         66.67%              113427455640312821154458202477256070484

On Fri, Nov 23, 2012 at 2:16 AM, aaron morton <aa...@thelastpickle.com>> wrote:
>  From what I know having too much data on one node is bad, not really sure why, but  I think that performance will go down due to the size of indexes and bloom filters (I may be wrong on the reasons but I'm quite sure you can't store too much data per node).
If you have many hundreds of millions of rows on a node the memory needed for bloom filters and index sampling can be significant. These can both be tuned.

If you have 1.1T per node the time to do a compaction, repair or upgrade may be very significant. Also the time taken to copy this data should you need to remove or replace a node may be prohibitive.

> 2. Switch to Leveled compaction strategy.
I would avoid making a change like that on an unstable / at risk system.

> - Our usage pattern is write once, read once (export) and delete once!

 The column TTL may be of use to you, it removes the need to do a delete.

> - We were thinking of relying on the automatic minor compactions to free up space for us but as..
There are some usage patterns which make life harder for STS. For example if you have very long lived rows that are written to and deleted a lot. Row fragments that have been around for a while will end up in bigger files, and these files get compacted less often.

In this situation, if you are running low on disk space and you think there is a lot of deleted data in there, I would run a major compaction. A word or warning though, if do this you will need to continue to do it regularly. Major compaction creates a single big file, that will not get compaction often. There are ways to resolve this, and moving to LDB may help in the future.

If you are stuck and worried about disk space it's what I would do. Once you are stable again then look at LDB http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com<http://www.thelastpickle.com/>

On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ <ar...@gmail.com>> wrote:

> Hi Alexandru,
>
> "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per node for the data dir and separate disk for the commitlog, 12 cores, 24 GB RAM"
>
> I think you should tune your architecture in a very different way. From what I know having too much data on one node is bad, not really sure why, but  I think that performance will go down due to the size of indexes and bloom filters (I may be wrong on the reasons but I'm quite sure you can't store too much data per node).
>
> Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) would be better if you have the choice.
>
> "(12GB to Cassandra heap)."
>
> The max heap recommanded is 8GB because if you use more than these 8GB the Gc jobs will start decreasing your performance.
>
> "We now have 1.1 TB worth of data per node (RF = 2)."
>
> You should use RF=3 unless one out of consistency or SPOF  doesn't matter to you.
>
> With RF=2 you are obliged to write at CL.one to remove the single point of failure.
>
> "1. Start issuing regular major compactions (nodetool compact).
>      - This is not recommended:
>             - Stops minor compactions.
>             - Major performance hit on node (very bad for us because need to be taking data all the time)."
>
> Actually, major compaction *does not* stop minor compactions. What happens is that due to the size of the size of the sstable that remains after your major compaction, it will never be compacted with the upcoming new sstables, and because of that, your read performance will go down until you run an other major compaction.
>
> "2. Switch to Leveled compaction strategy.
>       - It is mentioned to help with deletes and disk space usage. Can someone confirm?"
>
> From what I know, Leveled compaction will not free disk space. It will allow you to use a greater percentage of your total disk space (50% max for sized tier compaction vs about 80% for leveled compaction)
>
> "Our usage pattern is write once, read once (export) and delete once! "
>
> In this case, I think that leveled compaction fits your needs.
>
> "Can anyone suggest which (if any) is better? Are there better solutions?"
>
> Are your sstable compressed ? You have 2 types of built-in compression and you may use them depending on the model of each of your CF.
>
> see: http://www.datastax.com/docs/1.1/operations/tuning#configure-compression
>
> Alain
>
> 2012/11/22 Alexandru Sicoe <ad...@gmail.com>>
> We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per node for the data dir and separate disk for the commitlog, 12 cores, 24 GB RAM (12GB to Cassandra heap).
>





RE: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Posted by "Poziombka, Wade L" <wa...@intel.com>.
So if my calculations are correct a terabyte sized database would require a minimum of 15 nodes (RF = 3).  That sound about right?

2000 / 400 * RF

From: aaron morton [mailto:aaron@thelastpickle.com]
Sent: Thursday, December 06, 2012 9:43 PM
To: user@cassandra.apache.org
Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Meaning terabyte size databases.
Lots of people have TB sized systems. Just add more nodes.
300 to 400 Gb is just a rough guideline. The bigger picture is considering how routine and non routine maintenance tasks are going to be carried out.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 7/12/2012, at 4:38 AM, Edward Capriolo <ed...@gmail.com>> wrote:


http://wiki.apache.org/cassandra/LargeDataSetConsiderations

On Thu, Dec 6, 2012 at 9:53 AM, Poziombka, Wade L <wa...@intel.com>> wrote:

"Having so much data on each node is a potential bad day."

Is this discussed somewhere on the Cassandra documentation (limits, practices etc)?  We are also trying to load up quite a lot of data and have hit memory issues (bloom filter etc.) in 1.0.10.  I would like to read up on big data usage of Cassandra.  Meaning terabyte size databases.

I do get your point about the amount of time required to recover downed node. But this 300-400MB business is interesting to me.

Thanks in advance.

Wade

From: aaron morton [mailto:aaron@thelastpickle.com<ma...@thelastpickle.com>]
Sent: Wednesday, December 05, 2012 9:23 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected!
I would recommend having up to 300MB to 400MB per node on a regular HDD with 1GB networking.

But on the 3rd node, we suspect major compaction didn't actually finish it's job...
The file list looks odd. Check the time stamps, on the files. You should not have files older than when compaction started.

8GB heap
The default is 4GB max now days.

1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below?
I cannot answer that.

2) Should we restart with leveled compaction next year?
I would run some tests to see how it works for you workload.

4) Should we consider increasing the cluster capacity?
IMHO yes.
You may also want to do some experiments with turing compression on if it not already enabled.

Having so much data on each node is a potential bad day. If instead you had to move or repair one of those nodes how long would it take for cassandra to stream all the data over ? (Or to rsync the data over.) How long does it take to run nodetool repair on the node ?

With RF 3, if you lose a node you have lost your redundancy. It's important to have a plan about how to get it back and how long it may take.

Hope that helps.

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com<http://www.thelastpickle.com/>

On 6/12/2012, at 3:40 AM, Alexandru Sicoe <ad...@gmail.com>> wrote:

Hi guys,
Sorry for the late follow-up but I waited to run major compactions on all 3 nodes at a time before replying with my findings.

Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected!

But on the 3rd node, we suspect major compaction didn't actually finish it's job. First of all nodetool compact returned much earlier than the rest - after one day and 15 hrs. Secondly from the 1.4TBs initially on the node only about 36GB were freed up (almost the same size as before). Saw nothing in the server log (debug not enabled). Below I pasted some more details about file sizes before and after compaction on this third node and disk occupancy.

The situation is maybe not so dramatic for us because in less than 2 weeks we will have a down time till after the new year. During this we can completely delete all the data in the cluster and start fresh with TTLs for 1 month (as suggested by Aaron and 8GB heap as suggested by Alain - thanks).

Questions:

1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below?
[Note: we expect the minor compactions to continue building up files but never really getting to compacting the large file and thus not needing much temporarily extra disk space].

2) Should we restart with leveled compaction next year?
[Note: Aaron was right, we have 1 week rows which get deleted after 1 month which means older rows end up in big files => to free up space with SizeTiered we will have no choice but run major compactions which we don't know if they will work provided that we get at ~1TB / node / 1 month. You can see we are at the limit!]

3) In case we keep SizeTiered:

    - How can we improve the performance of our major compactions? (we left all config parameters as default). Would increasing compactions throughput interfere with writes and reads? What about multi-threaded compactions?

    - Do we still need to run regular repair operations as well? Do these also do a major compaction or are they completely separate operations?

[Note: we have 3 nodes with RF=2 and inserting at consistency level 1 and reading at consistency level ALL. We read primarily for exporting reasons - we export 1 week worth of data at a time].

4) Should we consider increasing the cluster capacity?
[We generate ~5million new rows every week which shouldn't come close to the hundreds of millions of rows on a node mentioned by Aaron which are the volumes that would create problems with bloom filters and indexes].

Cheers,
Alex
------------------

The situation in the data folder

    before calling nodetool comapact:

du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
376G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db
305G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db
39G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db
78G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db
81G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db
205M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db
20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db
20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db
20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db
4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db
4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db
4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db
333M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db
92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db
92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db
99M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db
2.5G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db
1.4T    total

    after nodetool comapact returned:

du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
910G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db
19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db
19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db
5.0G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db
4.8G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db
338M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db
339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db
339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db
98M


Looking at the disk occupancy for the logical partition where the data folder is in:

df /data_bst
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdb1            2927242720 1482502260 1444740460  51% /data_bst


and the situation in the cluster

nodetool -h $HOSTNAME ring (before major compaction)
Address         DC          Rack        Status State   Load            Effective-Ownership Token
                                                                                           113427455640312821154458202477256070484
10.146.44.17    datacenter1 rack1       Up     Normal  1.37 TB         66.67%              0
10.146.44.18    datacenter1 rack1       Up     Normal  1.04 TB         66.67%              56713727820156410577229101238628035242
10.146.44.32    datacenter1 rack1       Up     Normal  1.14 TB         66.67%              113427455640312821154458202477256070484

nodetool -h $HOSTNAME ring (after major compaction) (Note we were inserting data in the meantime)
Address         DC          Rack        Status State   Load            Effective-Ownership Token
                                                                                           113427455640312821154458202477256070484
10.146.44.17    datacenter1 rack1       Up     Normal  1.38 TB         66.67%              0
10.146.44.18    datacenter1 rack1       Up     Normal  1.08 TB         66.67%              56713727820156410577229101238628035242
10.146.44.32    datacenter1 rack1       Up     Normal  1.19 TB         66.67%              113427455640312821154458202477256070484

On Fri, Nov 23, 2012 at 2:16 AM, aaron morton <aa...@thelastpickle.com>> wrote:
>  From what I know having too much data on one node is bad, not really sure why, but  I think that performance will go down due to the size of indexes and bloom filters (I may be wrong on the reasons but I'm quite sure you can't store too much data per node).
If you have many hundreds of millions of rows on a node the memory needed for bloom filters and index sampling can be significant. These can both be tuned.

If you have 1.1T per node the time to do a compaction, repair or upgrade may be very significant. Also the time taken to copy this data should you need to remove or replace a node may be prohibitive.

> 2. Switch to Leveled compaction strategy.
I would avoid making a change like that on an unstable / at risk system.

> - Our usage pattern is write once, read once (export) and delete once!

 The column TTL may be of use to you, it removes the need to do a delete.

> - We were thinking of relying on the automatic minor compactions to free up space for us but as..
There are some usage patterns which make life harder for STS. For example if you have very long lived rows that are written to and deleted a lot. Row fragments that have been around for a while will end up in bigger files, and these files get compacted less often.

In this situation, if you are running low on disk space and you think there is a lot of deleted data in there, I would run a major compaction. A word or warning though, if do this you will need to continue to do it regularly. Major compaction creates a single big file, that will not get compaction often. There are ways to resolve this, and moving to LDB may help in the future.

If you are stuck and worried about disk space it's what I would do. Once you are stable again then look at LDB http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com<http://www.thelastpickle.com/>

On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ <ar...@gmail.com>> wrote:

> Hi Alexandru,
>
> "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per node for the data dir and separate disk for the commitlog, 12 cores, 24 GB RAM"
>
> I think you should tune your architecture in a very different way. From what I know having too much data on one node is bad, not really sure why, but  I think that performance will go down due to the size of indexes and bloom filters (I may be wrong on the reasons but I'm quite sure you can't store too much data per node).
>
> Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) would be better if you have the choice.
>
> "(12GB to Cassandra heap)."
>
> The max heap recommanded is 8GB because if you use more than these 8GB the Gc jobs will start decreasing your performance.
>
> "We now have 1.1 TB worth of data per node (RF = 2)."
>
> You should use RF=3 unless one out of consistency or SPOF  doesn't matter to you.
>
> With RF=2 you are obliged to write at CL.one to remove the single point of failure.
>
> "1. Start issuing regular major compactions (nodetool compact).
>      - This is not recommended:
>             - Stops minor compactions.
>             - Major performance hit on node (very bad for us because need to be taking data all the time)."
>
> Actually, major compaction *does not* stop minor compactions. What happens is that due to the size of the size of the sstable that remains after your major compaction, it will never be compacted with the upcoming new sstables, and because of that, your read performance will go down until you run an other major compaction.
>
> "2. Switch to Leveled compaction strategy.
>       - It is mentioned to help with deletes and disk space usage. Can someone confirm?"
>
> From what I know, Leveled compaction will not free disk space. It will allow you to use a greater percentage of your total disk space (50% max for sized tier compaction vs about 80% for leveled compaction)
>
> "Our usage pattern is write once, read once (export) and delete once! "
>
> In this case, I think that leveled compaction fits your needs.
>
> "Can anyone suggest which (if any) is better? Are there better solutions?"
>
> Are your sstable compressed ? You have 2 types of built-in compression and you may use them depending on the model of each of your CF.
>
> see: http://www.datastax.com/docs/1.1/operations/tuning#configure-compression
>
> Alain
>
> 2012/11/22 Alexandru Sicoe <ad...@gmail.com>>
> We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per node for the data dir and separate disk for the commitlog, 12 cores, 24 GB RAM (12GB to Cassandra heap).
>





Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Posted by aaron morton <aa...@thelastpickle.com>.
> Meaning terabyte size databases. 
> 
Lots of people have TB sized systems. Just add more nodes. 
300 to 400 Gb is just a rough guideline. The bigger picture is considering how routine and non routine maintenance tasks are going to be carried out. 

Cheers
  
-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 7/12/2012, at 4:38 AM, Edward Capriolo <ed...@gmail.com> wrote:

> http://wiki.apache.org/cassandra/LargeDataSetConsiderations
> 
> 
> On Thu, Dec 6, 2012 at 9:53 AM, Poziombka, Wade L <wa...@intel.com> wrote:
> “Having so much data on each node is a potential bad day.”
> 
>  
> 
> Is this discussed somewhere on the Cassandra documentation (limits, practices etc)?  We are also trying to load up quite a lot of data and have hit memory issues (bloom filter etc.) in 1.0.10.  I would like to read up on big data usage of Cassandra.  Meaning terabyte size databases. 
> 
>  
> 
> I do get your point about the amount of time required to recover downed node. But this 300-400MB business is interesting to me.
> 
>  
> 
> Thanks in advance.
> 
>  
> 
> Wade
> 
>  
> 
> From: aaron morton [mailto:aaron@thelastpickle.com] 
> Sent: Wednesday, December 05, 2012 9:23 PM
> To: user@cassandra.apache.org
> Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.
> 
>  
> 
> Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected!
> 
> I would recommend having up to 300MB to 400MB per node on a regular HDD with 1GB networking. 
> 
>  
> 
> But on the 3rd node, we suspect major compaction didn't actually finish it's job…
> 
> The file list looks odd. Check the time stamps, on the files. You should not have files older than when compaction started. 
> 
>  
> 
> 8GB heap 
> 
> The default is 4GB max now days. 
> 
>  
> 
> 1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below? 
> 
> I cannot answer that. 
> 
>  
> 
> 2) Should we restart with leveled compaction next year? 
> 
> I would run some tests to see how it works for you workload. 
> 
>  
> 
> 4) Should we consider increasing the cluster capacity?
> 
> IMHO yes.
> 
> You may also want to do some experiments with turing compression on if it not already enabled. 
> 
>  
> 
> Having so much data on each node is a potential bad day. If instead you had to move or repair one of those nodes how long would it take for cassandra to stream all the data over ? (Or to rsync the data over.) How long does it take to run nodetool repair on the node ?
> 
>  
> 
> With RF 3, if you lose a node you have lost your redundancy. It's important to have a plan about how to get it back and how long it may take.   
> 
>  
> 
> Hope that helps. 
> 
>  
> 
> -----------------
> 
> Aaron Morton
> 
> Freelance Cassandra Developer
> 
> New Zealand
> 
>  
> 
> @aaronmorton
> 
> http://www.thelastpickle.com
> 
>  
> 
> On 6/12/2012, at 3:40 AM, Alexandru Sicoe <ad...@gmail.com> wrote:
> 
> 
> 
> 
> Hi guys,
> Sorry for the late follow-up but I waited to run major compactions on all 3 nodes at a time before replying with my findings.
> 
> Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected!
> 
> But on the 3rd node, we suspect major compaction didn't actually finish it's job. First of all nodetool compact returned much earlier than the rest - after one day and 15 hrs. Secondly from the 1.4TBs initially on the node only about 36GB were freed up (almost the same size as before). Saw nothing in the server log (debug not enabled). Below I pasted some more details about file sizes before and after compaction on this third node and disk occupancy.
> 
> The situation is maybe not so dramatic for us because in less than 2 weeks we will have a down time till after the new year. During this we can completely delete all the data in the cluster and start fresh with TTLs for 1 month (as suggested by Aaron and 8GB heap as suggested by Alain - thanks).
> 
> Questions:
> 
> 1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below? 
> [Note: we expect the minor compactions to continue building up files but never really getting to compacting the large file and thus not needing much temporarily extra disk space].
> 
> 2) Should we restart with leveled compaction next year? 
> [Note: Aaron was right, we have 1 week rows which get deleted after 1 month which means older rows end up in big files => to free up space with SizeTiered we will have no choice but run major compactions which we don't know if they will work provided that we get at ~1TB / node / 1 month. You can see we are at the limit!]
> 
> 3) In case we keep SizeTiered:
> 
>     - How can we improve the performance of our major compactions? (we left all config parameters as default). Would increasing compactions throughput interfere with writes and reads? What about multi-threaded compactions?
> 
>     - Do we still need to run regular repair operations as well? Do these also do a major compaction or are they completely separate operations? 
> 
> [Note: we have 3 nodes with RF=2 and inserting at consistency level 1 and reading at consistency level ALL. We read primarily for exporting reasons - we export 1 week worth of data at a time].
> 
> 4) Should we consider increasing the cluster capacity?
> [We generate ~5million new rows every week which shouldn't come close to the hundreds of millions of rows on a node mentioned by Aaron which are the volumes that would create problems with bloom filters and indexes].
> 
> Cheers,
> Alex
> ------------------
> 
> The situation in the data folder 
> 
>     before calling nodetool comapact:
> 
> du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
> 444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
> 376G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db
> 305G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db
> 39G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db
> 78G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db
> 81G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db
> 205M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db
> 20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db
> 20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db
> 20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db
> 4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db
> 4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db
> 4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db
> 333M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db
> 92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db
> 92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db
> 99M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db
> 2.5G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db
> 1.4T    total
> 
>     after nodetool comapact returned:
> 
> du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
> 444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
> 910G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db
> 19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db
> 19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db
> 5.0G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db
> 4.8G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db
> 338M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db
> 339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db
> 339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db
> 98M  
> 
> 
> Looking at the disk occupancy for the logical partition where the data folder is in:
> 
> df /data_bst
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/sdb1            2927242720 1482502260 1444740460  51% /data_bst
> 
> 
> and the situation in the cluster
> 
> nodetool -h $HOSTNAME ring (before major compaction)
> Address         DC          Rack        Status State   Load            Effective-Ownership Token                                       
>                                                                                            113427455640312821154458202477256070484     
> 10.146.44.17    datacenter1 rack1       Up     Normal  1.37 TB         66.67%              0                                           
> 10.146.44.18    datacenter1 rack1       Up     Normal  1.04 TB         66.67%              56713727820156410577229101238628035242      
> 10.146.44.32    datacenter1 rack1       Up     Normal  1.14 TB         66.67%              113427455640312821154458202477256070484
> 
> nodetool -h $HOSTNAME ring (after major compaction) (Note we were inserting data in the meantime)
> Address         DC          Rack        Status State   Load            Effective-Ownership Token                                       
>                                                                                            113427455640312821154458202477256070484     
> 10.146.44.17    datacenter1 rack1       Up     Normal  1.38 TB         66.67%              0                                           
> 10.146.44.18    datacenter1 rack1       Up     Normal  1.08 TB         66.67%              56713727820156410577229101238628035242      
> 10.146.44.32    datacenter1 rack1       Up     Normal  1.19 TB         66.67%              113427455640312821154458202477256070484
> 
> 
>  
> 
> On Fri, Nov 23, 2012 at 2:16 AM, aaron morton <aa...@thelastpickle.com> wrote:
> 
> >  From what I know having too much data on one node is bad, not really sure why, but  I think that performance will go down due to the size of indexes and bloom filters (I may be wrong on the reasons but I'm quite sure you can't store too much data per node).
> 
> If you have many hundreds of millions of rows on a node the memory needed for bloom filters and index sampling can be significant. These can both be tuned.
> 
> If you have 1.1T per node the time to do a compaction, repair or upgrade may be very significant. Also the time taken to copy this data should you need to remove or replace a node may be prohibitive.
> 
> 
> > 2. Switch to Leveled compaction strategy.
> 
> I would avoid making a change like that on an unstable / at risk system.
> 
> > - Our usage pattern is write once, read once (export) and delete once!
> 
>  The column TTL may be of use to you, it removes the need to do a delete.
> 
> > - We were thinking of relying on the automatic minor compactions to free up space for us but as..
> There are some usage patterns which make life harder for STS. For example if you have very long lived rows that are written to and deleted a lot. Row fragments that have been around for a while will end up in bigger files, and these files get compacted less often.
> 
> In this situation, if you are running low on disk space and you think there is a lot of deleted data in there, I would run a major compaction. A word or warning though, if do this you will need to continue to do it regularly. Major compaction creates a single big file, that will not get compaction often. There are ways to resolve this, and moving to LDB may help in the future.
> 
> If you are stuck and worried about disk space it's what I would do. Once you are stable again then look at LDB http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> 
> On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:
> 
> > Hi Alexandru,
> >
> > "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per node for the data dir and separate disk for the commitlog, 12 cores, 24 GB RAM"
> >
> > I think you should tune your architecture in a very different way. From what I know having too much data on one node is bad, not really sure why, but  I think that performance will go down due to the size of indexes and bloom filters (I may be wrong on the reasons but I'm quite sure you can't store too much data per node).
> >
> > Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) would be better if you have the choice.
> >
> > "(12GB to Cassandra heap)."
> >
> > The max heap recommanded is 8GB because if you use more than these 8GB the Gc jobs will start decreasing your performance.
> >
> > "We now have 1.1 TB worth of data per node (RF = 2)."
> >
> > You should use RF=3 unless one out of consistency or SPOF  doesn't matter to you.
> >
> > With RF=2 you are obliged to write at CL.one to remove the single point of failure.
> >
> > "1. Start issuing regular major compactions (nodetool compact).
> >      - This is not recommended:
> >             - Stops minor compactions.
> >             - Major performance hit on node (very bad for us because need to be taking data all the time)."
> >
> > Actually, major compaction *does not* stop minor compactions. What happens is that due to the size of the size of the sstable that remains after your major compaction, it will never be compacted with the upcoming new sstables, and because of that, your read performance will go down until you run an other major compaction.
> >
> > "2. Switch to Leveled compaction strategy.
> >       - It is mentioned to help with deletes and disk space usage. Can someone confirm?"
> >
> > From what I know, Leveled compaction will not free disk space. It will allow you to use a greater percentage of your total disk space (50% max for sized tier compaction vs about 80% for leveled compaction)
> >
> > "Our usage pattern is write once, read once (export) and delete once! "
> >
> > In this case, I think that leveled compaction fits your needs.
> >
> > "Can anyone suggest which (if any) is better? Are there better solutions?"
> >
> > Are your sstable compressed ? You have 2 types of built-in compression and you may use them depending on the model of each of your CF.
> >
> > see: http://www.datastax.com/docs/1.1/operations/tuning#configure-compression
> >
> > Alain
> >
> > 2012/11/22 Alexandru Sicoe <ad...@gmail.com>
> > We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per node for the data dir and separate disk for the commitlog, 12 cores, 24 GB RAM (12GB to Cassandra heap).
> >
> 
>  
> 
>  
> 
> 


Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Posted by Edward Capriolo <ed...@gmail.com>.
http://wiki.apache.org/cassandra/LargeDataSetConsiderations


On Thu, Dec 6, 2012 at 9:53 AM, Poziombka, Wade L <
wade.l.poziombka@intel.com> wrote:

>  “Having so much data on each node is a potential bad day.”****
>
> ** **
>
> Is this discussed somewhere on the Cassandra documentation (limits,
> practices etc)?  We are also trying to load up quite a lot of data and have
> hit memory issues (bloom filter etc.) in 1.0.10.  I would like to read up
> on big data usage of Cassandra.  Meaning terabyte size databases.  ****
>
> ** **
>
> I do get your point about the amount of time required to recover downed
> node. But this 300-400MB business is interesting to me.****
>
> ** **
>
> Thanks in advance.****
>
> ** **
>
> Wade****
>
> ** **
>
> *From:* aaron morton [mailto:aaron@thelastpickle.com]
> *Sent:* Wednesday, December 05, 2012 9:23 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered
> compaction.****
>
> ** **
>
> Basically we were successful on two of the nodes. They both took ~2 days
> and 11 hours to complete and at the end we saw one very large file ~900GB
> and the rest much smaller (the overall size decreased). This is what we
> expected!****
>
> I would recommend having up to 300MB to 400MB per node on a regular HDD
> with 1GB networking. ****
>
> ** **
>
> But on the 3rd node, we suspect major compaction didn't actually finish
> it's job…****
>
> The file list looks odd. Check the time stamps, on the files. You should
> not have files older than when compaction started. ****
>
> ** **
>
> 8GB heap ****
>
> The default is 4GB max now days. ****
>
> ** **
>
> 1) Do you expect problems with the 3rd node during 2 weeks more of
> operations, in the conditions seen below? ****
>
> I cannot answer that. ****
>
> ** **
>
> 2) Should we restart with leveled compaction next year? ****
>
> I would run some tests to see how it works for you workload. ****
>
> ** **
>
> 4) Should we consider increasing the cluster capacity?****
>
> IMHO yes.****
>
> You may also want to do some experiments with turing compression on if it
> not already enabled. ****
>
> ** **
>
> Having so much data on each node is a potential bad day. If instead you
> had to move or repair one of those nodes how long would it take for
> cassandra to stream all the data over ? (Or to rsync the data over.) How
> long does it take to run nodetool repair on the node ?****
>
> ** **
>
> With RF 3, if you lose a node you have lost your redundancy. It's
> important to have a plan about how to get it back and how long it may take.
>   ****
>
> ** **
>
> Hope that helps. ****
>
> ** **
>
> -----------------****
>
> Aaron Morton****
>
> Freelance Cassandra Developer****
>
> New Zealand****
>
> ** **
>
> @aaronmorton****
>
> http://www.thelastpickle.com****
>
> ** **
>
> On 6/12/2012, at 3:40 AM, Alexandru Sicoe <ad...@gmail.com> wrote:****
>
>
>
> ****
>
> Hi guys,
> Sorry for the late follow-up but I waited to run major compactions on all
> 3 nodes at a time before replying with my findings.
>
> Basically we were successful on two of the nodes. They both took ~2 days
> and 11 hours to complete and at the end we saw one very large file ~900GB
> and the rest much smaller (the overall size decreased). This is what we
> expected!
>
> But on the 3rd node, we suspect major compaction didn't actually finish
> it's job. First of all nodetool compact returned much earlier than the rest
> - after one day and 15 hrs. Secondly from the 1.4TBs initially on the node
> only about 36GB were freed up (almost the same size as before). Saw nothing
> in the server log (debug not enabled). Below I pasted some more details
> about file sizes before and after compaction on this third node and disk
> occupancy.
>
> The situation is maybe not so dramatic for us because in less than 2 weeks
> we will have a down time till after the new year. During this we can
> completely delete all the data in the cluster and start fresh with TTLs for
> 1 month (as suggested by Aaron and 8GB heap as suggested by Alain - thanks).
>
> Questions:
>
> 1) Do you expect problems with the 3rd node during 2 weeks more of
> operations, in the conditions seen below?
> [Note: we expect the minor compactions to continue building up files but
> never really getting to compacting the large file and thus not needing much
> temporarily extra disk space].
>
> 2) Should we restart with leveled compaction next year?
> [Note: Aaron was right, we have 1 week rows which get deleted after 1
> month which means older rows end up in big files => to free up space with
> SizeTiered we will have no choice but run major compactions which we don't
> know if they will work provided that we get at ~1TB / node / 1 month. You
> can see we are at the limit!]
>
> 3) In case we keep SizeTiered:
>
>     - How can we improve the performance of our major compactions? (we
> left all config parameters as default). Would increasing compactions
> throughput interfere with writes and reads? What about multi-threaded
> compactions?
>
>     - Do we still need to run regular repair operations as well? Do these
> also do a major compaction or are they completely separate operations?
>
> [Note: we have 3 nodes with RF=2 and inserting at consistency level 1 and
> reading at consistency level ALL. We read primarily for exporting reasons -
> we export 1 week worth of data at a time].
>
> 4) Should we consider increasing the cluster capacity?
> [We generate ~5million new rows every week which shouldn't come close to
> the hundreds of millions of rows on a node mentioned by Aaron which are the
> volumes that would create problems with bloom filters and indexes].
>
> Cheers,
> Alex
> ------------------
>
> The situation in the data folder
>
>     before calling nodetool comapact:
>
> du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
> 444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
> 376G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db
> 305G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db
> 39G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db
> 78G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db
> 81G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db
> 205M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db
> 20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db
> 20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db
> 20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db
> 4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db
> 4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db
> 4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db
> 333M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db
> 92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db
> 92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db
> 99M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db
> 2.5G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db
> 1.4T    total
>
>     after nodetool comapact returned:
>
> du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
> 444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
> 910G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db
> 19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db
> 19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db
> 5.0G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db
> 4.8G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db
> 338M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db
> 339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db
> 339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db
> 98M
>
>
> Looking at the disk occupancy for the logical partition where the data
> folder is in:
>
> df /data_bst
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/sdb1            2927242720 1482502260 1444740460  51% /data_bst
>
>
> and the situation in the cluster
>
> nodetool -h $HOSTNAME ring (before major compaction)
> Address         DC          Rack        Status State   Load
> Effective-Ownership Token
>
> 113427455640312821154458202477256070484
> 10.146.44.17    datacenter1 rack1       Up     Normal  1.37 TB
> 66.67%              0
> 10.146.44.18    datacenter1 rack1       Up     Normal  1.04 TB
> 66.67%              56713727820156410577229101238628035242
> 10.146.44.32    datacenter1 rack1       Up     Normal  1.14 TB
> 66.67%              113427455640312821154458202477256070484
>
> nodetool -h $HOSTNAME ring (after major compaction) (Note we were
> inserting data in the meantime)
> Address         DC          Rack        Status State   Load
> Effective-Ownership Token
>
> 113427455640312821154458202477256070484
> 10.146.44.17    datacenter1 rack1       Up     Normal  1.38 TB
> 66.67%              0
> 10.146.44.18    datacenter1 rack1       Up     Normal  1.08 TB
> 66.67%              56713727820156410577229101238628035242
> 10.146.44.32    datacenter1 rack1       Up     Normal  1.19 TB
> 66.67%              113427455640312821154458202477256070484
>
> ****
>
> ** **
>
> On Fri, Nov 23, 2012 at 2:16 AM, aaron morton <aa...@thelastpickle.com>
> wrote:****
>
> >  From what I know having too much data on one node is bad, not really
> sure why, but  I think that performance will go down due to the size of
> indexes and bloom filters (I may be wrong on the reasons but I'm quite sure
> you can't store too much data per node).****
>
> If you have many hundreds of millions of rows on a node the memory needed
> for bloom filters and index sampling can be significant. These can both be
> tuned.
>
> If you have 1.1T per node the time to do a compaction, repair or upgrade
> may be very significant. Also the time taken to copy this data should you
> need to remove or replace a node may be prohibitive.****
>
>
> > 2. Switch to Leveled compaction strategy.****
>
> I would avoid making a change like that on an unstable / at risk system.
>
> > - Our usage pattern is write once, read once (export) and delete once!
>
>  The column TTL may be of use to you, it removes the need to do a delete.
>
> > - We were thinking of relying on the automatic minor compactions to free
> up space for us but as..
> There are some usage patterns which make life harder for STS. For example
> if you have very long lived rows that are written to and deleted a lot. Row
> fragments that have been around for a while will end up in bigger files,
> and these files get compacted less often.
>
> In this situation, if you are running low on disk space and you think
> there is a lot of deleted data in there, I would run a major compaction. A
> word or warning though, if do this you will need to continue to do it
> regularly. Major compaction creates a single big file, that will not get
> compaction often. There are ways to resolve this, and moving to LDB may
> help in the future.
>
> If you are stuck and worried about disk space it's what I would do. Once
> you are stable again then look at LDB
> http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com****
>
>
> On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:
>
> > Hi Alexandru,
> >
> > "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk
> per node for the data dir and separate disk for the commitlog, 12 cores, 24
> GB RAM"
> >
> > I think you should tune your architecture in a very different way. From
> what I know having too much data on one node is bad, not really sure why,
> but  I think that performance will go down due to the size of indexes and
> bloom filters (I may be wrong on the reasons but I'm quite sure you can't
> store too much data per node).
> >
> > Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) would
> be better if you have the choice.
> >
> > "(12GB to Cassandra heap)."
> >
> > The max heap recommanded is 8GB because if you use more than these 8GB
> the Gc jobs will start decreasing your performance.
> >
> > "We now have 1.1 TB worth of data per node (RF = 2)."
> >
> > You should use RF=3 unless one out of consistency or SPOF  doesn't
> matter to you.
> >
> > With RF=2 you are obliged to write at CL.one to remove the single point
> of failure.
> >
> > "1. Start issuing regular major compactions (nodetool compact).
> >      - This is not recommended:
> >             - Stops minor compactions.
> >             - Major performance hit on node (very bad for us because
> need to be taking data all the time)."
> >
> > Actually, major compaction *does not* stop minor compactions. What
> happens is that due to the size of the size of the sstable that remains
> after your major compaction, it will never be compacted with the upcoming
> new sstables, and because of that, your read performance will go down until
> you run an other major compaction.
> >
> > "2. Switch to Leveled compaction strategy.
> >       - It is mentioned to help with deletes and disk space usage. Can
> someone confirm?"
> >
> > From what I know, Leveled compaction will not free disk space. It will
> allow you to use a greater percentage of your total disk space (50% max for
> sized tier compaction vs about 80% for leveled compaction)
> >
> > "Our usage pattern is write once, read once (export) and delete once! "
> >
> > In this case, I think that leveled compaction fits your needs.
> >
> > "Can anyone suggest which (if any) is better? Are there better
> solutions?"
> >
> > Are your sstable compressed ? You have 2 types of built-in compression
> and you may use them depending on the model of each of your CF.
> >
> > see:
> http://www.datastax.com/docs/1.1/operations/tuning#configure-compression
> >
> > Alain
> >
> > 2012/11/22 Alexandru Sicoe <ad...@gmail.com>
> > We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk
> per node for the data dir and separate disk for the commitlog, 12 cores, 24
> GB RAM (12GB to Cassandra heap).
> >****
>
> ** **
>
> ** **
>

RE: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Posted by "Poziombka, Wade L" <wa...@intel.com>.
"Having so much data on each node is a potential bad day."

Is this discussed somewhere on the Cassandra documentation (limits, practices etc)?  We are also trying to load up quite a lot of data and have hit memory issues (bloom filter etc.) in 1.0.10.  I would like to read up on big data usage of Cassandra.  Meaning terabyte size databases.

I do get your point about the amount of time required to recover downed node. But this 300-400MB business is interesting to me.

Thanks in advance.

Wade

From: aaron morton [mailto:aaron@thelastpickle.com]
Sent: Wednesday, December 05, 2012 9:23 PM
To: user@cassandra.apache.org
Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected!
I would recommend having up to 300MB to 400MB per node on a regular HDD with 1GB networking.

But on the 3rd node, we suspect major compaction didn't actually finish it's job...
The file list looks odd. Check the time stamps, on the files. You should not have files older than when compaction started.

8GB heap
The default is 4GB max now days.

1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below?
I cannot answer that.

2) Should we restart with leveled compaction next year?
I would run some tests to see how it works for you workload.

4) Should we consider increasing the cluster capacity?
IMHO yes.
You may also want to do some experiments with turing compression on if it not already enabled.

Having so much data on each node is a potential bad day. If instead you had to move or repair one of those nodes how long would it take for cassandra to stream all the data over ? (Or to rsync the data over.) How long does it take to run nodetool repair on the node ?

With RF 3, if you lose a node you have lost your redundancy. It's important to have a plan about how to get it back and how long it may take.

Hope that helps.

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 6/12/2012, at 3:40 AM, Alexandru Sicoe <ad...@gmail.com>> wrote:


Hi guys,
Sorry for the late follow-up but I waited to run major compactions on all 3 nodes at a time before replying with my findings.

Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected!

But on the 3rd node, we suspect major compaction didn't actually finish it's job. First of all nodetool compact returned much earlier than the rest - after one day and 15 hrs. Secondly from the 1.4TBs initially on the node only about 36GB were freed up (almost the same size as before). Saw nothing in the server log (debug not enabled). Below I pasted some more details about file sizes before and after compaction on this third node and disk occupancy.

The situation is maybe not so dramatic for us because in less than 2 weeks we will have a down time till after the new year. During this we can completely delete all the data in the cluster and start fresh with TTLs for 1 month (as suggested by Aaron and 8GB heap as suggested by Alain - thanks).

Questions:

1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below?
[Note: we expect the minor compactions to continue building up files but never really getting to compacting the large file and thus not needing much temporarily extra disk space].

2) Should we restart with leveled compaction next year?
[Note: Aaron was right, we have 1 week rows which get deleted after 1 month which means older rows end up in big files => to free up space with SizeTiered we will have no choice but run major compactions which we don't know if they will work provided that we get at ~1TB / node / 1 month. You can see we are at the limit!]

3) In case we keep SizeTiered:

    - How can we improve the performance of our major compactions? (we left all config parameters as default). Would increasing compactions throughput interfere with writes and reads? What about multi-threaded compactions?

    - Do we still need to run regular repair operations as well? Do these also do a major compaction or are they completely separate operations?

[Note: we have 3 nodes with RF=2 and inserting at consistency level 1 and reading at consistency level ALL. We read primarily for exporting reasons - we export 1 week worth of data at a time].

4) Should we consider increasing the cluster capacity?
[We generate ~5million new rows every week which shouldn't come close to the hundreds of millions of rows on a node mentioned by Aaron which are the volumes that would create problems with bloom filters and indexes].

Cheers,
Alex
------------------

The situation in the data folder

    before calling nodetool comapact:

du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
376G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db
305G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db
39G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db
78G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db
81G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db
205M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db
20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db
20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db
20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db
4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db
4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db
4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db
333M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db
92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db
92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db
99M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db
2.5G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db
1.4T    total

    after nodetool comapact returned:

du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
910G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db
19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db
19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db
5.0G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db
4.8G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db
338M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db
339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db
339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db
98M


Looking at the disk occupancy for the logical partition where the data folder is in:

df /data_bst
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdb1            2927242720 1482502260 1444740460  51% /data_bst


and the situation in the cluster

nodetool -h $HOSTNAME ring (before major compaction)
Address         DC          Rack        Status State   Load            Effective-Ownership Token
                                                                                           113427455640312821154458202477256070484
10.146.44.17    datacenter1 rack1       Up     Normal  1.37 TB         66.67%              0
10.146.44.18    datacenter1 rack1       Up     Normal  1.04 TB         66.67%              56713727820156410577229101238628035242
10.146.44.32    datacenter1 rack1       Up     Normal  1.14 TB         66.67%              113427455640312821154458202477256070484

nodetool -h $HOSTNAME ring (after major compaction) (Note we were inserting data in the meantime)
Address         DC          Rack        Status State   Load            Effective-Ownership Token
                                                                                           113427455640312821154458202477256070484
10.146.44.17    datacenter1 rack1       Up     Normal  1.38 TB         66.67%              0
10.146.44.18    datacenter1 rack1       Up     Normal  1.08 TB         66.67%              56713727820156410577229101238628035242
10.146.44.32    datacenter1 rack1       Up     Normal  1.19 TB         66.67%              113427455640312821154458202477256070484


On Fri, Nov 23, 2012 at 2:16 AM, aaron morton <aa...@thelastpickle.com>> wrote:
>  From what I know having too much data on one node is bad, not really sure why, but  I think that performance will go down due to the size of indexes and bloom filters (I may be wrong on the reasons but I'm quite sure you can't store too much data per node).
If you have many hundreds of millions of rows on a node the memory needed for bloom filters and index sampling can be significant. These can both be tuned.

If you have 1.1T per node the time to do a compaction, repair or upgrade may be very significant. Also the time taken to copy this data should you need to remove or replace a node may be prohibitive.

> 2. Switch to Leveled compaction strategy.
I would avoid making a change like that on an unstable / at risk system.

> - Our usage pattern is write once, read once (export) and delete once!

 The column TTL may be of use to you, it removes the need to do a delete.

> - We were thinking of relying on the automatic minor compactions to free up space for us but as..
There are some usage patterns which make life harder for STS. For example if you have very long lived rows that are written to and deleted a lot. Row fragments that have been around for a while will end up in bigger files, and these files get compacted less often.

In this situation, if you are running low on disk space and you think there is a lot of deleted data in there, I would run a major compaction. A word or warning though, if do this you will need to continue to do it regularly. Major compaction creates a single big file, that will not get compaction often. There are ways to resolve this, and moving to LDB may help in the future.

If you are stuck and worried about disk space it's what I would do. Once you are stable again then look at LDB http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com<http://www.thelastpickle.com/>

On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ <ar...@gmail.com>> wrote:

> Hi Alexandru,
>
> "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per node for the data dir and separate disk for the commitlog, 12 cores, 24 GB RAM"
>
> I think you should tune your architecture in a very different way. From what I know having too much data on one node is bad, not really sure why, but  I think that performance will go down due to the size of indexes and bloom filters (I may be wrong on the reasons but I'm quite sure you can't store too much data per node).
>
> Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) would be better if you have the choice.
>
> "(12GB to Cassandra heap)."
>
> The max heap recommanded is 8GB because if you use more than these 8GB the Gc jobs will start decreasing your performance.
>
> "We now have 1.1 TB worth of data per node (RF = 2)."
>
> You should use RF=3 unless one out of consistency or SPOF  doesn't matter to you.
>
> With RF=2 you are obliged to write at CL.one to remove the single point of failure.
>
> "1. Start issuing regular major compactions (nodetool compact).
>      - This is not recommended:
>             - Stops minor compactions.
>             - Major performance hit on node (very bad for us because need to be taking data all the time)."
>
> Actually, major compaction *does not* stop minor compactions. What happens is that due to the size of the size of the sstable that remains after your major compaction, it will never be compacted with the upcoming new sstables, and because of that, your read performance will go down until you run an other major compaction.
>
> "2. Switch to Leveled compaction strategy.
>       - It is mentioned to help with deletes and disk space usage. Can someone confirm?"
>
> From what I know, Leveled compaction will not free disk space. It will allow you to use a greater percentage of your total disk space (50% max for sized tier compaction vs about 80% for leveled compaction)
>
> "Our usage pattern is write once, read once (export) and delete once! "
>
> In this case, I think that leveled compaction fits your needs.
>
> "Can anyone suggest which (if any) is better? Are there better solutions?"
>
> Are your sstable compressed ? You have 2 types of built-in compression and you may use them depending on the model of each of your CF.
>
> see: http://www.datastax.com/docs/1.1/operations/tuning#configure-compression
>
> Alain
>
> 2012/11/22 Alexandru Sicoe <ad...@gmail.com>>
> We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per node for the data dir and separate disk for the commitlog, 12 cores, 24 GB RAM (12GB to Cassandra heap).
>



Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

Posted by aaron morton <aa...@thelastpickle.com>.
> Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected!
I would recommend having up to 300MB to 400MB per node on a regular HDD with 1GB networking. 

> But on the 3rd node, we suspect major compaction didn't actually finish it's job…
The file list looks odd. Check the time stamps, on the files. You should not have files older than when compaction started. 

> 8GB heap 
The default is 4GB max now days. 

> 1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below? 
I cannot answer that. 

> 2) Should we restart with leveled compaction next year? 
I would run some tests to see how it works for you workload. 

> 4) Should we consider increasing the cluster capacity?
IMHO yes.
You may also want to do some experiments with turing compression on if it not already enabled. 

Having so much data on each node is a potential bad day. If instead you had to move or repair one of those nodes how long would it take for cassandra to stream all the data over ? (Or to rsync the data over.) How long does it take to run nodetool repair on the node ?

With RF 3, if you lose a node you have lost your redundancy. It's important to have a plan about how to get it back and how long it may take.   

Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 6/12/2012, at 3:40 AM, Alexandru Sicoe <ad...@gmail.com> wrote:

> Hi guys,
> Sorry for the late follow-up but I waited to run major compactions on all 3 nodes at a time before replying with my findings.
> 
> Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected!
> 
> But on the 3rd node, we suspect major compaction didn't actually finish it's job. First of all nodetool compact returned much earlier than the rest - after one day and 15 hrs. Secondly from the 1.4TBs initially on the node only about 36GB were freed up (almost the same size as before). Saw nothing in the server log (debug not enabled). Below I pasted some more details about file sizes before and after compaction on this third node and disk occupancy.
> 
> The situation is maybe not so dramatic for us because in less than 2 weeks we will have a down time till after the new year. During this we can completely delete all the data in the cluster and start fresh with TTLs for 1 month (as suggested by Aaron and 8GB heap as suggested by Alain - thanks).
> 
> Questions:
> 
> 1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below? 
> [Note: we expect the minor compactions to continue building up files but never really getting to compacting the large file and thus not needing much temporarily extra disk space].
> 
> 2) Should we restart with leveled compaction next year? 
> [Note: Aaron was right, we have 1 week rows which get deleted after 1 month which means older rows end up in big files => to free up space with SizeTiered we will have no choice but run major compactions which we don't know if they will work provided that we get at ~1TB / node / 1 month. You can see we are at the limit!]
> 
> 3) In case we keep SizeTiered:
> 
>     - How can we improve the performance of our major compactions? (we left all config parameters as default). Would increasing compactions throughput interfere with writes and reads? What about multi-threaded compactions?
> 
>     - Do we still need to run regular repair operations as well? Do these also do a major compaction or are they completely separate operations? 
> 
> [Note: we have 3 nodes with RF=2 and inserting at consistency level 1 and reading at consistency level ALL. We read primarily for exporting reasons - we export 1 week worth of data at a time].
> 
> 4) Should we consider increasing the cluster capacity?
> [We generate ~5million new rows every week which shouldn't come close to the hundreds of millions of rows on a node mentioned by Aaron which are the volumes that would create problems with bloom filters and indexes].
> 
> Cheers,
> Alex
> ------------------
> 
> The situation in the data folder 
> 
>     before calling nodetool comapact:
> 
> du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
> 444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
> 376G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db
> 305G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db
> 39G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db
> 78G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db
> 81G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db
> 205M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db
> 20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db
> 20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db
> 20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db
> 4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db
> 4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db
> 4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db
> 333M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db
> 92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db
> 92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db
> 99M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db
> 2.5G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db
> 1.4T    total
> 
>     after nodetool comapact returned:
> 
> du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
> 444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
> 910G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db
> 19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db
> 19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db
> 5.0G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db
> 4.8G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db
> 338M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db
> 339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db
> 339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db
> 98M  
> 
> 
> Looking at the disk occupancy for the logical partition where the data folder is in:
> 
> df /data_bst
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/sdb1            2927242720 1482502260 1444740460  51% /data_bst
> 
> 
> and the situation in the cluster
> 
> nodetool -h $HOSTNAME ring (before major compaction)
> Address         DC          Rack        Status State   Load            Effective-Ownership Token                                       
>                                                                                            113427455640312821154458202477256070484     
> 10.146.44.17    datacenter1 rack1       Up     Normal  1.37 TB         66.67%              0                                           
> 10.146.44.18    datacenter1 rack1       Up     Normal  1.04 TB         66.67%              56713727820156410577229101238628035242      
> 10.146.44.32    datacenter1 rack1       Up     Normal  1.14 TB         66.67%              113427455640312821154458202477256070484
> 
> nodetool -h $HOSTNAME ring (after major compaction) (Note we were inserting data in the meantime)
> Address         DC          Rack        Status State   Load            Effective-Ownership Token                                       
>                                                                                            113427455640312821154458202477256070484     
> 10.146.44.17    datacenter1 rack1       Up     Normal  1.38 TB         66.67%              0                                           
> 10.146.44.18    datacenter1 rack1       Up     Normal  1.08 TB         66.67%              56713727820156410577229101238628035242      
> 10.146.44.32    datacenter1 rack1       Up     Normal  1.19 TB         66.67%              113427455640312821154458202477256070484
> 
> 
> 
> 
> On Fri, Nov 23, 2012 at 2:16 AM, aaron morton <aa...@thelastpickle.com> wrote:
> >  From what I know having too much data on one node is bad, not really sure why, but  I think that performance will go down due to the size of indexes and bloom filters (I may be wrong on the reasons but I'm quite sure you can't store too much data per node).
> If you have many hundreds of millions of rows on a node the memory needed for bloom filters and index sampling can be significant. These can both be tuned.
> 
> If you have 1.1T per node the time to do a compaction, repair or upgrade may be very significant. Also the time taken to copy this data should you need to remove or replace a node may be prohibitive.
> 
> > 2. Switch to Leveled compaction strategy.
> I would avoid making a change like that on an unstable / at risk system.
> 
> > - Our usage pattern is write once, read once (export) and delete once!
> 
>  The column TTL may be of use to you, it removes the need to do a delete.
> 
> > - We were thinking of relying on the automatic minor compactions to free up space for us but as..
> There are some usage patterns which make life harder for STS. For example if you have very long lived rows that are written to and deleted a lot. Row fragments that have been around for a while will end up in bigger files, and these files get compacted less often.
> 
> In this situation, if you are running low on disk space and you think there is a lot of deleted data in there, I would run a major compaction. A word or warning though, if do this you will need to continue to do it regularly. Major compaction creates a single big file, that will not get compaction often. There are ways to resolve this, and moving to LDB may help in the future.
> 
> If you are stuck and worried about disk space it's what I would do. Once you are stable again then look at LDB http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:
> 
> > Hi Alexandru,
> >
> > "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per node for the data dir and separate disk for the commitlog, 12 cores, 24 GB RAM"
> >
> > I think you should tune your architecture in a very different way. From what I know having too much data on one node is bad, not really sure why, but  I think that performance will go down due to the size of indexes and bloom filters (I may be wrong on the reasons but I'm quite sure you can't store too much data per node).
> >
> > Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) would be better if you have the choice.
> >
> > "(12GB to Cassandra heap)."
> >
> > The max heap recommanded is 8GB because if you use more than these 8GB the Gc jobs will start decreasing your performance.
> >
> > "We now have 1.1 TB worth of data per node (RF = 2)."
> >
> > You should use RF=3 unless one out of consistency or SPOF  doesn't matter to you.
> >
> > With RF=2 you are obliged to write at CL.one to remove the single point of failure.
> >
> > "1. Start issuing regular major compactions (nodetool compact).
> >      - This is not recommended:
> >             - Stops minor compactions.
> >             - Major performance hit on node (very bad for us because need to be taking data all the time)."
> >
> > Actually, major compaction *does not* stop minor compactions. What happens is that due to the size of the size of the sstable that remains after your major compaction, it will never be compacted with the upcoming new sstables, and because of that, your read performance will go down until you run an other major compaction.
> >
> > "2. Switch to Leveled compaction strategy.
> >       - It is mentioned to help with deletes and disk space usage. Can someone confirm?"
> >
> > From what I know, Leveled compaction will not free disk space. It will allow you to use a greater percentage of your total disk space (50% max for sized tier compaction vs about 80% for leveled compaction)
> >
> > "Our usage pattern is write once, read once (export) and delete once! "
> >
> > In this case, I think that leveled compaction fits your needs.
> >
> > "Can anyone suggest which (if any) is better? Are there better solutions?"
> >
> > Are your sstable compressed ? You have 2 types of built-in compression and you may use them depending on the model of each of your CF.
> >
> > see: http://www.datastax.com/docs/1.1/operations/tuning#configure-compression
> >
> > Alain
> >
> > 2012/11/22 Alexandru Sicoe <ad...@gmail.com>
> > We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk per node for the data dir and separate disk for the commitlog, 12 cores, 24 GB RAM (12GB to Cassandra heap).
> >
> 
>