You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Julie <ju...@nextcentury.com> on 2010/07/07 17:56:39 UTC

Cassandra disk space utilization

Hi guys,
I have what may be a dumb question but I am confused by how much disk space is 
being used by my Cassandra nodes.  I have 10 nodes in my cluster with a 
replication factor of 3.  After I write 1,000,000 rows to the database (100kB 
each), I see that they have been distributed very evenly, about 100,000 rows 
per node but because of the replication factor of 3, each node contains about 
300,000 rows.  This is all good.  Since my rows are 100kB each, I expect each 
node to store about 30GB of data, however that is not what I am seeing.  
Instead, I am seeing some nodes that do not experience any compaction 
exceptions but report their space used as MUCH more.  Here's one using 106 GB 
of disk.  My disks are only 160 GB so this is at the bleeding edge and I 
thought my node would be able to store more data.

I only use a single column family so here is the cfstats output from one of my 
nodes (server5):

		Column Family: Standard1
		SSTable count: 12
		Space used (live): 113946099884
		Space used (total): 113946099884
		Memtable Columns Count: 0
		Memtable Data Size: 0
		Memtable Switch Count: 451
		Read Count: 31786
		Read Latency: 161.429 ms.
		Write Count: 300633
		Write Latency: 0.124 ms.
		Pending Tasks: 0
		Key cache: disabled
		Row cache capacity: 3000
		Row cache size: 3000
		Row cache hit rate: 0.38331340841880074
		Compacted row minimum size: 100220
		Compacted row maximum size: 100225
		Compacted row mean size: 100224

Note that I wrote these 1M rows of data yesterday and the system has had 24 
hours to digest it. There are no exceptions in the system.log file.  Here's 
the tail end of it:

...
INFO [SSTABLE-CLEANUP-TIMER] 2010-07-06 16:13:43,162
SSTableDeletingReference.java (line 104) Deleted
/var/lib/cassandra/data/Keyspace1/Standard1-430-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-07-06 16:13:43,269
SSTableDeletingReference.java (line 104) Deleted
/var/lib/cassandra/data/Keyspace1/Standard1-445-Data.db
 INFO [COMPACTION-POOL:1] 2010-07-06 16:35:21,718 CompactionManager.java (line 
246) Compacting []
 INFO [Timer-1] 2010-07-06 17:01:01,907 Gossiper.java (line 179) InetAddress 
/10.248.107.19 is now dead.
 INFO [GMFD:1] 2010-07-06 17:01:42,039 Gossiper.java (line 568) InetAddress 
/10.248.107.19 is now UP
 INFO [COMPACTION-POOL:1] 2010-07-06 17:35:21,306 CompactionManager.java (line 
246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-07-06 18:35:20,802 CompactionManager.java (line 
246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-07-06 19:35:20,389 CompactionManager.java (line 
246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-07-06 20:35:19,934 CompactionManager.java (line 
246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-07-06 21:35:19,582 CompactionManager.java (line 
246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-07-06 22:35:19,233 CompactionManager.java (line 
246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-07-06 23:35:18,593 CompactionManager.java (line 
246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-07-07 00:35:18,076 CompactionManager.java (line 
246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-07-07 01:35:17,673 CompactionManager.java (line 
246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-07-07 02:35:17,172 CompactionManager.java (line 
246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-07-07 03:35:16,784 CompactionManager.java (line 
246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-07-07 04:35:16,383 CompactionManager.java (line 
246) Compacting []

Thank you for your help!!
Julie

Re: RackAwareStrategy vs RackUnAwareStrategy on AWS EC2 cloud

Posted by Dave Viner <da...@pobox.com>.

Hi,

Can you post the stress test code and storage.conf used?

I have a cluster in EC2 using RackAware.  However, I am in 1 region
(us-east-1) but 2 Availability Zones.  Amazon helps to ensure that AZ's are
isolated from each other creating a fail-resistant cluster.  But, staying in
the same region allows for higher thruput numbers.

Dave Viner


On Fri, Jul 9, 2010 at 10:36 AM, maneela a <ma...@yahoo.com> wrote:

> Are there any known performance issues if cassandra cluster launched with
> RackAwareStrategy because I see huge performance difference between
> RackAwareStrategy vs RackUnAwareStrategy.  Here are details:
>
>
>
> we have a cluster setup with 4 EC2 X large nodes, 3 of them are running in
> East region and 4th one is running in West region and they all communicate
> with each other through VPN tunnel interface which is only way we found to
> achieve ring architecture across Amazon cloud regions:
>
>
>
> we are able to process 3.5K write operations per second when we used
> RackUnAwareStrategy whereas
>
>
> :/home/ubuntu/cassandra/contrib/py_stress# ./stress.py -o insert -n 80000
> -y regular -d ec2-xxx-xxx-xxx-xx.compute-1.amazonaws.com --threads 100
> --keep-going
>
> total,interval_op_rate,avg_latency,elapsed_time
>
> 35935,3593,0.0289930914479,10
>
> 70531,3459,0.0289145907593,20
>
> 80000,946,0.0267288666213,30
>
>
> whereas we are able to process only 250 write operations per second when we
> used RackAwareStrategy
>
>
> :/home/ubuntu/cassandra/contrib/py_stress# ./stress.py -o insert -n 80000
> -y regular -d ec2-xxx-xxx-xxx-xx.compute-1.amazonaws.com --threads 100
> --keep-going
>
> total,interval_op_rate,avg_latency,elapsed_time
>
> 2327,232,0.434396038355,10
>
> 4772,244,0.40946514036,20
>
> 7383,261,0.384504625415,30
>
> 9924,254,0.392919449861,40
>
> 12525,260,0.383832110482,50
>
> 15158,263,0.378838069983,60
>
> 17784,262,0.383219807364,70
>
> 20416,263,0.381646275973,80
>
> 23030,261,0.382550528602,90
>
> 25644,261,0.384442176815,100
>
> 28268,262,0.380935921084,110
>
> 30910,264,0.377376309224,120
>
> 33541,263,0.385158945698,130
>
> 36119,257,0.387976026517,140
>
> 38735,261,0.382333525368,150
>
> 41342,260,0.38413751514,160
>
> 43925,258,0.387684800391,170
>
> 46642,271,0.36899637237,180
>
> 49291,264,0.378489510164,190
>
> 51931,264,0.3793784538,200
>
> 54573,264,0.378474057217,210
>
> 57253,268,0.374258003573,220
>
> 59884,263,0.380020038658,230
>
> 62484,260,0.387267011954,240
>
> 64728,224,0.439328571054,250
>
> 67340,261,0.389221810455,260
>
> 69920,258,0.386144905127,270
>
> 72531,261,0.384242234948,280
>
> 75202,267,0.372129596605,290
>
> 77843,264,0.354621512291,300
>
> 80000,215,0.183918378283,310
>
> Thanks in advance
>
> Niru
>
>
>

RE: Help! Cassandra disk space utilization WAY higher than I would expect

Posted by Stu Hood <st...@rackspace.com>.

Cassandra has a very high constant per-row overhead at the moment of around 40 bytes. Additionally, there is around 12 bytes of overhead per column. Finally, column names are repeated for each row.

CASSANDRA-674 and CASSANDRA-1207 will help with these overheads, but they will not be fixed until 0.8. The file format change should bring lovely things like compression and variable length encoding, which Cassandra will gain huge benefits from.

But, "disk is cheap"... the solution for now is to add more nodes. And why not?

Thanks,
Stu


-----Original Message-----
From: "Julie" <ju...@nextcentury.com>
Sent: Friday, July 9, 2010 9:58am
To: user@cassandra.apache.org
Subject: Help! Cassandra disk space utilization WAY higher than I would expect

Hi guys,
I am on the hook to explain why 30GB of data is filling up 106GB of disk space
since this is concerning information for my project.  

We are very excited about the possibility of using Cassandra but need to
understand this anomaly in order to feel confident.  Does anyone know why this
could be happening?

cfstats reports that space used live is equal to space used total so I think the
data is truly taking up 106GB, I just can't explain why.

 		Space used (live): 113946099884
 		Space used (total): 113946099884

Thank you for any guidance!
Julie

Re: RackAwareStrategy vs RackUnAwareStrategy on AWS EC2 cloud

Posted by maneela a <ma...@yahoo.com>.

ConsistencyLevel.ONE is default option given inside stress.py so I am using default one

--- On Fri, 7/9/10, Bill de hÓra <bi...@dehora.net> wrote:

From: Bill de hÓra <bi...@dehora.net>
Subject: Re: RackAwareStrategy vs RackUnAwareStrategy on AWS EC2 cloud
To: user@cassandra.apache.org
Date: Friday, July 9, 2010, 2:12 PM

  east: A B C 

  west: D

Perhaps you are blocking on a write to D - what's your quorum/rf set up
as?


Bill

On Fri, 2010-07-09 at 10:36 -0700, maneela a wrote:
>         Are there any known performance issues if cassandra cluster
>         launched with RackAwareStrategy because I see huge performance
>         difference between RackAwareStrategy vs RackUnAwareStrategy.
>         Here are details:
>         
>         
>         
>         we have a cluster setup with 4 EC2 X large nodes, 3 of them
>         are running in East region and 4th one is running in West
>         region and they all communicate with each other through VPN
>         tunnel interface which is only way we found to achieve ring
>         architecture across Amazon cloud regions:
>         
>         
>         
>         we are able to process 3.5K write operations per second when
>         we used RackUnAwareStrategy whereas 
>         
>         
>         :/home/ubuntu/cassandra/contrib/py_stress# ./stress.py -o
>         insert -n 80000 -y regular -d
>         ec2-xxx-xxx-xxx-xx.compute-1.amazonaws.com --threads 100
>         --keep-going
>         
>         total,interval_op_rate,avg_latency,elapsed_time
>         
>         35935,3593,0.0289930914479,10
>         
>         70531,3459,0.0289145907593,20
>         
>         80000,946,0.0267288666213,30
>         
>         
>         whereas we are able to process only 250 write operations per
>         second when we used RackAwareStrategy
>         
>         
>         :/home/ubuntu/cassandra/contrib/py_stress# ./stress.py -o
>         insert -n 80000 -y regular -d
>         ec2-xxx-xxx-xxx-xx.compute-1.amazonaws.com --threads 100
>         --keep-going
>         
>         total,interval_op_rate,avg_latency,elapsed_time
>         
>         2327,232,0.434396038355,10
>         
>         4772,244,0.40946514036,20
>         
>         7383,261,0.384504625415,30
>         
>         9924,254,0.392919449861,40
>         
>         12525,260,0.383832110482,50
>         
>         15158,263,0.378838069983,60
>         
>         17784,262,0.383219807364,70
>         
>         20416,263,0.381646275973,80
>         
>         23030,261,0.382550528602,90
>         
>         25644,261,0.384442176815,100
>         
>         28268,262,0.380935921084,110
>         
>         30910,264,0.377376309224,120
>         
>         33541,263,0.385158945698,130
>         
>         36119,257,0.387976026517,140
>         
>         38735,261,0.382333525368,150
>         
>         41342,260,0.38413751514,160
>         
>         43925,258,0.387684800391,170
>         
>         46642,271,0.36899637237,180
>         
>         49291,264,0.378489510164,190
>         
>         51931,264,0.3793784538,200
>         
>         54573,264,0.378474057217,210
>         
>         57253,268,0.374258003573,220
>         
>         59884,263,0.380020038658,230
>         
>         62484,260,0.387267011954,240
>         
>         64728,224,0.439328571054,250
>         
>         67340,261,0.389221810455,260
>         
>         69920,258,0.386144905127,270
>         
>         72531,261,0.384242234948,280
>         
>         75202,267,0.372129596605,290
>         
>         77843,264,0.354621512291,300
>         
>         80000,215,0.183918378283,310
>         
>         
>         
>         Thanks in advance
>         
>         
>         Niru
>

Re: RackAwareStrategy vs RackUnAwareStrategy on AWS EC2 cloud

Posted by Bill de hÓra <bi...@dehora.net>.

  east: A B C 

  west: D

Perhaps you are blocking on a write to D - what's your quorum/rf set up
as?


Bill

On Fri, 2010-07-09 at 10:36 -0700, maneela a wrote:
>         Are there any known performance issues if cassandra cluster
>         launched with RackAwareStrategy because I see huge performance
>         difference between RackAwareStrategy vs RackUnAwareStrategy.
>         Here are details:
>         
>         
>         
>         we have a cluster setup with 4 EC2 X large nodes, 3 of them
>         are running in East region and 4th one is running in West
>         region and they all communicate with each other through VPN
>         tunnel interface which is only way we found to achieve ring
>         architecture across Amazon cloud regions:
>         
>         
>         
>         we are able to process 3.5K write operations per second when
>         we used RackUnAwareStrategy whereas 
>         
>         
>         :/home/ubuntu/cassandra/contrib/py_stress# ./stress.py -o
>         insert -n 80000 -y regular -d
>         ec2-xxx-xxx-xxx-xx.compute-1.amazonaws.com --threads 100
>         --keep-going
>         
>         total,interval_op_rate,avg_latency,elapsed_time
>         
>         35935,3593,0.0289930914479,10
>         
>         70531,3459,0.0289145907593,20
>         
>         80000,946,0.0267288666213,30
>         
>         
>         whereas we are able to process only 250 write operations per
>         second when we used RackAwareStrategy
>         
>         
>         :/home/ubuntu/cassandra/contrib/py_stress# ./stress.py -o
>         insert -n 80000 -y regular -d
>         ec2-xxx-xxx-xxx-xx.compute-1.amazonaws.com --threads 100
>         --keep-going
>         
>         total,interval_op_rate,avg_latency,elapsed_time
>         
>         2327,232,0.434396038355,10
>         
>         4772,244,0.40946514036,20
>         
>         7383,261,0.384504625415,30
>         
>         9924,254,0.392919449861,40
>         
>         12525,260,0.383832110482,50
>         
>         15158,263,0.378838069983,60
>         
>         17784,262,0.383219807364,70
>         
>         20416,263,0.381646275973,80
>         
>         23030,261,0.382550528602,90
>         
>         25644,261,0.384442176815,100
>         
>         28268,262,0.380935921084,110
>         
>         30910,264,0.377376309224,120
>         
>         33541,263,0.385158945698,130
>         
>         36119,257,0.387976026517,140
>         
>         38735,261,0.382333525368,150
>         
>         41342,260,0.38413751514,160
>         
>         43925,258,0.387684800391,170
>         
>         46642,271,0.36899637237,180
>         
>         49291,264,0.378489510164,190
>         
>         51931,264,0.3793784538,200
>         
>         54573,264,0.378474057217,210
>         
>         57253,268,0.374258003573,220
>         
>         59884,263,0.380020038658,230
>         
>         62484,260,0.387267011954,240
>         
>         64728,224,0.439328571054,250
>         
>         67340,261,0.389221810455,260
>         
>         69920,258,0.386144905127,270
>         
>         72531,261,0.384242234948,280
>         
>         75202,267,0.372129596605,290
>         
>         77843,264,0.354621512291,300
>         
>         80000,215,0.183918378283,310
>         
>         
>         
>         Thanks in advance
>         
>         
>         Niru
>

Re: RackAwareStrategy vs RackUnAwareStrategy on AWS EC2 cloud

Posted by Joe Stump <jo...@joestump.net>.

On Jul 9, 2010, at 1:16 PM, maneela a wrote:

> Is there any way to mark cassandra node to keep it as just for replication purpose and not to be as Primary for any data range in the ring? 

I believe there is. This is what we're doing, but we do all of our writes via a queue. Derek or Mike from SimpleGeo (both on the list) can probably chime in with a better explanation than I can.

--Joe

Re: RackAwareStrategy vs RackUnAwareStrategy on AWS EC2 cloud

Posted by maneela a <ma...@yahoo.com>.

Thanks for your quick reply.. JoeI forgot to mentioned that we are using PropertyFileEndPointSnitch to provide cassandra about our network topology and below is property file uses by that class
cat rack.properties10.9.0.6=east:r1b10.9.0.18=east:r1c10.9.0.14=east:r1d10.9.0.10=west:r1adefault=east:rdef
In my first glance, VPNCubed uses openVPN software as part of transport layer so I did not look into it much deeper. I will work on it to see if it helps in out set up
we are looking for something like Oracle Primary/Standby solution where write operations happens only on Primary set of nodes ( in our case nodes those are running in different AZ in East region) and one copy of each data block should replicate to the node running in the West region so that read operation can be available local to any region because we have applications that consume cassandra data,  are running from both East and West regions. we are inclined to accept write latency between regions because readers won't look for data immediately and there will be around 5 -10 mins gap between write and read operations.
Is there any way to mark cassandra node to keep it as just for replication purpose and not to be as Primary for any data range in the ring? 
Niru

--- On Fri, 7/9/10, Joe Stump <jo...@joestump.net> wrote:

From: Joe Stump <jo...@joestump.net>
Subject: Re: RackAwareStrategy vs RackUnAwareStrategy on AWS EC2 cloud
To: user@cassandra.apache.org
Date: Friday, July 9, 2010, 1:41 PM

We had similar issues when we started running Cassandra on EC2 between multiple AZ's (not regions; we're working up to that shortly). We ended up building a rack aware strategy specific to AWS, which is posted somewhere in JIRA. Basically it uses the AWS API to ensure that replicants are stored in each AZ. We then ensure that our clients are only reading from nodes in a given AZ. What I'm guessing is that you're seeing latency issues between regions combined with a higher consistency level than what we use.
Also, SSL tunnels are hard to scale from the management side. The Amazon folks have told us that VPNCubed is a better solution for such things.
--Joe

On Jul 9, 2010, at 11:36 AM, maneela a wrote:
Are there any known performance issues if cassandra cluster launched with RackAwareStrategy because I see huge performance difference between RackAwareStrategy vs RackUnAwareStrategy.  Here are details:

we have a cluster setup with 4 EC2 X large nodes, 3 of them are running in East region and 4th one is running in West region and they all communicate with each other through VPN tunnel interface which is only way we found to achieve ring architecture across Amazon cloud regions:

we are able to process 3.5K write operations per second when we used RackUnAwareStrategy whereas 
:/home/ubuntu/cassandra/contrib/py_stress# ./stress.py -o insert -n 80000 -y regular -d ec2-xxx-xxx-xxx-xx.compute-1.amazonaws.com --threads 100 --keep-goingtotal,interval_op_rate,avg_latency,elapsed_time35935,3593,0.0289930914479,1070531,3459,0.0289145907593,2080000,946,0.0267288666213,30
whereas we are able to process only 250 write operations per second when we used RackAwareStrategy
:/home/ubuntu/cassandra/contrib/py_stress# ./stress.py -o insert -n 80000 -y regular -d ec2-xxx-xxx-xxx-xx.compute-1.amazonaws.com --threads 100
 --keep-goingtotal,interval_op_rate,avg_latency,elapsed_time2327,232,0.434396038355,104772,244,0.40946514036,207383,261,0.384504625415,309924,254,0.392919449861,4012525,260,0.383832110482,5015158,263,0.378838069983,6017784,262,0.383219807364,7020416,263,0.381646275973,8023030,261,0.382550528602,9025644,261,0.384442176815,10028268,262,0.380935921084,11030910,264,0.377376309224,12033541,263,0.385158945698,13036119,257,0.387976026517,14038735,261,0.382333525368,15041342,260,0.38413751514,16043925,258,0.387684800391,17046642,271,0.36899637237,18049291,264,0.378489510164,19051931,264,0.3793784538,20054573,264,0.378474057217,21057253,268,0.374258003573,22059884,263,0.380020038658,23062484,260,0.387267011954,24064728,224,0.439328571054,25067340,261,0.389221810455,26069920,258,0.386144905127,27072531,261,0.384242234948,28075202,267,0.372129596605,29077843,264,0.354621512291,30080000,215,0.183918378283,310
Thanks in advance
Niru

Re: RackAwareStrategy vs RackUnAwareStrategy on AWS EC2 cloud

Posted by Joe Stump <jo...@joestump.net>.

We had similar issues when we started running Cassandra on EC2 between multiple AZ's (not regions; we're working up to that shortly). We ended up building a rack aware strategy specific to AWS, which is posted somewhere in JIRA. Basically it uses the AWS API to ensure that replicants are stored in each AZ. We then ensure that our clients are only reading from nodes in a given AZ. What I'm guessing is that you're seeing latency issues between regions combined with a higher consistency level than what we use.

Also, SSL tunnels are hard to scale from the management side. The Amazon folks have told us that VPNCubed is a better solution for such things.

--Joe


On Jul 9, 2010, at 11:36 AM, maneela a wrote:

> Are there any known performance issues if cassandra cluster launched with RackAwareStrategy because I see huge performance difference between RackAwareStrategy vs RackUnAwareStrategy.  Here are details:
> 
> 
> we have a cluster setup with 4 EC2 X large nodes, 3 of them are running in East region and 4th one is running in West region and they all communicate with each other through VPN tunnel interface which is only way we found to achieve ring architecture across Amazon cloud regions:
> 
> 
> we are able to process 3.5K write operations per second when we used RackUnAwareStrategy whereas 
> 
> :/home/ubuntu/cassandra/contrib/py_stress# ./stress.py -o insert -n 80000 -y regular -d ec2-xxx-xxx-xxx-xx.compute-1.amazonaws.com --threads 100 --keep-going
> total,interval_op_rate,avg_latency,elapsed_time
> 35935,3593,0.0289930914479,10
> 70531,3459,0.0289145907593,20
> 80000,946,0.0267288666213,30
> 
> whereas we are able to process only 250 write operations per second when we used RackAwareStrategy
> 
> :/home/ubuntu/cassandra/contrib/py_stress# ./stress.py -o insert -n 80000 -y regular -d ec2-xxx-xxx-xxx-xx.compute-1.amazonaws.com --threads 100 --keep-going
> total,interval_op_rate,avg_latency,elapsed_time
> 2327,232,0.434396038355,10
> 4772,244,0.40946514036,20
> 7383,261,0.384504625415,30
> 9924,254,0.392919449861,40
> 12525,260,0.383832110482,50
> 15158,263,0.378838069983,60
> 17784,262,0.383219807364,70
> 20416,263,0.381646275973,80
> 23030,261,0.382550528602,90
> 25644,261,0.384442176815,100
> 28268,262,0.380935921084,110
> 30910,264,0.377376309224,120
> 33541,263,0.385158945698,130
> 36119,257,0.387976026517,140
> 38735,261,0.382333525368,150
> 41342,260,0.38413751514,160
> 43925,258,0.387684800391,170
> 46642,271,0.36899637237,180
> 49291,264,0.378489510164,190
> 51931,264,0.3793784538,200
> 54573,264,0.378474057217,210
> 57253,268,0.374258003573,220
> 59884,263,0.380020038658,230
> 62484,260,0.387267011954,240
> 64728,224,0.439328571054,250
> 67340,261,0.389221810455,260
> 69920,258,0.386144905127,270
> 72531,261,0.384242234948,280
> 75202,267,0.372129596605,290
> 77843,264,0.354621512291,300
> 80000,215,0.183918378283,310
> 
> Thanks in advance
> 
> Niru
>

RackAwareStrategy vs RackUnAwareStrategy on AWS EC2 cloud

Posted by maneela a <ma...@yahoo.com>.

Are there any known performance issues if cassandra cluster launched with RackAwareStrategy because I see huge performance difference between RackAwareStrategy vs RackUnAwareStrategy.  Here are details:




we have a cluster setup with 4 EC2 X large nodes, 3 of them are running in East region and 4th one is running in West region and they all communicate with each other through VPN tunnel interface which is only way we found to achieve ring architecture across Amazon cloud regions:




we are able to process 3.5K write operations per second when we used RackUnAwareStrategy whereas 


:/home/ubuntu/cassandra/contrib/py_stress# ./stress.py -o insert -n 80000 -y regular -d ec2-xxx-xxx-xxx-xx.compute-1.amazonaws.com --threads 100 --keep-going
total,interval_op_rate,avg_latency,elapsed_time
35935,3593,0.0289930914479,10
70531,3459,0.0289145907593,20
80000,946,0.0267288666213,30


whereas we are able to process only 250 write operations per second when we used RackAwareStrategy


:/home/ubuntu/cassandra/contrib/py_stress# ./stress.py -o insert -n 80000 -y regular -d ec2-xxx-xxx-xxx-xx.compute-1.amazonaws.com --threads 100 --keep-going
total,interval_op_rate,avg_latency,elapsed_time
2327,232,0.434396038355,10
4772,244,0.40946514036,20
7383,261,0.384504625415,30
9924,254,0.392919449861,40
12525,260,0.383832110482,50
15158,263,0.378838069983,60
17784,262,0.383219807364,70
20416,263,0.381646275973,80
23030,261,0.382550528602,90
25644,261,0.384442176815,100
28268,262,0.380935921084,110
30910,264,0.377376309224,120
33541,263,0.385158945698,130
36119,257,0.387976026517,140
38735,261,0.382333525368,150
41342,260,0.38413751514,160
43925,258,0.387684800391,170
46642,271,0.36899637237,180
49291,264,0.378489510164,190
51931,264,0.3793784538,200
54573,264,0.378474057217,210
57253,268,0.374258003573,220
59884,263,0.380020038658,230
62484,260,0.387267011954,240
64728,224,0.439328571054,250
67340,261,0.389221810455,260
69920,258,0.386144905127,270
72531,261,0.384242234948,280
75202,267,0.372129596605,290
77843,264,0.354621512291,300
80000,215,0.183918378283,310
Thanks in advance
Niru

Help! Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Hi guys,
I am on the hook to explain why 30GB of data is filling up 106GB of disk space
since this is concerning information for my project.  

We are very excited about the possibility of using Cassandra but need to
understand this anomaly in order to feel confident.  Does anyone know why this
could be happening?

cfstats reports that space used live is equal to space used total so I think the
data is truly taking up 106GB, I just can't explain why.

 		Space used (live): 113946099884
 		Space used (total): 113946099884

Thank you for any guidance!
Julie

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Peter Schuller <pe...@infidyne.com>.

> One thing to keep in mind is that SSTables are not actually removed
> from disk until the garbage collector has identified the relevant
> in-memory structures as garbage (there is a note on the wiki about

However I forgot that the 'load' reported by nodetool ring does not, I
think, represent on-disk size. So nevermind on this bit.

-- 
/ Peter Schuller

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Peter Schuller <pe...@infidyne.com>.

> So the manual compaction did help somewhat but did not get the nodes down to the
> size of their raw data.  There are still multiple SSTables on most nodes.
>
> At 4:02pm, ran nodetool cleanup on every node.
>
> At 4:12pm, nodes are taking up the expected amount of space and all nodes are
> using exactly 1 SSTable (fully compacted):

One thing to keep in mind is that SSTables are not actually removed
from disk until the garbage collector has identified the relevant
in-memory structures as garbage (there is a note on the wiki about
this somewhere; it's a way to avoid the complexity of keeping track of
when an sstable becomes safe to delete).

I may be wrong, but I did a quick check and did not find an obvious GC
trigger in the codepath for the 'cleanup' command. So while I'm not
sure why the cleanup would necessarily help other than generally
generating garbage and perhaps triggering a GC, a delay in actually
freeing disk space can probably be attributed to the GC.

(The reason I don't understand why cleanup would help is that even if
cleanup did trigger sufficient garbage generation that CMS kicks in
and does a mark/sweep, thus triggering the deletion of old sstables,
presumably the cleanup itself would produce new sstables that would
then have to wait anyway. Unless there is some code path to avoid
doing that if nothing at all changes in the sstables as a result of
the cleanup... I don't know.)

-- 
/ Peter Schuller

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Peter Schuller <pe...@infidyne.com>.

> sstables waiting for the GC to trigger actual file removal. *However*,
> and this is what I meant with my follow-up, that still does not
> explain the data from her post unless 'nodetool ring' reports total
> sstable size rather than the total size of live sstables.

As far as I can tell, the information 'nodetool ring' prints comes
from SSTableTracker's liveSize. This seems to be updated in replace()
which is used by the compaction manager after compaction. So I do
believe 'nodetool ring' should *not* include the size of sstables that
have been compacted, regardless of GC and whether they still remain on
disk.

-- 
/ Peter Schuller

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Rob Coli <rc...@digg.com>.

On 8/6/10 3:30 PM, Peter Schuller wrote:
> *However*,
> and this is what I meant with my follow-up, that still does not
> explain the data from her post unless 'nodetool ring' reports total
> sstable size rather than the total size of live sstables.

Relatively limited time available to respond to this post, but FYI :

src/java/org/apache/cassandra/tools/NodeCmd.java
"
     public void printRing(PrintStream outs)
...
         Map<String, String> loadMap = probe.getLoadMap();
"

probe object is of type NodeProbe, so..

src/java/org/apache/cassandra/tools/NodeProbe.java
"
         Map<String, String> loadMap = ssProxy.getLoadMap();
"

ssProxy object is a MBean proxy to Storage Service methods..

./cassandra-0.6/src/java/org/apache/cassandra/service/StorageService.java
"
     public double getLoad()
...
                 bytes += cfs.getLiveDiskSpaceUsed();
"

In other words, "nodetool ring" should be reporting only the live disk 
space used.

> Just to be clear, my perhaps misuse of the term "obsolete" only refers
> to sstables that have been successfully compacted together with others
> into a new sstable which replaces the old ones (hence making them
> "obsolete"). I do not mean to imply that they contain obsolete
> columns.

Ok, thought that's what you meant. Thx for the clarification.

=Rob

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Peter Schuller <pe...@infidyne.com>.

> Your post refers to "obsolete" sstables, but the only thing that makes them
> "obsolete" in this case is that they have been compacted?

Yes.

> As I understand Julie's case, she is :
>
> a) initializing her cluster
> b) inserting some number of unique keys with CL.ALL
> c) noticing that more disk space (6x?) than is expected is used
> d) but that she gets expected usage if she does a major compaction
>
> In other words, the problem isn't "temporary disk space occupied during the
> compact", it's permanent disk space occupied unless she compacts.

Sorry, I re-read my previous message and I wasn't being very clear and
I was probably a bit confused too ;)

The reason I mention temporary spikes during compaction is that these
are fully expected to roughly double disk space use. I did not mean to
imply that this is what she is seeing, since she's specifically
waiting for background compaction to complete. Rather, my point was
that as long as we are only talking about roughly a doubling of disk
space, regardless of its cause, it is not worse than what you may
expect anyway, even if temporarily, during active compaction.

I still have no explanation for lingering data, other than obsolete
sstables waiting for the GC to trigger actual file removal. *However*,
and this is what I meant with my follow-up, that still does not
explain the data from her post unless 'nodetool ring' reports total
sstable size rather than the total size of live sstables.

But let's suppose that's wrong and 'nodetool ring' somehow does
include all sstable data; in such a case it seems that the data from
Julie's latest post may be consistent with obsolete sstables, since
the repeated attempts at compaction/cleanup may very well have
triggered a CMS sweep, thus eventually freeing the sstables (while
simply leaving the cluster idle would not have done so normally).

It is also worth noting that this kind of space waste (sstables
hanging around waiting for a GC) is not likely to scale to larger data
sets as the probability of triggering a CMS sweep within a reasonable
period after compaction, increases with the amount of traffic you
throw at the cluster. So if you're e.g. writing 100 gb of data, you're
pretty unlikely to not trigger a CMS sweep within the first few
percent (assuming you don't have a huge 250 gb heap or something). For
this reason (even disregarding that cassandra tries to trigger a GC
when disk runs low) I would not count them as expected to be
"cumulative" on top of any temporary spikes resulting from compaction;
so in effect the idea is that these effects should normally not matter
since you need the disk space to deal with the compaction spikes
anyway, and you are unlikely to have both compaction spikes *and*
these obsolete sstables at the same time even without the GC
triggering cassandra does.

> There is clearly overhead from there being multiple SSTables with multiple
> bloom filters and multiple indexes. But from my understanding, that does not
> fully account for the difference in disk usage she is seeing. If it is 6x
> across the whole cluster, it seems unlikely that the meta information is 5x
> the size of the actual information.

Definitely agreed. I should have made it clearer that I was only
addressing the post.

Julie, are you still seeing 6x and similar factors of wastes in your
re-producable test cases, or was the 6x factor limited to your initial
experience with real data?

The reason I'm barking up this tree is that I figure we may be
observing the results of two independent problems here, and if the
smaller test case can be explained away by lack of GC, then it
probably doesn't help figuring out what happened in the original
problem scenario.

Hmm. Maybe a 'nodetool gc' would be in order to make it easy to
trigger a gc without going for jconsole/JMX manually.

> I haven't been following this thread very closely, but I don't think
> "obsolete" SSTables should be relevant, because she's not doing UPDATE or
> DELETE and she hasn't changed cluster topography (the "cleanup" case).

Even a workload strictly limited to writing new data with unique keys
and column names will cause sstables to become obsolete, as long as
you write enough data to reach the compaction threshold.

Just to be clear, my perhaps misuse of the term "obsolete" only refers
to sstables that have been successfully compacted together with others
into a new sstable which replaces the old ones (hence making them
"obsolete"). I do not mean to imply that they contain obsolete
columns.

-- 
/ Peter Schuller

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Rob Coli <rc...@digg.com>.

On 8/20/10 1:58 PM, Julie wrote:
> Julie<julie.sugar<at>  nextcentury.com>  writes:
>
> Please see previous post but is hinted handoff a factor if the CL is set to ALL?

Your previous post looks like a flush or compaction is causing the node 
to mark its neighbors down. Do you see correlation between memtable 
flushes or compaction and the GMFD marking? Are you running a version of 
Cassandra (<=0.6.4) which has the MESSAGE-DESERIALIZER-POOL bug, and/or 
are you seeing a backup there? That can cause GMFD to incorrectly mark 
nodes down..

https://issues.apache.org/jira/browse/CASSANDRA-1358

However you are correct, I missed that you are writing at CL.ALL. With 
CL.ALL, you should get an exception if nodes are unreachable during a 
write. This means that Hinted Handoff is unlikely to be the culprit for 
your usage bloat. As you clearly have nodes which are occasionally 
considered unreachable.. are you seeing those exceptions?

=Rob

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Julie <julie.sugar <at> nextcentury.com> writes:

Please see previous post but is hinted handoff a factor if the CL is set to ALL?

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Robert Coli <rcoli <at> digg.com> writes:

> Check the size of the Hinted Handoff CF? If your nodes are flapping
> under sustained write, they could be storing a non-trivial number of
> hinted handoff rows? Probably not 5x usage though..
> 
> http://wiki.apache.org/cassandra/Operations
> "
> The reason why you run nodetool cleanup on all live nodes [after
> replacing a node] is to remove old Hinted Handoff writes stored for
> the dead node. 
> =Rob


Please see below and let me know if you think hinted handoff is to blame.  

I do see some down/up activity according to the gossiping on the nodes.  
Interestingly, I see mass "deaths" being detected on servers 1, 2, 4, 5, and 
6.  Each server detects the "mass death" at a unique time, making it look as 
though it's the server that's detecting the mass death that is really the 
culprit.  Server3, the bloated node, is not having this problem.

As far as nodes being reported as going down and coming back up (always 
quickly) --  being down/up is being reported for each server as follows:
server1: 16 times
server2: 17 times
server3: 12 times
server4: 18 times
server5: 24 times
server6: 20 times
server7: 13 times
server8: 12 times

Again, server3 is looking like a very healthy node, so you wouldn't think it
would have a backlog of hinted handoffs coming its way when the writes complete.
Server5 and server6 seem to be the least healthy.

Here's the streaming that took place on all 8 nodes, following the 800,000 row
write (writes completed at 13:47):

server1-system.log: INFO [Thread-25] 2010-08-16 12:59:49,858
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-172-Data.db
server1-system.log: INFO [Thread-31] 2010-08-16 18:12:01,829 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-475-Data.db

server3-system.log: INFO [Thread-25] 2010-08-16 14:34:06,827 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-467-Data.db
server3-system.log: INFO [Thread-28] 2010-08-16 15:38:34,685 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-475-Data.db 
server3-system.log: INFO [Thread-34] 2010-08-16 17:24:57,584 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-477-Data.db
server3-system.log: INFO [Thread-31] 2010-08-16 17:26:37,281 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-476-Data.db

server4-system.log: INFO [Thread-25] 2010-08-16 18:30:19,313 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-472-Data.db
server4-system.log: INFO [Thread-28] 2010-08-16 18:33:07,141 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db

server6-system.log: INFO [Thread-25] 2010-08-16 16:53:43,108 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-470-Data.db
server6-system.log: INFO [Thread-28] 2010-08-16 17:58:15,031 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-471-Data.db

server7-system.log: INFO [Thread-25] 2010-08-16 12:39:54,342 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-110-Data.db
server7-system.log: INFO [Thread-28] 2010-08-16 12:46:16,067 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-144-Data.db
server7-system.log: INFO [Thread-31] 2010-08-16 14:51:24,585 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-469-Data.db
server7-system.log: INFO [Thread-34] 2010-08-16 15:17:22,168  
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-472-Data.db

server8-system.log: INFO [Thread-25] 2010-08-16 12:46:39,462 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-133-Data.db
server8-system.log: INFO [Thread-28] 2010-08-16 15:56:38,124 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db
server8-system.log: INFO [Thread-31] 2010-08-16 17:05:24,805 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-475-Data.db
server8-system.log: INFO [Thread-34] 2010-08-16 18:52:03,416 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-476-Data.db

Do you see any red flags?

Thanks for your help!

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Robert Coli <rc...@digg.com>.

On Thu, Aug 19, 2010 at 7:23 AM, Julie <ju...@nextcentury.com> wrote:
> At this point, I logged in.  The data distribution on this node was 122GB.  I
> started performing a manual nodetool cleanup.

Check the size of the Hinted Handoff CF? If your nodes are flapping
under sustained write, they could be storing a non-trivial number of
hinted handoff rows? Probably not 5x usage though..

http://wiki.apache.org/cassandra/Operations
"
The reason why you run nodetool cleanup on all live nodes [after
replacing a node] is to remove old Hinted Handoff writes stored for
the dead node.
"

You could relatively quickly determine whether Hinted Handoff is
implicated by running your test with the feature turned off.

https://issues.apache.org/jira/browse/CASSANDRA-894

=Rob

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Peter Schuller <peter.schuller <at> infidyne.com> writes:

> Without necessarily dumping all the information - approximately what
> do they contain? Do they contain anything about compactions,
> anti-compactions, streaming, etc?
> 
> With an idle node after taking writes, I *think* the only expected
> disk I/O (once it has settled) would be a memtable flush triggered by
> memtable_flush_after_mins, and possibly compactions resulting from
> that (depending on how close one were to triggering compaction prior
> to the memtable flush). Whatever is causing additional sstables to be
> written, even if somehow triggered incorrectly, I'd hope that they
> were logged still.

******************************************************
Hi Peter,

Thanks for your ideas.  Here is what I see in the bloated node's system.log file
(condensed) starting 4 minutes after the last write.  At this time, data usage
on this node was only around 28GB (as expected).
******************************************************

INFO [COMPACTION-POOL:1] 2010-08-16 13:51:38,432 CompactionManager.java
 (line 246) Compacting ...
 INFO [COMPACTION-POOL:1] 2010-08-16 14:20:02,840 CompactionManager.java
 (line 246) Compacting ...
 INFO [COMPACTION-POOL:1] 2010-08-16 14:28:37,085 CompactionManager.java
 (line 246) Compacting ...

 INFO [Thread-25] 2010-08-16 14:34:06,827 StreamCompletionHandler.java
 (line 64) Streaming added
 /var/lib/cassandra/data/Keyspace1/Standard1-467-Data.db

 INFO [Timer-1] 2010-08-16 14:34:31,655 Gossiper.java (line 179)
 InetAddress /10.254.107.178 is now dead.
 INFO [GMFD:1] 2010-08-16 14:35:03,012 Gossiper.java (line 568)
 InetAddress /10.254.107.178 is now UP

 INFO [COMPACTION-POOL:1] 2010-08-16 14:36:04,808 CompactionManager.java
 (line 246) Compacting ...
 INFO [COMPACTION-POOL:1] 2010-08-16 14:40:31,283 CompactionManager.java
 (line 246) Compacting ...
 INFO [FLUSH-TIMER] 2010-08-16 14:50:46,570 ColumnFamilyStore.java (line 357)
 Standard1 has reached its threshold; switching in a fresh Memtable at
 CommitLogContext(file=
 '/var/lib/cassandra/commitlog/CommitLog-1281980828674.log', position=87493038)


 INFO [FLUSH-TIMER] 2010-08-16 14:50:46,602 ColumnFamilyStore.java (line 609)
 Enqueuing flush of Memtable(Standard1)@31873581
 INFO [FLUSH-WRITER-POOL:1] 2010-08-16 14:50:46,647 Memtable.java (line 148)
 Writing Memtable(Standard1)@31873581
 INFO [GC inspection] 2010-08-16 14:50:47,350 GCInspector.java (line 110)
 GC for ConcurrentMarkSweep: 362 ms, 506927464 reclaimed leaving 129128920 used;
 max is 1172766720
 INFO [FLUSH-WRITER-POOL:1] 2010-08-16 14:50:51,137 Memtable.java (line 162)
 Completed flushing /var/lib/cassandra/data/Keyspace1/Standard1-472-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 14:50:58,411
 SSTableDeletingReference.java (line 104)
 Deleted /var/lib/cassandra/data/Keyspace1/Standard1-223-Data.db
...
...
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 14:51:40,662
 SSTableDeletingReference.java (line 104)
 Deleted /var/lib/cassandra/data/Keyspace1/Standard1-407-Data.db

 INFO [Timer-1] 2010-08-16 14:51:40,772 Gossiper.java (line 179)
 InetAddress /10.254.242.159 is now dead.

 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 14:51:41,011
 SSTableDeletingReference.java (line 104)
 Deleted /var/lib/cassandra/data/Keyspace1/Standard1-279-Data.db
 ...
 ...
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 14:52:22,206
 SSTableDeletingReference.java (line 104)
 Deleted /var/lib/cassandra/data/Keyspace1/Standard1-389-Data.db

 INFO [GMFD:1] 2010-08-16 14:52:34,906 Gossiper.java (line 568)
 InetAddress /10.254.242.159 is now UP

 INFO [COMPACTION-POOL:1] 2010-08-16 14:58:48,958 CompactionManager.java
 (line 246) Compacting ...
 INFO [COMPACTION-POOL:1] 2010-08-16 15:02:43,444 CompactionManager.java
 (line 246) Compacting ...
 INFO [COMPACTION-POOL:1] 2010-08-16 15:05:08,837 CompactionManager.java
 (line 246) Compacting []

 INFO [Timer-1] 2010-08-16 15:17:38,697 Gossiper.java (line 179)
 InetAddress /10.254.234.226 is now dead.
 INFO [GMFD:1] 2010-08-16 15:18:37,471 Gossiper.java (line 568)
 InetAddress /10.254.234.226 is now UP


 INFO [Thread-28] 2010-08-16 15:38:34,685 StreamCompletionHandler.java
 (line 64)
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-475-Data.db

 INFO [COMPACTION-POOL:1] 2010-08-16 16:05:08,811 CompactionManager.java
 (line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-16 17:05:08,792 CompactionManager.java
 (line 246) Compacting []

 INFO [Timer-1] 2010-08-16 17:05:55,583 Gossiper.java (line 179)
 InetAddress /10.254.242.159 is now dead.
 INFO [GMFD:1] 2010-08-16 17:06:19,614 Gossiper.java (line 568)
 InetAddress /10.254.242.159 is now UP


 INFO [Thread-34] 2010-08-16 17:24:57,584 StreamCompletionHandler.java
 (line 64)
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-477-Data.db

 INFO [Timer-1] 2010-08-16 17:25:19,206 Gossiper.java (line 179)
 InetAddress /10.210.157.187 is now dead.
 INFO [GMFD:1] 2010-08-16 17:25:53,096 Gossiper.java (line 568)
 InetAddress /10.210.157.187 is now UP


 INFO [Thread-31] 2010-08-16 17:26:37,281 StreamCompletionHandler.java
 (line 64)
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-476-Data.db
 INFO [Timer-1] 2010-08-16 17:27:01,939 Gossiper.java (line 179)
 InetAddress /10.254.234.226 is now dead.
 INFO [GMFD:1] 2010-08-16 17:27:24,115 Gossiper.java (line 568)
 InetAddress /10.254.234.226 is now UP


 INFO [COMPACTION-POOL:1] 2010-08-16 18:05:08,764 CompactionManager.java
 (line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-16 19:05:08,741 CompactionManager.java
 (line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-16 20:05:08,718 CompactionManager.java
 (line 246) Compacting []
INFO [COMPACTION-POOL:1] 2010-08-16 21:05:08,694 CompactionManager.java
 (line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-16 22:05:08,674 CompactionManager.java
 (line 246) Compacting []

******************************************************
At this point, I logged in.  The data distribution on this node was 122GB.  I
started performing a manual nodetool cleanup.
******************************************************

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Peter Schuller <pe...@infidyne.com>.

> I actually have the log files from all 8 nodes if it helps to diagnose what
> activity was going on behind the scenes.  I really need to understand how this
> happened.

Without necessarily dumping all the information - approximately what
do they contain? Do they contain anything about compactions,
anti-compactions, streaming, etc?

With an idle node after taking writes, I *think* the only expected
disk I/O (once it has settled) would be a memtable flush triggered by
memtable_flush_after_mins, and possibly compactions resulting from
that (depending on how close one were to triggering compaction prior
to the memtable flush). Whatever is causing additional sstables to be
written, even if somehow triggered incorrectly, I'd hope that they
were logged still.

What about something like a gossiping issue with some kind of
disagreement about token space? But even then, why would nodes
spontaneously start pushing data - my understanding is that this is
only triggered by administrative operations right now, which seems
confirmed by:

   http://wiki.apache.org/cassandra/Streaming

Assuming the log files contain some kind of activity such as
compaction/streaming/etc; do they correlate well in time with each
other and/or something else?

-- 
/ Peter Schuller

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Jonathan Ellis <jbellis <at> gmail.com> writes:

> 
> If you read the stack traces you pasted, the node in question ran out
> of diskspace.  When you have < 25% space free this is not surprising.
> 
> But fundamentally you are missing something important from your story
> here.  Disk space doesn't just increase spontaneously with "absolutely
> no activity."

I agree, it's not surprising for a node to run out of disk space when it is
occupying 122GB (133GB including the still present compacted files) on a 160GB
drive.  What I don't understand is how 30GB of data grew into 122GB.

I could send you the Cassandra log files from the time period after the 800,000
rows of writes complete.  There was no activity other than Cassandra doing
whatever work it does behind the scenes.

The 800,000 unique key writes started at 12:15pm and completed at 1:47pm.  These
8 vms were created solely for this test and I was the only one logged on.  After
the writes were done, Cassandra did what it does behind the scenes, there was no
human intervention and no clients running/connecting to it.  I didn't log in
again to check on things until 11pm that night which is when I saw the uneven
data distribution (122GB on one of the nodes).

I actually have the log files from all 8 nodes if it helps to diagnose what
activity was going on behind the scenes.  I really need to understand how this
happened.

Thanks again for your help,
Julie

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Edward Capriolo <ed...@gmail.com>.

On Wed, Aug 18, 2010 at 10:51 AM, Jonathan Ellis <jb...@gmail.com> wrote:
> If you read the stack traces you pasted, the node in question ran out
> of diskspace.  When you have < 25% space free this is not surprising.
>
> But fundamentally you are missing something important from your story
> here.  Disk space doesn't just increase spontaneously with "absolutely
> no activity."
>
> On Wed, Aug 18, 2010 at 9:36 AM, Julie <ju...@nextcentury.com> wrote:
>>
>> Rob Coli <rcoli <at> digg.com> writes:
>>
>>
>>> As I understand Julie's case, she is :
>>> a) initializing her cluster
>>> b) inserting some number of unique keys with CL.ALL
>>> c) noticing that more disk space (6x?) than is expected is used
>>> d) but that she gets expected usage if she does a major compaction
>>> In other words, the problem isn't "temporary disk space occupied during
>>> the compact", it's permanent disk space occupied unless she compacts.
>>>
>>> Julie : when compaction occurs, it logs the number of bytes that it
>>> started with and the number it ended with, as well as the number of keys
>>> involved in the compaction. What do these messages say?
>>>
>>> example line :
>>> INFO [COMPACTION-POOL:1] 2010-08-06 13:48:00,328 CompactionManager.java
>>> (line 398) Compacted to /path/to/MyColumnFamily-26-Data.db.
>>> 999999999/888888888 bytes for 12345678 keys.  Time: 123456ms.
>>
>> Rob -
>> I reran the original test: 8 nodes in the cluster (160GB drives on each
>> node). Populated each node with 30GB of data using unique keys and CL.ALL
>> and repFactor=3.  (Wrote 10GB of data to each node but with the repFactor=3,
>> it results in about 30GB of data on each node.)
>>
>> One hour after the last write, the ring distribution looks excellent:
>> Address       Status     Load          Range
>>  Ring
>>                                       170141183460469231731687303715884105728
>> 10.210.198.64 Up         28.32 GB      21267647932558653966460912964485513216
>>  |<--|
>> 10.210.157.187Up         28.07 GB      42535295865117307932921825928971026432
>>  |   ^
>> 10.206.34.194 Up         28.12 GB      63802943797675961899382738893456539648
>>  v   |
>> 10.254.107.178Up         28.15 GB      85070591730234615865843651857942052864
>>  |   ^
>> 10.254.234.226Up         28.02 GB      106338239662793269832304564822427566080
>>  v   |
>> 10.254.242.159Up         27.96 GB      127605887595351923798765477786913079296
>>  |   ^
>> 10.214.18.198 Up         28.18 GB      148873535527910577765226390751398592512
>>  v   |
>> 10.214.26.118 Up         29.82 GB      170141183460469231731687303715884105728
>>  |-->|
>>
>> But 8 hours after the last write (absolutely no activity), things don't
>> look as good:
>>
>> Address       Status     Load          Range
>>  Ring
>>                                       170141183460469231731687303715884105728
>> 10.210.198.64 Up         30.28 GB      21267647932558653966460912964485513216
>>  |<--|
>> 10.210.157.187Up         28.12 GB      42535295865117307932921825928971026432
>>  |   ^
>> 10.206.34.194 Up         122.41 GB     63802943797675961899382738893456539648
>>  v   |
>> 10.254.107.178Up         33.89 GB      85070591730234615865843651857942052864
>>  |   ^
>> 10.254.234.226Up         28.01 GB      106338239662793269832304564822427566080
>>  v   |
>> 10.254.242.159Up         72.58 GB      127605887595351923798765477786913079296
>>  |   ^
>> 10.214.18.198 Up         83.41 GB      148873535527910577765226390751398592512
>>  v   |
>> 10.214.26.118 Up         62.01 GB      170141183460469231731687303715884105728
>>  |-->|
>>
>> The 122.41GB node is named ec2-server3.  Here's what cfstats reports:
>> On ec2-server3:
>>
>>          Write Latency: 0.21970486121446028 ms.
>>          Pending Tasks: 0
>>                   Column Family: Standard1
>>                   SSTable count: 9
>>                        Space used (live): 131438293207
>>                        Space used (total): 143577216419
>>                   Memtable Columns Count: 0
>>                   Memtable Data Size: 0
>>                   Memtable Switch Count: 454
>>                   Read Count: 0
>>                   Read Latency: NaN ms.
>>                   Write Count: 302373
>>                   Write Latency: 0.220 ms.
>>                   Pending Tasks: 0
>>                   Key cache capacity: 200000
>>                   Key cache size: 0
>>                   Key cache hit rate: NaN
>>                   Row cache: disabled
>>                   Compacted row minimum size: 100316
>>                   Compacted row maximum size: 100323
>>                   Compacted row mean size: 100322
>>
>> On ec2-server3, df reports:
>> /dev/sdb             153899044 140784388   5297032  97% /mnt
>>
>> So this node should (I would think) contain 30GB of data on a 160GB hard
>> drive but somehow it has grown to 122 GB of data (live space) plus several
>> compacted files that have not yet been deleted (total space).  Keep in
>> mind that no deletions or updates have taken place, just unique key writes.
>>
>> At this point, here are the compaction lines from the ec2-server3 system.log
>> file, taking place after all writes had completed:
>>
>> INFO [COMPACTION-POOL:1] 2010-08-16 14:11:58,614 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-465-Data.db.
>> 8595448738/8595448738 bytes for 85678 keys.  Time: 984905ms.
>>  INFO [COMPACTION-POOL:1] 2010-08-16 14:20:02,825 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-466-Data.db.
>> 2157339354/2157339354 bytes for 21504 keys.  Time: 484188ms.
>>  INFO [COMPACTION-POOL:1] 2010-08-16 14:28:37,066 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-468-Data.db.
>> 2157339329/2157339329 bytes for 21504 keys.  Time: 514226ms.
>>  INFO [COMPACTION-POOL:1] 2010-08-16 14:36:04,806 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-469-Data.db.
>> 2157339417/2157339417 bytes for 21504 keys.  Time: 447720ms.
>>  INFO [COMPACTION-POOL:1] 2010-08-16 14:40:31,234 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-470-Data.db.
>> 2157339202/2157339202 bytes for 21504 keys.  Time: 266418ms.
>>  INFO [COMPACTION-POOL:1] 2010-08-16 14:58:48,943 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-471-Data.db.
>> 8629357302/8629357302 bytes for 86016 keys.  Time: 1097659ms.
>>  INFO [COMPACTION-POOL:1] 2010-08-16 15:02:43,428 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db.
>> 2157338687/2157338687 bytes for 21504 keys.  Time: 234464ms.
>>  INFO [COMPACTION-POOL:1] 2010-08-16 15:05:08,807 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-474-Data.db.
>> 1345327705/1345327705 bytes for 13410 keys.  Time: 145363ms.
>>
>> Here’s what’s in the data directory on this node:
>>
>> -rw-r--r-- 1 root root  9518412981 Aug 16 13:36 Standard1-247-Data.db
>> -rw-r--r-- 1 root root  8595448738 Aug 16 14:11 Standard1-465-Data.db
>> -rw-r--r-- 1 root root 22968417153 Aug 16 14:34 Standard1-467-Data.db
>> -rw-r--r-- 1 root root  8629357302 Aug 16 14:58 Standard1-471-Data.db
>> -rw-r--r-- 1 root root  2157338687 Aug 16 15:02 Standard1-473-Data.db
>> -rw-r--r-- 1 root root  1345327705 Aug 16 15:05 Standard1-474-Data.db
>> -rw-r--r-- 1 root root 28995571589 Aug 16 15:38 Standard1-475-Data.db
>> -rw-r--r-- 1 root root 30063580839 Aug 16 17:26 Standard1-476-Data.db
>> -rw-r--r-- 1 root root 19091100731 Aug 16 17:24 Standard1-477-Data.db
>> Plus a bunch of compacted files.
>>
>> At this point, I performed a manual "nodetool cleanup" on the super large
>> node when *very bad things* happened.  Here's an excerpt from the large
>> node's log file upon issuing the cleanup:
>>
>> INFO [COMPACTION-POOL:1] 2010-08-16 23:04:04,230 CompactionManager.java (line
>> 345) AntiCompacting [org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-474-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-471-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-477-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-475-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-467-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-247-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-476-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-465-Data.db')]
>>
>>  INFO [COMPACTION-POOL:1] 2010-08-16 23:04:04,230 StorageService.java
>> (line 1499) requesting GC to free disk space
>>  INFO [GC inspection] 2010-08-16 23:04:04,470 GCInspector.java (line 110)
>> GC for ConcurrentMarkSweep: 233 ms, 299442152 reclaimed leaving 20577560
>> used; max is 1172766720
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:14,641
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-419-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:17,974
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-466-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,100
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-428-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,225
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-415-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,374
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-441-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,514
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-461-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,623
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-442-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,740
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-425-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,891
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-447-Data.db
>> …
>> INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,331
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-451-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,423
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-450-Data.db
>> ERROR [COMPACTION-POOL:1] 2010-08-16 23:04:25,525
>> DebuggableThreadPoolExecutor.java (line 94) Error in executor futuretask
>> java.util.concurrent.ExecutionException:
>> java.lang.UnsupportedOperationException: disk full
>>        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>>        at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>>        at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.after
>> Execute (DebuggableThreadPoolExecutor.java:86)
>>        at
>> org.apache.cassandra.db.CompactionManager$CompactionExecutor.after
>> Execute (CompactionManager.java:582)
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask
>> (ThreadPoolExecutor.java:888)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>        at java.lang.Thread.run(Thread.java:619)
>> Caused by: java.lang.UnsupportedOperationException: disk full
>>        at
>> org.apache.cassandra.db.CompactionManager.doAntiCompaction
>> (CompactionManager.java:351)
>>        at
>> org.apache.cassandra.db.CompactionManager.doCleanupCompaction
>> (CompactionManager.java:417)
>>        at
>> org.apache.cassandra.db.CompactionManager.access$400
>> (CompactionManager.java:49)
>>        at
>> org.apache.cassandra.db.CompactionManager$2.call(CompactionManager.java:130)
>>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask
>> (ThreadPoolExecutor.java:886)
>>        ... 2 more
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,558
>> SSTableDeletingReference.java (line 104)
>> Deleted /var/lib/cassandra/data/Keyspace1/Standard1-449-Data.db
>> …
>> INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:33,265
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-453-Data.db
>>  INFO [COMPACTION-POOL:1] 2010-08-16 23:05:08,656 CompactionManager.java
>> (line 246) Compacting []
>>
>> The next morning (19 hours later):
>>
>> INFO [COMPACTION-POOL:1] 2010-08-16 23:05:08,656 CompactionManager.java
>> (line 246) Compacting []
>>  INFO [COMPACTION-POOL:1] 2010-08-17 00:05:08,637 CompactionManager.java
>> (line 246) Compacting []
>>  INFO [COMPACTION-POOL:1] 2010-08-17 01:05:08,607 CompactionManager.java
>> (line 246) Compacting []
>>  INFO [COMPACTION-POOL:1] 2010-08-17 02:05:08,581 CompactionManager.java
>> (line 246) Compacting []
>>  INFO [COMPACTION-POOL:1] 2010-08-17 03:05:08,568 CompactionManager.java
>> (line 246) Compacting []
>>  INFO [COMPACTION-POOL:1] 2010-08-17 04:05:08,532 CompactionManager.java
>> (line 246) Compacting []
>>  INFO [COMPACTION-POOL:1] 2010-08-17 05:05:08,505 CompactionManager.java
>> (line 246) Compacting []
>>  INFO [COMPACTION-POOL:1] 2010-08-17 06:05:08,494 CompactionManager.java
>> (line 246) Compacting []
>>
>> Also, 19 hours later, ring distribution is the same:
>>
>> Address       Status     Load          Range
>>  Ring
>>                                       170141183460469231731687303715884105728
>> 10.210.198.64 Up         30.28 GB      21267647932558653966460912964485513216
>>  |<--|
>> 10.210.157.187Up         28.12 GB      42535295865117307932921825928971026432
>>  |   ^
>> 10.206.34.194 Up         122.41 GB     63802943797675961899382738893456539648
>>  v   |
>> 10.254.107.178Up         33.89 GB      85070591730234615865843651857942052864
>>  |   ^
>> 10.254.234.226Up         28.01 GB      106338239662793269832304564822427566080
>>  v   |
>> 10.254.242.159Up         72.58 GB      127605887595351923798765477786913079296
>>  |   ^
>> 10.214.18.198 Up         83.41 GB      148873535527910577765226390751398592512
>>  v   |
>> 10.214.26.118 Up         62.01 GB      170141183460469231731687303715884105728
>>  |-->|
>>
>> So nodetool cleanup seems to have resulted in a fatal error on my overly (>4x)
>> bloated node.
>>
>> Can anyone help me with understanding why this happened, taking into account
>> the node should only contain 30GB of data on a 160GB hard drive?  I have the 8
>> system.log files from the 24 hour period (the nodes were only alive for 24
>> hours total).
>>
>> Thank you!
>> Julie
>>
>>
>>
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

I noticed something funny about repair as well. I upgraded to 6.3 then
joined a node with bootstrap set to false.

So the node was empty. I ran REPAIR on the new node. Later I noticed
that the node had 2-4x the data it should have had.

Triggering REPAIR should cause anti compaction on the closest nodes on
the ring. Those nodes should then stream ONLY the data that belongs on
new node to the new node. Right? By the size of the new node I assumed
that all the contents of both nodes were streamed to this node. (This
could be wrong though) Sorry I have no good test case to replicate
this.

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Jonathan Ellis <jb...@gmail.com>.

If you read the stack traces you pasted, the node in question ran out
of diskspace.  When you have < 25% space free this is not surprising.

But fundamentally you are missing something important from your story
here.  Disk space doesn't just increase spontaneously with "absolutely
no activity."

On Wed, Aug 18, 2010 at 9:36 AM, Julie <ju...@nextcentury.com> wrote:
>
> Rob Coli <rcoli <at> digg.com> writes:
>
>
>> As I understand Julie's case, she is :
>> a) initializing her cluster
>> b) inserting some number of unique keys with CL.ALL
>> c) noticing that more disk space (6x?) than is expected is used
>> d) but that she gets expected usage if she does a major compaction
>> In other words, the problem isn't "temporary disk space occupied during
>> the compact", it's permanent disk space occupied unless she compacts.
>>
>> Julie : when compaction occurs, it logs the number of bytes that it
>> started with and the number it ended with, as well as the number of keys
>> involved in the compaction. What do these messages say?
>>
>> example line :
>> INFO [COMPACTION-POOL:1] 2010-08-06 13:48:00,328 CompactionManager.java
>> (line 398) Compacted to /path/to/MyColumnFamily-26-Data.db.
>> 999999999/888888888 bytes for 12345678 keys.  Time: 123456ms.
>
> Rob -
> I reran the original test: 8 nodes in the cluster (160GB drives on each
> node). Populated each node with 30GB of data using unique keys and CL.ALL
> and repFactor=3.  (Wrote 10GB of data to each node but with the repFactor=3,
> it results in about 30GB of data on each node.)
>
> One hour after the last write, the ring distribution looks excellent:
> Address       Status     Load          Range
>  Ring
>                                       170141183460469231731687303715884105728
> 10.210.198.64 Up         28.32 GB      21267647932558653966460912964485513216
>  |<--|
> 10.210.157.187Up         28.07 GB      42535295865117307932921825928971026432
>  |   ^
> 10.206.34.194 Up         28.12 GB      63802943797675961899382738893456539648
>  v   |
> 10.254.107.178Up         28.15 GB      85070591730234615865843651857942052864
>  |   ^
> 10.254.234.226Up         28.02 GB      106338239662793269832304564822427566080
>  v   |
> 10.254.242.159Up         27.96 GB      127605887595351923798765477786913079296
>  |   ^
> 10.214.18.198 Up         28.18 GB      148873535527910577765226390751398592512
>  v   |
> 10.214.26.118 Up         29.82 GB      170141183460469231731687303715884105728
>  |-->|
>
> But 8 hours after the last write (absolutely no activity), things don't
> look as good:
>
> Address       Status     Load          Range
>  Ring
>                                       170141183460469231731687303715884105728
> 10.210.198.64 Up         30.28 GB      21267647932558653966460912964485513216
>  |<--|
> 10.210.157.187Up         28.12 GB      42535295865117307932921825928971026432
>  |   ^
> 10.206.34.194 Up         122.41 GB     63802943797675961899382738893456539648
>  v   |
> 10.254.107.178Up         33.89 GB      85070591730234615865843651857942052864
>  |   ^
> 10.254.234.226Up         28.01 GB      106338239662793269832304564822427566080
>  v   |
> 10.254.242.159Up         72.58 GB      127605887595351923798765477786913079296
>  |   ^
> 10.214.18.198 Up         83.41 GB      148873535527910577765226390751398592512
>  v   |
> 10.214.26.118 Up         62.01 GB      170141183460469231731687303715884105728
>  |-->|
>
> The 122.41GB node is named ec2-server3.  Here's what cfstats reports:
> On ec2-server3:
>
>          Write Latency: 0.21970486121446028 ms.
>          Pending Tasks: 0
>                   Column Family: Standard1
>                   SSTable count: 9
>                        Space used (live): 131438293207
>                        Space used (total): 143577216419
>                   Memtable Columns Count: 0
>                   Memtable Data Size: 0
>                   Memtable Switch Count: 454
>                   Read Count: 0
>                   Read Latency: NaN ms.
>                   Write Count: 302373
>                   Write Latency: 0.220 ms.
>                   Pending Tasks: 0
>                   Key cache capacity: 200000
>                   Key cache size: 0
>                   Key cache hit rate: NaN
>                   Row cache: disabled
>                   Compacted row minimum size: 100316
>                   Compacted row maximum size: 100323
>                   Compacted row mean size: 100322
>
> On ec2-server3, df reports:
> /dev/sdb             153899044 140784388   5297032  97% /mnt
>
> So this node should (I would think) contain 30GB of data on a 160GB hard
> drive but somehow it has grown to 122 GB of data (live space) plus several
> compacted files that have not yet been deleted (total space).  Keep in
> mind that no deletions or updates have taken place, just unique key writes.
>
> At this point, here are the compaction lines from the ec2-server3 system.log
> file, taking place after all writes had completed:
>
> INFO [COMPACTION-POOL:1] 2010-08-16 14:11:58,614 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-465-Data.db.
> 8595448738/8595448738 bytes for 85678 keys.  Time: 984905ms.
>  INFO [COMPACTION-POOL:1] 2010-08-16 14:20:02,825 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-466-Data.db.
> 2157339354/2157339354 bytes for 21504 keys.  Time: 484188ms.
>  INFO [COMPACTION-POOL:1] 2010-08-16 14:28:37,066 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-468-Data.db.
> 2157339329/2157339329 bytes for 21504 keys.  Time: 514226ms.
>  INFO [COMPACTION-POOL:1] 2010-08-16 14:36:04,806 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-469-Data.db.
> 2157339417/2157339417 bytes for 21504 keys.  Time: 447720ms.
>  INFO [COMPACTION-POOL:1] 2010-08-16 14:40:31,234 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-470-Data.db.
> 2157339202/2157339202 bytes for 21504 keys.  Time: 266418ms.
>  INFO [COMPACTION-POOL:1] 2010-08-16 14:58:48,943 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-471-Data.db.
> 8629357302/8629357302 bytes for 86016 keys.  Time: 1097659ms.
>  INFO [COMPACTION-POOL:1] 2010-08-16 15:02:43,428 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db.
> 2157338687/2157338687 bytes for 21504 keys.  Time: 234464ms.
>  INFO [COMPACTION-POOL:1] 2010-08-16 15:05:08,807 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-474-Data.db.
> 1345327705/1345327705 bytes for 13410 keys.  Time: 145363ms.
>
> Here’s what’s in the data directory on this node:
>
> -rw-r--r-- 1 root root  9518412981 Aug 16 13:36 Standard1-247-Data.db
> -rw-r--r-- 1 root root  8595448738 Aug 16 14:11 Standard1-465-Data.db
> -rw-r--r-- 1 root root 22968417153 Aug 16 14:34 Standard1-467-Data.db
> -rw-r--r-- 1 root root  8629357302 Aug 16 14:58 Standard1-471-Data.db
> -rw-r--r-- 1 root root  2157338687 Aug 16 15:02 Standard1-473-Data.db
> -rw-r--r-- 1 root root  1345327705 Aug 16 15:05 Standard1-474-Data.db
> -rw-r--r-- 1 root root 28995571589 Aug 16 15:38 Standard1-475-Data.db
> -rw-r--r-- 1 root root 30063580839 Aug 16 17:26 Standard1-476-Data.db
> -rw-r--r-- 1 root root 19091100731 Aug 16 17:24 Standard1-477-Data.db
> Plus a bunch of compacted files.
>
> At this point, I performed a manual "nodetool cleanup" on the super large
> node when *very bad things* happened.  Here's an excerpt from the large
> node's log file upon issuing the cleanup:
>
> INFO [COMPACTION-POOL:1] 2010-08-16 23:04:04,230 CompactionManager.java (line
> 345) AntiCompacting [org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-474-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-471-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-477-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-475-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-467-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-247-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-476-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-465-Data.db')]
>
>  INFO [COMPACTION-POOL:1] 2010-08-16 23:04:04,230 StorageService.java
> (line 1499) requesting GC to free disk space
>  INFO [GC inspection] 2010-08-16 23:04:04,470 GCInspector.java (line 110)
> GC for ConcurrentMarkSweep: 233 ms, 299442152 reclaimed leaving 20577560
> used; max is 1172766720
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:14,641
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-419-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:17,974
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-466-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,100
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-428-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,225
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-415-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,374
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-441-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,514
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-461-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,623
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-442-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,740
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-425-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,891
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-447-Data.db
> …
> INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,331
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-451-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,423
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-450-Data.db
> ERROR [COMPACTION-POOL:1] 2010-08-16 23:04:25,525
> DebuggableThreadPoolExecutor.java (line 94) Error in executor futuretask
> java.util.concurrent.ExecutionException:
> java.lang.UnsupportedOperationException: disk full
>        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>        at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>        at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.after
> Execute (DebuggableThreadPoolExecutor.java:86)
>        at
> org.apache.cassandra.db.CompactionManager$CompactionExecutor.after
> Execute (CompactionManager.java:582)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask
> (ThreadPoolExecutor.java:888)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.UnsupportedOperationException: disk full
>        at
> org.apache.cassandra.db.CompactionManager.doAntiCompaction
> (CompactionManager.java:351)
>        at
> org.apache.cassandra.db.CompactionManager.doCleanupCompaction
> (CompactionManager.java:417)
>        at
> org.apache.cassandra.db.CompactionManager.access$400
> (CompactionManager.java:49)
>        at
> org.apache.cassandra.db.CompactionManager$2.call(CompactionManager.java:130)
>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask
> (ThreadPoolExecutor.java:886)
>        ... 2 more
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,558
> SSTableDeletingReference.java (line 104)
> Deleted /var/lib/cassandra/data/Keyspace1/Standard1-449-Data.db
> …
> INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:33,265
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-453-Data.db
>  INFO [COMPACTION-POOL:1] 2010-08-16 23:05:08,656 CompactionManager.java
> (line 246) Compacting []
>
> The next morning (19 hours later):
>
> INFO [COMPACTION-POOL:1] 2010-08-16 23:05:08,656 CompactionManager.java
> (line 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-08-17 00:05:08,637 CompactionManager.java
> (line 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-08-17 01:05:08,607 CompactionManager.java
> (line 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-08-17 02:05:08,581 CompactionManager.java
> (line 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-08-17 03:05:08,568 CompactionManager.java
> (line 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-08-17 04:05:08,532 CompactionManager.java
> (line 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-08-17 05:05:08,505 CompactionManager.java
> (line 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-08-17 06:05:08,494 CompactionManager.java
> (line 246) Compacting []
>
> Also, 19 hours later, ring distribution is the same:
>
> Address       Status     Load          Range
>  Ring
>                                       170141183460469231731687303715884105728
> 10.210.198.64 Up         30.28 GB      21267647932558653966460912964485513216
>  |<--|
> 10.210.157.187Up         28.12 GB      42535295865117307932921825928971026432
>  |   ^
> 10.206.34.194 Up         122.41 GB     63802943797675961899382738893456539648
>  v   |
> 10.254.107.178Up         33.89 GB      85070591730234615865843651857942052864
>  |   ^
> 10.254.234.226Up         28.01 GB      106338239662793269832304564822427566080
>  v   |
> 10.254.242.159Up         72.58 GB      127605887595351923798765477786913079296
>  |   ^
> 10.214.18.198 Up         83.41 GB      148873535527910577765226390751398592512
>  v   |
> 10.214.26.118 Up         62.01 GB      170141183460469231731687303715884105728
>  |-->|
>
> So nodetool cleanup seems to have resulted in a fatal error on my overly (>4x)
> bloated node.
>
> Can anyone help me with understanding why this happened, taking into account
> the node should only contain 30GB of data on a 160GB hard drive?  I have the 8
> system.log files from the 24 hour period (the nodes were only alive for 24
> hours total).
>
> Thank you!
> Julie
>
>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Rob Coli <rcoli <at> digg.com> writes:


> As I understand Julie's case, she is :
> a) initializing her cluster
> b) inserting some number of unique keys with CL.ALL
> c) noticing that more disk space (6x?) than is expected is used
> d) but that she gets expected usage if she does a major compaction
> In other words, the problem isn't "temporary disk space occupied during
> the compact", it's permanent disk space occupied unless she compacts.
> 
> Julie : when compaction occurs, it logs the number of bytes that it 
> started with and the number it ended with, as well as the number of keys 
> involved in the compaction. What do these messages say?
> 
> example line :
> INFO [COMPACTION-POOL:1] 2010-08-06 13:48:00,328 CompactionManager.java 
> (line 398) Compacted to /path/to/MyColumnFamily-26-Data.db. 
> 999999999/888888888 bytes for 12345678 keys.  Time: 123456ms.
 
Rob -
I reran the original test: 8 nodes in the cluster (160GB drives on each 
node). Populated each node with 30GB of data using unique keys and CL.ALL 
and repFactor=3.  (Wrote 10GB of data to each node but with the repFactor=3, 
it results in about 30GB of data on each node.)  

One hour after the last write, the ring distribution looks excellent:
Address       Status     Load          Range                                   
  Ring
                                       170141183460469231731687303715884105728    
10.210.198.64 Up         28.32 GB      21267647932558653966460912964485513216  
  |<--|
10.210.157.187Up         28.07 GB      42535295865117307932921825928971026432  
  |   ^
10.206.34.194 Up         28.12 GB      63802943797675961899382738893456539648  
  v   |
10.254.107.178Up         28.15 GB      85070591730234615865843651857942052864  
  |   ^
10.254.234.226Up         28.02 GB      106338239662793269832304564822427566080 
  v   |
10.254.242.159Up         27.96 GB      127605887595351923798765477786913079296 
  |   ^
10.214.18.198 Up         28.18 GB      148873535527910577765226390751398592512 
  v   |
10.214.26.118 Up         29.82 GB      170141183460469231731687303715884105728 
  |-->|

But 8 hours after the last write (absolutely no activity), things don't 
look as good:

Address       Status     Load          Range                                   
  Ring
                                       170141183460469231731687303715884105728    
10.210.198.64 Up         30.28 GB      21267647932558653966460912964485513216  
  |<--|
10.210.157.187Up         28.12 GB      42535295865117307932921825928971026432  
  |   ^
10.206.34.194 Up         122.41 GB     63802943797675961899382738893456539648  
  v   |
10.254.107.178Up         33.89 GB      85070591730234615865843651857942052864  
  |   ^
10.254.234.226Up         28.01 GB      106338239662793269832304564822427566080 
  v   |
10.254.242.159Up         72.58 GB      127605887595351923798765477786913079296 
  |   ^
10.214.18.198 Up         83.41 GB      148873535527910577765226390751398592512 
  v   |
10.214.26.118 Up         62.01 GB      170141183460469231731687303715884105728 
  |-->|

The 122.41GB node is named ec2-server3.  Here's what cfstats reports:
On ec2-server3:

          Write Latency: 0.21970486121446028 ms.
          Pending Tasks: 0
                   Column Family: Standard1
                   SSTable count: 9
                        Space used (live): 131438293207
                        Space used (total): 143577216419
                   Memtable Columns Count: 0
                   Memtable Data Size: 0
                   Memtable Switch Count: 454
                   Read Count: 0
                   Read Latency: NaN ms.
                   Write Count: 302373
                   Write Latency: 0.220 ms.
                   Pending Tasks: 0
                   Key cache capacity: 200000
                   Key cache size: 0
                   Key cache hit rate: NaN
                   Row cache: disabled
                   Compacted row minimum size: 100316
                   Compacted row maximum size: 100323
                   Compacted row mean size: 100322

On ec2-server3, df reports:
/dev/sdb             153899044 140784388   5297032  97% /mnt

So this node should (I would think) contain 30GB of data on a 160GB hard 
drive but somehow it has grown to 122 GB of data (live space) plus several 
compacted files that have not yet been deleted (total space).  Keep in 
mind that no deletions or updates have taken place, just unique key writes.

At this point, here are the compaction lines from the ec2-server3 system.log
file, taking place after all writes had completed:

INFO [COMPACTION-POOL:1] 2010-08-16 14:11:58,614 CompactionManager.java (line 
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-465-Data.db. 
8595448738/8595448738 bytes for 85678 keys.  Time: 984905ms.
 INFO [COMPACTION-POOL:1] 2010-08-16 14:20:02,825 CompactionManager.java (line 
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-466-Data.db. 
2157339354/2157339354 bytes for 21504 keys.  Time: 484188ms.
 INFO [COMPACTION-POOL:1] 2010-08-16 14:28:37,066 CompactionManager.java (line 
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-468-Data.db.  
2157339329/2157339329 bytes for 21504 keys.  Time: 514226ms.
 INFO [COMPACTION-POOL:1] 2010-08-16 14:36:04,806 CompactionManager.java (line 
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-469-Data.db. 
2157339417/2157339417 bytes for 21504 keys.  Time: 447720ms.
 INFO [COMPACTION-POOL:1] 2010-08-16 14:40:31,234 CompactionManager.java (line 
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-470-Data.db.  
2157339202/2157339202 bytes for 21504 keys.  Time: 266418ms.
 INFO [COMPACTION-POOL:1] 2010-08-16 14:58:48,943 CompactionManager.java (line 
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-471-Data.db.  
8629357302/8629357302 bytes for 86016 keys.  Time: 1097659ms.
 INFO [COMPACTION-POOL:1] 2010-08-16 15:02:43,428 CompactionManager.java (line
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db.  
2157338687/2157338687 bytes for 21504 keys.  Time: 234464ms.
 INFO [COMPACTION-POOL:1] 2010-08-16 15:05:08,807 CompactionManager.java (line 
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-474-Data.db. 
1345327705/1345327705 bytes for 13410 keys.  Time: 145363ms.

Here’s what’s in the data directory on this node:

-rw-r--r-- 1 root root  9518412981 Aug 16 13:36 Standard1-247-Data.db
-rw-r--r-- 1 root root  8595448738 Aug 16 14:11 Standard1-465-Data.db
-rw-r--r-- 1 root root 22968417153 Aug 16 14:34 Standard1-467-Data.db
-rw-r--r-- 1 root root  8629357302 Aug 16 14:58 Standard1-471-Data.db
-rw-r--r-- 1 root root  2157338687 Aug 16 15:02 Standard1-473-Data.db
-rw-r--r-- 1 root root  1345327705 Aug 16 15:05 Standard1-474-Data.db
-rw-r--r-- 1 root root 28995571589 Aug 16 15:38 Standard1-475-Data.db
-rw-r--r-- 1 root root 30063580839 Aug 16 17:26 Standard1-476-Data.db
-rw-r--r-- 1 root root 19091100731 Aug 16 17:24 Standard1-477-Data.db
Plus a bunch of compacted files. 

At this point, I performed a manual "nodetool cleanup" on the super large 
node when *very bad things* happened.  Here's an excerpt from the large 
node's log file upon issuing the cleanup:

INFO [COMPACTION-POOL:1] 2010-08-16 23:04:04,230 CompactionManager.java (line 
345) AntiCompacting [org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-474-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-471-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-477-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-475-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-467-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-247-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-476-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-465-Data.db')]

 INFO [COMPACTION-POOL:1] 2010-08-16 23:04:04,230 StorageService.java 
(line 1499) requesting GC to free disk space
 INFO [GC inspection] 2010-08-16 23:04:04,470 GCInspector.java (line 110) 
GC for ConcurrentMarkSweep: 233 ms, 299442152 reclaimed leaving 20577560 
used; max is 1172766720
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:14,641 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-419-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:17,974 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-466-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,100 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-428-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,225 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-415-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,374 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-441-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,514 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-461-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,623 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-442-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,740 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-425-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,891 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-447-Data.db
…
INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,331 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-451-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,423 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-450-Data.db
ERROR [COMPACTION-POOL:1] 2010-08-16 23:04:25,525 
DebuggableThreadPoolExecutor.java (line 94) Error in executor futuretask
java.util.concurrent.ExecutionException: 
java.lang.UnsupportedOperationException: disk full
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
        at java.util.concurrent.FutureTask.get(FutureTask.java:83)
        at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.after
Execute (DebuggableThreadPoolExecutor.java:86)
        at
org.apache.cassandra.db.CompactionManager$CompactionExecutor.after
Execute (CompactionManager.java:582)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask
(ThreadPoolExecutor.java:888)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.UnsupportedOperationException: disk full
        at 
org.apache.cassandra.db.CompactionManager.doAntiCompaction
(CompactionManager.java:351)
        at 
org.apache.cassandra.db.CompactionManager.doCleanupCompaction
(CompactionManager.java:417)
        at 
org.apache.cassandra.db.CompactionManager.access$400
(CompactionManager.java:49)
        at 
org.apache.cassandra.db.CompactionManager$2.call(CompactionManager.java:130)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask
(ThreadPoolExecutor.java:886)
        ... 2 more
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,558 
SSTableDeletingReference.java (line 104) 
Deleted /var/lib/cassandra/data/Keyspace1/Standard1-449-Data.db
…
INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:33,265 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-453-Data.db
 INFO [COMPACTION-POOL:1] 2010-08-16 23:05:08,656 CompactionManager.java 
(line 246) Compacting []

The next morning (19 hours later):

INFO [COMPACTION-POOL:1] 2010-08-16 23:05:08,656 CompactionManager.java 
(line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-17 00:05:08,637 CompactionManager.java 
(line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-17 01:05:08,607 CompactionManager.java 
(line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-17 02:05:08,581 CompactionManager.java 
(line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-17 03:05:08,568 CompactionManager.java 
(line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-17 04:05:08,532 CompactionManager.java 
(line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-17 05:05:08,505 CompactionManager.java 
(line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-17 06:05:08,494 CompactionManager.java 
(line 246) Compacting []

Also, 19 hours later, ring distribution is the same:

Address       Status     Load          Range                                   
  Ring
                                       170141183460469231731687303715884105728    
10.210.198.64 Up         30.28 GB      21267647932558653966460912964485513216  
  |<--|
10.210.157.187Up         28.12 GB      42535295865117307932921825928971026432  
  |   ^
10.206.34.194 Up         122.41 GB     63802943797675961899382738893456539648  
  v   |
10.254.107.178Up         33.89 GB      85070591730234615865843651857942052864  
  |   ^
10.254.234.226Up         28.01 GB      106338239662793269832304564822427566080 
  v   |
10.254.242.159Up         72.58 GB      127605887595351923798765477786913079296 
  |   ^
10.214.18.198 Up         83.41 GB      148873535527910577765226390751398592512 
  v   |
10.214.26.118 Up         62.01 GB      170141183460469231731687303715884105728 
  |-->|

So nodetool cleanup seems to have resulted in a fatal error on my overly (>4x)
bloated node.

Can anyone help me with understanding why this happened, taking into account 
the node should only contain 30GB of data on a 160GB hard drive?  I have the 8
system.log files from the 24 hour period (the nodes were only alive for 24
hours total).

Thank you!
Julie

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Rob Coli <rcoli <at> digg.com> writes:

> As I understand Julie's case, she is :
> 
> a) initializing her cluster
> b) inserting some number of unique keys with CL.ALL
> c) noticing that more disk space (6x?) than is expected is used
> d) but that she gets expected usage if she does a major compaction

Yes, this is the scenario in my initial post.

> There is clearly overhead from there being multiple SSTables with 
> multiple bloom filters and multiple indexes. But from my understanding, 
> that does not fully account for the difference in disk usage she is 
> seeing. If it is 6x across the whole cluster, it seems unlikely that the 
> meta information is 5x the size of the actual information.
> 
> Julie : when compaction occurs, it logs the number of bytes that it 
> started with and the number it ended with, as well as the number of keys 
> involved in the compaction. What do these messages say?
> 
> example line :
> 
> INFO [COMPACTION-POOL:1] 2010-08-06 13:48:00,328 CompactionManager.java 
> (line 398) Compacted to /path/to/MyColumnFamily-26-Data.db. 
> 999999999/888888888 bytes for 12345678 keys.  Time: 123456ms.
> 
> =Rob

I will need to re-try this scenario today.  My subsequent posts were from
writing unique keys, then updating them which is a totally different test and
*should* result in excess SStable sizes.  I will retry writing unique keys, then
waiting for the nodes to settle and get back to you.  THANK YOU!!!

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Rob Coli <rc...@digg.com>.

On 8/5/10 11:51 AM, Peter Schuller wrote:
> Also, the variation in disk space in your most recent post looks
> entirely as expected to me and nothing really extreme. The temporary
> disk space occupied during the compact/cleanup would easily be as high
> as your original disk space usage to begin with, and the fact that
> you're reaching the 5-7 GB per node level after a cleanup has
> completed fully and all obsolete sstables have been removed

Your post refers to "obsolete" sstables, but the only thing that makes 
them "obsolete" in this case is that they have been compacted?

As I understand Julie's case, she is :

a) initializing her cluster
b) inserting some number of unique keys with CL.ALL
c) noticing that more disk space (6x?) than is expected is used
d) but that she gets expected usage if she does a major compaction

In other words, the problem isn't "temporary disk space occupied during 
the compact", it's permanent disk space occupied unless she compacts.

There is clearly overhead from there being multiple SSTables with 
multiple bloom filters and multiple indexes. But from my understanding, 
that does not fully account for the difference in disk usage she is 
seeing. If it is 6x across the whole cluster, it seems unlikely that the 
meta information is 5x the size of the actual information.

I haven't been following this thread very closely, but I don't think 
"obsolete" SSTables should be relevant, because she's not doing UPDATE 
or DELETE and she hasn't changed cluster topography (the "cleanup" case).

Julie : when compaction occurs, it logs the number of bytes that it 
started with and the number it ended with, as well as the number of keys 
involved in the compaction. What do these messages say?

example line :

INFO [COMPACTION-POOL:1] 2010-08-06 13:48:00,328 CompactionManager.java 
(line 398) Compacted to /path/to/MyColumnFamily-26-Data.db. 
999999999/888888888 bytes for 12345678 keys.  Time: 123456ms.

=Rob

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Peter Schuller <pe...@infidyne.com>.

Oh and,

> Nodetool cleanup works so beautifully, that I am wondering if there is any harm
> in using "nodetool cleanup" in a cron job on a live system that is actively
> processing reads and writes to the database?

since a cleanup/compact is supposed to trigger a full compaction,
that's generating a lot of I/O that is normally deferred. Doing it
nightly won't scale in the general case (suppose you have terrabytes
of data...). But assuming compaction does not have a sufficiently
detrimental effect on your application and your data size is small
enough that a compaction finishes within some reasonable amount of
time, I don't see a problem with it.

That said, since cleanup by design actually permanently removes data
not belonging to nodes, it might be dangerous in the event that there
is somehow confusion over what data nodes are responsible for. A
regular compaction should not, as far as I know, ever remove data not
belonging to the node. So I can certainly see a danger there;
personally I'd probably want to avoid automating 'cleanup' for that
reason.

But; if everything works reasonably it still seems to me that you
should not be seeing extreme wastes of diskspace. If you truly need to
compact nightly to save space, you might be running too close to your
maximum disk capacity anyway. That said if you're still seeing extreme
amounts of "extra" disk space I don't think there is yet an
explanation for that in this thread.

Also, the variation in disk space in your most recent post looks
entirely as expected to me and nothing really extreme. The temporary
disk space occupied during the compact/cleanup would easily be as high
as your original disk space usage to begin with, and the fact that
you're reaching the 5-7 GB per node level after a cleanup has
completed fully and all obsolete sstables have been removed, does not
necessarily help you since each future cleanup/compaction will
typically double your diskspace anyway (even if temporarily).

-- 
/ Peter Schuller

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Jonathan Ellis <jbellis <at> gmail.com> writes:

> 
> did you try compact instead of cleanup, anyway?
> 

Hi Jonathan, 
Thanks for your reply. Actually, I didn't use compact, I used cleanup.  But I
did some testing with compact today since you mentioned it. Using nodetool
compact does improve my disk usage on each node.  But I don't see the disk usage
go down to the amount I would expect until I run nodetool cleanup on every node.
 It seems to force all the SSTables to combine into one.

Here are the results of my experiments with cleanup and compact.

At 3:25pm, here’s what my ring distribution was:

Address       Status     Load          Range                                   
  Ring
                                       170141183460469231731687303715884105728    
10.248.54.192 Up         12.58 GB      21267647932558653966460912964485513216  
  |<--|
10.248.254.15 Up         12.56 GB      42535295865117307932921825928971026432  
  |   ^
10.248.135.239Up         13.63 GB      63802943797675961899382738893456539648  
  v   |
10.248.199.79 Up         11.64 GB      85070591730234615865843651857942052864  
  |   ^
10.249.166.80 Up         12.18 GB      106338239662793269832304564822427566080 
  v   |
10.248.223.191Up         11.18 GB      127605887595351923798765477786913079296 
  |   ^
10.248.122.240Up         11.19 GB      148873535527910577765226390751398592512 
  v   |
10.248.34.80  Up         11.45 GB      170141183460469231731687303715884105728 
  |-->|


Ran nodetool compact on every node.  Then at 4pm, once quiescent:

Address       Status     Load          Range                                   
  Ring
                                       170141183460469231731687303715884105728    
10.248.54.192 Up         6.93 GB       21267647932558653966460912964485513216  
  |<--|
10.248.254.15 Up         6.72 GB       42535295865117307932921825928971026432  
  |   ^
10.248.135.239Up         13.96 GB      63802943797675961899382738893456539648  
  v   |
10.248.199.79 Up         7.15 GB       85070591730234615865843651857942052864  
  |   ^
10.249.166.80 Up         7.98 GB       106338239662793269832304564822427566080 
  v   |
10.248.223.191Up         6.76 GB       127605887595351923798765477786913079296 
  |   ^
10.248.122.240Up         6.58 GB       148873535527910577765226390751398592512 
  v   |
10.248.34.80  Up         6.8 GB        170141183460469231731687303715884105728 
  |-->|

So the manual compaction did help somewhat but did not get the nodes down to the
size of their raw data.  There are still multiple SSTables on most nodes.

At 4:02pm, ran nodetool cleanup on every node.

At 4:12pm, nodes are taking up the expected amount of space and all nodes are
using exactly 1 SSTable (fully compacted):

Address       Status     Load          Range                                   
  Ring
                                       170141183460469231731687303715884105728    
10.248.54.192 Up         5.64 GB       21267647932558653966460912964485513216  
  |<--|
10.248.254.15 Up         5.64 GB       42535295865117307932921825928971026432  
  |   ^
10.248.135.239Up         5.65 GB       63802943797675961899382738893456539648  
  v   |
10.248.199.79 Up         5.59 GB       85070591730234615865843651857942052864  
  |   ^
10.249.166.80 Up         5.57 GB       106338239662793269832304564822427566080 
  v   |
10.248.223.191Up         5.55 GB       127605887595351923798765477786913079296 
  |   ^
10.248.122.240Up         5.57 GB       148873535527910577765226390751398592512 
  v   |
10.248.34.80  Up         5.59 GB       170141183460469231731687303715884105728 
  |-->|


Nodetool cleanup works so beautifully, that I am wondering if there is any harm
in using "nodetool cleanup" in a cron job on a live system that is actively
processing reads and writes to the database?

Julie

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Jonathan Ellis <jb...@gmail.com>.

did you try compact instead of cleanup, anyway?

On Tue, Jul 27, 2010 at 1:08 PM, Julie <ju...@nextcentury.com> wrote:
> Peter Schuller <peter.schuller <at> infidyne.com> writes:
>
>> > a) cleanup is a superset of compaction, so if you've been doing
>> > overwrites at all then it will reduce space used for that reason
>>
>
> Hi Peter and Jonathan,
>
> In my test, I write 80,000 rows (100KB each row) to an 8 node cluster.  The
> 80,000 rows all have unique keys '1' through '80000' so no overwriting is
> occurring.  I also don't do any deletes.  I simply write the 80,000 rows to
> the 8 node cluster which should be about 1GB of data times 3 (replication
> factor=3) on each node.
>
> The only thing I am doing special, is I use Random Partitioning and set the
> Initial Token on each node to try to get the data evenly distributed:
>
>    # Create tokens for the RandomPartitioner that evenly divide token space
>    # The RandomPatitioner hashes keys into integer tokens in the range 0 to
>    # 2^127.
>    # So we simply divide that space into N equal sections.
>    # serverCount = the number of Cassandra nodes in the cluster
>
>    for ((ii=1; ii<=serverCount; ii++)); do
>        host=ec2-server$ii
>        echo Generating InitialToken for server on $host
>        token=$(bc<<-EOF
>            ($ii*(2^127))/$serverCount
>    EOF)
>        echo host=$host initialToken=$token
>        echo "<InitialToken>$token</InitialToken>" >> storage-conf-node.xml
>        cat storage-conf-node.xml
>    ...
>    done
>
> 24 hours after my writes, the data is evenly distributed according to
> cfstats (I see almost identical numRows from node to node) but there is
> a lot of extra disk space being used on some nodes, again according to
> cfstats.  This disk usage drops back down to 2.7GB (exactly what I expect
> since that's how much raw data is on each node) when I run "nodetool
> cleanup".
>
> I am confused why there is anything to clean up 24 hours after my last
> write? All nodes in the cluster are fully up and aware of each other
> before I begin the writes.  The only other thing that could possibly be
> considered unusual is I cycle through all 8 nodes, rather than
> communicating with a single Cassandra node.  I use a write consistency
> setting of ALL.  I can't see how these would increase the amount of disk
> space used but just mentioning it.
>
> Any help would be greatly appreciated,
> Julie
>
> Peter Schuller <peter.schuller <at> infidyne.com> writes:
>
>> > a) cleanup is a superset of compaction, so if you've been doing
>> > overwrites at all then it will reduce space used for that reason
>>
>> I had failed to consider over-writes as a possible culprit (since
>> removals were stated not to be done). However thinking about it I
>> believe the effect of this should be limited to roughly a doubling of
>> disk space in the absolute worst case of over-writing all data in the
>> absolute worst possible order (such as writing everything twice in the
>> same order).
>>
>> Or more accurately, it should be limited to wasting as much as space
>> as the size of the overwritten values. If you're overwriting with
>> larger values, it will no longer be a "doubling" relative to the
>> actual live data set.
>>
>> Julie, did you do over-writes or was your disk space measurements
>> based on the state of the cluster after an initial set of writes of
>> unique values?
>
>
>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Peter Schuller <peter.schuller <at> infidyne.com> writes:

> > a) cleanup is a superset of compaction, so if you've been doing
> > overwrites at all then it will reduce space used for that reason
> 

Hi Peter and Jonathan,

In my test, I write 80,000 rows (100KB each row) to an 8 node cluster.  The
80,000 rows all have unique keys '1' through '80000' so no overwriting is
occurring.  I also don't do any deletes.  I simply write the 80,000 rows to 
the 8 node cluster which should be about 1GB of data times 3 (replication 
factor=3) on each node.  

The only thing I am doing special, is I use Random Partitioning and set the
Initial Token on each node to try to get the data evenly distributed:

    # Create tokens for the RandomPartitioner that evenly divide token space
    # The RandomPatitioner hashes keys into integer tokens in the range 0 to
    # 2^127.
    # So we simply divide that space into N equal sections.
    # serverCount = the number of Cassandra nodes in the cluster

    for ((ii=1; ii<=serverCount; ii++)); do
        host=ec2-server$ii
        echo Generating InitialToken for server on $host
        token=$(bc<<-EOF
            ($ii*(2^127))/$serverCount
    EOF)
        echo host=$host initialToken=$token
        echo "<InitialToken>$token</InitialToken>" >> storage-conf-node.xml
        cat storage-conf-node.xml
    ...
    done

24 hours after my writes, the data is evenly distributed according to 
cfstats (I see almost identical numRows from node to node) but there is 
a lot of extra disk space being used on some nodes, again according to 
cfstats.  This disk usage drops back down to 2.7GB (exactly what I expect 
since that's how much raw data is on each node) when I run "nodetool 
cleanup".

I am confused why there is anything to clean up 24 hours after my last 
write? All nodes in the cluster are fully up and aware of each other 
before I begin the writes.  The only other thing that could possibly be
considered unusual is I cycle through all 8 nodes, rather than 
communicating with a single Cassandra node.  I use a write consistency 
setting of ALL.  I can't see how these would increase the amount of disk 
space used but just mentioning it.

Any help would be greatly appreciated,
Julie

Peter Schuller <peter.schuller <at> infidyne.com> writes:

> > a) cleanup is a superset of compaction, so if you've been doing
> > overwrites at all then it will reduce space used for that reason
> 
> I had failed to consider over-writes as a possible culprit (since
> removals were stated not to be done). However thinking about it I
> believe the effect of this should be limited to roughly a doubling of
> disk space in the absolute worst case of over-writing all data in the
> absolute worst possible order (such as writing everything twice in the
> same order).
> 
> Or more accurately, it should be limited to wasting as much as space
> as the size of the overwritten values. If you're overwriting with
> larger values, it will no longer be a "doubling" relative to the
> actual live data set.
> 
> Julie, did you do over-writes or was your disk space measurements
> based on the state of the cluster after an initial set of writes of
> unique values?

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Peter Schuller <pe...@infidyne.com>.

> Minor compactions (see
> http://wiki.apache.org/cassandra/MemtableSSTable) will try to keep the
> growth in check but it is by no means limited to 2x.

Sorry I was being unclear. I was rather thinking along the lines of a
doubling of data triggering an implicit major compaction. However I
was wrong anyway, since minimumCompactionThreshold in
CompactionManager is set to 4.

This does make me realize that the actual worst-case spike of disk
space usage is decidedly non-trivial to figure out, even if we are
allowed to assume that compaction speed is equal to or greater than
the speed of writes.

-- 
/ Peter Schuller

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Jonathan Ellis <jb...@gmail.com>.

On Tue, Jul 27, 2010 at 9:26 AM, Peter Schuller
<pe...@infidyne.com> wrote:
> I had failed to consider over-writes as a possible culprit (since
> removals were stated not to be done). However thinking about it I
> believe the effect of this should be limited to roughly a doubling of
> disk space in the absolute worst case of over-writing all data in the
> absolute worst possible order (such as writing everything twice in the
> same order).

Minor compactions (see
http://wiki.apache.org/cassandra/MemtableSSTable) will try to keep the
growth in check but it is by no means limited to 2x.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Peter Schuller <pe...@infidyne.com>.

> a) cleanup is a superset of compaction, so if you've been doing
> overwrites at all then it will reduce space used for that reason

I had failed to consider over-writes as a possible culprit (since
removals were stated not to be done). However thinking about it I
believe the effect of this should be limited to roughly a doubling of
disk space in the absolute worst case of over-writing all data in the
absolute worst possible order (such as writing everything twice in the
same order).

Or more accurately, it should be limited to wasting as much as space
as the size of the overwritten values. If you're overwriting with
larger values, it will no longer be a "doubling" relative to the
actual live data set.

Julie, did you do over-writes or was your disk space measurements
based on the state of the cluster after an initial set of writes of
unique values?

-- 
/ Peter Schuller

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Jonathan Ellis <jb...@gmail.com>.

On Fri, Jul 23, 2010 at 8:57 AM, Julie <ju...@nextcentury.com> wrote:
> But in my focused testing today I see that if I run nodetool "cleanup" on the
> nodes taking up way more space than I expect, I see multiple SS Tables being
> combined into 1 or 2 and the live disk usage going way down, down to what I know
> the raw data requires.
>
> This is great news!  I haven't tested it on hugely bloated nodes yet (where the
> disk usage is 6X the size of the raw data) since I haven't reproduced that
> problem today, but I would think using nodetool "cleanup" will work.
>
> I just have two questions:
>
>       (1) How can I set up Cassandra to do this automatically, to allow my
> nodes to store more data?

You'd have to use cron or a similar external service.

>       (2) I am a bit confused why cleanup is working this way since the doc
> claims it just cleans up keys no longer belonging to this node.  I have 8 nodes
> and do a simple sequential write of 10,000 keys to each of them.  I'm using
> random partitioning and give each node an Initial Token that should force even
> spacing of tokens around the hash space:

a) cleanup is a superset of compaction, so if you've been doing
overwrites at all then it will reduce space used for that reason
b) if you have added, moved, or removed any nodes then you will have
"keys no longer belonging to this node"

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Jonathan Ellis <jbellis <at> gmail.com> writes:

> 
> then obsolete sstables is not your culprit.
> 

I believe I figured out how to force my node disk usage to go down.  I had been
letting Cassandra perform its own data management, and did not use nodetool to
force anything since in our real system, the data will need to be managed
automatically, without much human intervention.

But in my focused testing today I see that if I run nodetool "cleanup" on the
nodes taking up way more space than I expect, I see multiple SS Tables being
combined into 1 or 2 and the live disk usage going way down, down to what I know
the raw data requires.

This is great news!  I haven't tested it on hugely bloated nodes yet (where the
disk usage is 6X the size of the raw data) since I haven't reproduced that
problem today, but I would think using nodetool "cleanup" will work.

I just have two questions:

       (1) How can I set up Cassandra to do this automatically, to allow my
nodes to store more data? 

       (2) I am a bit confused why cleanup is working this way since the doc
claims it just cleans up keys no longer belonging to this node.  I have 8 nodes
and do a simple sequential write of 10,000 keys to each of them.  I'm using
random partitioning and give each node an Initial Token that should force even
spacing of tokens around the hash space:

# Create tokens for the RandomPartitioner that evenly divide token space
# The RandomPatitioner hashes keys into integer tokens in the range 0 to
# 2^127.
# So we simply divide that space into N equal sections.

for ((ii=1; ii<=serverCount; ii++)); do
    host=ec2-server$ii
    echo Generating InitialToken for server on $host
    token=$(bc<<-EOF
        ($ii*(2^127))/$serverCount
EOF)
    echo host=$host initialToken=$token
    echo "<InitialToken>$token</InitialToken>" >> storage-conf-node.xml
    cat storage-conf-node.xml

If tokens truly were being evenly distributed, I wouldn't think there would be a
plethora of keys to redistribute?  (All my rows are 1000Kb long, one column.) So
I'm not sure why cleanup is having this big of an effect on my disk space usage?

If you can tell me how to automate this and why it's working, I would love it.

Thanks for your help!
Julie

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Jonathan Ellis <jb...@gmail.com>.

then obsolete sstables is not your culprit.

On Thu, Jul 8, 2010 at 8:32 AM, Julie <ju...@nextcentury.com> wrote:
> Jonathan Ellis <jbellis <at> gmail.com> writes:
>
>> "SSTables that are obsoleted by a compaction are deleted
>> asynchronously when the JVM performs a GC. You can force a GC from
>> jconsole if necessary, but Cassandra will force one itself if it
>> detects that it is low on space. A compaction marker is also added to
>> obsolete sstables so they can be deleted on startup if the server does
>> not perform a GC before being restarted.
>>
>> "CFStoreMBean exposes sstable space used as getLiveDiskSpaceUsed (only
>> includes size of non-obsolete files) and getTotalDiskSpaceUsed
>> (includes everything)."
>>
>
> Thank you so much for your help.
> If I'm reading this right, it sounds like the extra 76 GB of disk space being
> used could be due to SSTables that are obsolete due to compaction but not yet
> deleted.  But would I be able to see a big difference in cfstats then between
> Space used (live) and Space used (total)?  Here's what is being reported for
> this node:
>                Space used (live): 113946099884
>                Space used (total): 113946099884
>
>
>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Jonathan Ellis <jbellis <at> gmail.com> writes:

> "SSTables that are obsoleted by a compaction are deleted
> asynchronously when the JVM performs a GC. You can force a GC from
> jconsole if necessary, but Cassandra will force one itself if it
> detects that it is low on space. A compaction marker is also added to
> obsolete sstables so they can be deleted on startup if the server does
> not perform a GC before being restarted.
> 
> "CFStoreMBean exposes sstable space used as getLiveDiskSpaceUsed (only
> includes size of non-obsolete files) and getTotalDiskSpaceUsed
> (includes everything)."
> 

Thank you so much for your help.
If I'm reading this right, it sounds like the extra 76 GB of disk space being
used could be due to SSTables that are obsolete due to compaction but not yet
deleted.  But would I be able to see a big difference in cfstats then between
Space used (live) and Space used (total)?  Here's what is being reported for
this node:
		Space used (live): 113946099884
		Space used (total): 113946099884

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Jonathan Ellis <jb...@gmail.com>.

On Wed, Jul 7, 2010 at 1:22 PM, Julie <ju...@nextcentury.com> wrote:
> Jonathan Ellis <jbellis <at> gmail.com> writes:
>
>> On Wed, Jul 7, 2010 at 12:10 PM, Julie <julie.sugar <at> nextcentury.com>
> wrote:
>> >
>> > This doesn't explain why 30 GB of data is taking up 106 GB of disk 24 hours
>> > after all writes have completed.  Compactions should be complete, no?
>>
>> http://wiki.apache.org/cassandra/MemtableSSTable should help.
>
> Thank you for your response, Jonathan.  I may be misunderstanding the page you
> referred me to.  Does it say I can expect that I will have temporary periods
> when 2x my existing data size will be required for disk space?  This I can plan
> for.

Yes, it does, but I was primarily referring to these paragraphs:

"SSTables that are obsoleted by a compaction are deleted
asynchronously when the JVM performs a GC. You can force a GC from
jconsole if necessary, but Cassandra will force one itself if it
detects that it is low on space. A compaction marker is also added to
obsolete sstables so they can be deleted on startup if the server does
not perform a GC before being restarted.

"CFStoreMBean exposes sstable space used as getLiveDiskSpaceUsed (only
includes size of non-obsolete files) and getTotalDiskSpaceUsed
(includes everything)."

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Jonathan Ellis <jbellis <at> gmail.com> writes:

> On Wed, Jul 7, 2010 at 12:10 PM, Julie <julie.sugar <at> nextcentury.com> 
wrote:
> >
> > This doesn't explain why 30 GB of data is taking up 106 GB of disk 24 hours
> > after all writes have completed.  Compactions should be complete, no?
> 
> http://wiki.apache.org/cassandra/MemtableSSTable should help.

Thank you for your response, Jonathan.  I may be misunderstanding the page you 
referred me to.  Does it say I can expect that I will have temporary periods
when 2x my existing data size will be required for disk space?  This I can plan
for.  

But is there some sort of rule of thumb that it could require 3x or 4x the data
size in disk space during digestion of the data?

Thank you again for your help!
Julie

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Jonathan Ellis <jb...@gmail.com>.

On Wed, Jul 7, 2010 at 12:10 PM, Julie <ju...@nextcentury.com> wrote:
> I am thinking that the timestamps and column names should be included in the
> column family stats, which basically says 300,000 rows that are 100KB each=30
> GB.  My rows only have 1 column so there should only be one timestamp.  My
> column name is only 10 bytes long.
>
> This doesn't explain why 30 GB of data is taking up 106 GB of disk 24 hours
> after all writes have completed.  Compactions should be complete, no?

http://wiki.apache.org/cassandra/MemtableSSTable should help.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Jordan Pittier - Rezel <jo...@rezel.net>.

I see the same thing here. I have tried to do some maths including
timestamps, columns name, keys and raw data but in the end cassandra reports
a cluster size from 2 to 3 times bigger than the raw data. I am surely
missing something in my formula + i have a lot of free hard drive space, so
it's not a big issue to me. Just puzzling.

On Wed, Jul 7, 2010 at 7:17 PM, Peter Schuller
<pe...@infidyne.com>wrote:

> > I am thinking that the timestamps and column names should be included in
> the
> > column family stats, which basically says 300,000 rows that are 100KB
> each=30
> > GB.  My rows only have 1 column so there should only be one timestamp.
>  My
> > column name is only 10 bytes long.
> >
> > This doesn't explain why 30 GB of data is taking up 106 GB of disk 24
> hours
> > after all writes have completed.  Compactions should be complete, no?
>
> Nope, it sounds fishy to me. Presuming that compaction is not actively
> running in the background still (should be obvious from logs and/or
> CPU usage and/or disk I/O).
>
> --
> / Peter Schuller
>

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Peter Schuller <pe...@infidyne.com>.

> I am thinking that the timestamps and column names should be included in the
> column family stats, which basically says 300,000 rows that are 100KB each=30
> GB.  My rows only have 1 column so there should only be one timestamp.  My
> column name is only 10 bytes long.
>
> This doesn't explain why 30 GB of data is taking up 106 GB of disk 24 hours
> after all writes have completed.  Compactions should be complete, no?

Nope, it sounds fishy to me. Presuming that compaction is not actively
running in the background still (should be obvious from logs and/or
CPU usage and/or disk I/O).

-- 
/ Peter Schuller

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Rob Coli <rcoli <at> digg.com> writes:

> Is your workload straight INSERT or does it contain UPDATE and/or 
> DELETE? If your workload contains UPDATE/DELETE and GCGraceSeconds (10 
> days by default) hasn't passed, you might have a non-trivial number of 
> tombstone rows. Only major compactions clean up tombstones in the 
> current implementation and even given improvements [1] they will always 
> be necessary in some cases.
> 
> =Rob

I am in the testing phase so I'm writing dummy data using keys numbered '1' to
'1000000', writing each row exactly once.  I'm using batch_mutate() but in this
particular case my batch size is 1.  I do not do any deletes and I am not
overwriting any rows since I write each key (1-1,000,000) only once.  So I don't
think I should have any tombstones, unless there's something going on that I
don't know about.

Thanks for your help,
Julie

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Rob Coli <rc...@digg.com>.

On 7/7/10 10:10 AM, Julie wrote:

> This doesn't explain why 30 GB of data is taking up 106 GB of disk 24 hours
> after all writes have completed.  Compactions should be complete, no?

Is your workload straight INSERT or does it contain UPDATE and/or 
DELETE? If your workload contains UPDATE/DELETE and GCGraceSeconds (10 
days by default) hasn't passed, you might have a non-trivial number of 
tombstone rows. Only major compactions clean up tombstones in the 
current implementation and even given improvements [1] they will always 
be necessary in some cases.

=Rob

[1] https://issues.apache.org/jira/browse/CASSANDRA-1074

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Peter Schuller <peter.schuller <at> infidyne.com> writes:

> > Keep in mind that there is additional data storage overhead, including 
> > timestamps and column names. Because the schema can vary from row to row, 
> > the column names are stored with each row, in addition to the data. Disk
> > space-efficiency is not a primary design goal for Cassandra.
> 
> If the row's that are 200k (or was it 100k) are not single columns but
> rather lots and lots of smaller columns, then this will be
> significant.
> 
> In addition, during compaction there is the potential for using twice
> the amount of disk in a column family (during a major compaction all
> data will at some point exist in duplicates).

I am thinking that the timestamps and column names should be included in the 
column family stats, which basically says 300,000 rows that are 100KB each=30 
GB.  My rows only have 1 column so there should only be one timestamp.  My 
column name is only 10 bytes long.  

This doesn't explain why 30 GB of data is taking up 106 GB of disk 24 hours 
after all writes have completed.  Compactions should be complete, no?

Julie

Re: Cassandra disk space utilization

Posted by Peter Schuller <pe...@infidyne.com>.

> Keep in mind that there is additional data storage overhead, including
> timestamps and column names. Because the schema can vary from row to row,
> the column names are stored with each row, in addition to the data. Disk
> space-efficiency is not a primary design goal for Cassandra.

If the row's that are 200k (or was it 100k) are not single columns but
rather lots and lots of smaller columns, then this will be
significant.

In addition, during compaction there is the potential for using twice
the amount of disk in a column family (during a major compaction all
data will at some point exist in duplicates).

-- 
/ Peter Schuller

Re: Cassandra disk space utilization

Posted by Mason Hale <ma...@onespot.com>.

Hi Julie --

Keep in mind that there is additional data storage overhead, including
timestamps and column names. Because the schema can vary from row to row,
the column names are stored with each row, in addition to the data. Disk
space-efficiency is not a primary design goal for Cassandra.

Mason



On Wed, Jul 7, 2010 at 10:56 AM, Julie <ju...@nextcentury.com> wrote:

> Hi guys,
> I have what may be a dumb question but I am confused by how much disk space
> is
> being used by my Cassandra nodes.  I have 10 nodes in my cluster with a
> replication factor of 3.  After I write 1,000,000 rows to the database
> (100kB
> each), I see that they have been distributed very evenly, about 100,000
> rows
> per node but because of the replication factor of 3, each node contains
> about
> 300,000 rows.  This is all good.  Since my rows are 100kB each, I expect
> each
> node to store about 30GB of data, however that is not what I am seeing.
> Instead, I am seeing some nodes that do not experience any compaction
> exceptions but report their space used as MUCH more.  Here's one using 106
> GB
> of disk.  My disks are only 160 GB so this is at the bleeding edge and I
> thought my node would be able to store more data.
>
> I only use a single column family so here is the cfstats output from one of
> my
> nodes (server5):
>
>                Column Family: Standard1
>                SSTable count: 12
>                Space used (live): 113946099884
>                Space used (total): 113946099884
>                Memtable Columns Count: 0
>                Memtable Data Size: 0
>                Memtable Switch Count: 451
>                Read Count: 31786
>                Read Latency: 161.429 ms.
>                Write Count: 300633
>                Write Latency: 0.124 ms.
>                Pending Tasks: 0
>                Key cache: disabled
>                Row cache capacity: 3000
>                Row cache size: 3000
>                Row cache hit rate: 0.38331340841880074
>                Compacted row minimum size: 100220
>                Compacted row maximum size: 100225
>                Compacted row mean size: 100224
>
> Note that I wrote these 1M rows of data yesterday and the system has had 24
> hours to digest it. There are no exceptions in the system.log file.  Here's
> the tail end of it:
>
> ...
> INFO [SSTABLE-CLEANUP-TIMER] 2010-07-06 16:13:43,162
> SSTableDeletingReference.java (line 104) Deleted
> /var/lib/cassandra/data/Keyspace1/Standard1-430-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-07-06 16:13:43,269
> SSTableDeletingReference.java (line 104) Deleted
> /var/lib/cassandra/data/Keyspace1/Standard1-445-Data.db
>  INFO [COMPACTION-POOL:1] 2010-07-06 16:35:21,718 CompactionManager.java
> (line
> 246) Compacting []
>  INFO [Timer-1] 2010-07-06 17:01:01,907 Gossiper.java (line 179)
> InetAddress
> /10.248.107.19 is now dead.
>  INFO [GMFD:1] 2010-07-06 17:01:42,039 Gossiper.java (line 568) InetAddress
> /10.248.107.19 is now UP
>  INFO [COMPACTION-POOL:1] 2010-07-06 17:35:21,306 CompactionManager.java
> (line
> 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-07-06 18:35:20,802 CompactionManager.java
> (line
> 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-07-06 19:35:20,389 CompactionManager.java
> (line
> 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-07-06 20:35:19,934 CompactionManager.java
> (line
> 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-07-06 21:35:19,582 CompactionManager.java
> (line
> 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-07-06 22:35:19,233 CompactionManager.java
> (line
> 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-07-06 23:35:18,593 CompactionManager.java
> (line
> 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-07-07 00:35:18,076 CompactionManager.java
> (line
> 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-07-07 01:35:17,673 CompactionManager.java
> (line
> 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-07-07 02:35:17,172 CompactionManager.java
> (line
> 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-07-07 03:35:16,784 CompactionManager.java
> (line
> 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-07-07 04:35:16,383 CompactionManager.java
> (line
> 246) Compacting []
>
> Thank you for your help!!
> Julie
>
>
>