You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Julie <ju...@nextcentury.com> on 2010/08/04 22:47:05 UTC

Re: Cassandra disk space utilization WAY higher than I would expect

Jonathan Ellis <jbellis <at> gmail.com> writes:

> 
> did you try compact instead of cleanup, anyway?
> 

Hi Jonathan, 
Thanks for your reply. Actually, I didn't use compact, I used cleanup.  But I
did some testing with compact today since you mentioned it. Using nodetool
compact does improve my disk usage on each node.  But I don't see the disk usage
go down to the amount I would expect until I run nodetool cleanup on every node.
 It seems to force all the SSTables to combine into one.

Here are the results of my experiments with cleanup and compact.

At 3:25pm, here’s what my ring distribution was:

Address       Status     Load          Range                                   
  Ring
                                       170141183460469231731687303715884105728    
10.248.54.192 Up         12.58 GB      21267647932558653966460912964485513216  
  |<--|
10.248.254.15 Up         12.56 GB      42535295865117307932921825928971026432  
  |   ^
10.248.135.239Up         13.63 GB      63802943797675961899382738893456539648  
  v   |
10.248.199.79 Up         11.64 GB      85070591730234615865843651857942052864  
  |   ^
10.249.166.80 Up         12.18 GB      106338239662793269832304564822427566080 
  v   |
10.248.223.191Up         11.18 GB      127605887595351923798765477786913079296 
  |   ^
10.248.122.240Up         11.19 GB      148873535527910577765226390751398592512 
  v   |
10.248.34.80  Up         11.45 GB      170141183460469231731687303715884105728 
  |-->|


Ran nodetool compact on every node.  Then at 4pm, once quiescent:

Address       Status     Load          Range                                   
  Ring
                                       170141183460469231731687303715884105728    
10.248.54.192 Up         6.93 GB       21267647932558653966460912964485513216  
  |<--|
10.248.254.15 Up         6.72 GB       42535295865117307932921825928971026432  
  |   ^
10.248.135.239Up         13.96 GB      63802943797675961899382738893456539648  
  v   |
10.248.199.79 Up         7.15 GB       85070591730234615865843651857942052864  
  |   ^
10.249.166.80 Up         7.98 GB       106338239662793269832304564822427566080 
  v   |
10.248.223.191Up         6.76 GB       127605887595351923798765477786913079296 
  |   ^
10.248.122.240Up         6.58 GB       148873535527910577765226390751398592512 
  v   |
10.248.34.80  Up         6.8 GB        170141183460469231731687303715884105728 
  |-->|

So the manual compaction did help somewhat but did not get the nodes down to the
size of their raw data.  There are still multiple SSTables on most nodes.

At 4:02pm, ran nodetool cleanup on every node.

At 4:12pm, nodes are taking up the expected amount of space and all nodes are
using exactly 1 SSTable (fully compacted):

Address       Status     Load          Range                                   
  Ring
                                       170141183460469231731687303715884105728    
10.248.54.192 Up         5.64 GB       21267647932558653966460912964485513216  
  |<--|
10.248.254.15 Up         5.64 GB       42535295865117307932921825928971026432  
  |   ^
10.248.135.239Up         5.65 GB       63802943797675961899382738893456539648  
  v   |
10.248.199.79 Up         5.59 GB       85070591730234615865843651857942052864  
  |   ^
10.249.166.80 Up         5.57 GB       106338239662793269832304564822427566080 
  v   |
10.248.223.191Up         5.55 GB       127605887595351923798765477786913079296 
  |   ^
10.248.122.240Up         5.57 GB       148873535527910577765226390751398592512 
  v   |
10.248.34.80  Up         5.59 GB       170141183460469231731687303715884105728 
  |-->|


Nodetool cleanup works so beautifully, that I am wondering if there is any harm
in using "nodetool cleanup" in a cron job on a live system that is actively
processing reads and writes to the database?

Julie

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Peter Schuller <pe...@infidyne.com>.

> One thing to keep in mind is that SSTables are not actually removed
> from disk until the garbage collector has identified the relevant
> in-memory structures as garbage (there is a note on the wiki about

However I forgot that the 'load' reported by nodetool ring does not, I
think, represent on-disk size. So nevermind on this bit.

-- 
/ Peter Schuller

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Peter Schuller <pe...@infidyne.com>.

> So the manual compaction did help somewhat but did not get the nodes down to the
> size of their raw data.  There are still multiple SSTables on most nodes.
>
> At 4:02pm, ran nodetool cleanup on every node.
>
> At 4:12pm, nodes are taking up the expected amount of space and all nodes are
> using exactly 1 SSTable (fully compacted):

One thing to keep in mind is that SSTables are not actually removed
from disk until the garbage collector has identified the relevant
in-memory structures as garbage (there is a note on the wiki about
this somewhere; it's a way to avoid the complexity of keeping track of
when an sstable becomes safe to delete).

I may be wrong, but I did a quick check and did not find an obvious GC
trigger in the codepath for the 'cleanup' command. So while I'm not
sure why the cleanup would necessarily help other than generally
generating garbage and perhaps triggering a GC, a delay in actually
freeing disk space can probably be attributed to the GC.

(The reason I don't understand why cleanup would help is that even if
cleanup did trigger sufficient garbage generation that CMS kicks in
and does a mark/sweep, thus triggering the deletion of old sstables,
presumably the cleanup itself would produce new sstables that would
then have to wait anyway. Unless there is some code path to avoid
doing that if nothing at all changes in the sstables as a result of
the cleanup... I don't know.)

-- 
/ Peter Schuller

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Peter Schuller <pe...@infidyne.com>.

> sstables waiting for the GC to trigger actual file removal. *However*,
> and this is what I meant with my follow-up, that still does not
> explain the data from her post unless 'nodetool ring' reports total
> sstable size rather than the total size of live sstables.

As far as I can tell, the information 'nodetool ring' prints comes
from SSTableTracker's liveSize. This seems to be updated in replace()
which is used by the compaction manager after compaction. So I do
believe 'nodetool ring' should *not* include the size of sstables that
have been compacted, regardless of GC and whether they still remain on
disk.

-- 
/ Peter Schuller

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Rob Coli <rc...@digg.com>.

On 8/6/10 3:30 PM, Peter Schuller wrote:
> *However*,
> and this is what I meant with my follow-up, that still does not
> explain the data from her post unless 'nodetool ring' reports total
> sstable size rather than the total size of live sstables.

Relatively limited time available to respond to this post, but FYI :

src/java/org/apache/cassandra/tools/NodeCmd.java
"
     public void printRing(PrintStream outs)
...
         Map<String, String> loadMap = probe.getLoadMap();
"

probe object is of type NodeProbe, so..

src/java/org/apache/cassandra/tools/NodeProbe.java
"
         Map<String, String> loadMap = ssProxy.getLoadMap();
"

ssProxy object is a MBean proxy to Storage Service methods..

./cassandra-0.6/src/java/org/apache/cassandra/service/StorageService.java
"
     public double getLoad()
...
                 bytes += cfs.getLiveDiskSpaceUsed();
"

In other words, "nodetool ring" should be reporting only the live disk 
space used.

> Just to be clear, my perhaps misuse of the term "obsolete" only refers
> to sstables that have been successfully compacted together with others
> into a new sstable which replaces the old ones (hence making them
> "obsolete"). I do not mean to imply that they contain obsolete
> columns.

Ok, thought that's what you meant. Thx for the clarification.

=Rob

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Peter Schuller <pe...@infidyne.com>.

> Your post refers to "obsolete" sstables, but the only thing that makes them
> "obsolete" in this case is that they have been compacted?

Yes.

> As I understand Julie's case, she is :
>
> a) initializing her cluster
> b) inserting some number of unique keys with CL.ALL
> c) noticing that more disk space (6x?) than is expected is used
> d) but that she gets expected usage if she does a major compaction
>
> In other words, the problem isn't "temporary disk space occupied during the
> compact", it's permanent disk space occupied unless she compacts.

Sorry, I re-read my previous message and I wasn't being very clear and
I was probably a bit confused too ;)

The reason I mention temporary spikes during compaction is that these
are fully expected to roughly double disk space use. I did not mean to
imply that this is what she is seeing, since she's specifically
waiting for background compaction to complete. Rather, my point was
that as long as we are only talking about roughly a doubling of disk
space, regardless of its cause, it is not worse than what you may
expect anyway, even if temporarily, during active compaction.

I still have no explanation for lingering data, other than obsolete
sstables waiting for the GC to trigger actual file removal. *However*,
and this is what I meant with my follow-up, that still does not
explain the data from her post unless 'nodetool ring' reports total
sstable size rather than the total size of live sstables.

But let's suppose that's wrong and 'nodetool ring' somehow does
include all sstable data; in such a case it seems that the data from
Julie's latest post may be consistent with obsolete sstables, since
the repeated attempts at compaction/cleanup may very well have
triggered a CMS sweep, thus eventually freeing the sstables (while
simply leaving the cluster idle would not have done so normally).

It is also worth noting that this kind of space waste (sstables
hanging around waiting for a GC) is not likely to scale to larger data
sets as the probability of triggering a CMS sweep within a reasonable
period after compaction, increases with the amount of traffic you
throw at the cluster. So if you're e.g. writing 100 gb of data, you're
pretty unlikely to not trigger a CMS sweep within the first few
percent (assuming you don't have a huge 250 gb heap or something). For
this reason (even disregarding that cassandra tries to trigger a GC
when disk runs low) I would not count them as expected to be
"cumulative" on top of any temporary spikes resulting from compaction;
so in effect the idea is that these effects should normally not matter
since you need the disk space to deal with the compaction spikes
anyway, and you are unlikely to have both compaction spikes *and*
these obsolete sstables at the same time even without the GC
triggering cassandra does.

> There is clearly overhead from there being multiple SSTables with multiple
> bloom filters and multiple indexes. But from my understanding, that does not
> fully account for the difference in disk usage she is seeing. If it is 6x
> across the whole cluster, it seems unlikely that the meta information is 5x
> the size of the actual information.

Definitely agreed. I should have made it clearer that I was only
addressing the post.

Julie, are you still seeing 6x and similar factors of wastes in your
re-producable test cases, or was the 6x factor limited to your initial
experience with real data?

The reason I'm barking up this tree is that I figure we may be
observing the results of two independent problems here, and if the
smaller test case can be explained away by lack of GC, then it
probably doesn't help figuring out what happened in the original
problem scenario.

Hmm. Maybe a 'nodetool gc' would be in order to make it easy to
trigger a gc without going for jconsole/JMX manually.

> I haven't been following this thread very closely, but I don't think
> "obsolete" SSTables should be relevant, because she's not doing UPDATE or
> DELETE and she hasn't changed cluster topography (the "cleanup" case).

Even a workload strictly limited to writing new data with unique keys
and column names will cause sstables to become obsolete, as long as
you write enough data to reach the compaction threshold.

Just to be clear, my perhaps misuse of the term "obsolete" only refers
to sstables that have been successfully compacted together with others
into a new sstable which replaces the old ones (hence making them
"obsolete"). I do not mean to imply that they contain obsolete
columns.

-- 
/ Peter Schuller

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Rob Coli <rc...@digg.com>.

On 8/20/10 1:58 PM, Julie wrote:
> Julie<julie.sugar<at>  nextcentury.com>  writes:
>
> Please see previous post but is hinted handoff a factor if the CL is set to ALL?

Your previous post looks like a flush or compaction is causing the node 
to mark its neighbors down. Do you see correlation between memtable 
flushes or compaction and the GMFD marking? Are you running a version of 
Cassandra (<=0.6.4) which has the MESSAGE-DESERIALIZER-POOL bug, and/or 
are you seeing a backup there? That can cause GMFD to incorrectly mark 
nodes down..

https://issues.apache.org/jira/browse/CASSANDRA-1358

However you are correct, I missed that you are writing at CL.ALL. With 
CL.ALL, you should get an exception if nodes are unreachable during a 
write. This means that Hinted Handoff is unlikely to be the culprit for 
your usage bloat. As you clearly have nodes which are occasionally 
considered unreachable.. are you seeing those exceptions?

=Rob

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Julie <julie.sugar <at> nextcentury.com> writes:

Please see previous post but is hinted handoff a factor if the CL is set to ALL?

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Robert Coli <rcoli <at> digg.com> writes:

> Check the size of the Hinted Handoff CF? If your nodes are flapping
> under sustained write, they could be storing a non-trivial number of
> hinted handoff rows? Probably not 5x usage though..
> 
> http://wiki.apache.org/cassandra/Operations
> "
> The reason why you run nodetool cleanup on all live nodes [after
> replacing a node] is to remove old Hinted Handoff writes stored for
> the dead node. 
> =Rob


Please see below and let me know if you think hinted handoff is to blame.  

I do see some down/up activity according to the gossiping on the nodes.  
Interestingly, I see mass "deaths" being detected on servers 1, 2, 4, 5, and 
6.  Each server detects the "mass death" at a unique time, making it look as 
though it's the server that's detecting the mass death that is really the 
culprit.  Server3, the bloated node, is not having this problem.

As far as nodes being reported as going down and coming back up (always 
quickly) --  being down/up is being reported for each server as follows:
server1: 16 times
server2: 17 times
server3: 12 times
server4: 18 times
server5: 24 times
server6: 20 times
server7: 13 times
server8: 12 times

Again, server3 is looking like a very healthy node, so you wouldn't think it
would have a backlog of hinted handoffs coming its way when the writes complete.
Server5 and server6 seem to be the least healthy.

Here's the streaming that took place on all 8 nodes, following the 800,000 row
write (writes completed at 13:47):

server1-system.log: INFO [Thread-25] 2010-08-16 12:59:49,858
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-172-Data.db
server1-system.log: INFO [Thread-31] 2010-08-16 18:12:01,829 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-475-Data.db

server3-system.log: INFO [Thread-25] 2010-08-16 14:34:06,827 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-467-Data.db
server3-system.log: INFO [Thread-28] 2010-08-16 15:38:34,685 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-475-Data.db 
server3-system.log: INFO [Thread-34] 2010-08-16 17:24:57,584 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-477-Data.db
server3-system.log: INFO [Thread-31] 2010-08-16 17:26:37,281 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-476-Data.db

server4-system.log: INFO [Thread-25] 2010-08-16 18:30:19,313 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-472-Data.db
server4-system.log: INFO [Thread-28] 2010-08-16 18:33:07,141 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db

server6-system.log: INFO [Thread-25] 2010-08-16 16:53:43,108 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-470-Data.db
server6-system.log: INFO [Thread-28] 2010-08-16 17:58:15,031 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-471-Data.db

server7-system.log: INFO [Thread-25] 2010-08-16 12:39:54,342 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-110-Data.db
server7-system.log: INFO [Thread-28] 2010-08-16 12:46:16,067 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-144-Data.db
server7-system.log: INFO [Thread-31] 2010-08-16 14:51:24,585 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-469-Data.db
server7-system.log: INFO [Thread-34] 2010-08-16 15:17:22,168  
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-472-Data.db

server8-system.log: INFO [Thread-25] 2010-08-16 12:46:39,462 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-133-Data.db
server8-system.log: INFO [Thread-28] 2010-08-16 15:56:38,124 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db
server8-system.log: INFO [Thread-31] 2010-08-16 17:05:24,805 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-475-Data.db
server8-system.log: INFO [Thread-34] 2010-08-16 18:52:03,416 
 StreamCompletionHandler.java (line 64) 
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-476-Data.db

Do you see any red flags?

Thanks for your help!

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Robert Coli <rc...@digg.com>.

On Thu, Aug 19, 2010 at 7:23 AM, Julie <ju...@nextcentury.com> wrote:
> At this point, I logged in.  The data distribution on this node was 122GB.  I
> started performing a manual nodetool cleanup.

Check the size of the Hinted Handoff CF? If your nodes are flapping
under sustained write, they could be storing a non-trivial number of
hinted handoff rows? Probably not 5x usage though..

http://wiki.apache.org/cassandra/Operations
"
The reason why you run nodetool cleanup on all live nodes [after
replacing a node] is to remove old Hinted Handoff writes stored for
the dead node.
"

You could relatively quickly determine whether Hinted Handoff is
implicated by running your test with the feature turned off.

https://issues.apache.org/jira/browse/CASSANDRA-894

=Rob

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Peter Schuller <peter.schuller <at> infidyne.com> writes:

> Without necessarily dumping all the information - approximately what
> do they contain? Do they contain anything about compactions,
> anti-compactions, streaming, etc?
> 
> With an idle node after taking writes, I *think* the only expected
> disk I/O (once it has settled) would be a memtable flush triggered by
> memtable_flush_after_mins, and possibly compactions resulting from
> that (depending on how close one were to triggering compaction prior
> to the memtable flush). Whatever is causing additional sstables to be
> written, even if somehow triggered incorrectly, I'd hope that they
> were logged still.

******************************************************
Hi Peter,

Thanks for your ideas.  Here is what I see in the bloated node's system.log file
(condensed) starting 4 minutes after the last write.  At this time, data usage
on this node was only around 28GB (as expected).
******************************************************

INFO [COMPACTION-POOL:1] 2010-08-16 13:51:38,432 CompactionManager.java
 (line 246) Compacting ...
 INFO [COMPACTION-POOL:1] 2010-08-16 14:20:02,840 CompactionManager.java
 (line 246) Compacting ...
 INFO [COMPACTION-POOL:1] 2010-08-16 14:28:37,085 CompactionManager.java
 (line 246) Compacting ...

 INFO [Thread-25] 2010-08-16 14:34:06,827 StreamCompletionHandler.java
 (line 64) Streaming added
 /var/lib/cassandra/data/Keyspace1/Standard1-467-Data.db

 INFO [Timer-1] 2010-08-16 14:34:31,655 Gossiper.java (line 179)
 InetAddress /10.254.107.178 is now dead.
 INFO [GMFD:1] 2010-08-16 14:35:03,012 Gossiper.java (line 568)
 InetAddress /10.254.107.178 is now UP

 INFO [COMPACTION-POOL:1] 2010-08-16 14:36:04,808 CompactionManager.java
 (line 246) Compacting ...
 INFO [COMPACTION-POOL:1] 2010-08-16 14:40:31,283 CompactionManager.java
 (line 246) Compacting ...
 INFO [FLUSH-TIMER] 2010-08-16 14:50:46,570 ColumnFamilyStore.java (line 357)
 Standard1 has reached its threshold; switching in a fresh Memtable at
 CommitLogContext(file=
 '/var/lib/cassandra/commitlog/CommitLog-1281980828674.log', position=87493038)


 INFO [FLUSH-TIMER] 2010-08-16 14:50:46,602 ColumnFamilyStore.java (line 609)
 Enqueuing flush of Memtable(Standard1)@31873581
 INFO [FLUSH-WRITER-POOL:1] 2010-08-16 14:50:46,647 Memtable.java (line 148)
 Writing Memtable(Standard1)@31873581
 INFO [GC inspection] 2010-08-16 14:50:47,350 GCInspector.java (line 110)
 GC for ConcurrentMarkSweep: 362 ms, 506927464 reclaimed leaving 129128920 used;
 max is 1172766720
 INFO [FLUSH-WRITER-POOL:1] 2010-08-16 14:50:51,137 Memtable.java (line 162)
 Completed flushing /var/lib/cassandra/data/Keyspace1/Standard1-472-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 14:50:58,411
 SSTableDeletingReference.java (line 104)
 Deleted /var/lib/cassandra/data/Keyspace1/Standard1-223-Data.db
...
...
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 14:51:40,662
 SSTableDeletingReference.java (line 104)
 Deleted /var/lib/cassandra/data/Keyspace1/Standard1-407-Data.db

 INFO [Timer-1] 2010-08-16 14:51:40,772 Gossiper.java (line 179)
 InetAddress /10.254.242.159 is now dead.

 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 14:51:41,011
 SSTableDeletingReference.java (line 104)
 Deleted /var/lib/cassandra/data/Keyspace1/Standard1-279-Data.db
 ...
 ...
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 14:52:22,206
 SSTableDeletingReference.java (line 104)
 Deleted /var/lib/cassandra/data/Keyspace1/Standard1-389-Data.db

 INFO [GMFD:1] 2010-08-16 14:52:34,906 Gossiper.java (line 568)
 InetAddress /10.254.242.159 is now UP

 INFO [COMPACTION-POOL:1] 2010-08-16 14:58:48,958 CompactionManager.java
 (line 246) Compacting ...
 INFO [COMPACTION-POOL:1] 2010-08-16 15:02:43,444 CompactionManager.java
 (line 246) Compacting ...
 INFO [COMPACTION-POOL:1] 2010-08-16 15:05:08,837 CompactionManager.java
 (line 246) Compacting []

 INFO [Timer-1] 2010-08-16 15:17:38,697 Gossiper.java (line 179)
 InetAddress /10.254.234.226 is now dead.
 INFO [GMFD:1] 2010-08-16 15:18:37,471 Gossiper.java (line 568)
 InetAddress /10.254.234.226 is now UP


 INFO [Thread-28] 2010-08-16 15:38:34,685 StreamCompletionHandler.java
 (line 64)
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-475-Data.db

 INFO [COMPACTION-POOL:1] 2010-08-16 16:05:08,811 CompactionManager.java
 (line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-16 17:05:08,792 CompactionManager.java
 (line 246) Compacting []

 INFO [Timer-1] 2010-08-16 17:05:55,583 Gossiper.java (line 179)
 InetAddress /10.254.242.159 is now dead.
 INFO [GMFD:1] 2010-08-16 17:06:19,614 Gossiper.java (line 568)
 InetAddress /10.254.242.159 is now UP


 INFO [Thread-34] 2010-08-16 17:24:57,584 StreamCompletionHandler.java
 (line 64)
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-477-Data.db

 INFO [Timer-1] 2010-08-16 17:25:19,206 Gossiper.java (line 179)
 InetAddress /10.210.157.187 is now dead.
 INFO [GMFD:1] 2010-08-16 17:25:53,096 Gossiper.java (line 568)
 InetAddress /10.210.157.187 is now UP


 INFO [Thread-31] 2010-08-16 17:26:37,281 StreamCompletionHandler.java
 (line 64)
 Streaming added /var/lib/cassandra/data/Keyspace1/Standard1-476-Data.db
 INFO [Timer-1] 2010-08-16 17:27:01,939 Gossiper.java (line 179)
 InetAddress /10.254.234.226 is now dead.
 INFO [GMFD:1] 2010-08-16 17:27:24,115 Gossiper.java (line 568)
 InetAddress /10.254.234.226 is now UP


 INFO [COMPACTION-POOL:1] 2010-08-16 18:05:08,764 CompactionManager.java
 (line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-16 19:05:08,741 CompactionManager.java
 (line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-16 20:05:08,718 CompactionManager.java
 (line 246) Compacting []
INFO [COMPACTION-POOL:1] 2010-08-16 21:05:08,694 CompactionManager.java
 (line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-16 22:05:08,674 CompactionManager.java
 (line 246) Compacting []

******************************************************
At this point, I logged in.  The data distribution on this node was 122GB.  I
started performing a manual nodetool cleanup.
******************************************************

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Peter Schuller <pe...@infidyne.com>.

> I actually have the log files from all 8 nodes if it helps to diagnose what
> activity was going on behind the scenes.  I really need to understand how this
> happened.

Without necessarily dumping all the information - approximately what
do they contain? Do they contain anything about compactions,
anti-compactions, streaming, etc?

With an idle node after taking writes, I *think* the only expected
disk I/O (once it has settled) would be a memtable flush triggered by
memtable_flush_after_mins, and possibly compactions resulting from
that (depending on how close one were to triggering compaction prior
to the memtable flush). Whatever is causing additional sstables to be
written, even if somehow triggered incorrectly, I'd hope that they
were logged still.

What about something like a gossiping issue with some kind of
disagreement about token space? But even then, why would nodes
spontaneously start pushing data - my understanding is that this is
only triggered by administrative operations right now, which seems
confirmed by:

   http://wiki.apache.org/cassandra/Streaming

Assuming the log files contain some kind of activity such as
compaction/streaming/etc; do they correlate well in time with each
other and/or something else?

-- 
/ Peter Schuller

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Jonathan Ellis <jbellis <at> gmail.com> writes:

> 
> If you read the stack traces you pasted, the node in question ran out
> of diskspace.  When you have < 25% space free this is not surprising.
> 
> But fundamentally you are missing something important from your story
> here.  Disk space doesn't just increase spontaneously with "absolutely
> no activity."

I agree, it's not surprising for a node to run out of disk space when it is
occupying 122GB (133GB including the still present compacted files) on a 160GB
drive.  What I don't understand is how 30GB of data grew into 122GB.

I could send you the Cassandra log files from the time period after the 800,000
rows of writes complete.  There was no activity other than Cassandra doing
whatever work it does behind the scenes.

The 800,000 unique key writes started at 12:15pm and completed at 1:47pm.  These
8 vms were created solely for this test and I was the only one logged on.  After
the writes were done, Cassandra did what it does behind the scenes, there was no
human intervention and no clients running/connecting to it.  I didn't log in
again to check on things until 11pm that night which is when I saw the uneven
data distribution (122GB on one of the nodes).

I actually have the log files from all 8 nodes if it helps to diagnose what
activity was going on behind the scenes.  I really need to understand how this
happened.

Thanks again for your help,
Julie

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Edward Capriolo <ed...@gmail.com>.

On Wed, Aug 18, 2010 at 10:51 AM, Jonathan Ellis <jb...@gmail.com> wrote:
> If you read the stack traces you pasted, the node in question ran out
> of diskspace.  When you have < 25% space free this is not surprising.
>
> But fundamentally you are missing something important from your story
> here.  Disk space doesn't just increase spontaneously with "absolutely
> no activity."
>
> On Wed, Aug 18, 2010 at 9:36 AM, Julie <ju...@nextcentury.com> wrote:
>>
>> Rob Coli <rcoli <at> digg.com> writes:
>>
>>
>>> As I understand Julie's case, she is :
>>> a) initializing her cluster
>>> b) inserting some number of unique keys with CL.ALL
>>> c) noticing that more disk space (6x?) than is expected is used
>>> d) but that she gets expected usage if she does a major compaction
>>> In other words, the problem isn't "temporary disk space occupied during
>>> the compact", it's permanent disk space occupied unless she compacts.
>>>
>>> Julie : when compaction occurs, it logs the number of bytes that it
>>> started with and the number it ended with, as well as the number of keys
>>> involved in the compaction. What do these messages say?
>>>
>>> example line :
>>> INFO [COMPACTION-POOL:1] 2010-08-06 13:48:00,328 CompactionManager.java
>>> (line 398) Compacted to /path/to/MyColumnFamily-26-Data.db.
>>> 999999999/888888888 bytes for 12345678 keys.  Time: 123456ms.
>>
>> Rob -
>> I reran the original test: 8 nodes in the cluster (160GB drives on each
>> node). Populated each node with 30GB of data using unique keys and CL.ALL
>> and repFactor=3.  (Wrote 10GB of data to each node but with the repFactor=3,
>> it results in about 30GB of data on each node.)
>>
>> One hour after the last write, the ring distribution looks excellent:
>> Address       Status     Load          Range
>>  Ring
>>                                       170141183460469231731687303715884105728
>> 10.210.198.64 Up         28.32 GB      21267647932558653966460912964485513216
>>  |<--|
>> 10.210.157.187Up         28.07 GB      42535295865117307932921825928971026432
>>  |   ^
>> 10.206.34.194 Up         28.12 GB      63802943797675961899382738893456539648
>>  v   |
>> 10.254.107.178Up         28.15 GB      85070591730234615865843651857942052864
>>  |   ^
>> 10.254.234.226Up         28.02 GB      106338239662793269832304564822427566080
>>  v   |
>> 10.254.242.159Up         27.96 GB      127605887595351923798765477786913079296
>>  |   ^
>> 10.214.18.198 Up         28.18 GB      148873535527910577765226390751398592512
>>  v   |
>> 10.214.26.118 Up         29.82 GB      170141183460469231731687303715884105728
>>  |-->|
>>
>> But 8 hours after the last write (absolutely no activity), things don't
>> look as good:
>>
>> Address       Status     Load          Range
>>  Ring
>>                                       170141183460469231731687303715884105728
>> 10.210.198.64 Up         30.28 GB      21267647932558653966460912964485513216
>>  |<--|
>> 10.210.157.187Up         28.12 GB      42535295865117307932921825928971026432
>>  |   ^
>> 10.206.34.194 Up         122.41 GB     63802943797675961899382738893456539648
>>  v   |
>> 10.254.107.178Up         33.89 GB      85070591730234615865843651857942052864
>>  |   ^
>> 10.254.234.226Up         28.01 GB      106338239662793269832304564822427566080
>>  v   |
>> 10.254.242.159Up         72.58 GB      127605887595351923798765477786913079296
>>  |   ^
>> 10.214.18.198 Up         83.41 GB      148873535527910577765226390751398592512
>>  v   |
>> 10.214.26.118 Up         62.01 GB      170141183460469231731687303715884105728
>>  |-->|
>>
>> The 122.41GB node is named ec2-server3.  Here's what cfstats reports:
>> On ec2-server3:
>>
>>          Write Latency: 0.21970486121446028 ms.
>>          Pending Tasks: 0
>>                   Column Family: Standard1
>>                   SSTable count: 9
>>                        Space used (live): 131438293207
>>                        Space used (total): 143577216419
>>                   Memtable Columns Count: 0
>>                   Memtable Data Size: 0
>>                   Memtable Switch Count: 454
>>                   Read Count: 0
>>                   Read Latency: NaN ms.
>>                   Write Count: 302373
>>                   Write Latency: 0.220 ms.
>>                   Pending Tasks: 0
>>                   Key cache capacity: 200000
>>                   Key cache size: 0
>>                   Key cache hit rate: NaN
>>                   Row cache: disabled
>>                   Compacted row minimum size: 100316
>>                   Compacted row maximum size: 100323
>>                   Compacted row mean size: 100322
>>
>> On ec2-server3, df reports:
>> /dev/sdb             153899044 140784388   5297032  97% /mnt
>>
>> So this node should (I would think) contain 30GB of data on a 160GB hard
>> drive but somehow it has grown to 122 GB of data (live space) plus several
>> compacted files that have not yet been deleted (total space).  Keep in
>> mind that no deletions or updates have taken place, just unique key writes.
>>
>> At this point, here are the compaction lines from the ec2-server3 system.log
>> file, taking place after all writes had completed:
>>
>> INFO [COMPACTION-POOL:1] 2010-08-16 14:11:58,614 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-465-Data.db.
>> 8595448738/8595448738 bytes for 85678 keys.  Time: 984905ms.
>>  INFO [COMPACTION-POOL:1] 2010-08-16 14:20:02,825 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-466-Data.db.
>> 2157339354/2157339354 bytes for 21504 keys.  Time: 484188ms.
>>  INFO [COMPACTION-POOL:1] 2010-08-16 14:28:37,066 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-468-Data.db.
>> 2157339329/2157339329 bytes for 21504 keys.  Time: 514226ms.
>>  INFO [COMPACTION-POOL:1] 2010-08-16 14:36:04,806 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-469-Data.db.
>> 2157339417/2157339417 bytes for 21504 keys.  Time: 447720ms.
>>  INFO [COMPACTION-POOL:1] 2010-08-16 14:40:31,234 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-470-Data.db.
>> 2157339202/2157339202 bytes for 21504 keys.  Time: 266418ms.
>>  INFO [COMPACTION-POOL:1] 2010-08-16 14:58:48,943 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-471-Data.db.
>> 8629357302/8629357302 bytes for 86016 keys.  Time: 1097659ms.
>>  INFO [COMPACTION-POOL:1] 2010-08-16 15:02:43,428 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db.
>> 2157338687/2157338687 bytes for 21504 keys.  Time: 234464ms.
>>  INFO [COMPACTION-POOL:1] 2010-08-16 15:05:08,807 CompactionManager.java (line
>> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-474-Data.db.
>> 1345327705/1345327705 bytes for 13410 keys.  Time: 145363ms.
>>
>> Here’s what’s in the data directory on this node:
>>
>> -rw-r--r-- 1 root root  9518412981 Aug 16 13:36 Standard1-247-Data.db
>> -rw-r--r-- 1 root root  8595448738 Aug 16 14:11 Standard1-465-Data.db
>> -rw-r--r-- 1 root root 22968417153 Aug 16 14:34 Standard1-467-Data.db
>> -rw-r--r-- 1 root root  8629357302 Aug 16 14:58 Standard1-471-Data.db
>> -rw-r--r-- 1 root root  2157338687 Aug 16 15:02 Standard1-473-Data.db
>> -rw-r--r-- 1 root root  1345327705 Aug 16 15:05 Standard1-474-Data.db
>> -rw-r--r-- 1 root root 28995571589 Aug 16 15:38 Standard1-475-Data.db
>> -rw-r--r-- 1 root root 30063580839 Aug 16 17:26 Standard1-476-Data.db
>> -rw-r--r-- 1 root root 19091100731 Aug 16 17:24 Standard1-477-Data.db
>> Plus a bunch of compacted files.
>>
>> At this point, I performed a manual "nodetool cleanup" on the super large
>> node when *very bad things* happened.  Here's an excerpt from the large
>> node's log file upon issuing the cleanup:
>>
>> INFO [COMPACTION-POOL:1] 2010-08-16 23:04:04,230 CompactionManager.java (line
>> 345) AntiCompacting [org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-474-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-471-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-477-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-475-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-467-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-247-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-476-Data.db'),
>> org.apache.cassandra.io.SSTableReader(
>> path='/var/lib/cassandra/data/Keyspace1/Standard1-465-Data.db')]
>>
>>  INFO [COMPACTION-POOL:1] 2010-08-16 23:04:04,230 StorageService.java
>> (line 1499) requesting GC to free disk space
>>  INFO [GC inspection] 2010-08-16 23:04:04,470 GCInspector.java (line 110)
>> GC for ConcurrentMarkSweep: 233 ms, 299442152 reclaimed leaving 20577560
>> used; max is 1172766720
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:14,641
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-419-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:17,974
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-466-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,100
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-428-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,225
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-415-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,374
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-441-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,514
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-461-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,623
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-442-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,740
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-425-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,891
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-447-Data.db
>> …
>> INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,331
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-451-Data.db
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,423
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-450-Data.db
>> ERROR [COMPACTION-POOL:1] 2010-08-16 23:04:25,525
>> DebuggableThreadPoolExecutor.java (line 94) Error in executor futuretask
>> java.util.concurrent.ExecutionException:
>> java.lang.UnsupportedOperationException: disk full
>>        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>>        at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>>        at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.after
>> Execute (DebuggableThreadPoolExecutor.java:86)
>>        at
>> org.apache.cassandra.db.CompactionManager$CompactionExecutor.after
>> Execute (CompactionManager.java:582)
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask
>> (ThreadPoolExecutor.java:888)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>        at java.lang.Thread.run(Thread.java:619)
>> Caused by: java.lang.UnsupportedOperationException: disk full
>>        at
>> org.apache.cassandra.db.CompactionManager.doAntiCompaction
>> (CompactionManager.java:351)
>>        at
>> org.apache.cassandra.db.CompactionManager.doCleanupCompaction
>> (CompactionManager.java:417)
>>        at
>> org.apache.cassandra.db.CompactionManager.access$400
>> (CompactionManager.java:49)
>>        at
>> org.apache.cassandra.db.CompactionManager$2.call(CompactionManager.java:130)
>>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask
>> (ThreadPoolExecutor.java:886)
>>        ... 2 more
>>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,558
>> SSTableDeletingReference.java (line 104)
>> Deleted /var/lib/cassandra/data/Keyspace1/Standard1-449-Data.db
>> …
>> INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:33,265
>> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
>> /data/Keyspace1/Standard1-453-Data.db
>>  INFO [COMPACTION-POOL:1] 2010-08-16 23:05:08,656 CompactionManager.java
>> (line 246) Compacting []
>>
>> The next morning (19 hours later):
>>
>> INFO [COMPACTION-POOL:1] 2010-08-16 23:05:08,656 CompactionManager.java
>> (line 246) Compacting []
>>  INFO [COMPACTION-POOL:1] 2010-08-17 00:05:08,637 CompactionManager.java
>> (line 246) Compacting []
>>  INFO [COMPACTION-POOL:1] 2010-08-17 01:05:08,607 CompactionManager.java
>> (line 246) Compacting []
>>  INFO [COMPACTION-POOL:1] 2010-08-17 02:05:08,581 CompactionManager.java
>> (line 246) Compacting []
>>  INFO [COMPACTION-POOL:1] 2010-08-17 03:05:08,568 CompactionManager.java
>> (line 246) Compacting []
>>  INFO [COMPACTION-POOL:1] 2010-08-17 04:05:08,532 CompactionManager.java
>> (line 246) Compacting []
>>  INFO [COMPACTION-POOL:1] 2010-08-17 05:05:08,505 CompactionManager.java
>> (line 246) Compacting []
>>  INFO [COMPACTION-POOL:1] 2010-08-17 06:05:08,494 CompactionManager.java
>> (line 246) Compacting []
>>
>> Also, 19 hours later, ring distribution is the same:
>>
>> Address       Status     Load          Range
>>  Ring
>>                                       170141183460469231731687303715884105728
>> 10.210.198.64 Up         30.28 GB      21267647932558653966460912964485513216
>>  |<--|
>> 10.210.157.187Up         28.12 GB      42535295865117307932921825928971026432
>>  |   ^
>> 10.206.34.194 Up         122.41 GB     63802943797675961899382738893456539648
>>  v   |
>> 10.254.107.178Up         33.89 GB      85070591730234615865843651857942052864
>>  |   ^
>> 10.254.234.226Up         28.01 GB      106338239662793269832304564822427566080
>>  v   |
>> 10.254.242.159Up         72.58 GB      127605887595351923798765477786913079296
>>  |   ^
>> 10.214.18.198 Up         83.41 GB      148873535527910577765226390751398592512
>>  v   |
>> 10.214.26.118 Up         62.01 GB      170141183460469231731687303715884105728
>>  |-->|
>>
>> So nodetool cleanup seems to have resulted in a fatal error on my overly (>4x)
>> bloated node.
>>
>> Can anyone help me with understanding why this happened, taking into account
>> the node should only contain 30GB of data on a 160GB hard drive?  I have the 8
>> system.log files from the 24 hour period (the nodes were only alive for 24
>> hours total).
>>
>> Thank you!
>> Julie
>>
>>
>>
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

I noticed something funny about repair as well. I upgraded to 6.3 then
joined a node with bootstrap set to false.

So the node was empty. I ran REPAIR on the new node. Later I noticed
that the node had 2-4x the data it should have had.

Triggering REPAIR should cause anti compaction on the closest nodes on
the ring. Those nodes should then stream ONLY the data that belongs on
new node to the new node. Right? By the size of the new node I assumed
that all the contents of both nodes were streamed to this node. (This
could be wrong though) Sorry I have no good test case to replicate
this.

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Jonathan Ellis <jb...@gmail.com>.

If you read the stack traces you pasted, the node in question ran out
of diskspace.  When you have < 25% space free this is not surprising.

But fundamentally you are missing something important from your story
here.  Disk space doesn't just increase spontaneously with "absolutely
no activity."

On Wed, Aug 18, 2010 at 9:36 AM, Julie <ju...@nextcentury.com> wrote:
>
> Rob Coli <rcoli <at> digg.com> writes:
>
>
>> As I understand Julie's case, she is :
>> a) initializing her cluster
>> b) inserting some number of unique keys with CL.ALL
>> c) noticing that more disk space (6x?) than is expected is used
>> d) but that she gets expected usage if she does a major compaction
>> In other words, the problem isn't "temporary disk space occupied during
>> the compact", it's permanent disk space occupied unless she compacts.
>>
>> Julie : when compaction occurs, it logs the number of bytes that it
>> started with and the number it ended with, as well as the number of keys
>> involved in the compaction. What do these messages say?
>>
>> example line :
>> INFO [COMPACTION-POOL:1] 2010-08-06 13:48:00,328 CompactionManager.java
>> (line 398) Compacted to /path/to/MyColumnFamily-26-Data.db.
>> 999999999/888888888 bytes for 12345678 keys.  Time: 123456ms.
>
> Rob -
> I reran the original test: 8 nodes in the cluster (160GB drives on each
> node). Populated each node with 30GB of data using unique keys and CL.ALL
> and repFactor=3.  (Wrote 10GB of data to each node but with the repFactor=3,
> it results in about 30GB of data on each node.)
>
> One hour after the last write, the ring distribution looks excellent:
> Address       Status     Load          Range
>  Ring
>                                       170141183460469231731687303715884105728
> 10.210.198.64 Up         28.32 GB      21267647932558653966460912964485513216
>  |<--|
> 10.210.157.187Up         28.07 GB      42535295865117307932921825928971026432
>  |   ^
> 10.206.34.194 Up         28.12 GB      63802943797675961899382738893456539648
>  v   |
> 10.254.107.178Up         28.15 GB      85070591730234615865843651857942052864
>  |   ^
> 10.254.234.226Up         28.02 GB      106338239662793269832304564822427566080
>  v   |
> 10.254.242.159Up         27.96 GB      127605887595351923798765477786913079296
>  |   ^
> 10.214.18.198 Up         28.18 GB      148873535527910577765226390751398592512
>  v   |
> 10.214.26.118 Up         29.82 GB      170141183460469231731687303715884105728
>  |-->|
>
> But 8 hours after the last write (absolutely no activity), things don't
> look as good:
>
> Address       Status     Load          Range
>  Ring
>                                       170141183460469231731687303715884105728
> 10.210.198.64 Up         30.28 GB      21267647932558653966460912964485513216
>  |<--|
> 10.210.157.187Up         28.12 GB      42535295865117307932921825928971026432
>  |   ^
> 10.206.34.194 Up         122.41 GB     63802943797675961899382738893456539648
>  v   |
> 10.254.107.178Up         33.89 GB      85070591730234615865843651857942052864
>  |   ^
> 10.254.234.226Up         28.01 GB      106338239662793269832304564822427566080
>  v   |
> 10.254.242.159Up         72.58 GB      127605887595351923798765477786913079296
>  |   ^
> 10.214.18.198 Up         83.41 GB      148873535527910577765226390751398592512
>  v   |
> 10.214.26.118 Up         62.01 GB      170141183460469231731687303715884105728
>  |-->|
>
> The 122.41GB node is named ec2-server3.  Here's what cfstats reports:
> On ec2-server3:
>
>          Write Latency: 0.21970486121446028 ms.
>          Pending Tasks: 0
>                   Column Family: Standard1
>                   SSTable count: 9
>                        Space used (live): 131438293207
>                        Space used (total): 143577216419
>                   Memtable Columns Count: 0
>                   Memtable Data Size: 0
>                   Memtable Switch Count: 454
>                   Read Count: 0
>                   Read Latency: NaN ms.
>                   Write Count: 302373
>                   Write Latency: 0.220 ms.
>                   Pending Tasks: 0
>                   Key cache capacity: 200000
>                   Key cache size: 0
>                   Key cache hit rate: NaN
>                   Row cache: disabled
>                   Compacted row minimum size: 100316
>                   Compacted row maximum size: 100323
>                   Compacted row mean size: 100322
>
> On ec2-server3, df reports:
> /dev/sdb             153899044 140784388   5297032  97% /mnt
>
> So this node should (I would think) contain 30GB of data on a 160GB hard
> drive but somehow it has grown to 122 GB of data (live space) plus several
> compacted files that have not yet been deleted (total space).  Keep in
> mind that no deletions or updates have taken place, just unique key writes.
>
> At this point, here are the compaction lines from the ec2-server3 system.log
> file, taking place after all writes had completed:
>
> INFO [COMPACTION-POOL:1] 2010-08-16 14:11:58,614 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-465-Data.db.
> 8595448738/8595448738 bytes for 85678 keys.  Time: 984905ms.
>  INFO [COMPACTION-POOL:1] 2010-08-16 14:20:02,825 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-466-Data.db.
> 2157339354/2157339354 bytes for 21504 keys.  Time: 484188ms.
>  INFO [COMPACTION-POOL:1] 2010-08-16 14:28:37,066 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-468-Data.db.
> 2157339329/2157339329 bytes for 21504 keys.  Time: 514226ms.
>  INFO [COMPACTION-POOL:1] 2010-08-16 14:36:04,806 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-469-Data.db.
> 2157339417/2157339417 bytes for 21504 keys.  Time: 447720ms.
>  INFO [COMPACTION-POOL:1] 2010-08-16 14:40:31,234 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-470-Data.db.
> 2157339202/2157339202 bytes for 21504 keys.  Time: 266418ms.
>  INFO [COMPACTION-POOL:1] 2010-08-16 14:58:48,943 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-471-Data.db.
> 8629357302/8629357302 bytes for 86016 keys.  Time: 1097659ms.
>  INFO [COMPACTION-POOL:1] 2010-08-16 15:02:43,428 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db.
> 2157338687/2157338687 bytes for 21504 keys.  Time: 234464ms.
>  INFO [COMPACTION-POOL:1] 2010-08-16 15:05:08,807 CompactionManager.java (line
> 326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-474-Data.db.
> 1345327705/1345327705 bytes for 13410 keys.  Time: 145363ms.
>
> Here’s what’s in the data directory on this node:
>
> -rw-r--r-- 1 root root  9518412981 Aug 16 13:36 Standard1-247-Data.db
> -rw-r--r-- 1 root root  8595448738 Aug 16 14:11 Standard1-465-Data.db
> -rw-r--r-- 1 root root 22968417153 Aug 16 14:34 Standard1-467-Data.db
> -rw-r--r-- 1 root root  8629357302 Aug 16 14:58 Standard1-471-Data.db
> -rw-r--r-- 1 root root  2157338687 Aug 16 15:02 Standard1-473-Data.db
> -rw-r--r-- 1 root root  1345327705 Aug 16 15:05 Standard1-474-Data.db
> -rw-r--r-- 1 root root 28995571589 Aug 16 15:38 Standard1-475-Data.db
> -rw-r--r-- 1 root root 30063580839 Aug 16 17:26 Standard1-476-Data.db
> -rw-r--r-- 1 root root 19091100731 Aug 16 17:24 Standard1-477-Data.db
> Plus a bunch of compacted files.
>
> At this point, I performed a manual "nodetool cleanup" on the super large
> node when *very bad things* happened.  Here's an excerpt from the large
> node's log file upon issuing the cleanup:
>
> INFO [COMPACTION-POOL:1] 2010-08-16 23:04:04,230 CompactionManager.java (line
> 345) AntiCompacting [org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-474-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-471-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-477-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-475-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-467-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-247-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-476-Data.db'),
> org.apache.cassandra.io.SSTableReader(
> path='/var/lib/cassandra/data/Keyspace1/Standard1-465-Data.db')]
>
>  INFO [COMPACTION-POOL:1] 2010-08-16 23:04:04,230 StorageService.java
> (line 1499) requesting GC to free disk space
>  INFO [GC inspection] 2010-08-16 23:04:04,470 GCInspector.java (line 110)
> GC for ConcurrentMarkSweep: 233 ms, 299442152 reclaimed leaving 20577560
> used; max is 1172766720
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:14,641
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-419-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:17,974
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-466-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,100
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-428-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,225
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-415-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,374
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-441-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,514
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-461-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,623
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-442-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,740
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-425-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,891
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-447-Data.db
> …
> INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,331
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-451-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,423
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-450-Data.db
> ERROR [COMPACTION-POOL:1] 2010-08-16 23:04:25,525
> DebuggableThreadPoolExecutor.java (line 94) Error in executor futuretask
> java.util.concurrent.ExecutionException:
> java.lang.UnsupportedOperationException: disk full
>        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>        at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>        at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.after
> Execute (DebuggableThreadPoolExecutor.java:86)
>        at
> org.apache.cassandra.db.CompactionManager$CompactionExecutor.after
> Execute (CompactionManager.java:582)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask
> (ThreadPoolExecutor.java:888)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.UnsupportedOperationException: disk full
>        at
> org.apache.cassandra.db.CompactionManager.doAntiCompaction
> (CompactionManager.java:351)
>        at
> org.apache.cassandra.db.CompactionManager.doCleanupCompaction
> (CompactionManager.java:417)
>        at
> org.apache.cassandra.db.CompactionManager.access$400
> (CompactionManager.java:49)
>        at
> org.apache.cassandra.db.CompactionManager$2.call(CompactionManager.java:130)
>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask
> (ThreadPoolExecutor.java:886)
>        ... 2 more
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,558
> SSTableDeletingReference.java (line 104)
> Deleted /var/lib/cassandra/data/Keyspace1/Standard1-449-Data.db
> …
> INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:33,265
> SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
> /data/Keyspace1/Standard1-453-Data.db
>  INFO [COMPACTION-POOL:1] 2010-08-16 23:05:08,656 CompactionManager.java
> (line 246) Compacting []
>
> The next morning (19 hours later):
>
> INFO [COMPACTION-POOL:1] 2010-08-16 23:05:08,656 CompactionManager.java
> (line 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-08-17 00:05:08,637 CompactionManager.java
> (line 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-08-17 01:05:08,607 CompactionManager.java
> (line 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-08-17 02:05:08,581 CompactionManager.java
> (line 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-08-17 03:05:08,568 CompactionManager.java
> (line 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-08-17 04:05:08,532 CompactionManager.java
> (line 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-08-17 05:05:08,505 CompactionManager.java
> (line 246) Compacting []
>  INFO [COMPACTION-POOL:1] 2010-08-17 06:05:08,494 CompactionManager.java
> (line 246) Compacting []
>
> Also, 19 hours later, ring distribution is the same:
>
> Address       Status     Load          Range
>  Ring
>                                       170141183460469231731687303715884105728
> 10.210.198.64 Up         30.28 GB      21267647932558653966460912964485513216
>  |<--|
> 10.210.157.187Up         28.12 GB      42535295865117307932921825928971026432
>  |   ^
> 10.206.34.194 Up         122.41 GB     63802943797675961899382738893456539648
>  v   |
> 10.254.107.178Up         33.89 GB      85070591730234615865843651857942052864
>  |   ^
> 10.254.234.226Up         28.01 GB      106338239662793269832304564822427566080
>  v   |
> 10.254.242.159Up         72.58 GB      127605887595351923798765477786913079296
>  |   ^
> 10.214.18.198 Up         83.41 GB      148873535527910577765226390751398592512
>  v   |
> 10.214.26.118 Up         62.01 GB      170141183460469231731687303715884105728
>  |-->|
>
> So nodetool cleanup seems to have resulted in a fatal error on my overly (>4x)
> bloated node.
>
> Can anyone help me with understanding why this happened, taking into account
> the node should only contain 30GB of data on a 160GB hard drive?  I have the 8
> system.log files from the 24 hour period (the nodes were only alive for 24
> hours total).
>
> Thank you!
> Julie
>
>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Rob Coli <rcoli <at> digg.com> writes:


> As I understand Julie's case, she is :
> a) initializing her cluster
> b) inserting some number of unique keys with CL.ALL
> c) noticing that more disk space (6x?) than is expected is used
> d) but that she gets expected usage if she does a major compaction
> In other words, the problem isn't "temporary disk space occupied during
> the compact", it's permanent disk space occupied unless she compacts.
> 
> Julie : when compaction occurs, it logs the number of bytes that it 
> started with and the number it ended with, as well as the number of keys 
> involved in the compaction. What do these messages say?
> 
> example line :
> INFO [COMPACTION-POOL:1] 2010-08-06 13:48:00,328 CompactionManager.java 
> (line 398) Compacted to /path/to/MyColumnFamily-26-Data.db. 
> 999999999/888888888 bytes for 12345678 keys.  Time: 123456ms.
 
Rob -
I reran the original test: 8 nodes in the cluster (160GB drives on each 
node). Populated each node with 30GB of data using unique keys and CL.ALL 
and repFactor=3.  (Wrote 10GB of data to each node but with the repFactor=3, 
it results in about 30GB of data on each node.)  

One hour after the last write, the ring distribution looks excellent:
Address       Status     Load          Range                                   
  Ring
                                       170141183460469231731687303715884105728    
10.210.198.64 Up         28.32 GB      21267647932558653966460912964485513216  
  |<--|
10.210.157.187Up         28.07 GB      42535295865117307932921825928971026432  
  |   ^
10.206.34.194 Up         28.12 GB      63802943797675961899382738893456539648  
  v   |
10.254.107.178Up         28.15 GB      85070591730234615865843651857942052864  
  |   ^
10.254.234.226Up         28.02 GB      106338239662793269832304564822427566080 
  v   |
10.254.242.159Up         27.96 GB      127605887595351923798765477786913079296 
  |   ^
10.214.18.198 Up         28.18 GB      148873535527910577765226390751398592512 
  v   |
10.214.26.118 Up         29.82 GB      170141183460469231731687303715884105728 
  |-->|

But 8 hours after the last write (absolutely no activity), things don't 
look as good:

Address       Status     Load          Range                                   
  Ring
                                       170141183460469231731687303715884105728    
10.210.198.64 Up         30.28 GB      21267647932558653966460912964485513216  
  |<--|
10.210.157.187Up         28.12 GB      42535295865117307932921825928971026432  
  |   ^
10.206.34.194 Up         122.41 GB     63802943797675961899382738893456539648  
  v   |
10.254.107.178Up         33.89 GB      85070591730234615865843651857942052864  
  |   ^
10.254.234.226Up         28.01 GB      106338239662793269832304564822427566080 
  v   |
10.254.242.159Up         72.58 GB      127605887595351923798765477786913079296 
  |   ^
10.214.18.198 Up         83.41 GB      148873535527910577765226390751398592512 
  v   |
10.214.26.118 Up         62.01 GB      170141183460469231731687303715884105728 
  |-->|

The 122.41GB node is named ec2-server3.  Here's what cfstats reports:
On ec2-server3:

          Write Latency: 0.21970486121446028 ms.
          Pending Tasks: 0
                   Column Family: Standard1
                   SSTable count: 9
                        Space used (live): 131438293207
                        Space used (total): 143577216419
                   Memtable Columns Count: 0
                   Memtable Data Size: 0
                   Memtable Switch Count: 454
                   Read Count: 0
                   Read Latency: NaN ms.
                   Write Count: 302373
                   Write Latency: 0.220 ms.
                   Pending Tasks: 0
                   Key cache capacity: 200000
                   Key cache size: 0
                   Key cache hit rate: NaN
                   Row cache: disabled
                   Compacted row minimum size: 100316
                   Compacted row maximum size: 100323
                   Compacted row mean size: 100322

On ec2-server3, df reports:
/dev/sdb             153899044 140784388   5297032  97% /mnt

So this node should (I would think) contain 30GB of data on a 160GB hard 
drive but somehow it has grown to 122 GB of data (live space) plus several 
compacted files that have not yet been deleted (total space).  Keep in 
mind that no deletions or updates have taken place, just unique key writes.

At this point, here are the compaction lines from the ec2-server3 system.log
file, taking place after all writes had completed:

INFO [COMPACTION-POOL:1] 2010-08-16 14:11:58,614 CompactionManager.java (line 
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-465-Data.db. 
8595448738/8595448738 bytes for 85678 keys.  Time: 984905ms.
 INFO [COMPACTION-POOL:1] 2010-08-16 14:20:02,825 CompactionManager.java (line 
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-466-Data.db. 
2157339354/2157339354 bytes for 21504 keys.  Time: 484188ms.
 INFO [COMPACTION-POOL:1] 2010-08-16 14:28:37,066 CompactionManager.java (line 
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-468-Data.db.  
2157339329/2157339329 bytes for 21504 keys.  Time: 514226ms.
 INFO [COMPACTION-POOL:1] 2010-08-16 14:36:04,806 CompactionManager.java (line 
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-469-Data.db. 
2157339417/2157339417 bytes for 21504 keys.  Time: 447720ms.
 INFO [COMPACTION-POOL:1] 2010-08-16 14:40:31,234 CompactionManager.java (line 
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-470-Data.db.  
2157339202/2157339202 bytes for 21504 keys.  Time: 266418ms.
 INFO [COMPACTION-POOL:1] 2010-08-16 14:58:48,943 CompactionManager.java (line 
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-471-Data.db.  
8629357302/8629357302 bytes for 86016 keys.  Time: 1097659ms.
 INFO [COMPACTION-POOL:1] 2010-08-16 15:02:43,428 CompactionManager.java (line
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db.  
2157338687/2157338687 bytes for 21504 keys.  Time: 234464ms.
 INFO [COMPACTION-POOL:1] 2010-08-16 15:05:08,807 CompactionManager.java (line 
326) Compacted to /var/lib/cassandra/data/Keyspace1/Standard1-474-Data.db. 
1345327705/1345327705 bytes for 13410 keys.  Time: 145363ms.

Here’s what’s in the data directory on this node:

-rw-r--r-- 1 root root  9518412981 Aug 16 13:36 Standard1-247-Data.db
-rw-r--r-- 1 root root  8595448738 Aug 16 14:11 Standard1-465-Data.db
-rw-r--r-- 1 root root 22968417153 Aug 16 14:34 Standard1-467-Data.db
-rw-r--r-- 1 root root  8629357302 Aug 16 14:58 Standard1-471-Data.db
-rw-r--r-- 1 root root  2157338687 Aug 16 15:02 Standard1-473-Data.db
-rw-r--r-- 1 root root  1345327705 Aug 16 15:05 Standard1-474-Data.db
-rw-r--r-- 1 root root 28995571589 Aug 16 15:38 Standard1-475-Data.db
-rw-r--r-- 1 root root 30063580839 Aug 16 17:26 Standard1-476-Data.db
-rw-r--r-- 1 root root 19091100731 Aug 16 17:24 Standard1-477-Data.db
Plus a bunch of compacted files. 

At this point, I performed a manual "nodetool cleanup" on the super large 
node when *very bad things* happened.  Here's an excerpt from the large 
node's log file upon issuing the cleanup:

INFO [COMPACTION-POOL:1] 2010-08-16 23:04:04,230 CompactionManager.java (line 
345) AntiCompacting [org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-474-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-471-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-473-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-477-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-475-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-467-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-247-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-476-Data.db'),
org.apache.cassandra.io.SSTableReader(
path='/var/lib/cassandra/data/Keyspace1/Standard1-465-Data.db')]

 INFO [COMPACTION-POOL:1] 2010-08-16 23:04:04,230 StorageService.java 
(line 1499) requesting GC to free disk space
 INFO [GC inspection] 2010-08-16 23:04:04,470 GCInspector.java (line 110) 
GC for ConcurrentMarkSweep: 233 ms, 299442152 reclaimed leaving 20577560 
used; max is 1172766720
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:14,641 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-419-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:17,974 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-466-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,100 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-428-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,225 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-415-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,374 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-441-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,514 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-461-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,623 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-442-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,740 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-425-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:18,891 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-447-Data.db
…
INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,331 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-451-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,423 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-450-Data.db
ERROR [COMPACTION-POOL:1] 2010-08-16 23:04:25,525 
DebuggableThreadPoolExecutor.java (line 94) Error in executor futuretask
java.util.concurrent.ExecutionException: 
java.lang.UnsupportedOperationException: disk full
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
        at java.util.concurrent.FutureTask.get(FutureTask.java:83)
        at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.after
Execute (DebuggableThreadPoolExecutor.java:86)
        at
org.apache.cassandra.db.CompactionManager$CompactionExecutor.after
Execute (CompactionManager.java:582)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask
(ThreadPoolExecutor.java:888)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.UnsupportedOperationException: disk full
        at 
org.apache.cassandra.db.CompactionManager.doAntiCompaction
(CompactionManager.java:351)
        at 
org.apache.cassandra.db.CompactionManager.doCleanupCompaction
(CompactionManager.java:417)
        at 
org.apache.cassandra.db.CompactionManager.access$400
(CompactionManager.java:49)
        at 
org.apache.cassandra.db.CompactionManager$2.call(CompactionManager.java:130)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask
(ThreadPoolExecutor.java:886)
        ... 2 more
 INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:25,558 
SSTableDeletingReference.java (line 104) 
Deleted /var/lib/cassandra/data/Keyspace1/Standard1-449-Data.db
…
INFO [SSTABLE-CLEANUP-TIMER] 2010-08-16 23:04:33,265 
SSTableDeletingReference.java (line 104) Deleted /var/lib/cassandra
/data/Keyspace1/Standard1-453-Data.db
 INFO [COMPACTION-POOL:1] 2010-08-16 23:05:08,656 CompactionManager.java 
(line 246) Compacting []

The next morning (19 hours later):

INFO [COMPACTION-POOL:1] 2010-08-16 23:05:08,656 CompactionManager.java 
(line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-17 00:05:08,637 CompactionManager.java 
(line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-17 01:05:08,607 CompactionManager.java 
(line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-17 02:05:08,581 CompactionManager.java 
(line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-17 03:05:08,568 CompactionManager.java 
(line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-17 04:05:08,532 CompactionManager.java 
(line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-17 05:05:08,505 CompactionManager.java 
(line 246) Compacting []
 INFO [COMPACTION-POOL:1] 2010-08-17 06:05:08,494 CompactionManager.java 
(line 246) Compacting []

Also, 19 hours later, ring distribution is the same:

Address       Status     Load          Range                                   
  Ring
                                       170141183460469231731687303715884105728    
10.210.198.64 Up         30.28 GB      21267647932558653966460912964485513216  
  |<--|
10.210.157.187Up         28.12 GB      42535295865117307932921825928971026432  
  |   ^
10.206.34.194 Up         122.41 GB     63802943797675961899382738893456539648  
  v   |
10.254.107.178Up         33.89 GB      85070591730234615865843651857942052864  
  |   ^
10.254.234.226Up         28.01 GB      106338239662793269832304564822427566080 
  v   |
10.254.242.159Up         72.58 GB      127605887595351923798765477786913079296 
  |   ^
10.214.18.198 Up         83.41 GB      148873535527910577765226390751398592512 
  v   |
10.214.26.118 Up         62.01 GB      170141183460469231731687303715884105728 
  |-->|

So nodetool cleanup seems to have resulted in a fatal error on my overly (>4x)
bloated node.

Can anyone help me with understanding why this happened, taking into account 
the node should only contain 30GB of data on a 160GB hard drive?  I have the 8
system.log files from the 24 hour period (the nodes were only alive for 24
hours total).

Thank you!
Julie

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Julie <ju...@nextcentury.com>.

Rob Coli <rcoli <at> digg.com> writes:

> As I understand Julie's case, she is :
> 
> a) initializing her cluster
> b) inserting some number of unique keys with CL.ALL
> c) noticing that more disk space (6x?) than is expected is used
> d) but that she gets expected usage if she does a major compaction

Yes, this is the scenario in my initial post.

> There is clearly overhead from there being multiple SSTables with 
> multiple bloom filters and multiple indexes. But from my understanding, 
> that does not fully account for the difference in disk usage she is 
> seeing. If it is 6x across the whole cluster, it seems unlikely that the 
> meta information is 5x the size of the actual information.
> 
> Julie : when compaction occurs, it logs the number of bytes that it 
> started with and the number it ended with, as well as the number of keys 
> involved in the compaction. What do these messages say?
> 
> example line :
> 
> INFO [COMPACTION-POOL:1] 2010-08-06 13:48:00,328 CompactionManager.java 
> (line 398) Compacted to /path/to/MyColumnFamily-26-Data.db. 
> 999999999/888888888 bytes for 12345678 keys.  Time: 123456ms.
> 
> =Rob

I will need to re-try this scenario today.  My subsequent posts were from
writing unique keys, then updating them which is a totally different test and
*should* result in excess SStable sizes.  I will retry writing unique keys, then
waiting for the nodes to settle and get back to you.  THANK YOU!!!

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Rob Coli <rc...@digg.com>.

On 8/5/10 11:51 AM, Peter Schuller wrote:
> Also, the variation in disk space in your most recent post looks
> entirely as expected to me and nothing really extreme. The temporary
> disk space occupied during the compact/cleanup would easily be as high
> as your original disk space usage to begin with, and the fact that
> you're reaching the 5-7 GB per node level after a cleanup has
> completed fully and all obsolete sstables have been removed

Your post refers to "obsolete" sstables, but the only thing that makes 
them "obsolete" in this case is that they have been compacted?

As I understand Julie's case, she is :

a) initializing her cluster
b) inserting some number of unique keys with CL.ALL
c) noticing that more disk space (6x?) than is expected is used
d) but that she gets expected usage if she does a major compaction

In other words, the problem isn't "temporary disk space occupied during 
the compact", it's permanent disk space occupied unless she compacts.

There is clearly overhead from there being multiple SSTables with 
multiple bloom filters and multiple indexes. But from my understanding, 
that does not fully account for the difference in disk usage she is 
seeing. If it is 6x across the whole cluster, it seems unlikely that the 
meta information is 5x the size of the actual information.

I haven't been following this thread very closely, but I don't think 
"obsolete" SSTables should be relevant, because she's not doing UPDATE 
or DELETE and she hasn't changed cluster topography (the "cleanup" case).

Julie : when compaction occurs, it logs the number of bytes that it 
started with and the number it ended with, as well as the number of keys 
involved in the compaction. What do these messages say?

example line :

INFO [COMPACTION-POOL:1] 2010-08-06 13:48:00,328 CompactionManager.java 
(line 398) Compacted to /path/to/MyColumnFamily-26-Data.db. 
999999999/888888888 bytes for 12345678 keys.  Time: 123456ms.

=Rob

Re: Cassandra disk space utilization WAY higher than I would expect

Posted by Peter Schuller <pe...@infidyne.com>.

Oh and,

> Nodetool cleanup works so beautifully, that I am wondering if there is any harm
> in using "nodetool cleanup" in a cron job on a live system that is actively
> processing reads and writes to the database?

since a cleanup/compact is supposed to trigger a full compaction,
that's generating a lot of I/O that is normally deferred. Doing it
nightly won't scale in the general case (suppose you have terrabytes
of data...). But assuming compaction does not have a sufficiently
detrimental effect on your application and your data size is small
enough that a compaction finishes within some reasonable amount of
time, I don't see a problem with it.

That said, since cleanup by design actually permanently removes data
not belonging to nodes, it might be dangerous in the event that there
is somehow confusion over what data nodes are responsible for. A
regular compaction should not, as far as I know, ever remove data not
belonging to the node. So I can certainly see a danger there;
personally I'd probably want to avoid automating 'cleanup' for that
reason.

But; if everything works reasonably it still seems to me that you
should not be seeing extreme wastes of diskspace. If you truly need to
compact nightly to save space, you might be running too close to your
maximum disk capacity anyway. That said if you're still seeing extreme
amounts of "extra" disk space I don't think there is yet an
explanation for that in this thread.

Also, the variation in disk space in your most recent post looks
entirely as expected to me and nothing really extreme. The temporary
disk space occupied during the compact/cleanup would easily be as high
as your original disk space usage to begin with, and the fact that
you're reaching the 5-7 GB per node level after a cleanup has
completed fully and all obsolete sstables have been removed, does not
necessarily help you since each future cleanup/compaction will
typically double your diskspace anyway (even if temporarily).

-- 
/ Peter Schuller