You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Dan Larsen <da...@techba.se> on 2009/10/09 13:52:50 UTC

Irresponsive nodes

Hi again :-)

O.k... New problem...
I have an Amazon EC2 node with 4 "CPUs" and 7.5 GB of RAM.
Running CommitLog on 1 disk and data on another.
Cassandra 0.4.0 - (yes I have checked... correct version :-P)
6GB set in the cassandra.in.sh.

I started throwing data at it, without problems.
All of a sudden, the node becomes irresponsive.

I only have 6.6GB of data in the DBs.

I experienced the same thing, while running much smaller nodes.

I tried restarting cassandra (kill [pid]).

When it starts up, it goes crazy for a while, trying to fill up the  
RAM or something.
Then it stops filling RAM, but keeps a load of ~100% CPU.
It doesn't respond to anything, but a nodeprobe info, which responds,  
but VERY slowly.


The log doesn't give me anything - not that I can understand anyways...

[.....]
  INFO [main] 2009-10-09 11:23:37,320 CassandraDaemon.java (line 142)  
Cassandra starting up...
  INFO [PERIODIC-FLUSHER-POOL:1] 2009-10-09 11:24:40,239  
ColumnFamilyStore.java (line 369) LocationInfo has reached its  
threshold; switching in a fresh Memtable
  INFO [PERIODIC-FLUSHER-POOL:1] 2009-10-09 11:24:40,239  
ColumnFamilyStore.java (line 1178) Enqueuing flush of Memtable 
(LocationInfo)@2116316013
  INFO [MEMTABLE-FLUSHER-POOL:1] 2009-10-09 11:24:41,039 Memtable.java  
(line 186) Flushing Memtable(LocationInfo)@2116316013
DEBUG [COMMIT-LOG-WRITER] 2009-10-09 11:24:45,191 CommitLog.java (line  
466) discard completed log segments for CommitLogContext(file='/var/ 
lib/cassandra/commitlog/CommitLog-1255087417263.log', position=257),  
column family 0. CFIDs are system: TableMetadata(LocationInfo: 0,  
HintsColumnFamily: 1, }), Fetcher: TableMetadata(PageSentences: 2,  
Pages: 3, PageWords: 4, WordPages: 6, SentencePages: 5, }), }
DEBUG [COMMIT-LOG-WRITER] 2009-10-09 11:24:45,243 CommitLog.java (line  
509) Marking replay position 257 on commit log /var/lib/cassandra/ 
commitlog/CommitLog-1255087417263.log
  INFO [MEMTABLE-FLUSHER-POOL:1] 2009-10-09 11:24:45,243 Memtable.java  
(line 220) Completed flushing /mnt/cassandra/data/system/ 
LocationInfo-19-Data.db
DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228  
SSTableReader.java (line 58) index size for bloom filter calc for  
file  : /mnt/cassandra/data/Fetcher/WordPages-347-Data.db   : 256
DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228  
SSTableReader.java (line 58) index size for bloom filter calc for  
file  : /mnt/cassandra/data/Fetcher/WordPages-416-Data.db   : 512
DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228  
SSTableReader.java (line 58) index size for bloom filter calc for  
file  : /mnt/cassandra/data/Fetcher/WordPages-486-Data.db   : 768
DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228  
SSTableReader.java (line 58) index size for bloom filter calc for  
file  : /mnt/cassandra/data/Fetcher/WordPages-555-Data.db   : 1024
DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228  
ColumnFamilyStore.java (line 1048) Expected bloom filter size : 1024
DEBUG [Timer-0] 2009-10-09 11:28:39,859 LoadDisseminator.java (line  
40) Disseminating load info ...
DEBUG [Timer-0] 2009-10-09 11:33:40,783 LoadDisseminator.java (line  
40) Disseminating load info ...
DEBUG [Timer-0] 2009-10-09 11:38:40,956 LoadDisseminator.java (line  
40) Disseminating load info ...
DEBUG [Timer-0] 2009-10-09 11:43:40,064 LoadDisseminator.java (line  
40) Disseminating load info ...


If I try to insert anything, I get stuff like this:

ERROR [pool-1-thread-5324] 2009-10-09 10:12:36,574 StorageProxy.java  
(line 179) error writing key md5
java.util.concurrent.TimeoutException: Operation timed out - received  
only 0 responses from  .
	at org.apache.cassandra.service.QuorumResponseHandler.get 
(QuorumResponseHandler.java:88)
	at org.apache.cassandra.service.StorageProxy.insertBlocking 
(StorageProxy.java:164)
	at org.apache.cassandra.service.CassandraServer.doInsert 
(CassandraServer.java:468)
	at org.apache.cassandra.service.CassandraServer.insert 
(CassandraServer.java:421)
	at org.apache.cassandra.service.Cassandra$Processor$insert.process 
(Cassandra.java:824)
	at org.apache.cassandra.service.Cassandra$Processor.process 
(Cassandra.java:627)
	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run 
(TThreadPoolServer.java:253)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask 
(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run 
(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:619)


Any ideas?

Best regards
Dan

Re: Irresponsive nodes

Posted by Dan Larsen <da...@techba.se>.

That is exactly, what I have discovered by now ;-)
Looking forward for the next release then!

Thanks Jonathan!

Best
Dan


On 09/10/2009, at 16.21, Jonathan Ellis wrote:

> if you swamp it with inserts faster than it can write them, it will
> start spending more and more time trying to GC.  that's what's
> happening...  trunk is smarter about this and will stop accepting
> writes before it gets to that point, but for 0.4 you just need to be a
> little careful.
>
> -Jonathan
>
> On Fri, Oct 9, 2009 at 7:11 AM, Dan Larsen <da...@techba.se> wrote:
>> Thanks for the tips Eric.
>>
>> I was just about to try it, when I noticed, it had become  
>> responsive again.
>> It took exactly 1 hour, before it was done!...
>>
>> But when I restart now, it's ready almost immediatly... Weird stuff!!
>>
>> I will try out your tips, next time this happens!
>>
>> It sounds like, it's pretty well-defined, when the JVM dies under  
>> GC load..?
>> Any pointers there?
>> I was just thinking, that it might be possible, to add nodes based on
>> current knowledge?
>>
>> #Dan
>>
>> On 09/10/2009, at 13.58, Eric Bowman wrote:
>>
>>> A few things to try:
>>>
>>> 1. Enable verbose GC logging to see if your JVM is dying under GC  
>>> load.
>>> 2. pkill -3 java will dump some nice stack traces from all running
>>> threads, could be some clues there.
>>>
>>>
>>> Dan Larsen wrote:
>>>>
>>>> Hi again :-)
>>>>
>>>> O.k... New problem...
>>>> I have an Amazon EC2 node with 4 "CPUs" and 7.5 GB of RAM.
>>>> Running CommitLog on 1 disk and data on another.
>>>> Cassandra 0.4.0 - (yes I have checked... correct version :-P)
>>>> 6GB set in the cassandra.in.sh.
>>>>
>>>> I started throwing data at it, without problems.
>>>> All of a sudden, the node becomes irresponsive.
>>>>
>>>> I only have 6.6GB of data in the DBs.
>>>>
>>>> I experienced the same thing, while running much smaller nodes.
>>>>
>>>> I tried restarting cassandra (kill [pid]).
>>>>
>>>> When it starts up, it goes crazy for a while, trying to fill up the
>>>> RAM or something.
>>>> Then it stops filling RAM, but keeps a load of ~100% CPU.
>>>> It doesn't respond to anything, but a nodeprobe info, which  
>>>> responds,
>>>> but VERY slowly.
>>>>
>>>>
>>>> The log doesn't give me anything - not that I can understand  
>>>> anyways...
>>>>
>>>> [.....]
>>>> INFO [main] 2009-10-09 11:23:37,320 CassandraDaemon.java (line 142)
>>>> Cassandra starting up...
>>>> INFO [PERIODIC-FLUSHER-POOL:1] 2009-10-09 11:24:40,239
>>>> ColumnFamilyStore.java (line 369) LocationInfo has reached its
>>>> threshold; switching in a fresh Memtable
>>>> INFO [PERIODIC-FLUSHER-POOL:1] 2009-10-09 11:24:40,239
>>>> ColumnFamilyStore.java (line 1178) Enqueuing flush of
>>>> Memtable(LocationInfo)@2116316013
>>>> INFO [MEMTABLE-FLUSHER-POOL:1] 2009-10-09 11:24:41,039  
>>>> Memtable.java
>>>> (line 186) Flushing Memtable(LocationInfo)@2116316013
>>>> DEBUG [COMMIT-LOG-WRITER] 2009-10-09 11:24:45,191 CommitLog.java  
>>>> (line
>>>> 466) discard completed log segments for
>>>>
>>>> CommitLogContext(file='/var/lib/cassandra/commitlog/ 
>>>> CommitLog-1255087417263.log',
>>>> position=257), column family 0. CFIDs are system:
>>>> TableMetadata(LocationInfo: 0, HintsColumnFamily: 1, }), Fetcher:
>>>> TableMetadata(PageSentences: 2, Pages: 3, PageWords: 4,  
>>>> WordPages: 6,
>>>> SentencePages: 5, }), }
>>>> DEBUG [COMMIT-LOG-WRITER] 2009-10-09 11:24:45,243 CommitLog.java  
>>>> (line
>>>> 509) Marking replay position 257 on commit log
>>>> /var/lib/cassandra/commitlog/CommitLog-1255087417263.log
>>>> INFO [MEMTABLE-FLUSHER-POOL:1] 2009-10-09 11:24:45,243  
>>>> Memtable.java
>>>> (line 220) Completed flushing
>>>> /mnt/cassandra/data/system/LocationInfo-19-Data.db
>>>> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
>>>> SSTableReader.java (line 58) index size for bloom filter calc for  
>>>> file
>>>> : /mnt/cassandra/data/Fetcher/WordPages-347-Data.db : 256
>>>> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
>>>> SSTableReader.java (line 58) index size for bloom filter calc for  
>>>> file
>>>> : /mnt/cassandra/data/Fetcher/WordPages-416-Data.db : 512
>>>> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
>>>> SSTableReader.java (line 58) index size for bloom filter calc for  
>>>> file
>>>> : /mnt/cassandra/data/Fetcher/WordPages-486-Data.db : 768
>>>> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
>>>> SSTableReader.java (line 58) index size for bloom filter calc for  
>>>> file
>>>> : /mnt/cassandra/data/Fetcher/WordPages-555-Data.db : 1024
>>>> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
>>>> ColumnFamilyStore.java (line 1048) Expected bloom filter size :  
>>>> 1024
>>>> DEBUG [Timer-0] 2009-10-09 11:28:39,859 LoadDisseminator.java (line
>>>> 40) Disseminating load info ...
>>>> DEBUG [Timer-0] 2009-10-09 11:33:40,783 LoadDisseminator.java (line
>>>> 40) Disseminating load info ...
>>>> DEBUG [Timer-0] 2009-10-09 11:38:40,956 LoadDisseminator.java (line
>>>> 40) Disseminating load info ...
>>>> DEBUG [Timer-0] 2009-10-09 11:43:40,064 LoadDisseminator.java (line
>>>> 40) Disseminating load info ...
>>>>
>>>>
>>>> If I try to insert anything, I get stuff like this:
>>>>
>>>> ERROR [pool-1-thread-5324] 2009-10-09 10:12:36,574  
>>>> StorageProxy.java
>>>> (line 179) error writing key md5
>>>> java.util.concurrent.TimeoutException: Operation timed out -  
>>>> received
>>>> only 0 responses from .
>>>> at
>>>>
>>>> org.apache.cassandra.service.QuorumResponseHandler.get 
>>>> (QuorumResponseHandler.java:88)
>>>>
>>>> at
>>>>
>>>> org.apache.cassandra.service.StorageProxy.insertBlocking 
>>>> (StorageProxy.java:164)
>>>>
>>>> at
>>>>
>>>> org.apache.cassandra.service.CassandraServer.doInsert 
>>>> (CassandraServer.java:468)
>>>>
>>>> at
>>>>
>>>> org.apache.cassandra.service.CassandraServer.insert 
>>>> (CassandraServer.java:421)
>>>>
>>>> at
>>>>
>>>> org.apache.cassandra.service.Cassandra$Processor$insert.process 
>>>> (Cassandra.java:824)
>>>>
>>>> at
>>>>
>>>> org.apache.cassandra.service.Cassandra$Processor.process 
>>>> (Cassandra.java:627)
>>>>
>>>> at
>>>>
>>>> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run 
>>>> (TThreadPoolServer.java:253)
>>>>
>>>> at
>>>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask 
>>>> (ThreadPoolExecutor.java:886)
>>>>
>>>> at
>>>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run 
>>>> (ThreadPoolExecutor.java:908)
>>>>
>>>> at java.lang.Thread.run(Thread.java:619)
>>>>
>>>>
>>>> Any ideas?
>>>>
>>>> Best regards
>>>> Dan
>>>
>>>
>>> --
>>> Eric Bowman
>>> Boboco Ltd
>>> ebowman@boboco.ie
>>> http://www.boboco.ie/ebowman/pubkey.pgp
>>> +35318394189/+353872801532
>>>
>>>
>>
>>
>

Re: Irresponsive nodes

Posted by Jonathan Ellis <jb...@gmail.com>.

if you swamp it with inserts faster than it can write them, it will
start spending more and more time trying to GC.  that's what's
happening...  trunk is smarter about this and will stop accepting
writes before it gets to that point, but for 0.4 you just need to be a
little careful.

-Jonathan

On Fri, Oct 9, 2009 at 7:11 AM, Dan Larsen <da...@techba.se> wrote:
> Thanks for the tips Eric.
>
> I was just about to try it, when I noticed, it had become responsive again.
> It took exactly 1 hour, before it was done!...
>
> But when I restart now, it's ready almost immediatly... Weird stuff!!
>
> I will try out your tips, next time this happens!
>
> It sounds like, it's pretty well-defined, when the JVM dies under GC load..?
> Any pointers there?
> I was just thinking, that it might be possible, to add nodes based on
> current knowledge?
>
> #Dan
>
> On 09/10/2009, at 13.58, Eric Bowman wrote:
>
>> A few things to try:
>>
>> 1. Enable verbose GC logging to see if your JVM is dying under GC load.
>> 2. pkill -3 java will dump some nice stack traces from all running
>> threads, could be some clues there.
>>
>>
>> Dan Larsen wrote:
>>>
>>> Hi again :-)
>>>
>>> O.k... New problem...
>>> I have an Amazon EC2 node with 4 "CPUs" and 7.5 GB of RAM.
>>> Running CommitLog on 1 disk and data on another.
>>> Cassandra 0.4.0 - (yes I have checked... correct version :-P)
>>> 6GB set in the cassandra.in.sh.
>>>
>>> I started throwing data at it, without problems.
>>> All of a sudden, the node becomes irresponsive.
>>>
>>> I only have 6.6GB of data in the DBs.
>>>
>>> I experienced the same thing, while running much smaller nodes.
>>>
>>> I tried restarting cassandra (kill [pid]).
>>>
>>> When it starts up, it goes crazy for a while, trying to fill up the
>>> RAM or something.
>>> Then it stops filling RAM, but keeps a load of ~100% CPU.
>>> It doesn't respond to anything, but a nodeprobe info, which responds,
>>> but VERY slowly.
>>>
>>>
>>> The log doesn't give me anything - not that I can understand anyways...
>>>
>>> [.....]
>>> INFO [main] 2009-10-09 11:23:37,320 CassandraDaemon.java (line 142)
>>> Cassandra starting up...
>>> INFO [PERIODIC-FLUSHER-POOL:1] 2009-10-09 11:24:40,239
>>> ColumnFamilyStore.java (line 369) LocationInfo has reached its
>>> threshold; switching in a fresh Memtable
>>> INFO [PERIODIC-FLUSHER-POOL:1] 2009-10-09 11:24:40,239
>>> ColumnFamilyStore.java (line 1178) Enqueuing flush of
>>> Memtable(LocationInfo)@2116316013
>>> INFO [MEMTABLE-FLUSHER-POOL:1] 2009-10-09 11:24:41,039 Memtable.java
>>> (line 186) Flushing Memtable(LocationInfo)@2116316013
>>> DEBUG [COMMIT-LOG-WRITER] 2009-10-09 11:24:45,191 CommitLog.java (line
>>> 466) discard completed log segments for
>>>
>>> CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1255087417263.log',
>>> position=257), column family 0. CFIDs are system:
>>> TableMetadata(LocationInfo: 0, HintsColumnFamily: 1, }), Fetcher:
>>> TableMetadata(PageSentences: 2, Pages: 3, PageWords: 4, WordPages: 6,
>>> SentencePages: 5, }), }
>>> DEBUG [COMMIT-LOG-WRITER] 2009-10-09 11:24:45,243 CommitLog.java (line
>>> 509) Marking replay position 257 on commit log
>>> /var/lib/cassandra/commitlog/CommitLog-1255087417263.log
>>> INFO [MEMTABLE-FLUSHER-POOL:1] 2009-10-09 11:24:45,243 Memtable.java
>>> (line 220) Completed flushing
>>> /mnt/cassandra/data/system/LocationInfo-19-Data.db
>>> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
>>> SSTableReader.java (line 58) index size for bloom filter calc for file
>>> : /mnt/cassandra/data/Fetcher/WordPages-347-Data.db : 256
>>> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
>>> SSTableReader.java (line 58) index size for bloom filter calc for file
>>> : /mnt/cassandra/data/Fetcher/WordPages-416-Data.db : 512
>>> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
>>> SSTableReader.java (line 58) index size for bloom filter calc for file
>>> : /mnt/cassandra/data/Fetcher/WordPages-486-Data.db : 768
>>> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
>>> SSTableReader.java (line 58) index size for bloom filter calc for file
>>> : /mnt/cassandra/data/Fetcher/WordPages-555-Data.db : 1024
>>> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
>>> ColumnFamilyStore.java (line 1048) Expected bloom filter size : 1024
>>> DEBUG [Timer-0] 2009-10-09 11:28:39,859 LoadDisseminator.java (line
>>> 40) Disseminating load info ...
>>> DEBUG [Timer-0] 2009-10-09 11:33:40,783 LoadDisseminator.java (line
>>> 40) Disseminating load info ...
>>> DEBUG [Timer-0] 2009-10-09 11:38:40,956 LoadDisseminator.java (line
>>> 40) Disseminating load info ...
>>> DEBUG [Timer-0] 2009-10-09 11:43:40,064 LoadDisseminator.java (line
>>> 40) Disseminating load info ...
>>>
>>>
>>> If I try to insert anything, I get stuff like this:
>>>
>>> ERROR [pool-1-thread-5324] 2009-10-09 10:12:36,574 StorageProxy.java
>>> (line 179) error writing key md5
>>> java.util.concurrent.TimeoutException: Operation timed out - received
>>> only 0 responses from .
>>> at
>>>
>>> org.apache.cassandra.service.QuorumResponseHandler.get(QuorumResponseHandler.java:88)
>>>
>>> at
>>>
>>> org.apache.cassandra.service.StorageProxy.insertBlocking(StorageProxy.java:164)
>>>
>>> at
>>>
>>> org.apache.cassandra.service.CassandraServer.doInsert(CassandraServer.java:468)
>>>
>>> at
>>>
>>> org.apache.cassandra.service.CassandraServer.insert(CassandraServer.java:421)
>>>
>>> at
>>>
>>> org.apache.cassandra.service.Cassandra$Processor$insert.process(Cassandra.java:824)
>>>
>>> at
>>>
>>> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:627)
>>>
>>> at
>>>
>>> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
>>>
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>
>>> at java.lang.Thread.run(Thread.java:619)
>>>
>>>
>>> Any ideas?
>>>
>>> Best regards
>>> Dan
>>
>>
>> --
>> Eric Bowman
>> Boboco Ltd
>> ebowman@boboco.ie
>> http://www.boboco.ie/ebowman/pubkey.pgp
>> +35318394189/+353872801532
>>
>>
>
>

Re: Irresponsive nodes

Posted by Dan Larsen <da...@techba.se>.

Thanks for the tips Eric.

I was just about to try it, when I noticed, it had become responsive  
again.
It took exactly 1 hour, before it was done!...

But when I restart now, it's ready almost immediatly... Weird stuff!!

I will try out your tips, next time this happens!

It sounds like, it's pretty well-defined, when the JVM dies under GC  
load..?
Any pointers there?
I was just thinking, that it might be possible, to add nodes based on  
current knowledge?

#Dan

On 09/10/2009, at 13.58, Eric Bowman wrote:

> A few things to try:
>
> 1. Enable verbose GC logging to see if your JVM is dying under GC  
> load.
> 2. pkill -3 java will dump some nice stack traces from all running
> threads, could be some clues there.
>
>
> Dan Larsen wrote:
>> Hi again :-)
>>
>> O.k... New problem...
>> I have an Amazon EC2 node with 4 "CPUs" and 7.5 GB of RAM.
>> Running CommitLog on 1 disk and data on another.
>> Cassandra 0.4.0 - (yes I have checked... correct version :-P)
>> 6GB set in the cassandra.in.sh.
>>
>> I started throwing data at it, without problems.
>> All of a sudden, the node becomes irresponsive.
>>
>> I only have 6.6GB of data in the DBs.
>>
>> I experienced the same thing, while running much smaller nodes.
>>
>> I tried restarting cassandra (kill [pid]).
>>
>> When it starts up, it goes crazy for a while, trying to fill up the
>> RAM or something.
>> Then it stops filling RAM, but keeps a load of ~100% CPU.
>> It doesn't respond to anything, but a nodeprobe info, which responds,
>> but VERY slowly.
>>
>>
>> The log doesn't give me anything - not that I can understand  
>> anyways...
>>
>> [.....]
>> INFO [main] 2009-10-09 11:23:37,320 CassandraDaemon.java (line 142)
>> Cassandra starting up...
>> INFO [PERIODIC-FLUSHER-POOL:1] 2009-10-09 11:24:40,239
>> ColumnFamilyStore.java (line 369) LocationInfo has reached its
>> threshold; switching in a fresh Memtable
>> INFO [PERIODIC-FLUSHER-POOL:1] 2009-10-09 11:24:40,239
>> ColumnFamilyStore.java (line 1178) Enqueuing flush of
>> Memtable(LocationInfo)@2116316013
>> INFO [MEMTABLE-FLUSHER-POOL:1] 2009-10-09 11:24:41,039 Memtable.java
>> (line 186) Flushing Memtable(LocationInfo)@2116316013
>> DEBUG [COMMIT-LOG-WRITER] 2009-10-09 11:24:45,191 CommitLog.java  
>> (line
>> 466) discard completed log segments for
>> CommitLogContext(file='/var/lib/cassandra/commitlog/ 
>> CommitLog-1255087417263.log',
>> position=257), column family 0. CFIDs are system:
>> TableMetadata(LocationInfo: 0, HintsColumnFamily: 1, }), Fetcher:
>> TableMetadata(PageSentences: 2, Pages: 3, PageWords: 4, WordPages: 6,
>> SentencePages: 5, }), }
>> DEBUG [COMMIT-LOG-WRITER] 2009-10-09 11:24:45,243 CommitLog.java  
>> (line
>> 509) Marking replay position 257 on commit log
>> /var/lib/cassandra/commitlog/CommitLog-1255087417263.log
>> INFO [MEMTABLE-FLUSHER-POOL:1] 2009-10-09 11:24:45,243 Memtable.java
>> (line 220) Completed flushing
>> /mnt/cassandra/data/system/LocationInfo-19-Data.db
>> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
>> SSTableReader.java (line 58) index size for bloom filter calc for  
>> file
>> : /mnt/cassandra/data/Fetcher/WordPages-347-Data.db : 256
>> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
>> SSTableReader.java (line 58) index size for bloom filter calc for  
>> file
>> : /mnt/cassandra/data/Fetcher/WordPages-416-Data.db : 512
>> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
>> SSTableReader.java (line 58) index size for bloom filter calc for  
>> file
>> : /mnt/cassandra/data/Fetcher/WordPages-486-Data.db : 768
>> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
>> SSTableReader.java (line 58) index size for bloom filter calc for  
>> file
>> : /mnt/cassandra/data/Fetcher/WordPages-555-Data.db : 1024
>> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
>> ColumnFamilyStore.java (line 1048) Expected bloom filter size : 1024
>> DEBUG [Timer-0] 2009-10-09 11:28:39,859 LoadDisseminator.java (line
>> 40) Disseminating load info ...
>> DEBUG [Timer-0] 2009-10-09 11:33:40,783 LoadDisseminator.java (line
>> 40) Disseminating load info ...
>> DEBUG [Timer-0] 2009-10-09 11:38:40,956 LoadDisseminator.java (line
>> 40) Disseminating load info ...
>> DEBUG [Timer-0] 2009-10-09 11:43:40,064 LoadDisseminator.java (line
>> 40) Disseminating load info ...
>>
>>
>> If I try to insert anything, I get stuff like this:
>>
>> ERROR [pool-1-thread-5324] 2009-10-09 10:12:36,574 StorageProxy.java
>> (line 179) error writing key md5
>> java.util.concurrent.TimeoutException: Operation timed out - received
>> only 0 responses from .
>> at
>> org.apache.cassandra.service.QuorumResponseHandler.get 
>> (QuorumResponseHandler.java:88)
>>
>> at
>> org.apache.cassandra.service.StorageProxy.insertBlocking 
>> (StorageProxy.java:164)
>>
>> at
>> org.apache.cassandra.service.CassandraServer.doInsert 
>> (CassandraServer.java:468)
>>
>> at
>> org.apache.cassandra.service.CassandraServer.insert 
>> (CassandraServer.java:421)
>>
>> at
>> org.apache.cassandra.service.Cassandra$Processor$insert.process 
>> (Cassandra.java:824)
>>
>> at
>> org.apache.cassandra.service.Cassandra$Processor.process 
>> (Cassandra.java:627)
>>
>> at
>> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run 
>> (TThreadPoolServer.java:253)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask 
>> (ThreadPoolExecutor.java:886)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run 
>> (ThreadPoolExecutor.java:908)
>>
>> at java.lang.Thread.run(Thread.java:619)
>>
>>
>> Any ideas?
>>
>> Best regards
>> Dan
>
>
> -- 
> Eric Bowman
> Boboco Ltd
> ebowman@boboco.ie
> http://www.boboco.ie/ebowman/pubkey.pgp
> +35318394189/+353872801532
>
>

Re: Irresponsive nodes

Posted by Eric Bowman <eb...@boboco.ie>.

A few things to try:

1. Enable verbose GC logging to see if your JVM is dying under GC load.
2. pkill -3 java will dump some nice stack traces from all running
threads, could be some clues there.


Dan Larsen wrote:
> Hi again :-)
>
> O.k... New problem...
> I have an Amazon EC2 node with 4 "CPUs" and 7.5 GB of RAM.
> Running CommitLog on 1 disk and data on another.
> Cassandra 0.4.0 - (yes I have checked... correct version :-P)
> 6GB set in the cassandra.in.sh.
>
> I started throwing data at it, without problems.
> All of a sudden, the node becomes irresponsive.
>
> I only have 6.6GB of data in the DBs.
>
> I experienced the same thing, while running much smaller nodes.
>
> I tried restarting cassandra (kill [pid]).
>
> When it starts up, it goes crazy for a while, trying to fill up the
> RAM or something.
> Then it stops filling RAM, but keeps a load of ~100% CPU.
> It doesn't respond to anything, but a nodeprobe info, which responds,
> but VERY slowly.
>
>
> The log doesn't give me anything - not that I can understand anyways...
>
> [.....]
> INFO [main] 2009-10-09 11:23:37,320 CassandraDaemon.java (line 142)
> Cassandra starting up...
> INFO [PERIODIC-FLUSHER-POOL:1] 2009-10-09 11:24:40,239
> ColumnFamilyStore.java (line 369) LocationInfo has reached its
> threshold; switching in a fresh Memtable
> INFO [PERIODIC-FLUSHER-POOL:1] 2009-10-09 11:24:40,239
> ColumnFamilyStore.java (line 1178) Enqueuing flush of
> Memtable(LocationInfo)@2116316013
> INFO [MEMTABLE-FLUSHER-POOL:1] 2009-10-09 11:24:41,039 Memtable.java
> (line 186) Flushing Memtable(LocationInfo)@2116316013
> DEBUG [COMMIT-LOG-WRITER] 2009-10-09 11:24:45,191 CommitLog.java (line
> 466) discard completed log segments for
> CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1255087417263.log',
> position=257), column family 0. CFIDs are system:
> TableMetadata(LocationInfo: 0, HintsColumnFamily: 1, }), Fetcher:
> TableMetadata(PageSentences: 2, Pages: 3, PageWords: 4, WordPages: 6,
> SentencePages: 5, }), }
> DEBUG [COMMIT-LOG-WRITER] 2009-10-09 11:24:45,243 CommitLog.java (line
> 509) Marking replay position 257 on commit log
> /var/lib/cassandra/commitlog/CommitLog-1255087417263.log
> INFO [MEMTABLE-FLUSHER-POOL:1] 2009-10-09 11:24:45,243 Memtable.java
> (line 220) Completed flushing
> /mnt/cassandra/data/system/LocationInfo-19-Data.db
> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
> SSTableReader.java (line 58) index size for bloom filter calc for file
> : /mnt/cassandra/data/Fetcher/WordPages-347-Data.db : 256
> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
> SSTableReader.java (line 58) index size for bloom filter calc for file
> : /mnt/cassandra/data/Fetcher/WordPages-416-Data.db : 512
> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
> SSTableReader.java (line 58) index size for bloom filter calc for file
> : /mnt/cassandra/data/Fetcher/WordPages-486-Data.db : 768
> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
> SSTableReader.java (line 58) index size for bloom filter calc for file
> : /mnt/cassandra/data/Fetcher/WordPages-555-Data.db : 1024
> DEBUG [MINOR-COMPACTION-POOL:1] 2009-10-09 11:27:08,228
> ColumnFamilyStore.java (line 1048) Expected bloom filter size : 1024
> DEBUG [Timer-0] 2009-10-09 11:28:39,859 LoadDisseminator.java (line
> 40) Disseminating load info ...
> DEBUG [Timer-0] 2009-10-09 11:33:40,783 LoadDisseminator.java (line
> 40) Disseminating load info ...
> DEBUG [Timer-0] 2009-10-09 11:38:40,956 LoadDisseminator.java (line
> 40) Disseminating load info ...
> DEBUG [Timer-0] 2009-10-09 11:43:40,064 LoadDisseminator.java (line
> 40) Disseminating load info ...
>
>
> If I try to insert anything, I get stuff like this:
>
> ERROR [pool-1-thread-5324] 2009-10-09 10:12:36,574 StorageProxy.java
> (line 179) error writing key md5
> java.util.concurrent.TimeoutException: Operation timed out - received
> only 0 responses from .
> at
> org.apache.cassandra.service.QuorumResponseHandler.get(QuorumResponseHandler.java:88)
>
> at
> org.apache.cassandra.service.StorageProxy.insertBlocking(StorageProxy.java:164)
>
> at
> org.apache.cassandra.service.CassandraServer.doInsert(CassandraServer.java:468)
>
> at
> org.apache.cassandra.service.CassandraServer.insert(CassandraServer.java:421)
>
> at
> org.apache.cassandra.service.Cassandra$Processor$insert.process(Cassandra.java:824)
>
> at
> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:627)
>
> at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>
> at java.lang.Thread.run(Thread.java:619)
>
>
> Any ideas?
>
> Best regards
> Dan


-- 
Eric Bowman
Boboco Ltd
ebowman@boboco.ie
http://www.boboco.ie/ebowman/pubkey.pgp
+35318394189/+353872801532