You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Oleg Proudnikov <ol...@cloudorange.com> on 2011/02/07 05:26:04 UTC
Cassandra Thrift calls block indefinitely

Hi All,

This is the first time I see this. I am using Hector for a bulk load into a 3
node Cassandra 0.7.0 cluster. I have been doing this for a while now but this
time the load was more intense compared to the ones before and it was running
from a single client machine because I was afraid to overload the cluster.

My client blocked on a batch_insert for a very long time. Actually the call
never returned. This happened during bulk load at night. In fact the client got
auto-killed and restarted 20 times and every time it was blocked. After seeing
this in the morning many hours later I ran new Java stress test and saw this:

./stress -o INSERT -e ONE -c 30 -n 100000 -l 1 -i 1 -t 1
Created keyspaces. Sleeping 1s for propagation.
total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
0,0,0,NaN,1
0,0,0,NaN,2
0,0,0,NaN,3
0,0,0,NaN,4
0,0,0,NaN,5
0,0,0,NaN,6

Then I used cli to insert a single column into CF Standard1 created by the test
above. After this stress test was able to run. Here is what the log had in the
morning - the time is European. This is likely related to the stress test. I am
including this for completeness.

 INFO [MigrationStage:1] 2011-02-05 14:29:52,689 ColumnFamilyStore.java (line
648) switching in a fresh Memtable for Migrations at
CommitLogContext(file='/data/social2/
commitlog/CommitLog-1296861890465.log', position=6798985)
 INFO [MigrationStage:1] 2011-02-05 14:29:52,689 ColumnFamilyStore.java (line
952) Enqueuing flush of Memtable-Migrations@24152491(9724 bytes, 1 operations)
 INFO [MigrationStage:1] 2011-02-05 14:29:52,689 ColumnFamilyStore.java (line
648) switching in a fresh Memtable for Schema at
CommitLogContext(file='/data/social2/comm
itlog/CommitLog-1296861890465.log', position=6798985)
 INFO [FlushWriter:1] 2011-02-05 14:29:52,689 Memtable.java (line 155) Writing
Memtable-Migrations@24152491(9724 bytes, 1 operations)
INFO [MigrationStage:1] 2011-02-05 14:29:52,689 ColumnFamilyStore.java (line
952) Enqueuing flush of Memtable-Schema@22965127(5481 bytes, 5 operations)
INFO [FlushWriter:1] 2011-02-05 14:29:52,955 Memtable.java (line 162) Completed
flushing /data/social2/data/system/Migrations-e-15-Data.db (9857 bytes)
INFO [FlushWriter:1] 2011-02-05 14:29:52,955 Memtable.java (line 155) Writing
Memtable-Schema@22965127(5481 bytes, 5 operations)
ERROR [MutationStage:12] 2011-02-05 14:29:53,014 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find 
cfId=1009

================

Here is a summary of some unusual activity on one of the nodes before the first
blockage. This node is different from the other nodes - it is slower according
to my stress tests due to non-Cassandra background and disk activity:

- node had an OOM 10 minutes prior to the first blockage, 
- node restarted normally
- replayed commit logs
- lost contact with 2 other nodes for a few milliseconds
- flushed one memtable with 640K operations
- filled the same memtable with another 640K columns in 9 seconds ??? impossible
 ??? Maybe I am misreading the log below?

INFO [MutationStage:4] 2011-02-05 00:18:14,437 ColumnFamilyStore.java (line 952)
Enqueuing flush of Memtable-Domains@29765464(14150991 bytes, 640000 operations)
INFO [MutationStage:4] 2011-02-05 00:18:23,743 ColumnFamilyStore.java (line 952)
Enqueuing flush of Memtable-Domains@1573293(13948680 bytes, 640000 operations)

- compacted 4 SSTables for this CF

- compacted 5 SSTables for the same CF - very unusual

- reported a large row - this CF does have a large row but it should not have
been touched in this load.

- while the above is happening CMS is reclaiming 1.2G several times

- node lost contact with another node that seems under some pressure with high
heap and no GC to speak of

- the CF in question now receives only 10K columns in 4 minutes

- the last activity is hinted hand off to the stressed node that reappeared.

After the blockage started the only activity I see is all CFs being flushed due
to time and all caches being saved. No other activity until the stress test in
the morning.

Thank you,
Oleg