You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Łukasz Mrożkiewicz (JIRA)" <ji...@apache.org> on 2015/07/14 12:57:04 UTC
[jira] [Created] (CASSANDRA-9798) Cassandra seems to have deadlocks during flush operations

Łukasz Mrożkiewicz created CASSANDRA-9798:
---------------------------------------------

             Summary: Cassandra seems to have deadlocks during flush operations
                 Key: CASSANDRA-9798
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9798
             Project: Cassandra
          Issue Type: Bug
          Components: Core
         Environment: 4x HP Gen9 dl 360 servers
2x8 cpu each (Intel(R) Xeon E5-2667 v3 @ 3.20GHz)
6x900GB 10kRPM disk for data
1x900GB 10kRPM disk for commitlog
64GB ram
ETH: 10Gb/s
Red Hat Enterprise Linux Server release 6.6 (Santiago) 2.6.32-504.el6.x86_64
java build 1.8.0_45-b14 (openjdk) (tested on oracle java 8 too)
            Reporter: Łukasz Mrożkiewicz
         Attachments: cassandra.log, cassandra.yaml, gc.log.0.current

Hi,
We noticed some problem with dropped mutationstages. Usually on one random node there is a situation that:
MutationStage "active" is full, "pending" is increasing  "completed" is stalled.
MemtableFlushWriter "active" 6, pending: 25 completed: stalled 
MemtablePostFlush "active" is 1, pending 29 completed: stalled

after a some time (30s-10min) pending mutations are dropped and everything is working.
When it happened:
1. Cpu idle is ~95%
2. no gc long pauses or more activity.
3. memory usage 3.5GB form 8GB
4. only writes is processed by cassandra
5. when LOAD > 400GB/node problems appeared 
6. cassandra 2.1.6

There is gap in logs:
INFO  08:47:01 Timed out replaying hints to /192.168.100.83; aborting (0 delivered)
INFO  08:47:01 Enqueuing flush of hints: 7870567 (0%) on-heap, 0 (0%) off-heap
INFO  08:47:30 Enqueuing flush of ltemessages: 95301807 (4%) on-heap, 0 (0%) off-heap
INFO  08:47:31 Enqueuing flush of ltemessages: 60462632 (3%) on-heap, 0 (0%) off-heap
INFO  08:47:31 Enqueuing flush of ltecalls: 76973746 (4%) on-heap, 0 (0%) off-heap
INFO  08:47:31 Enqueuing flush of ltemessages: 84290135 (4%) on-heap, 0 (0%) off-heap
INFO  08:47:32 Enqueuing flush of ltecallsbycell: 56926652 (3%) on-heap, 0 (0%) off-heap
INFO  08:47:32 Enqueuing flush of ltemessages: 85124218 (4%) on-heap, 0 (0%) off-heap
INFO  08:47:33 Enqueuing flush of ltecalls: 95663415 (4%) on-heap, 0 (0%) off-heap
INFO  08:47:58 CompactionManager                 2        39
INFO  08:47:58 Writing Memtable-ltecalls@1767938721(13843064 serialized bytes, 162359 ops, 4%/0% of on/off-heap l
imit)
INFO  08:47:58 Writing Memtable-hints@1433125911(478703 serialized bytes, 424 ops, 0%/0% of on/off-heap limit)
INFO  08:47:58 Writing Memtable-ltecalls@1318583275(11783615 serialized bytes, 137378 ops, 4%/0% of on/off-heap l
imit)
INFO  08:47:58 Enqueuing flush of compactions_in_progress: 969 (0%) on-heap, 0 (0%) off-heap
INFO  08:47:58 Writing Memtable-ltemessages@541175113(17221327 serialized bytes, 180792 ops, 4%/0% of on/off-heap
 limit)
INFO  08:47:58 Writing Memtable-ltemessages@1361154669(27138519 serialized bytes, 273472 ops, 6%/0% of on/off-hea
p limit)

INFO  08:48:03 2176 MUTATION messages dropped in last 5000ms


use case:
100% write - 100Mb/s, couples of CF ~10column each. max cell size 100B
CMS and G1GC tested - no difference




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)