You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by "PenguinWhispererThe ." <th...@gmail.com> on 2015/12/02 11:21:07 UTC

Re: Cassandra compaction stuck? Should I disable?

So it seems I found the problem.

The node opening a stream is waiting for the other node to respond but that
node never responds due to a broken pipe which makes Cassandra wait forever.

It's basically this issue:
https://issues.apache.org/jira/browse/CASSANDRA-8472
And this is the workaround/fix:
https://issues.apache.org/jira/browse/CASSANDRA-8611

So:
- update cassandra to >=2.0.11
- add option streaming_socket_timeout_in_ms = 10000
- do rolling restart of cassandra

What's weird is that the IOException: Broken pipe is never shown in my logs
(not on any node). And my logging is set to INFO in log4j config.
I have this config in log4j-server.properties:
# output messages into a rolling log file as well as stdout
log4j.rootLogger=INFO,stdout,R

# stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%5p %d{HH:mm:ss,SSS} %m%n

# rolling log file
log4j.appender.R=org.apache.log4j.RollingFileAppender
log4j.appender.R.maxFileSize=20MB
log4j.appender.R.maxBackupIndex=50
log4j.appender.R.layout=org.apache.log4j.PatternLayout
log4j.appender.R.layout.ConversionPattern=%5p [%t] %d{ISO8601} %F (line %L)
%m%n
# Edit the next line to point to your logs directory
log4j.appender.R.File=/var/log/cassandra/system.log

# Application logging options
#log4j.logger.org.apache.cassandra=DEBUG
#log4j.logger.org.apache.cassandra.db=DEBUG
#log4j.logger.org.apache.cassandra.service.StorageProxy=DEBUG

# Adding this to avoid thrift logging disconnect errors.
log4j.logger.org.apache.thrift.server.TNonblockingServer=ERROR

Too bad nobody else could point to those. Hope it helps someone else from
wasting a lot of time.

2015-11-11 15:42 GMT+01:00 Sebastian Estevez <sebastian.estevez@datastax.com
>:

> Use 'nodetool compactionhistory'
>
> all the best,
>
> Sebastián
> On Nov 11, 2015 3:23 AM, "PenguinWhispererThe ." <
> th3penguinwhisperer@gmail.com> wrote:
>
>> Does compactionstats shows only stats for completed compactions (100%)?
>> It might be that the compaction is running constantly, over and over again.
>> In that case I need to know what I might be able to do to stop this
>> constant compaction so I can start a nodetool repair.
>>
>> Note that there is a lot of traffic on this columnfamily so I'm not sure
>> if temporary disabling compaction is an option. The repair will probably
>> take long as well.
>>
>> Sebastian and Rob: do you might have any more ideas about the things I
>> put in this thread? Any help is appreciated!
>>
>> 2015-11-10 20:03 GMT+01:00 PenguinWhispererThe . <
>> th3penguinwhisperer@gmail.com>:
>>
>>> Hi Sebastian,
>>>
>>> Thanks for your response.
>>>
>>> No swap is used. No offense, I just don't see a reason why having swap
>>> would be the issue here. I put swapiness on 1. I also have jna installed.
>>> That should prevent java being swapped out as wel AFAIK.
>>>
>>>
>>> 2015-11-10 19:50 GMT+01:00 Sebastian Estevez <
>>> sebastian.estevez@datastax.com>:
>>>
>>>> Turn off Swap.
>>>>
>>>>
>>>> http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installRecommendSettings.html?scroll=reference_ds_sxl_gf3_2k__disable-swap
>>>>
>>>>
>>>> All the best,
>>>>
>>>>
>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>
>>>> Sebastián Estévez
>>>>
>>>> Solutions Architect | 954 905 8615 | sebastian.estevez@datastax.com
>>>>
>>>> [image: linkedin.png] <https://www.linkedin.com/company/datastax> [image:
>>>> facebook.png] <https://www.facebook.com/datastax> [image: twitter.png]
>>>> <https://twitter.com/datastax> [image: g+.png]
>>>> <https://plus.google.com/+Datastax/about>
>>>> <http://feeds.feedburner.com/datastax>
>>>> <http://goog_410786983>
>>>>
>>>>
>>>> <http://www.datastax.com/gartner-magic-quadrant-odbms>
>>>>
>>>> DataStax is the fastest, most scalable distributed database
>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>> is the database technology and transactional backbone of choice for the
>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>
>>>> On Tue, Nov 10, 2015 at 1:48 PM, PenguinWhispererThe . <
>>>> th3penguinwhisperer@gmail.com> wrote:
>>>>
>>>>> I also have the following memory usage:
>>>>> [root@US-BILLINGDSX4 cassandra]# free -m
>>>>>              total       used       free     shared    buffers
>>>>> cached
>>>>> Mem:         12024       9455       2569          0        110
>>>>> 2163
>>>>> -/+ buffers/cache:       7180       4844
>>>>> Swap:         2047          0       2047
>>>>>
>>>>> Still a lot free and a lot of free buffers/cache.
>>>>>
>>>>> 2015-11-10 19:45 GMT+01:00 PenguinWhispererThe . <
>>>>> th3penguinwhisperer@gmail.com>:
>>>>>
>>>>>> Still stuck with this. However I enabled GC logging. This shows the
>>>>>> following:
>>>>>>
>>>>>> [root@myhost cassandra]# tail -f gc-1447180680.log
>>>>>> 2015-11-10T18:41:45.516+0000: 225.428: [GC
>>>>>> 2721842K->2066508K(6209536K), 0.0199040 secs]
>>>>>> 2015-11-10T18:41:45.977+0000: 225.889: [GC
>>>>>> 2721868K->2066511K(6209536K), 0.0221910 secs]
>>>>>> 2015-11-10T18:41:46.437+0000: 226.349: [GC
>>>>>> 2721871K->2066524K(6209536K), 0.0222140 secs]
>>>>>> 2015-11-10T18:41:46.897+0000: 226.809: [GC
>>>>>> 2721884K->2066539K(6209536K), 0.0224140 secs]
>>>>>> 2015-11-10T18:41:47.359+0000: 227.271: [GC
>>>>>> 2721899K->2066538K(6209536K), 0.0302520 secs]
>>>>>> 2015-11-10T18:41:47.821+0000: 227.733: [GC
>>>>>> 2721898K->2066557K(6209536K), 0.0280530 secs]
>>>>>> 2015-11-10T18:41:48.293+0000: 228.205: [GC
>>>>>> 2721917K->2066571K(6209536K), 0.0218000 secs]
>>>>>> 2015-11-10T18:41:48.790+0000: 228.702: [GC
>>>>>> 2721931K->2066780K(6209536K), 0.0292470 secs]
>>>>>> 2015-11-10T18:41:49.290+0000: 229.202: [GC
>>>>>> 2722140K->2066843K(6209536K), 0.0288740 secs]
>>>>>> 2015-11-10T18:41:49.756+0000: 229.668: [GC
>>>>>> 2722203K->2066818K(6209536K), 0.0283380 secs]
>>>>>> 2015-11-10T18:41:50.249+0000: 230.161: [GC
>>>>>> 2722178K->2067158K(6209536K), 0.0218690 secs]
>>>>>> 2015-11-10T18:41:50.713+0000: 230.625: [GC
>>>>>> 2722518K->2067236K(6209536K), 0.0278810 secs]
>>>>>>
>>>>>> This is a VM with 12GB of RAM. Highered the HEAP_SIZE to 6GB and
>>>>>> HEAP_NEWSIZE to 800MB.
>>>>>>
>>>>>> Still the same result.
>>>>>>
>>>>>> This looks very similar to following issue:
>>>>>>
>>>>>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAJ=3xgRLsvpnZe0uXEYjG94rKhfXeU+jBR=Q3A-_C3rsdD5kug@mail.gmail.com%3E
>>>>>>
>>>>>> Is the only possibility to upgrade memory? I mean, I can't believe
>>>>>> it's just loading all it's data in memory. That would require to keep
>>>>>> scaling up the node to keep it work?
>>>>>>
>>>>>>
>>>>>> 2015-11-10 9:36 GMT+01:00 PenguinWhispererThe . <
>>>>>> th3penguinwhisperer@gmail.com>:
>>>>>>
>>>>>>> Correction...
>>>>>>> I was grepping on Segmentation on the strace and it happens a lot.
>>>>>>>
>>>>>>> Do I need to run a scrub?
>>>>>>>
>>>>>>> 2015-11-10 9:30 GMT+01:00 PenguinWhispererThe . <
>>>>>>> th3penguinwhisperer@gmail.com>:
>>>>>>>
>>>>>>>> Hi Rob,
>>>>>>>>
>>>>>>>> Thanks for your reply.
>>>>>>>>
>>>>>>>> 2015-11-09 23:17 GMT+01:00 Robert Coli <rc...@eventbrite.com>:
>>>>>>>>
>>>>>>>>> On Mon, Nov 9, 2015 at 1:29 PM, PenguinWhispererThe . <
>>>>>>>>> th3penguinwhisperer@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> In Opscenter I see one of the nodes is orange. It seems like it's
>>>>>>>>>> working on compaction. I used nodetool compactionstats and whenever I did
>>>>>>>>>> this the Completed nad percentage stays the same (even with hours in
>>>>>>>>>> between).
>>>>>>>>>>
>>>>>>>>> Are you the same person from IRC, or a second report today of
>>>>>>>>> compaction hanging in this way?
>>>>>>>>>
>>>>>>>> Same person ;) Just didn't had things to work with from the chat
>>>>>>>> there. I want to understand the issue more, see what I can tune or fix. I
>>>>>>>> want to do nodetool repair before upgrading to 2.1.11 but the compaction is
>>>>>>>> blocking it.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> What version of Cassandra?
>>>>>>>>>
>>>>>>>> 2.0.9
>>>>>>>>
>>>>>>>>> I currently don't see cpu load from cassandra on that node. So it
>>>>>>>>>> seems stuck (somewhere mid 60%). Also some other nodes have compaction on
>>>>>>>>>> the same columnfamily. I don't see any progress.
>>>>>>>>>>
>>>>>>>>>>  WARN [RMI TCP Connection(554)-192.168.0.68] 2015-11-09 17:18:13,677 ColumnFamilyStore.java (line 2101) Unable to cancel in-progress compactions for usage_record_ptd.  Probably there is an unusually large row in progress somewhere.  It is also possible that buggy code left some sstables compacting after it was done with them
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - How can I assure that nothing is happening?
>>>>>>>>>>
>>>>>>>>>> Find the thread that is doing compaction and strace it. Generally
>>>>>>>>> it is one of the threads with a lower thread priority.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I have 141 threads. Not sure if that's normal.
>>>>>>>>
>>>>>>>> This seems to be the one:
>>>>>>>>  61404 cassandr  24   4 8948m 4.3g 820m R 90.2 36.8 292:54.47 java
>>>>>>>>
>>>>>>>> In the strace I see basically this part repeating (with once in a
>>>>>>>> while the "resource temporarily unavailable"):
>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>> getpriority(PRIO_PROCESS, 61404)        = 16
>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494045, NULL) = -1 EAGAIN
>>>>>>>> (Resource temporarily unavailable)
>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494047, NULL) = 0
>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>> getpriority(PRIO_PROCESS, 61404)        = 16
>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494049, NULL) = 0
>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>> getpriority(PRIO_PROCESS, 61404)        = 16
>>>>>>>>
>>>>>>>> But wait!
>>>>>>>> I also see this:
>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494055, NULL) = 0
>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>>>>>>>>
>>>>>>>> This doesn't seem to happen that often though.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Compaction often appears hung when decompressing a very large row,
>>>>>>>>> but usually not for "hours".
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Is it recommended to disable compaction from a certain data
>>>>>>>>>>    size? (I believe 25GB on each node).
>>>>>>>>>>
>>>>>>>>>> It is almost never recommended to disable compaction.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Can I stop this compaction? nodetool stop compaction
>>>>>>>>>>    doesn't seem to work.
>>>>>>>>>>
>>>>>>>>>> Killing the JVM ("the dungeon collapses!") would certainly stop
>>>>>>>>> it, but it'd likely just start again when you restart the node.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Is stopping the compaction dangerous?
>>>>>>>>>>
>>>>>>>>>>  Not if you're in a version that properly cleans up partial
>>>>>>>>> compactions, which is most of them.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Is killing the cassandra process dangerous while
>>>>>>>>>>    compacting(I did nodetool drain on one node)?
>>>>>>>>>>
>>>>>>>>>> No. But probably nodetool drain couldn't actually stop the
>>>>>>>>> in-progress compaction either, FWIW.
>>>>>>>>>
>>>>>>>>>> This is output of nodetool compactionstats grepped for the
>>>>>>>>>> keyspace that seems stuck.
>>>>>>>>>>
>>>>>>>>>> Do you have gigantic rows in that keyspace? What does cfstats say
>>>>>>>>> about the largest row compaction has seen/do you have log messages about
>>>>>>>>> compacting large rows?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't know about the gigantic rows. How can I check?
>>>>>>>>
>>>>>>>> I've checked the logs and found this:
>>>>>>>>  INFO [CompactionExecutor:67] 2015-11-10 02:34:19,077
>>>>>>>> CompactionController.java (line 192) Compacting large row
>>>>>>>> billing/usage_record_ptd:177727:2015-10-14 00\:00Z (243992466 bytes)
>>>>>>>> incrementally
>>>>>>>> So this is from 6 hours ago.
>>>>>>>>
>>>>>>>> I also see a lot of messages like this:
>>>>>>>> INFO [OptionalTasks:1] 2015-11-10 06:36:06,395 MeteredFlusher.java
>>>>>>>> (line 58) flushing high-traffic column family CFS(Keyspace='mykeyspace',
>>>>>>>> ColumnFamily='mycolumnfamily') (estimated 100317609 bytes)
>>>>>>>> And (although it's unrelated this might impact compaction
>>>>>>>> performance?):
>>>>>>>>  WARN [Native-Transport-Requests:10514] 2015-11-10 06:33:34,172
>>>>>>>> BatchStatement.java (line 223) Batch of prepared statements for
>>>>>>>> [billing.usage_record_ptd] is of size 13834, exceeding specified threshold
>>>>>>>> of 5120 by 8714.
>>>>>>>>
>>>>>>>> It's like the compaction is only doing one sstable at a time and is
>>>>>>>> doing nothing a long time in between.
>>>>>>>>
>>>>>>>> cfstats for this keyspace and columnfamily gives the following:
>>>>>>>>                 Table: mycolumnfamily
>>>>>>>>                 SSTable count: 26
>>>>>>>>                 Space used (live), bytes: 319858991
>>>>>>>>                 Space used (total), bytes: 319860267
>>>>>>>>                 SSTable Compression Ratio: 0.24265700071674673
>>>>>>>>                 Number of keys (estimate): 6656
>>>>>>>>                 Memtable cell count: 22710
>>>>>>>>                 Memtable data size, bytes: 3310654
>>>>>>>>                 Memtable switch count: 31
>>>>>>>>                 Local read count: 0
>>>>>>>>                 Local read latency: 0.000 ms
>>>>>>>>                 Local write count: 997667
>>>>>>>>                 Local write latency: 0.000 ms
>>>>>>>>                 Pending tasks: 0
>>>>>>>>                 Bloom filter false positives: 0
>>>>>>>>                 Bloom filter false ratio: 0.00000
>>>>>>>>                 Bloom filter space used, bytes: 12760
>>>>>>>>                 Compacted partition minimum bytes: 1332
>>>>>>>>                 Compacted partition maximum bytes: 43388628
>>>>>>>>                 Compacted partition mean bytes: 234682
>>>>>>>>                 Average live cells per slice (last five minutes):
>>>>>>>> 0.0
>>>>>>>>                 Average tombstones per slice (last five minutes):
>>>>>>>> 0.0
>>>>>>>>
>>>>>>>>
>>>>>>>>> I also see frequently lines like this in system.log:
>>>>>>>>>>
>>>>>>>>>> WARN [Native-Transport-Requests:11935] 2015-11-09 20:10:41,886 BatchStatement.java (line 223) Batch of prepared statements for [billing.usage_record_by_billing_period, billing.metric] is of size 53086, exceeding specified threshold of 5120 by 47966.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Unrelated.
>>>>>>>>>
>>>>>>>>> =Rob
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Can I upgrade to 2.1.11 without doing a nodetool repair/compaction
>>>>>>>> being stuck?
>>>>>>>> Another thing to mention is that nodetool repair didn't run yet. It
>>>>>>>> got installed but nobody bothered to schedule the repair.
>>>>>>>>
>>>>>>>> Thanks for looking into this!
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Cassandra compaction stuck? Should I disable?

Posted by Kai Wang <de...@gmail.com>.

Thank you for the investigation.
On Dec 2, 2015 5:21 AM, "PenguinWhispererThe ." <
th3penguinwhisperer@gmail.com> wrote:

> So it seems I found the problem.
>
> The node opening a stream is waiting for the other node to respond but
> that node never responds due to a broken pipe which makes Cassandra wait
> forever.
>
> It's basically this issue:
> https://issues.apache.org/jira/browse/CASSANDRA-8472
> And this is the workaround/fix:
> https://issues.apache.org/jira/browse/CASSANDRA-8611
>
> So:
> - update cassandra to >=2.0.11
> - add option streaming_socket_timeout_in_ms = 10000
> - do rolling restart of cassandra
>
> What's weird is that the IOException: Broken pipe is never shown in my
> logs (not on any node). And my logging is set to INFO in log4j config.
> I have this config in log4j-server.properties:
> # output messages into a rolling log file as well as stdout
> log4j.rootLogger=INFO,stdout,R
>
> # stdout
> log4j.appender.stdout=org.apache.log4j.ConsoleAppender
> log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
> log4j.appender.stdout.layout.ConversionPattern=%5p %d{HH:mm:ss,SSS} %m%n
>
> # rolling log file
> log4j.appender.R=org.apache.log4j.RollingFileAppender
> log4j.appender.R.maxFileSize=20MB
> log4j.appender.R.maxBackupIndex=50
> log4j.appender.R.layout=org.apache.log4j.PatternLayout
> log4j.appender.R.layout.ConversionPattern=%5p [%t] %d{ISO8601} %F (line
> %L) %m%n
> # Edit the next line to point to your logs directory
> log4j.appender.R.File=/var/log/cassandra/system.log
>
> # Application logging options
> #log4j.logger.org.apache.cassandra=DEBUG
> #log4j.logger.org.apache.cassandra.db=DEBUG
> #log4j.logger.org.apache.cassandra.service.StorageProxy=DEBUG
>
> # Adding this to avoid thrift logging disconnect errors.
> log4j.logger.org.apache.thrift.server.TNonblockingServer=ERROR
>
> Too bad nobody else could point to those. Hope it helps someone else from
> wasting a lot of time.
>
> 2015-11-11 15:42 GMT+01:00 Sebastian Estevez <
> sebastian.estevez@datastax.com>:
>
>> Use 'nodetool compactionhistory'
>>
>> all the best,
>>
>> Sebastián
>> On Nov 11, 2015 3:23 AM, "PenguinWhispererThe ." <
>> th3penguinwhisperer@gmail.com> wrote:
>>
>>> Does compactionstats shows only stats for completed compactions (100%)?
>>> It might be that the compaction is running constantly, over and over again.
>>> In that case I need to know what I might be able to do to stop this
>>> constant compaction so I can start a nodetool repair.
>>>
>>> Note that there is a lot of traffic on this columnfamily so I'm not sure
>>> if temporary disabling compaction is an option. The repair will probably
>>> take long as well.
>>>
>>> Sebastian and Rob: do you might have any more ideas about the things I
>>> put in this thread? Any help is appreciated!
>>>
>>> 2015-11-10 20:03 GMT+01:00 PenguinWhispererThe . <
>>> th3penguinwhisperer@gmail.com>:
>>>
>>>> Hi Sebastian,
>>>>
>>>> Thanks for your response.
>>>>
>>>> No swap is used. No offense, I just don't see a reason why having swap
>>>> would be the issue here. I put swapiness on 1. I also have jna installed.
>>>> That should prevent java being swapped out as wel AFAIK.
>>>>
>>>>
>>>> 2015-11-10 19:50 GMT+01:00 Sebastian Estevez <
>>>> sebastian.estevez@datastax.com>:
>>>>
>>>>> Turn off Swap.
>>>>>
>>>>>
>>>>> http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installRecommendSettings.html?scroll=reference_ds_sxl_gf3_2k__disable-swap
>>>>>
>>>>>
>>>>> All the best,
>>>>>
>>>>>
>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>
>>>>> Sebastián Estévez
>>>>>
>>>>> Solutions Architect | 954 905 8615 | sebastian.estevez@datastax.com
>>>>>
>>>>> [image: linkedin.png] <https://www.linkedin.com/company/datastax> [image:
>>>>> facebook.png] <https://www.facebook.com/datastax> [image: twitter.png]
>>>>> <https://twitter.com/datastax> [image: g+.png]
>>>>> <https://plus.google.com/+Datastax/about>
>>>>> <http://feeds.feedburner.com/datastax>
>>>>> <http://goog_410786983>
>>>>>
>>>>>
>>>>> <http://www.datastax.com/gartner-magic-quadrant-odbms>
>>>>>
>>>>> DataStax is the fastest, most scalable distributed database
>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>> is the database technology and transactional backbone of choice for the
>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>
>>>>> On Tue, Nov 10, 2015 at 1:48 PM, PenguinWhispererThe . <
>>>>> th3penguinwhisperer@gmail.com> wrote:
>>>>>
>>>>>> I also have the following memory usage:
>>>>>> [root@US-BILLINGDSX4 cassandra]# free -m
>>>>>>              total       used       free     shared    buffers
>>>>>> cached
>>>>>> Mem:         12024       9455       2569          0        110
>>>>>> 2163
>>>>>> -/+ buffers/cache:       7180       4844
>>>>>> Swap:         2047          0       2047
>>>>>>
>>>>>> Still a lot free and a lot of free buffers/cache.
>>>>>>
>>>>>> 2015-11-10 19:45 GMT+01:00 PenguinWhispererThe . <
>>>>>> th3penguinwhisperer@gmail.com>:
>>>>>>
>>>>>>> Still stuck with this. However I enabled GC logging. This shows the
>>>>>>> following:
>>>>>>>
>>>>>>> [root@myhost cassandra]# tail -f gc-1447180680.log
>>>>>>> 2015-11-10T18:41:45.516+0000: 225.428: [GC
>>>>>>> 2721842K->2066508K(6209536K), 0.0199040 secs]
>>>>>>> 2015-11-10T18:41:45.977+0000: 225.889: [GC
>>>>>>> 2721868K->2066511K(6209536K), 0.0221910 secs]
>>>>>>> 2015-11-10T18:41:46.437+0000: 226.349: [GC
>>>>>>> 2721871K->2066524K(6209536K), 0.0222140 secs]
>>>>>>> 2015-11-10T18:41:46.897+0000: 226.809: [GC
>>>>>>> 2721884K->2066539K(6209536K), 0.0224140 secs]
>>>>>>> 2015-11-10T18:41:47.359+0000: 227.271: [GC
>>>>>>> 2721899K->2066538K(6209536K), 0.0302520 secs]
>>>>>>> 2015-11-10T18:41:47.821+0000: 227.733: [GC
>>>>>>> 2721898K->2066557K(6209536K), 0.0280530 secs]
>>>>>>> 2015-11-10T18:41:48.293+0000: 228.205: [GC
>>>>>>> 2721917K->2066571K(6209536K), 0.0218000 secs]
>>>>>>> 2015-11-10T18:41:48.790+0000: 228.702: [GC
>>>>>>> 2721931K->2066780K(6209536K), 0.0292470 secs]
>>>>>>> 2015-11-10T18:41:49.290+0000: 229.202: [GC
>>>>>>> 2722140K->2066843K(6209536K), 0.0288740 secs]
>>>>>>> 2015-11-10T18:41:49.756+0000: 229.668: [GC
>>>>>>> 2722203K->2066818K(6209536K), 0.0283380 secs]
>>>>>>> 2015-11-10T18:41:50.249+0000: 230.161: [GC
>>>>>>> 2722178K->2067158K(6209536K), 0.0218690 secs]
>>>>>>> 2015-11-10T18:41:50.713+0000: 230.625: [GC
>>>>>>> 2722518K->2067236K(6209536K), 0.0278810 secs]
>>>>>>>
>>>>>>> This is a VM with 12GB of RAM. Highered the HEAP_SIZE to 6GB and
>>>>>>> HEAP_NEWSIZE to 800MB.
>>>>>>>
>>>>>>> Still the same result.
>>>>>>>
>>>>>>> This looks very similar to following issue:
>>>>>>>
>>>>>>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAJ=3xgRLsvpnZe0uXEYjG94rKhfXeU+jBR=Q3A-_C3rsdD5kug@mail.gmail.com%3E
>>>>>>>
>>>>>>> Is the only possibility to upgrade memory? I mean, I can't believe
>>>>>>> it's just loading all it's data in memory. That would require to keep
>>>>>>> scaling up the node to keep it work?
>>>>>>>
>>>>>>>
>>>>>>> 2015-11-10 9:36 GMT+01:00 PenguinWhispererThe . <
>>>>>>> th3penguinwhisperer@gmail.com>:
>>>>>>>
>>>>>>>> Correction...
>>>>>>>> I was grepping on Segmentation on the strace and it happens a lot.
>>>>>>>>
>>>>>>>> Do I need to run a scrub?
>>>>>>>>
>>>>>>>> 2015-11-10 9:30 GMT+01:00 PenguinWhispererThe . <
>>>>>>>> th3penguinwhisperer@gmail.com>:
>>>>>>>>
>>>>>>>>> Hi Rob,
>>>>>>>>>
>>>>>>>>> Thanks for your reply.
>>>>>>>>>
>>>>>>>>> 2015-11-09 23:17 GMT+01:00 Robert Coli <rc...@eventbrite.com>:
>>>>>>>>>
>>>>>>>>>> On Mon, Nov 9, 2015 at 1:29 PM, PenguinWhispererThe . <
>>>>>>>>>> th3penguinwhisperer@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> In Opscenter I see one of the nodes is orange. It seems like
>>>>>>>>>>> it's working on compaction. I used nodetool compactionstats and whenever I
>>>>>>>>>>> did this the Completed nad percentage stays the same (even with hours in
>>>>>>>>>>> between).
>>>>>>>>>>>
>>>>>>>>>> Are you the same person from IRC, or a second report today of
>>>>>>>>>> compaction hanging in this way?
>>>>>>>>>>
>>>>>>>>> Same person ;) Just didn't had things to work with from the chat
>>>>>>>>> there. I want to understand the issue more, see what I can tune or fix. I
>>>>>>>>> want to do nodetool repair before upgrading to 2.1.11 but the compaction is
>>>>>>>>> blocking it.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> What version of Cassandra?
>>>>>>>>>>
>>>>>>>>> 2.0.9
>>>>>>>>>
>>>>>>>>>> I currently don't see cpu load from cassandra on that node. So it
>>>>>>>>>>> seems stuck (somewhere mid 60%). Also some other nodes have compaction on
>>>>>>>>>>> the same columnfamily. I don't see any progress.
>>>>>>>>>>>
>>>>>>>>>>>  WARN [RMI TCP Connection(554)-192.168.0.68] 2015-11-09 17:18:13,677 ColumnFamilyStore.java (line 2101) Unable to cancel in-progress compactions for usage_record_ptd.  Probably there is an unusually large row in progress somewhere.  It is also possible that buggy code left some sstables compacting after it was done with them
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - How can I assure that nothing is happening?
>>>>>>>>>>>
>>>>>>>>>>> Find the thread that is doing compaction and strace it.
>>>>>>>>>> Generally it is one of the threads with a lower thread priority.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have 141 threads. Not sure if that's normal.
>>>>>>>>>
>>>>>>>>> This seems to be the one:
>>>>>>>>>  61404 cassandr  24   4 8948m 4.3g 820m R 90.2 36.8 292:54.47 java
>>>>>>>>>
>>>>>>>>> In the strace I see basically this part repeating (with once in a
>>>>>>>>> while the "resource temporarily unavailable"):
>>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>>> getpriority(PRIO_PROCESS, 61404)        = 16
>>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494045, NULL) = -1 EAGAIN
>>>>>>>>> (Resource temporarily unavailable)
>>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494047, NULL) = 0
>>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>>> getpriority(PRIO_PROCESS, 61404)        = 16
>>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>>> futex(0x7f5c64145e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494049, NULL) = 0
>>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>>> getpriority(PRIO_PROCESS, 61404)        = 16
>>>>>>>>>
>>>>>>>>> But wait!
>>>>>>>>> I also see this:
>>>>>>>>> futex(0x7f5c64145e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f5c64145e50,
>>>>>>>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>>>>>>>> futex(0x1233854, FUTEX_WAIT_PRIVATE, 494055, NULL) = 0
>>>>>>>>> futex(0x1233828, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>>>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>>>>>>>>>
>>>>>>>>> This doesn't seem to happen that often though.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Compaction often appears hung when decompressing a very large
>>>>>>>>>> row, but usually not for "hours".
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Is it recommended to disable compaction from a certain
>>>>>>>>>>>    data size? (I believe 25GB on each node).
>>>>>>>>>>>
>>>>>>>>>>> It is almost never recommended to disable compaction.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Can I stop this compaction? nodetool stop compaction
>>>>>>>>>>>    doesn't seem to work.
>>>>>>>>>>>
>>>>>>>>>>> Killing the JVM ("the dungeon collapses!") would certainly stop
>>>>>>>>>> it, but it'd likely just start again when you restart the node.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Is stopping the compaction dangerous?
>>>>>>>>>>>
>>>>>>>>>>>  Not if you're in a version that properly cleans up partial
>>>>>>>>>> compactions, which is most of them.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Is killing the cassandra process dangerous while
>>>>>>>>>>>    compacting(I did nodetool drain on one node)?
>>>>>>>>>>>
>>>>>>>>>>> No. But probably nodetool drain couldn't actually stop the
>>>>>>>>>> in-progress compaction either, FWIW.
>>>>>>>>>>
>>>>>>>>>>> This is output of nodetool compactionstats grepped for the
>>>>>>>>>>> keyspace that seems stuck.
>>>>>>>>>>>
>>>>>>>>>>> Do you have gigantic rows in that keyspace? What does cfstats
>>>>>>>>>> say about the largest row compaction has seen/do you have log messages
>>>>>>>>>> about compacting large rows?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I don't know about the gigantic rows. How can I check?
>>>>>>>>>
>>>>>>>>> I've checked the logs and found this:
>>>>>>>>>  INFO [CompactionExecutor:67] 2015-11-10 02:34:19,077
>>>>>>>>> CompactionController.java (line 192) Compacting large row
>>>>>>>>> billing/usage_record_ptd:177727:2015-10-14 00\:00Z (243992466 bytes)
>>>>>>>>> incrementally
>>>>>>>>> So this is from 6 hours ago.
>>>>>>>>>
>>>>>>>>> I also see a lot of messages like this:
>>>>>>>>> INFO [OptionalTasks:1] 2015-11-10 06:36:06,395 MeteredFlusher.java
>>>>>>>>> (line 58) flushing high-traffic column family CFS(Keyspace='mykeyspace',
>>>>>>>>> ColumnFamily='mycolumnfamily') (estimated 100317609 bytes)
>>>>>>>>> And (although it's unrelated this might impact compaction
>>>>>>>>> performance?):
>>>>>>>>>  WARN [Native-Transport-Requests:10514] 2015-11-10 06:33:34,172
>>>>>>>>> BatchStatement.java (line 223) Batch of prepared statements for
>>>>>>>>> [billing.usage_record_ptd] is of size 13834, exceeding specified threshold
>>>>>>>>> of 5120 by 8714.
>>>>>>>>>
>>>>>>>>> It's like the compaction is only doing one sstable at a time and
>>>>>>>>> is doing nothing a long time in between.
>>>>>>>>>
>>>>>>>>> cfstats for this keyspace and columnfamily gives the following:
>>>>>>>>>                 Table: mycolumnfamily
>>>>>>>>>                 SSTable count: 26
>>>>>>>>>                 Space used (live), bytes: 319858991
>>>>>>>>>                 Space used (total), bytes: 319860267
>>>>>>>>>                 SSTable Compression Ratio: 0.24265700071674673
>>>>>>>>>                 Number of keys (estimate): 6656
>>>>>>>>>                 Memtable cell count: 22710
>>>>>>>>>                 Memtable data size, bytes: 3310654
>>>>>>>>>                 Memtable switch count: 31
>>>>>>>>>                 Local read count: 0
>>>>>>>>>                 Local read latency: 0.000 ms
>>>>>>>>>                 Local write count: 997667
>>>>>>>>>                 Local write latency: 0.000 ms
>>>>>>>>>                 Pending tasks: 0
>>>>>>>>>                 Bloom filter false positives: 0
>>>>>>>>>                 Bloom filter false ratio: 0.00000
>>>>>>>>>                 Bloom filter space used, bytes: 12760
>>>>>>>>>                 Compacted partition minimum bytes: 1332
>>>>>>>>>                 Compacted partition maximum bytes: 43388628
>>>>>>>>>                 Compacted partition mean bytes: 234682
>>>>>>>>>                 Average live cells per slice (last five minutes):
>>>>>>>>> 0.0
>>>>>>>>>                 Average tombstones per slice (last five minutes):
>>>>>>>>> 0.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I also see frequently lines like this in system.log:
>>>>>>>>>>>
>>>>>>>>>>> WARN [Native-Transport-Requests:11935] 2015-11-09 20:10:41,886 BatchStatement.java (line 223) Batch of prepared statements for [billing.usage_record_by_billing_period, billing.metric] is of size 53086, exceeding specified threshold of 5120 by 47966.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> Unrelated.
>>>>>>>>>>
>>>>>>>>>> =Rob
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can I upgrade to 2.1.11 without doing a nodetool repair/compaction
>>>>>>>>> being stuck?
>>>>>>>>> Another thing to mention is that nodetool repair didn't run yet.
>>>>>>>>> It got installed but nobody bothered to schedule the repair.
>>>>>>>>>
>>>>>>>>> Thanks for looking into this!
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>