You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Igor Leão <ig...@ubee.in> on 2017/08/21 14:49:06 UTC

Cassandra 3.11 is compacting forever

Hi there,

I've been trying to upgrade a Cassandra 3.9 cluster to Cassandra 3.11.
Whenever I try to add a new Cassandra 3.11 node to the main datacenter,
using `-Dcassandra.force_3_0_protocol_version=true` on the new node, this
new node uses almost 100% of its CPU. Checking `nodetool compactionstats
-H`, I can see there is always pending compacting tasks.

t0:
```
[aladdin@ip-172-16-1-10 ~]$ nodetool compactionstats -H
pending tasks: 11
- x.morning: 9
- y.afternoon: 2

id                                   compaction type keyspace      table
 completed  total     unit  progress
b0f174d0-8671-10e8-8429-114a220c107d Compaction      x morning 170.15 MiB
23.64 GiB bytes 0.70%
38b02640-8439-10e8-8429-114a220c107d Compaction      y   afternoon   2.55
GiB   40.17 GiB bytes 6.35%
Active compaction remaining time :   1h05m09s
```

t1:
```
[aladdin@ip-172-16-1-10 ~]$ nodetool compactionstats -H
pending tasks: 11
- x.morning: 9
- y.afternoon: 2

id                                   compaction type keyspace      table
 completed total     unit  progress
b0f174d0-8671-10e8-8429-114a220c107d Compaction      x morning 404.3 MiB
23.64 GiB bytes 30.20%
38b02640-8439-10e8-8429-114a220c107d Compaction      y   afternoon   2.55
GiB  40.17 GiB bytes 6.35%
Active compaction remaining time :   1h04m55s
```

t2:
```
[aladdin@ip-172-16-1-10 ~]$ nodetool compactionstats -H

pending tasks: 8
- system.size_estimates: 1
- x.morning: 4
- y.afternoon: 3

id                                   compaction type keyspace   table
 completed  total     unit  progress
38b02640-8439-10e8-8429-114a220c107d Compaction      y   afternoon   2.55
GiB   40.17 GiB bytes 6.35%
06f3a660-867b-10e8-8429-114a220c107d Compaction      x morning 538.04 MiB
11.15 GiB bytes 4.71%
```


`38b02640-8439-10e8-8429-114a220c107d` seems to be stuck at 6.35% for the
last 4 days and y. morning keeps compacting forever, always with a
different id.

Nodes are configured to run via `GossipingPropertyFileSnitch` on a i3.large
(AWS) and this problem seems to happen only on the main datacenter. The
secondary datacenter has one node running `Cassandra 3.9` and one node
running `Cassandra 3.11` without any further issues.

Have you guys faced this problem before?

Thanks in advance!
Igor

Re: Cassandra 3.11 is compacting forever

Posted by Ioannis Zafiropoulos <jo...@gmail.com>.

A maybe long shot, have you deleted the cassandra-topology.properties file
since you are using GossipingPropertyFileSnitch?
I am sure I have seen a ticket about problems caused in some cases if that
file stays around.
I removed it from all the nodes and non-stop compaction stopped (after a
proper restart - not rolling).

On Fri, Sep 8, 2017 at 4:24 PM, Romain Hardouin <romainh_ml@yahoo.fr.invalid
> wrote:

> Hi,
>
> It might be useful to enable compaction logging with log_all subproperties.
>
> Best,
>
> Romain
>
> Le vendredi 8 septembre 2017 à 00:15:19 UTC+2, kurt greaves <
> kurt@instaclustr.com> a écrit :
>
>
> Might be worth turning on debug logging for that node and when the
> compaction kicks off and CPU skyrockets send through the logs.
>

Re: Cassandra 3.11 is compacting forever

Posted by Romain Hardouin <ro...@yahoo.fr.INVALID>.

Hi,
It might be useful to enable compaction logging with log_all subproperties.
Best,
Romain 
    Le vendredi 8 septembre 2017 à 00:15:19 UTC+2, kurt greaves <ku...@instaclustr.com> a écrit :  
 
 Might be worth turning on debug logging for that node and when the compaction kicks off and CPU skyrockets send through the logs.

Re: Cassandra 3.11 is compacting forever

Posted by kurt greaves <ku...@instaclustr.com>.

Might be worth turning on debug logging for that node and when the
compaction kicks off and CPU skyrockets send through the logs.

Re: Cassandra 3.11 is compacting forever

Posted by Igor Leão <ig...@ubee.in>.

Last week I moved all nodes back to cassandra 3.9. Everything worked fine
since then.
Yesterday I tried to upgrade again, running a rolling restart after the
upgrade.Nodes were just fine. Today one node started consuming 94.6% of its
CPU. Compacting is running all the time for this node.
I'm afraid to have the remaining nodes increasing their CPUs over the next
couple of days, as happened last week.

2017-09-03 22:28 GMT-03:00 kurt greaves <ku...@instaclustr.com>:

> Can't say that message explains why the compaction would be stuck.
> Generally not a good sign and you might need to investigate more but
> hopefully shouldn't be related. Has that stuck compaction moved since last
> week?
>
>
> On 1 September 2017 at 22:54, Fay Hou [Storage Service]  <
> fayhou@coupang.com> wrote:
>
>> try to do a rolling restart for the cluster before doing a compation
>>
>> On Fri, Sep 1, 2017 at 3:09 PM, Igor Leão <ig...@ubee.in> wrote:
>>
>>> Some generic errors:
>>>
>>> *[aladdin@ip-172-16-1-10 cassandra]$ tail cassandra.log | grep -i error*
>>> *[aladdin@ip-172-16-1-10 cassandra]$ tail cassandra.log | grep -i excep*
>>> *[aladdin@ip-172-16-1-10 cassandra]$ tail cassandra.log | grep -i fail*
>>> *[aladdin@ip-172-16-1-10 cassandra]$ tail debug.log | grep -i error*
>>> *[aladdin@ip-172-16-1-10 cassandra]$ tail debug.log | grep -i exce*
>>> *[aladdin@ip-172-16-1-10 cassandra]$ tail debug.log | grep -i fail*
>>> *DEBUG [GossipStage:1] 2017-09-01 15:33:27,046 FailureDetector.java:457
>>> - Ignoring interval time of 2108299431 <(210)%20829-9431> for /172.16.1.112
>>> <http://172.16.1.112/>*
>>> *DEBUG [GossipStage:1] 2017-09-01 15:33:29,051 FailureDetector.java:457
>>> - Ignoring interval time of 2005507384 for /172.16.1.74
>>> <http://172.16.1.74/>*
>>> *DEBUG [GossipStage:1] 2017-09-01 15:33:45,968 FailureDetector.java:457
>>> - Ignoring interval time of 2003371497 for /172.16.1.74
>>> <http://172.16.1.74/>*
>>> *DEBUG [GossipStage:1] 2017-09-01 15:33:51,133 FailureDetector.java:457
>>> - Ignoring interval time of 2013260173 <(201)%20326-0173> for /172.16.1.74
>>> <http://172.16.1.74/>*
>>> *DEBUG [GossipStage:1] 2017-09-01 15:33:58,981 FailureDetector.java:457
>>> - Ignoring interval time of 2009620081 for /172.16.1.112
>>> <http://172.16.1.112/>*
>>> *DEBUG [GossipStage:1] 2017-09-01 15:34:19,235 FailureDetector.java:457
>>> - Ignoring interval time of 2010956256 for /172.16.1.74
>>> <http://172.16.1.74/>*
>>> *DEBUG [GossipStage:1] 2017-09-01 15:34:19,235 FailureDetector.java:457
>>> - Ignoring interval time of 2011127930 for /10.0.1.122 <http://10.0.1.122/>*
>>> *[aladdin@ip-172-16-1-10 cassandra]$ tail system.log | grep -i error*
>>> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
>>> failed: Connection reset by peer*
>>> *[aladdin@ip-172-16-1-10 cassandra]$ tail system.log | grep -i exce*
>>> *INFO  [Native-Transport-Requests-5] 2017-09-01 15:22:58,806
>>> Message.java:619 - Unexpected exception during request; channel = [id:
>>> 0xdd63db2f, L:/10.0.1.47:9042 <http://10.0.1.47:9042/> !
>>> R:/10.0.44.196:41422 <http://10.0.44.196:41422/>]*
>>> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
>>> failed: Connection reset by peer*
>>> *[aladdin@ip-172-16-1-10 cassandra]$ tail system.log | grep -i fail*
>>> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
>>> failed: Connection reset by peer*
>>>
>>>
>>> Some interesting errors:
>>>
>>> 1.
>>> *DEBUG [ReadRepairStage:1] 2017-09-01 15:34:58,485 ReadCallback.java:242
>>> - Digest mismatch:*
>>> *org.apache.cassandra.service.DigestMismatchException: Mismatch for key
>>> DecoratedKey(5988282114260523734,
>>> 32623331326162652d633533332d343237632d626334322d306466643762653836343830)
>>> (023d99bbcf2263f0fa450c2312fdce88 vs a60ba37a46e0a61227a8b560fa4e0dfb)*
>>> * at
>>> org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
>>> ~[apache-cassandra-3.11.0.jar:3.11.0]*
>>> * at
>>> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233)
>>> ~[apache-cassandra-3.11.0.jar:3.11.0]*
>>> * at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> [na:1.8.0_112]*
>>> * at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> [na:1.8.0_112]*
>>> * at
>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
>>> [apache-cassandra-3.11.0.jar:3.11.0]*
>>> * at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_112]*
>>>
>>> 2.
>>> *INFO  [Native-Transport-Requests-5] 2017-09-01 15:22:58,806
>>> Message.java:619 - Unexpected exception during request; channel = [id:
>>> 0xdd63db2f, L:/10.0.1.47:9042 <http://10.0.1.47:9042/> !
>>> R:/10.0.44.196:41422 <http://10.0.44.196:41422/>]*
>>> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
>>> failed: Connection reset by peer*
>>> * at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown
>>> Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]*
>>> *INFO  [Native-Transport-Requests-11] 2017-09-01 15:31:42,722
>>> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
>>> allocate chunk of 1.000MiB*
>>>
>>> *INFO  [CompactionExecutor:470] 2017-09-01 10:16:42,026
>>> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
>>> allocate chunk of 1.000MiB*
>>> *INFO  [CompactionExecutor:475] 2017-09-01 10:31:42,032
>>> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
>>> allocate chunk of 1.000MiB*
>>> *INFO  [CompactionExecutor:478] 2017-09-01 10:46:42,108
>>> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
>>> allocate chunk of 1.000MiB*
>>> *INFO  [CompactionExecutor:482] 2017-09-01 11:01:42,131
>>> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
>>> allocate chunk of 1.000MiB*
>>>
>>> About this last error, I tried to increase `file_cache_size_in_mb` of
>>> this node to 2048, but the error only changed to
>>> *INFO  [ReadStage-2] 2017-09-01 16:18:38,657 NoSpamLogger.java:91 -
>>> Maximum memory usage reached (2.000GiB), cannot allocate chunk of 1.000MiB*
>>>
>>> 2017-09-01 9:07 GMT-03:00 kurt greaves <ku...@instaclustr.com>:
>>>
>>>> are you seeing any errors in the logs? Is that one compaction still
>>>> getting stuck?
>>>>
>>>
>>>
>


-- 
Igor Leão  Site Reliability Engineer

Mobile: +55 81 99727-1083 <callto://+5581997271083>
Skype: *igorvpcleao*
Office: +55 81 4042-9757 <callto://+558140429757>
Website: inlocomedia.com <http://www.inlocomedia.com/>
[image: inlocomedia]
<http://t.sidekickopen29.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XX45w6CwnN7dSpvzQZpw8W8pTc_456dVQFdQm8LT02?t=http%3A%2F%2Fwww.inlocomedia.com%2F&si=4991638468296704&pi=9266b53b-57c9-4b38-d81a-d2f8f01ed355>
 [image: LinkedIn]
<http://t.sidekickopen29.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XX45w6CwnN7dSpvzQZpw8W8pTc_456dVQFdQm8LT02?t=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fin-loco-media&si=4991638468296704&pi=9266b53b-57c9-4b38-d81a-d2f8f01ed355>
 [image: Facebook] <https://www.facebook.com/inlocomedia> [image: Twitter]
<http://t.sidekickopen29.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XX45w6CwnN7dSpvzQZpw8W8pTc_456dVQFdQm8LT02?t=https%3A%2F%2Ftwitter.com%2Finlocomedia&si=4991638468296704&pi=9266b53b-57c9-4b38-d81a-d2f8f01ed355>

Re: Cassandra 3.11 is compacting forever

Posted by kurt greaves <ku...@instaclustr.com>.

Can't say that message explains why the compaction would be stuck.
Generally not a good sign and you might need to investigate more but
hopefully shouldn't be related. Has that stuck compaction moved since last
week?

On 1 September 2017 at 22:54, Fay Hou [Storage Service]  <
fayhou@coupang.com> wrote:

> try to do a rolling restart for the cluster before doing a compation
>
> On Fri, Sep 1, 2017 at 3:09 PM, Igor Leão <ig...@ubee.in> wrote:
>
>> Some generic errors:
>>
>> *[aladdin@ip-172-16-1-10 cassandra]$ tail cassandra.log | grep -i error*
>> *[aladdin@ip-172-16-1-10 cassandra]$ tail cassandra.log | grep -i excep*
>> *[aladdin@ip-172-16-1-10 cassandra]$ tail cassandra.log | grep -i fail*
>> *[aladdin@ip-172-16-1-10 cassandra]$ tail debug.log | grep -i error*
>> *[aladdin@ip-172-16-1-10 cassandra]$ tail debug.log | grep -i exce*
>> *[aladdin@ip-172-16-1-10 cassandra]$ tail debug.log | grep -i fail*
>> *DEBUG [GossipStage:1] 2017-09-01 15:33:27,046 FailureDetector.java:457 -
>> Ignoring interval time of 2108299431 <(210)%20829-9431> for /172.16.1.112
>> <http://172.16.1.112/>*
>> *DEBUG [GossipStage:1] 2017-09-01 15:33:29,051 FailureDetector.java:457 -
>> Ignoring interval time of 2005507384 for /172.16.1.74 <http://172.16.1.74/>*
>> *DEBUG [GossipStage:1] 2017-09-01 15:33:45,968 FailureDetector.java:457 -
>> Ignoring interval time of 2003371497 for /172.16.1.74 <http://172.16.1.74/>*
>> *DEBUG [GossipStage:1] 2017-09-01 15:33:51,133 FailureDetector.java:457 -
>> Ignoring interval time of 2013260173 <(201)%20326-0173> for /172.16.1.74
>> <http://172.16.1.74/>*
>> *DEBUG [GossipStage:1] 2017-09-01 15:33:58,981 FailureDetector.java:457 -
>> Ignoring interval time of 2009620081 for /172.16.1.112
>> <http://172.16.1.112/>*
>> *DEBUG [GossipStage:1] 2017-09-01 15:34:19,235 FailureDetector.java:457 -
>> Ignoring interval time of 2010956256 for /172.16.1.74 <http://172.16.1.74/>*
>> *DEBUG [GossipStage:1] 2017-09-01 15:34:19,235 FailureDetector.java:457 -
>> Ignoring interval time of 2011127930 for /10.0.1.122 <http://10.0.1.122/>*
>> *[aladdin@ip-172-16-1-10 cassandra]$ tail system.log | grep -i error*
>> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
>> failed: Connection reset by peer*
>> *[aladdin@ip-172-16-1-10 cassandra]$ tail system.log | grep -i exce*
>> *INFO  [Native-Transport-Requests-5] 2017-09-01 15:22:58,806
>> Message.java:619 - Unexpected exception during request; channel = [id:
>> 0xdd63db2f, L:/10.0.1.47:9042 <http://10.0.1.47:9042/> !
>> R:/10.0.44.196:41422 <http://10.0.44.196:41422/>]*
>> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
>> failed: Connection reset by peer*
>> *[aladdin@ip-172-16-1-10 cassandra]$ tail system.log | grep -i fail*
>> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
>> failed: Connection reset by peer*
>>
>>
>> Some interesting errors:
>>
>> 1.
>> *DEBUG [ReadRepairStage:1] 2017-09-01 15:34:58,485 ReadCallback.java:242
>> - Digest mismatch:*
>> *org.apache.cassandra.service.DigestMismatchException: Mismatch for key
>> DecoratedKey(5988282114260523734,
>> 32623331326162652d633533332d343237632d626334322d306466643762653836343830)
>> (023d99bbcf2263f0fa450c2312fdce88 vs a60ba37a46e0a61227a8b560fa4e0dfb)*
>> * at
>> org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]*
>> * at
>> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233)
>> ~[apache-cassandra-3.11.0.jar:3.11.0]*
>> * at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> [na:1.8.0_112]*
>> * at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> [na:1.8.0_112]*
>> * at
>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
>> [apache-cassandra-3.11.0.jar:3.11.0]*
>> * at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_112]*
>>
>> 2.
>> *INFO  [Native-Transport-Requests-5] 2017-09-01 15:22:58,806
>> Message.java:619 - Unexpected exception during request; channel = [id:
>> 0xdd63db2f, L:/10.0.1.47:9042 <http://10.0.1.47:9042/> !
>> R:/10.0.44.196:41422 <http://10.0.44.196:41422/>]*
>> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
>> failed: Connection reset by peer*
>> * at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown
>> Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]*
>> *INFO  [Native-Transport-Requests-11] 2017-09-01 15:31:42,722
>> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
>> allocate chunk of 1.000MiB*
>>
>> *INFO  [CompactionExecutor:470] 2017-09-01 10:16:42,026
>> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
>> allocate chunk of 1.000MiB*
>> *INFO  [CompactionExecutor:475] 2017-09-01 10:31:42,032
>> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
>> allocate chunk of 1.000MiB*
>> *INFO  [CompactionExecutor:478] 2017-09-01 10:46:42,108
>> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
>> allocate chunk of 1.000MiB*
>> *INFO  [CompactionExecutor:482] 2017-09-01 11:01:42,131
>> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
>> allocate chunk of 1.000MiB*
>>
>> About this last error, I tried to increase `file_cache_size_in_mb` of
>> this node to 2048, but the error only changed to
>> *INFO  [ReadStage-2] 2017-09-01 16:18:38,657 NoSpamLogger.java:91 -
>> Maximum memory usage reached (2.000GiB), cannot allocate chunk of 1.000MiB*
>>
>> 2017-09-01 9:07 GMT-03:00 kurt greaves <ku...@instaclustr.com>:
>>
>>> are you seeing any errors in the logs? Is that one compaction still
>>> getting stuck?
>>>
>>
>>

Re: Cassandra 3.11 is compacting forever

Posted by Fay, , Storage, , <fa...@coupang.com>.

try to do a rolling restart for the cluster before doing a compation

On Fri, Sep 1, 2017 at 3:09 PM, Igor Leão <ig...@ubee.in> wrote:

> Some generic errors:
>
> *[aladdin@ip-172-16-1-10 cassandra]$ tail cassandra.log | grep -i error*
> *[aladdin@ip-172-16-1-10 cassandra]$ tail cassandra.log | grep -i excep*
> *[aladdin@ip-172-16-1-10 cassandra]$ tail cassandra.log | grep -i fail*
> *[aladdin@ip-172-16-1-10 cassandra]$ tail debug.log | grep -i error*
> *[aladdin@ip-172-16-1-10 cassandra]$ tail debug.log | grep -i exce*
> *[aladdin@ip-172-16-1-10 cassandra]$ tail debug.log | grep -i fail*
> *DEBUG [GossipStage:1] 2017-09-01 15:33:27,046 FailureDetector.java:457 -
> Ignoring interval time of 2108299431 <(210)%20829-9431> for /172.16.1.112
> <http://172.16.1.112/>*
> *DEBUG [GossipStage:1] 2017-09-01 15:33:29,051 FailureDetector.java:457 -
> Ignoring interval time of 2005507384 for /172.16.1.74 <http://172.16.1.74/>*
> *DEBUG [GossipStage:1] 2017-09-01 15:33:45,968 FailureDetector.java:457 -
> Ignoring interval time of 2003371497 for /172.16.1.74 <http://172.16.1.74/>*
> *DEBUG [GossipStage:1] 2017-09-01 15:33:51,133 FailureDetector.java:457 -
> Ignoring interval time of 2013260173 <(201)%20326-0173> for /172.16.1.74
> <http://172.16.1.74/>*
> *DEBUG [GossipStage:1] 2017-09-01 15:33:58,981 FailureDetector.java:457 -
> Ignoring interval time of 2009620081 for /172.16.1.112
> <http://172.16.1.112/>*
> *DEBUG [GossipStage:1] 2017-09-01 15:34:19,235 FailureDetector.java:457 -
> Ignoring interval time of 2010956256 for /172.16.1.74 <http://172.16.1.74/>*
> *DEBUG [GossipStage:1] 2017-09-01 15:34:19,235 FailureDetector.java:457 -
> Ignoring interval time of 2011127930 for /10.0.1.122 <http://10.0.1.122/>*
> *[aladdin@ip-172-16-1-10 cassandra]$ tail system.log | grep -i error*
> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
> failed: Connection reset by peer*
> *[aladdin@ip-172-16-1-10 cassandra]$ tail system.log | grep -i exce*
> *INFO  [Native-Transport-Requests-5] 2017-09-01 15:22:58,806
> Message.java:619 - Unexpected exception during request; channel = [id:
> 0xdd63db2f, L:/10.0.1.47:9042 <http://10.0.1.47:9042/> !
> R:/10.0.44.196:41422 <http://10.0.44.196:41422/>]*
> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
> failed: Connection reset by peer*
> *[aladdin@ip-172-16-1-10 cassandra]$ tail system.log | grep -i fail*
> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
> failed: Connection reset by peer*
>
>
> Some interesting errors:
>
> 1.
> *DEBUG [ReadRepairStage:1] 2017-09-01 15:34:58,485 ReadCallback.java:242 -
> Digest mismatch:*
> *org.apache.cassandra.service.DigestMismatchException: Mismatch for key
> DecoratedKey(5988282114260523734,
> 32623331326162652d633533332d343237632d626334322d306466643762653836343830)
> (023d99bbcf2263f0fa450c2312fdce88 vs a60ba37a46e0a61227a8b560fa4e0dfb)*
> * at
> org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
> ~[apache-cassandra-3.11.0.jar:3.11.0]*
> * at
> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233)
> ~[apache-cassandra-3.11.0.jar:3.11.0]*
> * at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_112]*
> * at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_112]*
> * at
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> [apache-cassandra-3.11.0.jar:3.11.0]*
> * at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_112]*
>
> 2.
> *INFO  [Native-Transport-Requests-5] 2017-09-01 15:22:58,806
> Message.java:619 - Unexpected exception during request; channel = [id:
> 0xdd63db2f, L:/10.0.1.47:9042 <http://10.0.1.47:9042/> !
> R:/10.0.44.196:41422 <http://10.0.44.196:41422/>]*
> *io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
> failed: Connection reset by peer*
> * at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown Source)
> ~[netty-all-4.0.44.Final.jar:4.0.44.Final]*
> *INFO  [Native-Transport-Requests-11] 2017-09-01 15:31:42,722
> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
> allocate chunk of 1.000MiB*
>
> *INFO  [CompactionExecutor:470] 2017-09-01 10:16:42,026
> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
> allocate chunk of 1.000MiB*
> *INFO  [CompactionExecutor:475] 2017-09-01 10:31:42,032
> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
> allocate chunk of 1.000MiB*
> *INFO  [CompactionExecutor:478] 2017-09-01 10:46:42,108
> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
> allocate chunk of 1.000MiB*
> *INFO  [CompactionExecutor:482] 2017-09-01 11:01:42,131
> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
> allocate chunk of 1.000MiB*
>
> About this last error, I tried to increase `file_cache_size_in_mb` of
> this node to 2048, but the error only changed to
> *INFO  [ReadStage-2] 2017-09-01 16:18:38,657 NoSpamLogger.java:91 -
> Maximum memory usage reached (2.000GiB), cannot allocate chunk of 1.000MiB*
>
> 2017-09-01 9:07 GMT-03:00 kurt greaves <ku...@instaclustr.com>:
>
>> are you seeing any errors in the logs? Is that one compaction still
>> getting stuck?
>>
>
>
>
>
>
>

Re: Cassandra 3.11 is compacting forever

Posted by Igor Leão <ig...@ubee.in>.

Some generic errors:

*[aladdin@ip-172-16-1-10 cassandra]$ tail cassandra.log | grep -i error*
*[aladdin@ip-172-16-1-10 cassandra]$ tail cassandra.log | grep -i excep*
*[aladdin@ip-172-16-1-10 cassandra]$ tail cassandra.log | grep -i fail*
*[aladdin@ip-172-16-1-10 cassandra]$ tail debug.log | grep -i error*
*[aladdin@ip-172-16-1-10 cassandra]$ tail debug.log | grep -i exce*
*[aladdin@ip-172-16-1-10 cassandra]$ tail debug.log | grep -i fail*
*DEBUG [GossipStage:1] 2017-09-01 15:33:27,046 FailureDetector.java:457 -
Ignoring interval time of 2108299431 <(210)%20829-9431> for /172.16.1.112
<http://172.16.1.112/>*
*DEBUG [GossipStage:1] 2017-09-01 15:33:29,051 FailureDetector.java:457 -
Ignoring interval time of 2005507384 for /172.16.1.74 <http://172.16.1.74/>*
*DEBUG [GossipStage:1] 2017-09-01 15:33:45,968 FailureDetector.java:457 -
Ignoring interval time of 2003371497 for /172.16.1.74 <http://172.16.1.74/>*
*DEBUG [GossipStage:1] 2017-09-01 15:33:51,133 FailureDetector.java:457 -
Ignoring interval time of 2013260173 <(201)%20326-0173> for /172.16.1.74
<http://172.16.1.74/>*
*DEBUG [GossipStage:1] 2017-09-01 15:33:58,981 FailureDetector.java:457 -
Ignoring interval time of 2009620081 for /172.16.1.112
<http://172.16.1.112/>*
*DEBUG [GossipStage:1] 2017-09-01 15:34:19,235 FailureDetector.java:457 -
Ignoring interval time of 2010956256 for /172.16.1.74 <http://172.16.1.74/>*
*DEBUG [GossipStage:1] 2017-09-01 15:34:19,235 FailureDetector.java:457 -
Ignoring interval time of 2011127930 for /10.0.1.122 <http://10.0.1.122/>*
*[aladdin@ip-172-16-1-10 cassandra]$ tail system.log | grep -i error*
*io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
failed: Connection reset by peer*
*[aladdin@ip-172-16-1-10 cassandra]$ tail system.log | grep -i exce*
*INFO  [Native-Transport-Requests-5] 2017-09-01 15:22:58,806
Message.java:619 - Unexpected exception during request; channel = [id:
0xdd63db2f, L:/10.0.1.47:9042 <http://10.0.1.47:9042/> !
R:/10.0.44.196:41422 <http://10.0.44.196:41422/>]*
*io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
failed: Connection reset by peer*
*[aladdin@ip-172-16-1-10 cassandra]$ tail system.log | grep -i fail*
*io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
failed: Connection reset by peer*


Some interesting errors:

1.
*DEBUG [ReadRepairStage:1] 2017-09-01 15:34:58,485 ReadCallback.java:242 -
Digest mismatch:*
*org.apache.cassandra.service.DigestMismatchException: Mismatch for key
DecoratedKey(5988282114260523734,
32623331326162652d633533332d343237632d626334322d306466643762653836343830)
(023d99bbcf2263f0fa450c2312fdce88 vs a60ba37a46e0a61227a8b560fa4e0dfb)*
* at
org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
~[apache-cassandra-3.11.0.jar:3.11.0]*
* at
org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233)
~[apache-cassandra-3.11.0.jar:3.11.0]*
* at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_112]*
* at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_112]*
* at
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
[apache-cassandra-3.11.0.jar:3.11.0]*
* at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_112]*

2.
*INFO  [Native-Transport-Requests-5] 2017-09-01 15:22:58,806
Message.java:619 - Unexpected exception during request; channel = [id:
0xdd63db2f, L:/10.0.1.47:9042 <http://10.0.1.47:9042/> !
R:/10.0.44.196:41422 <http://10.0.44.196:41422/>]*
*io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)()
failed: Connection reset by peer*
* at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown Source)
~[netty-all-4.0.44.Final.jar:4.0.44.Final]*
*INFO  [Native-Transport-Requests-11] 2017-09-01 15:31:42,722
NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
allocate chunk of 1.000MiB*

*INFO  [CompactionExecutor:470] 2017-09-01 10:16:42,026
NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
allocate chunk of 1.000MiB*
*INFO  [CompactionExecutor:475] 2017-09-01 10:31:42,032
NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
allocate chunk of 1.000MiB*
*INFO  [CompactionExecutor:478] 2017-09-01 10:46:42,108
NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
allocate chunk of 1.000MiB*
*INFO  [CompactionExecutor:482] 2017-09-01 11:01:42,131
NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot
allocate chunk of 1.000MiB*

About this last error, I tried to increase `file_cache_size_in_mb` of this
node to 2048, but the error only changed to
*INFO  [ReadStage-2] 2017-09-01 16:18:38,657 NoSpamLogger.java:91 - Maximum
memory usage reached (2.000GiB), cannot allocate chunk of 1.000MiB*

2017-09-01 9:07 GMT-03:00 kurt greaves <ku...@instaclustr.com>:

> are you seeing any errors in the logs? Is that one compaction still
> getting stuck?
>

Re: Cassandra 3.11 is compacting forever

Posted by kurt greaves <ku...@instaclustr.com>.

are you seeing any errors in the logs? Is that one compaction still getting
stuck?

Re: Cassandra 3.11 is compacting forever

Posted by Igor Leão <ig...@ubee.in>.

Hey Kurt,

Thanks for your reply.
Soon as the whole cluster was upgraded (using existing nodes) it worked
pretty well. After a while, the high cpu usage/ pending compactions was
back affecting all cluster.
It's still an open problem.

2017-08-21 20:24 GMT-03:00 kurt greaves <ku...@instaclustr.com>:

> Why are you adding new nodes? If you're upgrading you should upgrade the
> existing nodes first and then add nodes. 
>

Re: Cassandra 3.11 is compacting forever

Posted by kurt greaves <ku...@instaclustr.com>.

Why are you adding new nodes? If you're upgrading you should upgrade the
existing nodes first and then add nodes.