You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Swen Fuhrmann (Jira)" <ji...@apache.org> on 2020/06/25 13:56:00 UTC
[jira] [Updated] (CASSANDRA-15902) OOM because repair session
thread not closed when terminating repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Swen Fuhrmann updated CASSANDRA-15902:
--------------------------------------
Description:
In our cluster, after a while some nodes running slowly out of memory. On that nodes we observed that Cassandra Reaper cancel repairs with a JMX call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching timeout of 30 min.
In the memory heap dump we see >100 instances of {{io.netty.util.concurrent.FastThreadLocalThread}}. In the thread dump we see lot of repair threads:
{noformat}
grep "Repair#" threaddump.txt | wc -l
50 {noformat}
The repair jobs are waiting for the validation to finish:
{noformat}
"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x0000000012fc5000 nid=0x542a waiting on condition [0x00007f81ee414000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007939bcfc8> (a com.google.common.util.concurrent.AbstractFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
at org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown Source)
at java.lang.Thread.run(Thread.java:748) {noformat}
Thats the line where the threads stuck:
{noformat}
// Wait for validation to complete
Futures.getUnchecked(validations); {noformat}
The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops the thread pool executor. It looks like that futures which are in progress will therefor never be completed and the repair thread waits forever and won't be finished.
Environment:
Cassandra version: 3.11.4
Cassandra Reaper: 1.4.0
Java Runtime:
{noformat}
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) {noformat}
Here is the same issue described: https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
As suggested in the comments I created this new specific ticket.
was:
In our cluster, after a while some nodes running slowly out of memory. On that nodes we observed that Cassandra Reaper cancel repairs with a JMX call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching timeout of 30 min.
In the memory heap dump we see >100 instances of {{io.netty.util.concurrent.FastThreadLocalThread}}. In the thread dump we see lot of repair threads:
{noformat}
grep "Repair#" threaddump.txt | wc -l
50 {noformat}
The repair jobs are waiting for the validation to finish:
{noformat}
"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x0000000012fc5000 nid=0x542a waiting on condition [0x00007f81ee414000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007939bcfc8> (a com.google.common.util.concurrent.AbstractFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
at org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown Source)
at java.lang.Thread.run(Thread.java:748) {noformat}
Thats the line where the threads stuck:
{noformat}
// Wait for validation to complete
Futures.getUnchecked(validations); {noformat}
The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops the thread pool executor. It looks like that futures which are in progress will therefor never be completed and the repair thread waits forever and won't be finished.
Environment:
Cassandra version: 3.11.4
Cassandra Reaper: 1.4.0
Java Runtime:
{noformat}
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) {noformat}
> OOM because repair session thread not closed when terminating repair
> --------------------------------------------------------------------
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
> Issue Type: Bug
> Components: Consistency/Repair
> Reporter: Swen Fuhrmann
> Assignee: Swen Fuhrmann
> Priority: Normal
>
> In our cluster, after a while some nodes running slowly out of memory. On that nodes we observed that Cassandra Reaper cancel repairs with a JMX call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching timeout of 30 min.
> In the memory heap dump we see >100 instances of {{io.netty.util.concurrent.FastThreadLocalThread}}. In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
> 50 {noformat}
>
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x0000000012fc5000 nid=0x542a waiting on condition [0x00007f81ee414000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000007939bcfc8> (a com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops the thread pool executor. It looks like that futures which are in progress will therefor never be completed and the repair thread waits forever and won't be finished.
>
> Environment:
> Cassandra version: 3.11.4
> Cassandra Reaper: 1.4.0
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) {noformat}
>
> Here is the same issue described: https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org