You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by James Brown <jb...@easypost.com> on 2021/11/26 22:40:13 UTC

Incremental repairs getting stuck a lot

We're on 4.0.1 and switched to incremental repairs a couple of months ago.
They work fine about 95% of the time, but once in a while a session will
get stuck and will have to be cancelled (with `nodetool repair_admin cancel
-s <uuid>`). Typically the session will be in REPAIRING but nothing will
actually be happening.

Output of nodetool repair_admin:

$ nodetool repair_admin
id                                   | state     | last activity |
coordinator                          | participants





                                       | participants_wp
3a059b10-4ef6-11ec-925f-8f7bcf0ba035 | REPAIRING | 6771 (s)      |
/[fd00:ea51:d057:200:1:0:0:8e]:25472 |
fd00:ea51:d057:200:1:0:0:8e,fd00:ea51:d057:200:1:0:0:8f,fd00:ea51:d057:200:1:0:0:92,fd00:ea51:d057:100:1:0:0:571,fd00:ea51:d057:100:1:0:0:570,fd00:ea51:d057:200:1:0:0:93,fd00:ea51:d057:100:1:0:0:573,fd00:ea51:d057:200:1:0:0:90,fd00:ea51:d057:200:1:0:0:91,fd00:ea51:d057:100:1:0:0:572,fd00:ea51:d057:100:1:0:0:575,fd00:ea51:d057:100:1:0:0:574,fd00:ea51:d057:200:1:0:0:94,fd00:ea51:d057:100:1:0:0:577,fd00:ea51:d057:200:1:0:0:95,fd00:ea51:d057:100:1:0:0:576
|
[fd00:ea51:d057:200:1:0:0:8e]:25472,[fd00:ea51:d057:200:1:0:0:8f]:25472,[fd00:ea51:d057:200:1:0:0:92]:25472,[fd00:ea51:d057:100:1:0:0:571]:25472,[fd00:ea51:d057:100:1:0:0:570]:25472,[fd00:ea51:d057:200:1:0:0:93]:25472,[fd00:ea51:d057:100:1:0:0:573]:25472,[fd00:ea51:d057:200:1:0:0:90]:25472,[fd00:ea51:d057:200:1:0:0:91]:25472,[fd00:ea51:d057:100:1:0:0:572]:25472,[fd00:ea51:d057:100:1:0:0:575]:25472,[fd00:ea51:d057:100:1:0:0:574]:25472,[fd00:ea51:d057:200:1:0:0:94]:25472,[fd00:ea51:d057:100:1:0:0:577]:25472,[fd00:ea51:d057:200:1:0:0:95]:25472,[fd00:ea51:d057:100:1:0:0:576]:25472

Running `jstack` on the coordinator shows two repair threads, both idle:

"Repair#167:1" #602177 daemon prio=5 os_prio=0 cpu=9.60ms elapsed=57359.81s
tid=0x00007fa6d1741800 nid=0x18e6c waiting on condition
 [0x00007fc529f9a000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
- parking to wait for  <0x000000045ba93a18> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11
/LockSupport.java:234)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11
/AbstractQueuedSynchronizer.java:2123)
at java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11
/LinkedBlockingQueue.java:458)
at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11
/ThreadPoolExecutor.java:1053)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11
/ThreadPoolExecutor.java:1114)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11
/ThreadPoolExecutor.java:628)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)

"Repair#170:1" #654814 daemon prio=5 os_prio=0 cpu=9.62ms elapsed=7369.98s
tid=0x00007fa6aec09000 nid=0x1a96f waiting on condition
 [0x00007fc535aae000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
- parking to wait for  <0x00000004c45bf7d8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11
/LockSupport.java:234)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11
/AbstractQueuedSynchronizer.java:2123)
at java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11
/LinkedBlockingQueue.java:458)
at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11
/ThreadPoolExecutor.java:1053)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11
/ThreadPoolExecutor.java:1114)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11
/ThreadPoolExecutor.java:628)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)

nodetool netstats says there is nothing happening:

$ nodetool netstats | head -n 2
Mode: NORMAL
Not sending any streams.

There's nothing interesting in the logs for this repair; the last relevant
thing was a bunch of "Created 0 sync tasks based on 6 merkle tree responses
for 3a059b10-4ef6-11ec-
925f-8f7bcf0ba035 (took: 0ms)" and then back and forth for the last couple
of hours with things like

2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO  [OptionalTasks:1]
LocalSessions.java:938 - Attempting to learn the outcome of unfinished
local incremental repair session 3a059b10-4ef6-11ec-925f-8f7bcf0ba035
2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO  [AntiEntropyStage:1]
LocalSessions.java:987 - Received StatusResponse for repair session
3a059b10-4ef6-11ec-925f-8f7bcf0ba035 with state REPAIRING, which is not
actionable. Doing nothing.

Typically, cancelling the session and rerunning with the exact same command
line will succeed.
-- 
James Brown
Engineer

Re: Incremental repairs getting stuck a lot

Posted by James Brown <jb...@easypost.com>.

I filed this as CASSANDRA-17172
<https://issues.apache.org/jira/browse/CASSANDRA-17172>

On Fri, Nov 26, 2021 at 5:33 PM Dinesh Joshi <dj...@apache.org> wrote:

> Could you file a jira with the details?
>
> Dinesh
>
> On Nov 26, 2021, at 2:40 PM, James Brown <jb...@easypost.com> wrote:
>
> 
> We're on 4.0.1 and switched to incremental repairs a couple of months ago.
> They work fine about 95% of the time, but once in a while a session will
> get stuck and will have to be cancelled (with `nodetool repair_admin cancel
> -s <uuid>`). Typically the session will be in REPAIRING but nothing will
> actually be happening.
>
> Output of nodetool repair_admin:
>
> $ nodetool repair_admin
> id                                   | state     | last activity |
> coordinator                          | participants
>
>
>
>
>
>                                        | participants_wp
> 3a059b10-4ef6-11ec-925f-8f7bcf0ba035 | REPAIRING | 6771 (s)      |
> /[fd00:ea51:d057:200:1:0:0:8e]:25472 |
> fd00:ea51:d057:200:1:0:0:8e,fd00:ea51:d057:200:1:0:0:8f,fd00:ea51:d057:200:1:0:0:92,fd00:ea51:d057:100:1:0:0:571,fd00:ea51:d057:100:1:0:0:570,fd00:ea51:d057:200:1:0:0:93,fd00:ea51:d057:100:1:0:0:573,fd00:ea51:d057:200:1:0:0:90,fd00:ea51:d057:200:1:0:0:91,fd00:ea51:d057:100:1:0:0:572,fd00:ea51:d057:100:1:0:0:575,fd00:ea51:d057:100:1:0:0:574,fd00:ea51:d057:200:1:0:0:94,fd00:ea51:d057:100:1:0:0:577,fd00:ea51:d057:200:1:0:0:95,fd00:ea51:d057:100:1:0:0:576
> |
> [fd00:ea51:d057:200:1:0:0:8e]:25472,[fd00:ea51:d057:200:1:0:0:8f]:25472,[fd00:ea51:d057:200:1:0:0:92]:25472,[fd00:ea51:d057:100:1:0:0:571]:25472,[fd00:ea51:d057:100:1:0:0:570]:25472,[fd00:ea51:d057:200:1:0:0:93]:25472,[fd00:ea51:d057:100:1:0:0:573]:25472,[fd00:ea51:d057:200:1:0:0:90]:25472,[fd00:ea51:d057:200:1:0:0:91]:25472,[fd00:ea51:d057:100:1:0:0:572]:25472,[fd00:ea51:d057:100:1:0:0:575]:25472,[fd00:ea51:d057:100:1:0:0:574]:25472,[fd00:ea51:d057:200:1:0:0:94]:25472,[fd00:ea51:d057:100:1:0:0:577]:25472,[fd00:ea51:d057:200:1:0:0:95]:25472,[fd00:ea51:d057:100:1:0:0:576]:25472
>
> Running `jstack` on the coordinator shows two repair threads, both idle:
>
> "Repair#167:1" #602177 daemon prio=5 os_prio=0 cpu=9.60ms
> elapsed=57359.81s tid=0x00007fa6d1741800 nid=0x18e6c waiting on condition
>  [0x00007fc529f9a000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
> - parking to wait for  <0x000000045ba93a18> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11
> /LockSupport.java:234)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11
> /AbstractQueuedSynchronizer.java:2123)
> at java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11
> /LinkedBlockingQueue.java:458)
> at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11
> /ThreadPoolExecutor.java:1053)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11
> /ThreadPoolExecutor.java:1114)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11
> /ThreadPoolExecutor.java:628)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)
>
> "Repair#170:1" #654814 daemon prio=5 os_prio=0 cpu=9.62ms elapsed=7369.98s
> tid=0x00007fa6aec09000 nid=0x1a96f waiting on condition
>  [0x00007fc535aae000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
> - parking to wait for  <0x00000004c45bf7d8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11
> /LockSupport.java:234)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11
> /AbstractQueuedSynchronizer.java:2123)
> at java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11
> /LinkedBlockingQueue.java:458)
> at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11
> /ThreadPoolExecutor.java:1053)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11
> /ThreadPoolExecutor.java:1114)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11
> /ThreadPoolExecutor.java:628)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)
>
> nodetool netstats says there is nothing happening:
>
> $ nodetool netstats | head -n 2
> Mode: NORMAL
> Not sending any streams.
>
> There's nothing interesting in the logs for this repair; the last relevant
> thing was a bunch of "Created 0 sync tasks based on 6 merkle tree
> responses for 3a059b10-4ef6-11ec-
> 925f-8f7bcf0ba035 (took: 0ms)" and then back and forth for the last
> couple of hours with things like
>
> 2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO  [OptionalTasks:1]
> LocalSessions.java:938 - Attempting to learn the outcome of unfinished
> local incremental repair session 3a059b10-4ef6-11ec-925f-8f7bcf0ba035
> 2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO  [AntiEntropyStage:1]
> LocalSessions.java:987 - Received StatusResponse for repair session
> 3a059b10-4ef6-11ec-925f-8f7bcf0ba035 with state REPAIRING, which is not
> actionable. Doing nothing.
>
> Typically, cancelling the session and rerunning with the exact same
> command line will succeed.
> --
> James Brown
> Engineer
>
>

-- 
James Brown
Engineer

Re: Incremental repairs getting stuck a lot

Posted by Dinesh Joshi <dj...@apache.org>.

Could you file a jira with the details?

Dinesh

> On Nov 26, 2021, at 2:40 PM, James Brown <jb...@easypost.com> wrote:
> 
> 
> We're on 4.0.1 and switched to incremental repairs a couple of months ago. They work fine about 95% of the time, but once in a while a session will get stuck and will have to be cancelled (with `nodetool repair_admin cancel -s <uuid>`). Typically the session will be in REPAIRING but nothing will actually be happening.
> 
> Output of nodetool repair_admin:
> 
> $ nodetool repair_admin
> id                                   | state     | last activity | coordinator                          | participants                                                                                                                                                                                                                                                                                                                                                                                                                                                            | participants_wp
> 3a059b10-4ef6-11ec-925f-8f7bcf0ba035 | REPAIRING | 6771 (s)      | /[fd00:ea51:d057:200:1:0:0:8e]:25472 | fd00:ea51:d057:200:1:0:0:8e,fd00:ea51:d057:200:1:0:0:8f,fd00:ea51:d057:200:1:0:0:92,fd00:ea51:d057:100:1:0:0:571,fd00:ea51:d057:100:1:0:0:570,fd00:ea51:d057:200:1:0:0:93,fd00:ea51:d057:100:1:0:0:573,fd00:ea51:d057:200:1:0:0:90,fd00:ea51:d057:200:1:0:0:91,fd00:ea51:d057:100:1:0:0:572,fd00:ea51:d057:100:1:0:0:575,fd00:ea51:d057:100:1:0:0:574,fd00:ea51:d057:200:1:0:0:94,fd00:ea51:d057:100:1:0:0:577,fd00:ea51:d057:200:1:0:0:95,fd00:ea51:d057:100:1:0:0:576 | [fd00:ea51:d057:200:1:0:0:8e]:25472,[fd00:ea51:d057:200:1:0:0:8f]:25472,[fd00:ea51:d057:200:1:0:0:92]:25472,[fd00:ea51:d057:100:1:0:0:571]:25472,[fd00:ea51:d057:100:1:0:0:570]:25472,[fd00:ea51:d057:200:1:0:0:93]:25472,[fd00:ea51:d057:100:1:0:0:573]:25472,[fd00:ea51:d057:200:1:0:0:90]:25472,[fd00:ea51:d057:200:1:0:0:91]:25472,[fd00:ea51:d057:100:1:0:0:572]:25472,[fd00:ea51:d057:100:1:0:0:575]:25472,[fd00:ea51:d057:100:1:0:0:574]:25472,[fd00:ea51:d057:200:1:0:0:94]:25472,[fd00:ea51:d057:100:1:0:0:577]:25472,[fd00:ea51:d057:200:1:0:0:95]:25472,[fd00:ea51:d057:100:1:0:0:576]:25472
> 
> Running `jstack` on the coordinator shows two repair threads, both idle:
> 
> "Repair#167:1" #602177 daemon prio=5 os_prio=0 cpu=9.60ms elapsed=57359.81s tid=0x00007fa6d1741800 nid=0x18e6c waiting on condition  [0x00007fc529f9a000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
> - parking to wait for  <0x000000045ba93a18> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:2123)
> at java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11/LinkedBlockingQueue.java:458)
> at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11/ThreadPoolExecutor.java:1053)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1114)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
> at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)
> 
> "Repair#170:1" #654814 daemon prio=5 os_prio=0 cpu=9.62ms elapsed=7369.98s tid=0x00007fa6aec09000 nid=0x1a96f waiting on condition  [0x00007fc535aae000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
> - parking to wait for  <0x00000004c45bf7d8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:2123)
> at java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11/LinkedBlockingQueue.java:458)
> at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11/ThreadPoolExecutor.java:1053)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1114)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
> at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)
> 
> nodetool netstats says there is nothing happening:
> 
> $ nodetool netstats | head -n 2
> Mode: NORMAL
> Not sending any streams.
> 
> There's nothing interesting in the logs for this repair; the last relevant thing was a bunch of "Created 0 sync tasks based on 6 merkle tree responses for 3a059b10-4ef6-11ec-
> 925f-8f7bcf0ba035 (took: 0ms)" and then back and forth for the last couple of hours with things like
> 
> 2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO  [OptionalTasks:1] LocalSessions.java:938 - Attempting to learn the outcome of unfinished local incremental repair session 3a059b10-4ef6-11ec-925f-8f7bcf0ba035
> 2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO  [AntiEntropyStage:1] LocalSessions.java:987 - Received StatusResponse for repair session 3a059b10-4ef6-11ec-925f-8f7bcf0ba035 with state REPAIRING, which is not actionable. Doing nothing.
> 
> Typically, cancelling the session and rerunning with the exact same command line will succeed.
> -- 
> James Brown
> Engineer