You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Paolo Crosato <pa...@targaubiest.com> on 2014/01/08 17:52:32 UTC

nodetool repair stalled

Hi,

I have two nodes with Cassandra 2.0.3, where repair sessions hang for an 
undefinite time. I'm running nodetool repair once a week on every node, 
on different days. Currently I have like 4 repair sessions running on 
each node, one since 3 weeks and none has finished.
Reading the logs I didn't find any exception, apparently one of the 
repair session got stuck at this command:

  INFO [AntiEntropySessions:10] 2014-01-05 01:00:02,804 RepairJob.java 
(line 116) [repair #5385ea40-759c-11e3-93dc-a1357a0d9222] requesting 
merkle trees for events (to [/10.255.235.19, /10.255.235.18])

Has anybody any suggestion on why a nodetool repair might be stuck and 
how to debug it?

Regards,

Paolo Crosato

Re: nodetool repair stalled

Posted by Paolo Crosato <pa...@targaubiest.com>.

I was able to complete the repair, repairing one keyspace and cf each time.
However the last session is still shown as an active process, even if 
the session has been successfully completed, this is the log:

  INFO [CompactionExecutor:252] 2014-01-14 03:10:13,105 
CompactionTask.java (line 275) Compacted 12 sstables to 
[/data/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-jb-9492,]. 
1,371 bytes to 42 (~3% of original) in 56ms = 0.000715MB/s.  13 total 
partitions merged to 1.  Partition merge counts were {1:1, 2:6, }
  INFO [STREAM-IN-/10.255.235.19] 2014-01-14 03:11:40,750 
StreamResultFuture.java (line 181) [Stream 
#6cf54d20-7cbf-11e3-a6c2-a1357a0d9222] Session with /10.255.235.19 is 
complete
  INFO [STREAM-IN-/10.255.235.19] 2014-01-14 03:11:40,750 
StreamResultFuture.java (line 215) [Stream 
#6cf54d20-7cbf-11e3-a6c2-a1357a0d9222] All sessions completed
  INFO [STREAM-IN-/10.255.235.19] 2014-01-14 03:11:40,751 
StreamingRepairTask.java (line 96) [repair 
#02f3f620-7cbe-11e3-a6c2-a1357a0d9222] streaming task succeed, returning 
response to /10.255.235.18
  INFO [AntiEntropyStage:1] 2014-01-14 03:11:40,751 RepairSession.java 
(line 214) [repair #02f3f620-7cbe-11e3-a6c2-a1357a0d9222] positions is 
fully synced
  INFO [AntiEntropySessions:161] 2014-01-14 03:11:40,751 
RepairSession.java (line 274) [repair 
#02f3f620-7cbe-11e3-a6c2-a1357a0d9222] session completed successfully

This is what ps -eaf |grep java shows:

500      25488 25459  0 Jan13 ?        00:00:43 /usr/bin/java -cp 
/etc/cassandra/conf:/usr/share/java/jna.jar:/usr/share/cassandra/lib/antlr-3.2.jar:/usr/share/cassandra/lib/apache-cassandra-2.0.3.jar:/usr/share/cassandra/lib/apache-cassandra-clientutil-2.0.3.jar:/usr/share/cassandra/lib/apache-cassandra-thrift-2.0.3.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-lang3-3.1.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.3.jar:/usr/share/cassandra/lib/disruptor-3.0.1.jar:/usr/share/cassandra/lib/guava-15.0.jar:/usr/share/cassandra/lib/high-scale-lib-1.1.2.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/cassandra/lib/jamm-0.2.5.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jline-1.0.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.9.1.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/lz4-1.2.0.jar:/usr/share/cassandra/lib/metrics-core-2.2.0.jar:/usr/share/cassandra/lib/netty-3.6.6.Final.jar:/usr/share/cassandra/lib/reporter-config-2.1.0.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.7.2.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.7.2.jar:/usr/share/cassandra/lib/snakeyaml-1.11.jar:/usr/share/cassandra/lib/snappy-java-1.0.5.jar:/usr/share/cassandra/lib/snaptree-0.1.jar:/usr/share/cassandra/lib/stress.jar:/usr/share/cassandra/lib/thrift-server-0.3.2.jar 
-Xmx32m -Dlog4j.configuration=log4j-tools.properties 
-Dstorage-config=/etc/cassandra/conf org.apache.cassandra.tools.NodeCmd 
-p 7199 repair tiergast positions

Is this a known bug?

Regards,

Paolo Crosato

Il 13/01/2014 10:25, Paolo Crosato ha scritto:
> Hi,
>
> I rebooted the nodes and started a fresh repair session. The repair 
> session was started on node 1.
>
> This time actually I got this error on the node that started the repair:
>
> ERROR [AntiEntropySessions:2] 2014-01-10 09:44:46,360 
> RepairSession.java (line 278) [repair 
> #728f4860-79d3-11e3-8c98-a1357a0d9222] session completed with the 
> following error
> org.apache.cassandra.exceptions.RepairException: [repair 
> #728f4860-79d3-11e3-8c98-a1357a0d9222 on OpsCenter/rollups300, 
> (4515884230644880127,4556138740897423021]] Sync failed between 
> /10.255.235.18 and /10.255.235.19
>     at 
> org.apache.cassandra.repair.RepairSession.syncComplete(RepairSession.java:200)
>     at 
> org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:193)
>     at 
> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:59)
>     at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>     at java.lang.Thread.run(Thread.java:744)
> ERROR [AntiEntropySessions:2] 2014-01-10 09:44:46,399 
> CassandraDaemon.java (line 187) Exception in thread 
> Thread[AntiEntropySessions:2,5,RMI Runtime]
> java.lang.RuntimeException: 
> org.apache.cassandra.exceptions.RepairException: [repair 
> #728f4860-79d3-11e3-8c98-a1357a0d9222 on OpsCenter/rollups300, 
> (4515884230644880127,4556138740897423021]] Sync failed between 
> /10.255.235.18 and /10.255.235.19
>     at com.google.common.base.Throwables.propagate(Throwables.java:160)
>     at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>     at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.cassandra.exceptions.RepairException: [repair 
> #728f4860-79d3-11e3-8c98-a1357a0d9222 on OpsCenter/rollups300, 
> (4515884230644880127,4556138740897423021]] Sync failed between 
> /10.255.235.18 and /10.255.235.19
>     at 
> org.apache.cassandra.repair.RepairSession.syncComplete(RepairSession.java:200)
>     at 
> org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:193)
>     at 
> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:59)
>     at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
>     ... 3 more
>
> On the other node i left some black lines between these timestamps:
>
> INFO [ValidationExecutor:3] 2014-01-10 09:42:41,320 SSTableReader.java 
> (line 223) Opening 
> /data/cassandra/data/OpsCenter/rollups60/snapshots/29e4d5d0-79d3-11e3-8c98-a1357a0d9222/OpsCenter-rollups60-jb-11522 
> (88 bytes)
>
>  INFO [ValidationExecutor:14] 2014-01-10 10:37:48,509 
> SSTableReader.java (line 223) Opening 
> /data/cassandra/data/OpsCenter/rollups60/snapshots/d5176b00-79da-11e3-8c98-a1357a0d9222/OpsCenter-rollups60-jb-16275 
> (493003 b
>
> Between I have many log files full of "Opening ...." logs.
>
> I've noticed the repair sessions seems always to hang on the opscenter 
> keyspace. Would uninstall/reinstall help resolve the issue?
>
> Anyway, I attached the logs for the nodes involved, I'm sorry if there 
> is a lot of noise.
>
> Thanks for any input.
>
> Regards,
>
> Paolo Crosato
>
> Il 09/01/2014 03:54, sankalp kohli ha scritto:
>> Hi,
>>     Can you attach the logs around repair. Please do that for node 
>> which triggered it and nodes involved in repair. I will try to find 
>> something useful.
>>
>> Thanks,
>> Sankalp
>>
>>
>> On Wed, Jan 8, 2014 at 10:18 AM, Robert Coli <rcoli@eventbrite.com 
>> <ma...@eventbrite.com>> wrote:
>>
>>     On Wed, Jan 8, 2014 at 8:52 AM, Paolo Crosato
>>     <paolo.crosato@targaubiest.com
>>     <ma...@targaubiest.com>> wrote:
>>
>>         I have two nodes with Cassandra 2.0.3, where repair sessions
>>         hang for an undefinite time. I'm running nodetool repair once
>>         a week on every node, on different days. Currently I have
>>         like 4 repair sessions running on each node, one since 3
>>         weeks and none has finished.
>>         Reading the logs I didn't find any exception, apparently one
>>         of the repair session got stuck at this command:
>>
>>         Has anybody any suggestion on why a nodetool repair might be
>>         stuck and how to debug it?
>>
>>
>>     Cassandra repair has never quite worked right. It got a wholesale
>>     re-write in 2.0.x and "should" be more robust and at very least
>>     log more than before. But unfortunately I have heard a few
>>     reports like yours, so it is probably not completely fixed.
>>
>>     That said, that only option you have for failed repairs seems to
>>     be to restart the affected nodes. Your input as an operator of
>>     2.0.x who would appreciate an alternative is welcome at :
>>
>>     https://issues.apache.org/jira/browse/CASSANDRA-3486
>>
>>     =Rob
>>
>>

Re: nodetool repair stalled

Posted by Paolo Crosato <pa...@targaubiest.com>.

Hi,

I rebooted the nodes and started a fresh repair session. The repair 
session was started on node 1.

This time actually I got this error on the node that started the repair:

ERROR [AntiEntropySessions:2] 2014-01-10 09:44:46,360 RepairSession.java 
(line 278) [repair #728f4860-79d3-11e3-8c98-a1357a0d9222] session 
completed with the following error
org.apache.cassandra.exceptions.RepairException: [repair 
#728f4860-79d3-11e3-8c98-a1357a0d9222 on OpsCenter/rollups300, 
(4515884230644880127,4556138740897423021]] Sync failed between 
/10.255.235.18 and /10.255.235.19
     at 
org.apache.cassandra.repair.RepairSession.syncComplete(RepairSession.java:200)
     at 
org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:193)
     at 
org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:59)
     at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
     at java.lang.Thread.run(Thread.java:744)
ERROR [AntiEntropySessions:2] 2014-01-10 09:44:46,399 
CassandraDaemon.java (line 187) Exception in thread 
Thread[AntiEntropySessions:2,5,RMI Runtime]
java.lang.RuntimeException: 
org.apache.cassandra.exceptions.RepairException: [repair 
#728f4860-79d3-11e3-8c98-a1357a0d9222 on OpsCenter/rollups300, 
(4515884230644880127,4556138740897423021]] Sync failed between 
/10.255.235.18 and /10.255.235.19
     at com.google.common.base.Throwables.propagate(Throwables.java:160)
     at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
     at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
     at java.util.concurrent.FutureTask.run(FutureTask.java:262)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
     at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.cassandra.exceptions.RepairException: [repair 
#728f4860-79d3-11e3-8c98-a1357a0d9222 on OpsCenter/rollups300, 
(4515884230644880127,4556138740897423021]] Sync failed between 
/10.255.235.18 and /10.255.235.19
     at 
org.apache.cassandra.repair.RepairSession.syncComplete(RepairSession.java:200)
     at 
org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:193)
     at 
org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:59)
     at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
     ... 3 more

On the other node i left some black lines between these timestamps:

INFO [ValidationExecutor:3] 2014-01-10 09:42:41,320 SSTableReader.java 
(line 223) Opening 
/data/cassandra/data/OpsCenter/rollups60/snapshots/29e4d5d0-79d3-11e3-8c98-a1357a0d9222/OpsCenter-rollups60-jb-11522 
(88 bytes)

  INFO [ValidationExecutor:14] 2014-01-10 10:37:48,509 
SSTableReader.java (line 223) Opening 
/data/cassandra/data/OpsCenter/rollups60/snapshots/d5176b00-79da-11e3-8c98-a1357a0d9222/OpsCenter-rollups60-jb-16275 
(493003 b

Between I have many log files full of "Opening ...." logs.

I've noticed the repair sessions seems always to hang on the opscenter 
keyspace. Would uninstall/reinstall help resolve the issue?

Anyway, I attached the logs for the nodes involved, I'm sorry if there 
is a lot of noise.

Thanks for any input.

Regards,

Paolo Crosato

Il 09/01/2014 03:54, sankalp kohli ha scritto:
> Hi,
>     Can you attach the logs around repair. Please do that for node 
> which triggered it and nodes involved in repair. I will try to find 
> something useful.
>
> Thanks,
> Sankalp
>
>
> On Wed, Jan 8, 2014 at 10:18 AM, Robert Coli <rcoli@eventbrite.com 
> <ma...@eventbrite.com>> wrote:
>
>     On Wed, Jan 8, 2014 at 8:52 AM, Paolo Crosato
>     <paolo.crosato@targaubiest.com
>     <ma...@targaubiest.com>> wrote:
>
>         I have two nodes with Cassandra 2.0.3, where repair sessions
>         hang for an undefinite time. I'm running nodetool repair once
>         a week on every node, on different days. Currently I have like
>         4 repair sessions running on each node, one since 3 weeks and
>         none has finished.
>         Reading the logs I didn't find any exception, apparently one
>         of the repair session got stuck at this command:
>
>         Has anybody any suggestion on why a nodetool repair might be
>         stuck and how to debug it?
>
>
>     Cassandra repair has never quite worked right. It got a wholesale
>     re-write in 2.0.x and "should" be more robust and at very least
>     log more than before. But unfortunately I have heard a few reports
>     like yours, so it is probably not completely fixed.
>
>     That said, that only option you have for failed repairs seems to
>     be to restart the affected nodes. Your input as an operator of
>     2.0.x who would appreciate an alternative is welcome at :
>
>     https://issues.apache.org/jira/browse/CASSANDRA-3486
>
>     =Rob
>
>


-- 
Paolo Crosato
Software engineer/Custom Solutions
e-mail: paolo.crosato@targaubiest.com
Office phone: +3904221722825

UBIEST S.p.A.
........................................................................................
www.ubiest.com
Via E. Reginato, 85/H - 31100 Treviso- ITALY Tel [+39] 0422 210 194 - Fax [+39] 0422 210 270 ........................................................................................
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.

Re: nodetool repair stalled

Posted by sankalp kohli <ko...@gmail.com>.

Hi,
    Can you attach the logs around repair. Please do that for node which
triggered it and nodes involved in repair. I will try to find something
useful.

Thanks,
Sankalp


On Wed, Jan 8, 2014 at 10:18 AM, Robert Coli <rc...@eventbrite.com> wrote:

> On Wed, Jan 8, 2014 at 8:52 AM, Paolo Crosato <
> paolo.crosato@targaubiest.com> wrote:
>
>> I have two nodes with Cassandra 2.0.3, where repair sessions hang for an
>> undefinite time. I'm running nodetool repair once a week on every node, on
>> different days. Currently I have like 4 repair sessions running on each
>> node, one since 3 weeks and none has finished.
>> Reading the logs I didn't find any exception, apparently one of the
>> repair session got stuck at this command:
>>
>> Has anybody any suggestion on why a nodetool repair might be stuck and
>> how to debug it?
>>
>
> Cassandra repair has never quite worked right. It got a wholesale re-write
> in 2.0.x and "should" be more robust and at very least log more than
> before. But unfortunately I have heard a few reports like yours, so it is
> probably not completely fixed.
>
> That said, that only option you have for failed repairs seems to be to
> restart the affected nodes. Your input as an operator of 2.0.x who would
> appreciate an alternative is welcome at :
>
>  https://issues.apache.org/jira/browse/CASSANDRA-3486
>
> =Rob
>
>

Re: nodetool repair stalled

Posted by Robert Coli <rc...@eventbrite.com>.

On Wed, Jan 8, 2014 at 8:52 AM, Paolo Crosato <paolo.crosato@targaubiest.com
> wrote:

> I have two nodes with Cassandra 2.0.3, where repair sessions hang for an
> undefinite time. I'm running nodetool repair once a week on every node, on
> different days. Currently I have like 4 repair sessions running on each
> node, one since 3 weeks and none has finished.
> Reading the logs I didn't find any exception, apparently one of the repair
> session got stuck at this command:
>
> Has anybody any suggestion on why a nodetool repair might be stuck and how
> to debug it?
>

Cassandra repair has never quite worked right. It got a wholesale re-write
in 2.0.x and "should" be more robust and at very least log more than
before. But unfortunately I have heard a few reports like yours, so it is
probably not completely fixed.

That said, that only option you have for failed repairs seems to be to
restart the affected nodes. Your input as an operator of 2.0.x who would
appreciate an alternative is welcome at :

https://issues.apache.org/jira/browse/CASSANDRA-3486

=Rob