You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "gloCalHelp.com" <ww...@sina.com> on 2019/12/29 08:54:20 UTC

回复:repair failed

TO Oliver :   Maybe repair should be executed after all data in MEMTBL are all flushed into harddisk?





Sincerely yours,

Georgelin

www_8ems_com@sina.com

mobile:0086 180 5986 1565




----- 原始邮件 -----
发件人:Oliver Herrmann <o....@gmail.com>
收件人:user@cassandra.apache.org
主题:repair failed
日期:2019年12月28日 23点15分

Hello,

today the second time our weekly repair job failed which was working for many month without a problem. We are having multiple Cassandra nodes in two data center. 

The repair command is started only on one node with the following parameters:

nodetool repair -full -dcpar 

Is it problematic if the repair is started only on one node? 
The repair fails after one hour with the following error message:

 failed with error Could not create snapshot at /192.168.13.232 (progress: 0%)
[2019-12-28 05:00:04,295] Some repair failed
[2019-12-28 05:00:04,296] Repair command #1 finished in 1 hour 0 minutes 2 seconds
error: Repair job has failed with the error message: [2019-12-28 05:00:04,295] Some repair failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: [2019-12-28 05:00:04,295] Some repair failed
        at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
        at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
        at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(Unknown Source)
        at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(Unknown Source)
        at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(Unknown Source)
        at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(Unknown Source)

In the logfile on 192.168.13.232 which is in the second data center I could find only in debug.log the following log messages:DEBUG [COMMIT-LOG-ALLOCATOR] 2019-12-28 04:21:20,143 AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating a fresh one
DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28 04:31:00,450 OutboundTcpConnection.java:410 - Socket to 192.168.13.120
 closed
DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28 04:31:00,450 OutboundTcpConnection.java:349 - Error writing to 192.168
.13.120
java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_111]

We tried to run repair a few more times but it always failed with the same error. After restarting all nodes it was finally successful.

Any idea what could be wrong?
RegardsOliver

Re: repair failed

Posted by Ben Mills <be...@bitbrew.com>.
Hi Oliver,

I don't have a quick answer (or any answer yet), though we ran into a
similar issue and I'm wondering about your environment and some configs.

- Operating system?
- Cloud or on-premise?
- Version of Cassandra?
- Version of Java?
- Compaction strategy?
- Primarily read or primarily write (or a blend of both)?
- How much memory allocated to heap?
- How long do all the repair commands typically take per node?

nodetool repair -full -dcpar will stream data across data centers - is it
possible that the number of nodes, or the amount of data, or the number of
keyspaces has grown enough over time to cause streaming issues (and
timeouts)?

You wrote:

Is it problematic if the repair is started only on one node?

Are you asking whether it's ok to run -full repairs one node at a time (on
all nodes)? Or are you saying that you are only repairing one node in each
cluster or DC?

Thanks,
Ben




On Sun, Dec 29, 2019 at 3:54 AM gloCalHelp.com <ww...@sina.com>
wrote:

> TO Oliver :
>    Maybe repair should be executed after all data in MEMTBL are all
> flushed into harddisk?
>
>
> Sincerely yours,
> Georgelin
> www_8ems_com@sina.com
> mobile:0086 180 5986 1565
>
>
> ----- 原始邮件 -----
> 发件人:Oliver Herrmann <o....@gmail.com>
> 收件人:user@cassandra.apache.org
> 主题:repair failed
> 日期:2019年12月28日 23点15分
>
> Hello,
>
> today the second time our weekly repair job failed which was working for
> many month without a problem. We are having multiple Cassandra nodes in two
> data center.
>
> The repair command is started only on one node with the following
> parameters:
>
> nodetool repair -full -dcpar
>
> Is it problematic if the repair is started only on one node?
>
> The repair fails after one hour with the following error message:
>
>  failed with error Could not create snapshot at /192.168.13.232
> (progress: 0%)
> [2019-12-28 05:00:04,295] Some repair failed
> [2019-12-28 05:00:04,296] Repair command #1 finished in 1 hour 0 minutes 2
> seconds
> error: Repair job has failed with the error message: [2019-12-28
> 05:00:04,295] Some repair failed
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error message:
> [2019-12-28 05:00:04,295] Some repair failed
>         at
> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
>         at
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
>         at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(Unknown
> Source)
>         at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(Unknown
> Source)
>         at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(Unknown
> Source)
>         at
> com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(Unknown
> Source)
>
> In the logfile on 192.168.13.232 which is in the second data center I
> could find only in debug.log the following log messages:
> DEBUG [COMMIT-LOG-ALLOCATOR] 2019-12-28 04:21:20,143
> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
> a fresh one
> DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28
> 04:31:00,450 OutboundTcpConnection.java:410 - Socket to 192.168.13.120
>  closed
> DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28
> 04:31:00,450 OutboundTcpConnection.java:349 - Error writing to 192.168
> .13.120
> java.io.IOException: Connection timed out
>         at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> ~[na:1.8.0_111]
>
> We tried to run repair a few more times but it always failed with the same
> error. After restarting all nodes it was finally successful.
>
> Any idea what could be wrong?
>
> Regards
> Oliver
>