You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "gloCalHelp.com" <ww...@sina.com> on 2019/12/29 08:54:20 UTC
回复:repair failed
TO Oliver : Maybe repair should be executed after all data in MEMTBL are all flushed into harddisk?
Sincerely yours,
Georgelin
www_8ems_com@sina.com
mobile:0086 180 5986 1565
----- 原始邮件 -----
发件人:Oliver Herrmann <o....@gmail.com>
收件人:user@cassandra.apache.org
主题:repair failed
日期:2019年12月28日 23点15分
Hello,
today the second time our weekly repair job failed which was working for many month without a problem. We are having multiple Cassandra nodes in two data center.
The repair command is started only on one node with the following parameters:
nodetool repair -full -dcpar
Is it problematic if the repair is started only on one node?
The repair fails after one hour with the following error message:
failed with error Could not create snapshot at /192.168.13.232 (progress: 0%)
[2019-12-28 05:00:04,295] Some repair failed
[2019-12-28 05:00:04,296] Repair command #1 finished in 1 hour 0 minutes 2 seconds
error: Repair job has failed with the error message: [2019-12-28 05:00:04,295] Some repair failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: [2019-12-28 05:00:04,295] Some repair failed
at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(Unknown Source)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(Unknown Source)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(Unknown Source)
at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(Unknown Source)
In the logfile on 192.168.13.232 which is in the second data center I could find only in debug.log the following log messages:DEBUG [COMMIT-LOG-ALLOCATOR] 2019-12-28 04:21:20,143 AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating a fresh one
DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28 04:31:00,450 OutboundTcpConnection.java:410 - Socket to 192.168.13.120
closed
DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28 04:31:00,450 OutboundTcpConnection.java:349 - Error writing to 192.168
.13.120
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_111]
We tried to run repair a few more times but it always failed with the same error. After restarting all nodes it was finally successful.
Any idea what could be wrong?
RegardsOliver
Re: repair failed
Posted by Ben Mills <be...@bitbrew.com>.
Hi Oliver,
I don't have a quick answer (or any answer yet), though we ran into a
similar issue and I'm wondering about your environment and some configs.
- Operating system?
- Cloud or on-premise?
- Version of Cassandra?
- Version of Java?
- Compaction strategy?
- Primarily read or primarily write (or a blend of both)?
- How much memory allocated to heap?
- How long do all the repair commands typically take per node?
nodetool repair -full -dcpar will stream data across data centers - is it
possible that the number of nodes, or the amount of data, or the number of
keyspaces has grown enough over time to cause streaming issues (and
timeouts)?
You wrote:
Is it problematic if the repair is started only on one node?
Are you asking whether it's ok to run -full repairs one node at a time (on
all nodes)? Or are you saying that you are only repairing one node in each
cluster or DC?
Thanks,
Ben
On Sun, Dec 29, 2019 at 3:54 AM gloCalHelp.com <ww...@sina.com>
wrote:
> TO Oliver :
> Maybe repair should be executed after all data in MEMTBL are all
> flushed into harddisk?
>
>
> Sincerely yours,
> Georgelin
> www_8ems_com@sina.com
> mobile:0086 180 5986 1565
>
>
> ----- 原始邮件 -----
> 发件人:Oliver Herrmann <o....@gmail.com>
> 收件人:user@cassandra.apache.org
> 主题:repair failed
> 日期:2019年12月28日 23点15分
>
> Hello,
>
> today the second time our weekly repair job failed which was working for
> many month without a problem. We are having multiple Cassandra nodes in two
> data center.
>
> The repair command is started only on one node with the following
> parameters:
>
> nodetool repair -full -dcpar
>
> Is it problematic if the repair is started only on one node?
>
> The repair fails after one hour with the following error message:
>
> failed with error Could not create snapshot at /192.168.13.232
> (progress: 0%)
> [2019-12-28 05:00:04,295] Some repair failed
> [2019-12-28 05:00:04,296] Repair command #1 finished in 1 hour 0 minutes 2
> seconds
> error: Repair job has failed with the error message: [2019-12-28
> 05:00:04,295] Some repair failed
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error message:
> [2019-12-28 05:00:04,295] Some repair failed
> at
> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
> at
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(Unknown
> Source)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(Unknown
> Source)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(Unknown
> Source)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(Unknown
> Source)
>
> In the logfile on 192.168.13.232 which is in the second data center I
> could find only in debug.log the following log messages:
> DEBUG [COMMIT-LOG-ALLOCATOR] 2019-12-28 04:21:20,143
> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
> a fresh one
> DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28
> 04:31:00,450 OutboundTcpConnection.java:410 - Socket to 192.168.13.120
> closed
> DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28
> 04:31:00,450 OutboundTcpConnection.java:349 - Error writing to 192.168
> .13.120
> java.io.IOException: Connection timed out
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> ~[na:1.8.0_111]
>
> We tried to run repair a few more times but it always failed with the same
> error. After restarting all nodes it was finally successful.
>
> Any idea what could be wrong?
>
> Regards
> Oliver
>