You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jorge Gallegos (JIRA)" <ji...@apache.org> on 2013/09/12 19:31:11 UTC
[jira] [Commented] (CASSANDRA-5396) Repair process is a joke leading to a downward spiralling and eventually unusable cluster

    [ https://issues.apache.org/jira/browse/CASSANDRA-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765665#comment-13765665 ] 

Jorge Gallegos commented on CASSANDRA-5396:
-------------------------------------------

Regardless of the tone used, the fact remains that points 1), 2), and 3) are true.

An example of a repair session going AWOL (point 4):

{noformat}
Exception in thread "main" java.io.IOException: Some repair session(s) failed (see log for details).
        at org.apache.cassandra.service.StorageService.forceTableRepairPrimaryRange(StorageService.java:2003)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
        at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
        at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
        at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
        at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
        at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
        at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
        at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427)
        at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72)
        at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265)
        at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360)
        at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788)
        at sun.reflect.GeneratedMethodAccessor335.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305)
        at sun.rmi.transport.Transport$1.run(Transport.java:159)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
{noformat}

And the logs show these stack traces, which support point 5):

{noformat}
ERROR [GossipTasks:1] 2013-09-11 17:34:32,233 AbstractStreamSession.java (line 113) Stream failed because /10.128.9.106 died or 
was restarted/removed (streams may still be active in background, but further streams won't be started)
ERROR [AntiEntropySessions:5] 2013-09-11 17:34:32,235 AntiEntropyService.java (line 716) [repair #fa3e3400-1b24-11e3-0000-5f4707
e79ffb] session completed with the following error
java.io.IOException: Endpoint /10.128.9.106 died
        at org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:787)
        at org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:821)
        at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:193)
        at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:634)
        at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:61)
        at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:166)
        at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:79)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
ERROR [GossipTasks:1] 2013-09-11 17:34:32,233 AbstractStreamSession.java (line 113) Stream failed because /10.128.9.106 died or 
was restarted/removed (streams may still be active in background, but further streams won't be started)
ERROR [AntiEntropySessions:5] 2013-09-11 17:34:32,235 AntiEntropyService.java (line 716) [repair #fa3e3400-1b24-11e3-0000-5f4707
e79ffb] session completed with the following error
java.io.IOException: Endpoint /10.128.9.106 died
        at org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:787)
        at org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:821)
        at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:193)
        at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:634)
        at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:61)
        at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:166)
        at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:79)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.RuntimeException: java.io.IOException: Endpoint /10.128.9.106 died
        at org.apache.cassandra.utils.FBUtilities.unchecked(FBUtilities.java:628)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        ... 3 more
Caused by: java.io.IOException: Endpoint /10.128.9.106 died
        at org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:787)
        at org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:821)
        at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:193)
        at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:634)
        at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:61)
        at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:166)
        at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:79)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
        ... 3 more
{noformat}

Needless to say, the peer node didn't "die" neither it was "restarted/removed". In fact the peer node's cpu and other metrics didn't seem too high. This seems pretty similar to CASSANDRA-3838 (and by extension CASSANDRA-3569 ?) although in our case this is all in the same datacenter. Again this is conjecturing since, well, there are no logs for 1+ hour in that particular peer node. Here are the two lines in that particular time frame:

{noformat}
 INFO [FlushWriter:4414] 2013-09-11 16:38:05,237 Memtable.java (line 305) Completed flushing /mntj/data/cassandra/venkman/app_de
vice_tags/venkman-app_device_tags-hf-37100-Data.db (15478504 bytes) for commitlog position ReplayPosition(segmentId=137417554057
1, position=10412888)
 INFO [OptionalTasks:1] 2013-09-11 17:55:26,130 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='venkman', ColumnFamily='app_device_tags') (estimated 231388067 bytes)
{noformat}

Points 6) and 7) are true, since every time I attempted to repair a node it increased the amount of data on disk. Doing a cleanup doesn't alleviate the problem and no, not even re-creating the node has meaningful impact (I personally tested this)

Points 8) and 9) are just consequences of the above and they are also true, to the point that I'm considering just rebuilding the nodes instead of repairing them, but I am unsure if that will work anyway, so I haven't tried.

If this is already being tracked in a different jira issue then perhaps this should be marked as duplicate, thoughts?
                
> Repair process is a joke leading to a downward spiralling and eventually unusable cluster
> -----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-5396
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5396
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.2.3
>         Environment: all
>            Reporter: David Berkman
>            Priority: Critical
>
> Let's review the repair process...
> 1) It's mandatory to run repair.
> 2) Repair has a high impact and can take hours.
> 3) Repair provides no estimation of completion time and no progress indicator.
> 4) Repair is extremely fragile, and can fail to complete, or become stuck quite easily in real operating environments.
> 5) When repair fails it provides no feedback whatsoever of the problem or possible resolution.
> 6) A failed repair operation saddles the effected nodes with a huge amount of extra data (judging from node size).
> 7) There is no way to rid the node of the extra data associated with a failed repair short of completely rebuilding the node.
> 8) The extra data from a failed repair makes any subsequent repair take longer and increases the likelihood that it will simply become stuck or fail, leading to yet more node corruption.
> 9) Eventually no repair operation will complete successfully, and node operations will eventually become impacted leading to a failing cluster.
> Who would design such a system for a service meant to operate as a fault tolerant clustered data store operating on a lot of commodity hardware?
> Solution...
> 1) Repair must be robust.
> 2) Repair must *never* become 'stuck'.
> 3) Failure to complete must result in reasonable feedback.
> 4) Failure to complete must not result in a node whose state is worse than before the operation began.
> 5) Repair must provide some means of determining completion percentage.
> 6) It would be nice if repair could estimate its run time, even if it could do so only based upon previous runs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira