You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Brandon Williams (JIRA)" <ji...@apache.org> on 2010/11/22 21:25:13 UTC

[jira] Created: (CASSANDRA-1766) Streaming never makes progress

Streaming never makes progress
------------------------------

                 Key: CASSANDRA-1766
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1766
             Project: Cassandra
          Issue Type: Bug
    Affects Versions: 0.6.7
            Reporter: Brandon Williams
             Fix For: 0.6.9


I have a client that can never complete a bootstrap.  AC finishes, streaming begins.  Stream initiate completes, and the sources wait on the transfer to finish, but progress is never made on any stream.  Nodetool reports streaming is happening, the socket is held open, but nothing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1766) Streaming never makes progress

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974332#action_12974332 ] 

Jonathan Ellis commented on CASSANDRA-1766:
-------------------------------------------

Should we just turn on socket keepalive?

> Streaming never makes progress
> ------------------------------
>
>                 Key: CASSANDRA-1766
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1766
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.6.7
>            Reporter: Brandon Williams
>             Fix For: 0.6.9
>
>         Attachments: CASSANDRA-1766.patch
>
>
> I have a client that can never complete a bootstrap.  AC finishes, streaming begins.  Stream initiate completes, and the sources wait on the transfer to finish, but progress is never made on any stream.  Nodetool reports streaming is happening, the socket is held open, but nothing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1766) Streaming never makes progress

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935406#action_12935406 ] 

Jonathan Ellis commented on CASSANDRA-1766:
-------------------------------------------

Thanks for the patch, Erik.  Moving followup on that to CASSANDRA-1438 to leave this for the "streaming doesn't start at all" problem.

> Streaming never makes progress
> ------------------------------
>
>                 Key: CASSANDRA-1766
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1766
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.6.7
>            Reporter: Brandon Williams
>             Fix For: 0.6.9
>
>         Attachments: CASSANDRA-1766.patch
>
>
> I have a client that can never complete a bootstrap.  AC finishes, streaming begins.  Stream initiate completes, and the sources wait on the transfer to finish, but progress is never made on any stream.  Nodetool reports streaming is happening, the socket is held open, but nothing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-1766) Streaming never makes progress

Posted by "Erik Onnen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erik Onnen updated CASSANDRA-1766:
----------------------------------

    Attachment: CASSANDRA-1766.patch

Not sure it's exactly related but I encountered an issue where a stream failed post AE and was just wedged with the following stack trace:

"STREAM-STAGE:1" prio=10 tid=0x00007ff2440a5800 nid=0x3c3c in Object.wait() [0x00007ff24a21f000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00007ff28884fad8> (a org.apache.cassandra.utils.SimpleCondition)
        at java.lang.Object.wait(Object.java:485)
        at org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:38)
        - locked <0x00007ff28884fad8> (a org.apache.cassandra.utils.SimpleCondition)
        at org.apache.cassandra.streaming.StreamOutManager.waitForStreamCompletion(StreamOutManager.java:164)
        at org.apache.cassandra.streaming.StreamOut.transferSSTables(StreamOut.java:138)
        at org.apache.cassandra.service.AntiEntropyService$Differencer$1.runMayThrow(AntiEntropyService.java:511)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

We suspect that this occurred because the destination node was in drain state, although from reading the code it appears that any failed stream where the destination goes away would be susceptible to this issue. In this case, the StreamManager will never unblock making subsequent repairs to any node that was pending transfer impossible.

I've attached a patch that smooths out some possible streaming issues:

* Catches streaming errors. Near as I can tell, if an error occurred during streaming because the remote node went away, it would bubble all the way out of the executor and not even be logged. Worse, it would keep the current pending file wedged and never allow it to be cleared. This patch will remove the failed transfer when an IOException occurs. Could be it should be more general
* Allows for manual purging of pending files to a host via JMX which means un-sticking a wedged transfer no-longer requires a restart of that node. It also unfortunately results in removal of the file which could require anti-compaction again but this was the least painful path through the code.
* Corrects an unlikely but potentially fatal scenario where concurrent mutation/read from the file and fileMap references could result in dirty reads by making them concurrency-safe collections. Only way I could see this happening is if someone were to run repair multiple times in succession while streaming was happening. Unlikely but possible and the effects on unsafe map reads can result in a completely unresponsive JVM.


I'm not entirely sure this is the right thing to do but I though I'd float it out there for review. Whatever the correct fix, I think there needs to be a way to cancel pending streams so that they aren't stuck.

> Streaming never makes progress
> ------------------------------
>
>                 Key: CASSANDRA-1766
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1766
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.6.7
>            Reporter: Brandon Williams
>             Fix For: 0.6.9
>
>         Attachments: CASSANDRA-1766.patch
>
>
> I have a client that can never complete a bootstrap.  AC finishes, streaming begins.  Stream initiate completes, and the sources wait on the transfer to finish, but progress is never made on any stream.  Nodetool reports streaming is happening, the socket is held open, but nothing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1766) Streaming never makes progress

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974340#action_12974340 ] 

Brandon Williams commented on CASSANDRA-1766:
---------------------------------------------

That would hack around the problem, but the real issue is SID can get lost on the wire and hang the streaming process forever.  For 0.6, maybe keepalive is the least invasive thing to do.

> Streaming never makes progress
> ------------------------------
>
>                 Key: CASSANDRA-1766
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1766
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.6.7
>            Reporter: Brandon Williams
>             Fix For: 0.6.9
>
>         Attachments: CASSANDRA-1766.patch
>
>
> I have a client that can never complete a bootstrap.  AC finishes, streaming begins.  Stream initiate completes, and the sources wait on the transfer to finish, but progress is never made on any stream.  Nodetool reports streaming is happening, the socket is held open, but nothing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1766) Streaming never makes progress

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974329#action_12974329 ] 

Brandon Williams commented on CASSANDRA-1766:
---------------------------------------------

What's happening here in my case, is there is a firewall/vpn between the bootstrapping node and the source.  The source takes a long time to anticompact, and in this time the tcp connection is killed due to being idle by the firewall.  This causes the stream initiate done message to never be received, because OutboundTcpConnection doesn't actually retry, it only buffers.

> Streaming never makes progress
> ------------------------------
>
>                 Key: CASSANDRA-1766
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1766
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.6.7
>            Reporter: Brandon Williams
>             Fix For: 0.6.9
>
>         Attachments: CASSANDRA-1766.patch
>
>
> I have a client that can never complete a bootstrap.  AC finishes, streaming begins.  Stream initiate completes, and the sources wait on the transfer to finish, but progress is never made on any stream.  Nodetool reports streaming is happening, the socket is held open, but nothing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1766) Streaming never makes progress

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975372#action_12975372 ] 

Hudson commented on CASSANDRA-1766:
-----------------------------------

Integrated in Cassandra-0.6 #36 (See [https://hudson.apache.org/hudson/job/Cassandra-0.6/36/])
    enable keepalive on intra-cluster sockets
patch by jbellis; reviewed by brandonwilliams for CASSANDRA-1766


> Streaming never makes progress
> ------------------------------
>
>                 Key: CASSANDRA-1766
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1766
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.6
>            Reporter: Brandon Williams
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.9, 0.7.1
>
>         Attachments: 1766-keepalive.txt, CASSANDRA-1766.patch
>
>
> I have a client that can never complete a bootstrap.  AC finishes, streaming begins.  Stream initiate completes, and the sources wait on the transfer to finish, but progress is never made on any stream.  Nodetool reports streaming is happening, the socket is held open, but nothing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1766) Streaming never makes progress

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974775#action_12974775 ] 

Brandon Williams commented on CASSANDRA-1766:
---------------------------------------------

+1 on keepalive

> Streaming never makes progress
> ------------------------------
>
>                 Key: CASSANDRA-1766
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1766
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.6
>            Reporter: Brandon Williams
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.9, 0.7.1
>
>         Attachments: 1766-keepalive.txt, CASSANDRA-1766.patch
>
>
> I have a client that can never complete a bootstrap.  AC finishes, streaming begins.  Stream initiate completes, and the sources wait on the transfer to finish, but progress is never made on any stream.  Nodetool reports streaming is happening, the socket is held open, but nothing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-1766) Streaming never makes progress

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-1766:
--------------------------------------

    Attachment: 1766-keepalive.txt

keepalive patch against 0.6

> Streaming never makes progress
> ------------------------------
>
>                 Key: CASSANDRA-1766
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1766
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.6.7
>            Reporter: Brandon Williams
>             Fix For: 0.6.9
>
>         Attachments: 1766-keepalive.txt, CASSANDRA-1766.patch
>
>
> I have a client that can never complete a bootstrap.  AC finishes, streaming begins.  Stream initiate completes, and the sources wait on the transfer to finish, but progress is never made on any stream.  Nodetool reports streaming is happening, the socket is held open, but nothing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.