You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Sylvain Lebresne (JIRA)" <ji...@apache.org> on 2011/03/08 20:11:00 UTC

[jira] Created: (CASSANDRA-2290) Repair hangs if one of the neighbor is dead

Repair hangs if one of the neighbor is dead
-------------------------------------------

                 Key: CASSANDRA-2290
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2290
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.7.3
            Reporter: Sylvain Lebresne




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2290) Repair hangs if one of the neighbor is dead

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021529#comment-13021529 ] 

Hudson commented on CASSANDRA-2290:
-----------------------------------

Integrated in Cassandra-0.8 #19 (See [https://hudson.apache.org/hudson/job/Cassandra-0.8/19/])
    Fix unit tests for CASSANDRA-2290


> Repair hangs if one of the neighbor is dead
> -------------------------------------------
>
>                 Key: CASSANDRA-2290
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2290
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 0.7.5, 0.8
>
>         Attachments: 0001-Don-t-start-repair-if-a-neighbor-is-dead.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Repair don't cope well with dead/dying neighbors. There is 2 problems:
>   # Repair don't check if a node is dead before sending a TreeRequest; this is easily fixable.
>   # If a neighbor dies mid-repair, the repair will also hang forever.
> The second point is not easy to deal with. The best approach is probably CASSANDRA-1740 however. That is, if we add a way to query the state of a repair, and that this query correctly check all neighbors and also add a way to cancel a repair, this would probably be enough.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2290) Repair hangs if one of the neighbor is dead

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021528#comment-13021528 ] 

Sylvain Lebresne commented on CASSANDRA-2290:
---------------------------------------------

Tests fixed, sorry about that.

> Repair hangs if one of the neighbor is dead
> -------------------------------------------
>
>                 Key: CASSANDRA-2290
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2290
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 0.7.5, 0.8
>
>         Attachments: 0001-Don-t-start-repair-if-a-neighbor-is-dead.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Repair don't cope well with dead/dying neighbors. There is 2 problems:
>   # Repair don't check if a node is dead before sending a TreeRequest; this is easily fixable.
>   # If a neighbor dies mid-repair, the repair will also hang forever.
> The second point is not easy to deal with. The best approach is probably CASSANDRA-1740 however. That is, if we add a way to query the state of a repair, and that this query correctly check all neighbors and also add a way to cancel a repair, this would probably be enough.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (CASSANDRA-2290) Repair hangs if one of the neighbor is dead

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne resolved CASSANDRA-2290.
-----------------------------------------

    Resolution: Fixed

> Repair hangs if one of the neighbor is dead
> -------------------------------------------
>
>                 Key: CASSANDRA-2290
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2290
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 0.7.5, 0.8
>
>         Attachments: 0001-Don-t-start-repair-if-a-neighbor-is-dead.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Repair don't cope well with dead/dying neighbors. There is 2 problems:
>   # Repair don't check if a node is dead before sending a TreeRequest; this is easily fixable.
>   # If a neighbor dies mid-repair, the repair will also hang forever.
> The second point is not easy to deal with. The best approach is probably CASSANDRA-1740 however. That is, if we add a way to query the state of a repair, and that this query correctly check all neighbors and also add a way to cancel a repair, this would probably be enough.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2290) Repair hangs if one of the neighbor is dead

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021306#comment-13021306 ] 

Jonathan Ellis commented on CASSANDRA-2290:
-------------------------------------------

+1 the check-neighbors patch

> Repair hangs if one of the neighbor is dead
> -------------------------------------------
>
>                 Key: CASSANDRA-2290
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2290
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 0.7.5
>
>         Attachments: 0001-Don-t-start-repair-if-a-neighbor-is-dead.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Repair don't cope well with dead/dying neighbors. There is 2 problems:
>   # Repair don't check if a node is dead before sending a TreeRequest; this is easily fixable.
>   # If a neighbor dies mid-repair, the repair will also hang forever.
> The second point is not easy to deal with. The best approach is probably CASSANDRA-1740 however. That is, if we add a way to query the state of a repair, and that this query correctly check all neighbors and also add a way to cancel a repair, this would probably be enough.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Issue Comment Edited: (CASSANDRA-2290) Repair hangs if one of the neighbor is dead

Posted by "Aaron Morton (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004707#comment-13004707 ] 

Aaron Morton edited comment on CASSANDRA-2290 at 3/9/11 6:45 PM:
-----------------------------------------------------------------

Not sure if this helps. I found a place where AES was hanging while testing failure during streaming transfer for CASSANDRA-2088 (against 0.7). I broke the FileStresmTask to only send one range and close the sending channel. 

The  IncomingStreamReader.readFile() got stuck in an infinite loop because it does not check the return from FileChannel.transferFrom(). It was returning 0 bytes read. Also the FileStreamTask does not check the bytes sent by transferTo()

While stuck in the loop the socket it was reading from was (127.0.0.1 was in the loop, .0.2 was sending) 
java      25371 aaron   73u  IPv4 0xffffff8010742ff8      0t0  TCP 127.0.0.1:7000->127.0.0.2:52759 (CLOSE_WAIT)

When I was debugging the socketChannel was still reporting it was open. 

Update: Modified FileStresmTask to call System.exit() after sending the first section and got the same result.

      was (Author: amorton):
    Not sure if this helps. I found a place where AES was hanging while testing failure during streaming transfer for CASSANDRA-2088 (against 0.7). I broke the FileStresmTask to only send one range and close the sending channel. 

The  IncomingStreamReader.readFile() got stuck in an infinite loop because it does not check the return from FileChannel.transferFrom(). It was returning 0 bytes read. Also the FileStreamTask does not check the bytes sent by transferTo()

While stuck in the loop the socket it was reading from was (127.0.0.1 was in the loop, .0.2 was sending) 
java      25371 aaron   73u  IPv4 0xffffff8010742ff8      0t0  TCP 127.0.0.1:7000->127.0.0.2:52759 (CLOSE_WAIT)

When I was debugging the socketChannel was still reporting it was open. 
  
> Repair hangs if one of the neighbor is dead
> -------------------------------------------
>
>                 Key: CASSANDRA-2290
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2290
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 0.7.4
>
>         Attachments: 0001-Don-t-start-repair-if-a-neighbor-is-dead.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Repair don't cope well with dead/dying neighbors. There is 2 problems:
>   # Repair don't check if a node is dead before sending a TreeRequest; this is easily fixable.
>   # If a neighbor dies mid-repair, the repair will also hang forever.
> The second point is not easy to deal with. The best approach is probably CASSANDRA-1740 however. That is, if we add a way to query the state of a repair, and that this query correctly check all neighbors and also add a way to cancel a repair, this would probably be enough.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2290) Repair hangs if one of the neighbor is dead

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021527#comment-13021527 ] 

Hudson commented on CASSANDRA-2290:
-----------------------------------

Integrated in Cassandra #856 (See [https://hudson.apache.org/hudson/job/Cassandra/856/])
    

> Repair hangs if one of the neighbor is dead
> -------------------------------------------
>
>                 Key: CASSANDRA-2290
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2290
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 0.7.5, 0.8
>
>         Attachments: 0001-Don-t-start-repair-if-a-neighbor-is-dead.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Repair don't cope well with dead/dying neighbors. There is 2 problems:
>   # Repair don't check if a node is dead before sending a TreeRequest; this is easily fixable.
>   # If a neighbor dies mid-repair, the repair will also hang forever.
> The second point is not easy to deal with. The best approach is probably CASSANDRA-1740 however. That is, if we add a way to query the state of a repair, and that this query correctly check all neighbors and also add a way to cancel a repair, this would probably be enough.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (CASSANDRA-2290) Repair hangs if one of the neighbor is dead

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis reopened CASSANDRA-2290:
---------------------------------------


oops, need to fix AESTest now

> Repair hangs if one of the neighbor is dead
> -------------------------------------------
>
>                 Key: CASSANDRA-2290
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2290
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 0.7.5, 0.8
>
>         Attachments: 0001-Don-t-start-repair-if-a-neighbor-is-dead.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Repair don't cope well with dead/dying neighbors. There is 2 problems:
>   # Repair don't check if a node is dead before sending a TreeRequest; this is easily fixable.
>   # If a neighbor dies mid-repair, the repair will also hang forever.
> The second point is not easy to deal with. The best approach is probably CASSANDRA-1740 however. That is, if we add a way to query the state of a repair, and that this query correctly check all neighbors and also add a way to cancel a repair, this would probably be enough.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Issue Comment Edited: (CASSANDRA-2290) Repair hangs if one of the neighbor is dead

Posted by "Aaron Morton (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004707#comment-13004707 ] 

Aaron Morton edited comment on CASSANDRA-2290 at 3/9/11 6:32 PM:
-----------------------------------------------------------------

Not sure if this helps. I found a place where AES was hanging while testing failure during streaming transfer for CASSANDRA-2088 (against 0.7). I broke the FileStresmTask to only send one range and close the sending channel. 

The  IncomingStreamReader.readFile() got stuck in an infinite loop because it does not check the return from FileChannel.transferFrom(). It was returning 0 bytes read. Also the FileStreamTask does not check the bytes sent by transferTo()

While stuck in the loop the socket it was reading from was (127.0.0.1 was in the loop, .0.2 was sending) 
java      25371 aaron   73u  IPv4 0xffffff8010742ff8      0t0  TCP 127.0.0.1:7000->127.0.0.2:52759 (CLOSE_WAIT)

When I was debugging the socketChannel was still reporting it was open. 

      was (Author: amorton):
    Not sure if this helps. I found a place where AES was hanging while testing failure during streaming transfer for CASSANDRA-2088. I broke the FileStresmTask to only send one range and close the sending channel. 

The  IncomingStreamReader.readFile() got stuck in an infinite loop because it does not check the return from FileChannel.transferFrom(). It was returning 0 bytes read. Also the FileStreamTask does not check the bytes sent by transferTo()

While stuck in the loop the socket it was reading from was (127.0.0.1 was in the loop, .0.2 was sending) 
java      25371 aaron   73u  IPv4 0xffffff8010742ff8      0t0  TCP 127.0.0.1:7000->127.0.0.2:52759 (CLOSE_WAIT)

When I was debugging the socketChannel was still reporting it was open. 
  
> Repair hangs if one of the neighbor is dead
> -------------------------------------------
>
>                 Key: CASSANDRA-2290
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2290
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 0.7.4
>
>         Attachments: 0001-Don-t-start-repair-if-a-neighbor-is-dead.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Repair don't cope well with dead/dying neighbors. There is 2 problems:
>   # Repair don't check if a node is dead before sending a TreeRequest; this is easily fixable.
>   # If a neighbor dies mid-repair, the repair will also hang forever.
> The second point is not easy to deal with. The best approach is probably CASSANDRA-1740 however. That is, if we add a way to query the state of a repair, and that this query correctly check all neighbors and also add a way to cancel a repair, this would probably be enough.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (CASSANDRA-2290) Repair hangs if one of the neighbor is dead

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2290:
----------------------------------------

    Attachment: 0001-Don-t-start-repair-if-a-neighbor-is-dead.patch

Attaching patch for the first problem above. It checks that all neighbors are alive before attempting the repair. If not, it don't start the repair. Another option would be to still do the repair with whomever neighbor are alive (if any). But I think that refusing to repair is a saner default and I'm fine waiting that someone needs the second option before considering adding it.


> Repair hangs if one of the neighbor is dead
> -------------------------------------------
>
>                 Key: CASSANDRA-2290
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2290
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.3
>            Reporter: Sylvain Lebresne
>         Attachments: 0001-Don-t-start-repair-if-a-neighbor-is-dead.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Repair don't cope well with dead/dying neighbors. There is 2 problems:
>   # Repair don't check if a node is dead before sending a TreeRequest; this is easily fixable.
>   # If a neighbor dies mid-repair, the repair will also hang forever.
> The second point is not easy to deal with. The best approach is probably CASSANDRA-1740 however. That is, if we add a way to query the state of a repair, and that this query correctly check all neighbors and also add a way to cancel a repair, this would probably be enough.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2290) Repair hangs if one of the neighbor is dead

Posted by "Aaron Morton (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004707#comment-13004707 ] 

Aaron Morton commented on CASSANDRA-2290:
-----------------------------------------

Not sure if this helps. I found a place where AES was hanging while testing failure during streaming transfer for CASSANDRA-2088. I broke the FileStresmTask to only send one range and close the sending channel. 

The  IncomingStreamReader.readFile() got stuck in an infinite loop because it does not check the return from FileChannel.transferFrom(). It was returning 0 bytes read. Also the FileStreamTask does not check the bytes sent by transferTo()

While stuck in the loop the socket it was reading from was (127.0.0.1 was in the loop, .0.2 was sending) 
java      25371 aaron   73u  IPv4 0xffffff8010742ff8      0t0  TCP 127.0.0.1:7000->127.0.0.2:52759 (CLOSE_WAIT)

When I was debugging the socketChannel was still reporting it was open. 

> Repair hangs if one of the neighbor is dead
> -------------------------------------------
>
>                 Key: CASSANDRA-2290
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2290
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 0.7.4
>
>         Attachments: 0001-Don-t-start-repair-if-a-neighbor-is-dead.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Repair don't cope well with dead/dying neighbors. There is 2 problems:
>   # Repair don't check if a node is dead before sending a TreeRequest; this is easily fixable.
>   # If a neighbor dies mid-repair, the repair will also hang forever.
> The second point is not easy to deal with. The best approach is probably CASSANDRA-1740 however. That is, if we add a way to query the state of a repair, and that this query correctly check all neighbors and also add a way to cancel a repair, this would probably be enough.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (CASSANDRA-2290) Repair hangs if one of the neighbor is dead

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2290:
----------------------------------------

           Description: 
Repair don't cope well with dead/dying neighbors. There is 2 problems:

  # Repair don't check if a node is dead before sending a TreeRequest; this is easily fixable.
  # If a neighbor dies mid-repair, the repair will also hang forever.

The second point is not easy to deal with. The best approach is probably CASSANDRA-1740 however. That is, if we add a way to query the state of a repair, and that this query correctly check all neighbors and also add a way to cancel a repair, this would probably be enough.

    Remaining Estimate: 1h
     Original Estimate: 1h

> Repair hangs if one of the neighbor is dead
> -------------------------------------------
>
>                 Key: CASSANDRA-2290
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2290
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.3
>            Reporter: Sylvain Lebresne
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Repair don't cope well with dead/dying neighbors. There is 2 problems:
>   # Repair don't check if a node is dead before sending a TreeRequest; this is easily fixable.
>   # If a neighbor dies mid-repair, the repair will also hang forever.
> The second point is not easy to deal with. The best approach is probably CASSANDRA-1740 however. That is, if we add a way to query the state of a repair, and that this query correctly check all neighbors and also add a way to cancel a repair, this would probably be enough.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2290) Repair hangs if one of the neighbor is dead

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021595#comment-13021595 ] 

Hudson commented on CASSANDRA-2290:
-----------------------------------

Integrated in Cassandra-0.7 #447 (See [https://hudson.apache.org/hudson/job/Cassandra-0.7/447/])
    

> Repair hangs if one of the neighbor is dead
> -------------------------------------------
>
>                 Key: CASSANDRA-2290
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2290
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 0.7.5, 0.8
>
>         Attachments: 0001-Don-t-start-repair-if-a-neighbor-is-dead.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Repair don't cope well with dead/dying neighbors. There is 2 problems:
>   # Repair don't check if a node is dead before sending a TreeRequest; this is easily fixable.
>   # If a neighbor dies mid-repair, the repair will also hang forever.
> The second point is not easy to deal with. The best approach is probably CASSANDRA-1740 however. That is, if we add a way to query the state of a repair, and that this query correctly check all neighbors and also add a way to cancel a repair, this would probably be enough.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (CASSANDRA-2290) Repair hangs if one of the neighbor is dead

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2290:
--------------------------------------

             Reviewer: stuhood
             Priority: Minor  (was: Major)
    Affects Version/s:     (was: 0.7.3)
                       0.6
        Fix Version/s: 0.7.4
             Assignee: Sylvain Lebresne

> Repair hangs if one of the neighbor is dead
> -------------------------------------------
>
>                 Key: CASSANDRA-2290
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2290
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 0.7.4
>
>         Attachments: 0001-Don-t-start-repair-if-a-neighbor-is-dead.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Repair don't cope well with dead/dying neighbors. There is 2 problems:
>   # Repair don't check if a node is dead before sending a TreeRequest; this is easily fixable.
>   # If a neighbor dies mid-repair, the repair will also hang forever.
> The second point is not easy to deal with. The best approach is probably CASSANDRA-1740 however. That is, if we add a way to query the state of a repair, and that this query correctly check all neighbors and also add a way to cancel a repair, this would probably be enough.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira