You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Chris Goffinet (JIRA)" <ji...@apache.org> on 2009/12/15 18:36:18 UTC

[jira] Created: (CASSANDRA-634) Hinted Handoff Exception

Hinted Handoff Exception
------------------------

                 Key: CASSANDRA-634
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
             Project: Cassandra
          Issue Type: Bug
    Affects Versions: 0.5
            Reporter: Chris Goffinet
             Fix For: 0.5


Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:

ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
java.lang.RuntimeException: java.lang.NullPointerException
        at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
        at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
        at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
        at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
        at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
        at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
        ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Chris Goffinet (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792972#action_12792972 ] 

Chris Goffinet commented on CASSANDRA-634:
------------------------------------------

Does this patch address the NullException? I tried latest from branch 05 and am still seeing this exception.

> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>         Attachments: 634-1st-part-gossip-about-all-nodes.patch
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Chris Goffinet (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793939#action_12793939 ] 

Chris Goffinet commented on CASSANDRA-634:
------------------------------------------

can you rebase this on 05? i am fixing it manually, strange it complained you just added a comment to INTERVAL_IN_MS.

> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>         Attachments: 634-1st-part-gossip-about-all-nodes.patch, 634-discard-obsolete.patch
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Jaakko Laine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791400#action_12791400 ] 

Jaakko Laine commented on CASSANDRA-634:
----------------------------------------

Gossip SYN only includes digests for live endpoints. 

You're right, this would indeed cause wrong ranges on new nodes that have not seen the dead node when it was still alive. Write would still go to the right node, but it would lack HINT flag.

Simply gossiping state about all nodes (dead or alive) would solve this problem, but have to have another look tomorrow morning if it would cause any side effects with failuredetector on the new node (might momentarily consider the dead node to be alive, although this would probably not be a problem at all).


> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>             Fix For: 0.5
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791394#action_12791394 ] 

Jonathan Ellis commented on CASSANDRA-634:
------------------------------------------

Are you sure we don't gossip state for down nodes to new cluster members?  If we don't, the token ranges will be wrong, which is a bigger problem.

> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>             Fix For: 0.5
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Jaakko Laine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791392#action_12791392 ] 

Jaakko Laine commented on CASSANDRA-634:
----------------------------------------

Oops, sorry. I need a bigger font :-(

(but still "volatile" is not needed there :-)

OK, could it be like this: There are nodes A and B in the cluster, with replication factor 2. Node B goes down and node C is introduced as a new node after this. Now A knows there are A, B and C in the cluster, but C only knows about A. Suppose at this time client sends a write request to A, which falls into A's range (and replica to B's range). B is offline, so instead a hinted write will go to C. Problem is, C will try to deliver this hint later to B, but its Gossiper has never heard of B, so endpointstate will be null.

If this is the cause, then a simple fix

if (epState == null)
    return false;

before line 146 should do the trick.


> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>             Fix For: 0.5
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-634.
--------------------------------------

    Resolution: Fixed
      Assignee: Jaakko Laine

committed to 0.5 and trunk.

Created CASSANDRA-644 for the gossip membership problem.

> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>         Attachments: 634-1st-part-gossip-about-all-nodes.patch
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Jaakko Laine (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jaakko Laine updated CASSANDRA-634:
-----------------------------------

    Attachment: 634-1st-part-gossip-about-all-nodes.patch

The way node's alive/dead status gets determined currently is as follows:

1st time there is gossip about certain node:
- add new node to endpoint state and mark it alive (onAlive will be called)
- call onJoin

2nd and subsequent gossip about this node
- notify failuredetector whenever there is gossip about this endpoint -> failuredetector starts to monitor this node and set node's status dead if needed (it will not set it to alive)
- node is marked alive whenever there is gossip about it

The important things here are: (1) node is assumed to be alive when 1st info about it arrives and (2) failuredetector does not know anything about the node before 2nd gossip. That means we cannot simply start gossiping info about dead nodes, as their status would remain "alive" forever (that is, until the dead node comes online and activates failure detector)

Proposed fix (patch attached):

1st time gossip:
- add new node to endpoint state, but set its status as dead
- call onJoin (token metadata will be updated)

2nd and subsequent gossips:
- Unchanged. This 2nd gossip will trigger markAlive (and call onAlive) and activate failuredetector -> normal situation

In short: assume node to be dead unless otherwise proven by subsequent gossip. If the node is alive, it will be marked so within seconds. If it is dead, we have knowledge about its existence, but we consider it (correctly) to be dead.

There is a possibility of false "alive" interpretation, though: Cluster has nodes A, B and C. Suppose C has just gossiped to B and dies. At this time C's status in A is different (older) than in B. Now suppose at this instant node D enters the cluster and first gossips with A. In this case D will get the old gossip and only later the new one. This second newer gossip will cause C to be marked alive even though it is already dead. However, since second gossip will also activate failure detector, C will be correctly marked as dead in a few seconds, so this is probably OK (and anyway a very rare occurence).

Two open issues:
- Now that we're gossiping about dead nodes as well, gossip digest continues to grow without boundary when nodes come and go. This information will never disappear as it will be propagated to new nodes no matter how old and obsolete it is. To counter this, we need some mechanism to (1) either remove dead node from endpointstateinfo or (2) at some point stop to gossip about it, or both.

For (1): when we get removetoken command, it is probably safe to remove the endpoint immediately (STATE_LEFT is broadcasted by different endpoint, so info about token removal will remain in the gossiper). Another thing we could do is to keep track of nodes that have left. If nothing is heard about it for some time, we could assume that it is gone for good and remove it from gossiper after giving its STATE_LEFT enough time to spread.

For (2): We could gossip info only about nodes in either liveEndpoints or unreachableEndpoints (as opposed to endPointStateMap). Nodes are removed from unreachableEndpoints after three days of silence, so this would discard old information from the gossiper. Side effect would of course be that a node that is down more than three days but comes back later, might miss some of its data (new nodes that booted after the three day period would know nothing about this node).

(Attached patch should work as such, but does not take into account these last two issues)


> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>             Fix For: 0.5
>
>         Attachments: 634-1st-part-gossip-about-all-nodes.patch
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Chris Goffinet (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794177#action_12794177 ] 

Chris Goffinet commented on CASSANDRA-634:
------------------------------------------

+1 no more exceptions and I am seeing HH being disregarded for the node we removed now.

> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>         Attachments: 634-1st-part-gossip-about-all-nodes.patch, 634-discard-obsolete.patch
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Reopened: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Jaakko Laine (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jaakko Laine reopened CASSANDRA-634:
------------------------------------


> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>         Attachments: 634-1st-part-gossip-about-all-nodes.patch
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Jaakko Laine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793134#action_12793134 ] 

Jaakko Laine commented on CASSANDRA-634:
----------------------------------------

It was supposed to solve it, but obviously it did not fully do so.

Problem in your case might be because hinted handoff data is persistent and gossiper data is not. Suppose there are nodes A and B. Suppose B goes down and A stores hinted data for it. Later A is restarted -> A still has hinted data for B, but after restart its gossiper knows nothing about B. It does not help even if we gossip about dead nodes, as nobody has ever heard of B. If B is gone forever, A can never get rid of hinted data.

Don't know what would be the best thing to do here. removetoken command could make efforts to redirect hints to new destination in case a hinted target is removed. However, if the endpoint has been lost from gossip/tokenmetadata, then there is nothing it can do as it does not know who the endpoint was. Another option would be to add manual command to redirect hinted data.

Other options?


> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>         Attachments: 634-1st-part-gossip-about-all-nodes.patch
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791385#action_12791385 ] 

Jonathan Ellis commented on CASSANDRA-634:
------------------------------------------

No, 146 in the 0.5 branch is the next line:

        return epState.isAlive();


> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>             Fix For: 0.5
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Chris Goffinet (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790838#action_12790838 ] 

Chris Goffinet commented on CASSANDRA-634:
------------------------------------------

I'm wondering if this might be due to the fact that one of the node's in the cluster is currently down. This might offer a clue.

> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>             Fix For: 0.5
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Jaakko Laine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793444#action_12793444 ] 

Jaakko Laine commented on CASSANDRA-634:
----------------------------------------

If dropping HH data is OK, then this patch looks good to me.

One small thing: After a node starts, it will take some time before it knows about all cluster nodes. During this time if there is a request to deliver hints (not possible?), HH data will be discarded just because the node does not know about the other endpoint yet.


> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>         Attachments: 634-1st-part-gossip-about-all-nodes.patch, 634-discard-obsolete.patch
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Jaakko Laine (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791382#action_12791382 ] 

Jaakko Laine commented on CASSANDRA-634:
----------------------------------------

Very strange one.

Line 146 in FailureDetector is this:

EndPointState epState = Gossiper.instance().getEndPointStateForEndPoint(ep);

According to the stacktrace it does not even reach getEndPointStateForEndPoint, and even if it did, ep comes from InetAddress.getByAddress, so it cannot be null. Also, if the culprit was Gossiper constructor, that should also show in the trace.

This would mean that Gossiper.instance() returns null, but I don't know how that can happen. "volatile" in gossiper_ is actually not needed, but I don't know if having it there could cause such thing. "volatile" was added here pretty recently, so it just might be one possible explanation why this came up now.

BTW what version of java you're using?


> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>             Fix For: 0.5
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-634:
-------------------------------------

    Attachment: 634-discard-obsolete.patch

discard hints for nodes that are no longer part of the gossip network

> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>         Attachments: 634-1st-part-gossip-about-all-nodes.patch, 634-discard-obsolete.patch
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793244#action_12793244 ] 

Jonathan Ellis commented on CASSANDRA-634:
------------------------------------------

IMO we should just drop the HH data with a warning (in case ops does intend to bring the dead node back).  In almost all cases, if a node is down that long it is going to be replaced entirely, and the replacement will bootstrap and not need the old HH data.

Since you should repair after bringing a dead node online anyway (b/c there is a window before the FD is aware that we should start doing HH), HH is just an optimization and this is OK.

> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>         Attachments: 634-1st-part-gossip-about-all-nodes.patch
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-634) Hinted Handoff Exception

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793470#action_12793470 ] 

Jonathan Ellis commented on CASSANDRA-634:
------------------------------------------

hint delivery is attempted when a node is signaled to be up, or every 1h (starting after a 1h delay), so this seems OK.

> Hinted Handoff Exception
> ------------------------
>
>                 Key: CASSANDRA-634
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-634
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Chris Goffinet
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>         Attachments: 634-1st-part-gossip-about-all-nodes.patch, 634-discard-obsolete.patch
>
>
> Updated to the latest codebase from cassandra-0.5 branch. All nodes booted up fine and then I start noticing this error:
> ERROR [HINTED-HANDOFF-POOL:1] 2009-12-14 22:05:34,191 CassandraDaemon.java (line 71) Fatal exception in thread Thread[HINTED-HANDOFF-POOL:1,5,main]
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:253)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>         at org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:146)
>         at org.apache.cassandra.db.HintedHandOffManager.sendMessage(HintedHandOffManager.java:106)
>         at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:177)
>         at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75)
>         at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249)
>         ... 3 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.