You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Narendra Sharma (JIRA)" <ji...@apache.org> on 2011/04/20 02:30:06 UTC

[jira] [Created] (CASSANDRA-2514) batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes

batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
-----------------------------------------------------------------------------------------------------------

                 Key: CASSANDRA-2514
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.7.4
         Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
2. 2 DC setup
3. RF = 4 (DC1 = 2, DC2 = 2)
4. CL = LOCAL_QUORUM
            Reporter: Narendra Sharma
             Fix For: 0.7.5


We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
<snip>
keyspaces:
    - name: KeyspaceMetadata
      replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
      strategy_options:
        DC1 : 2
        DC2 : 2
      replication_factor: 4
</snip>

I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
Address         Status State   Load            Owns    Token                                       
                                                       169579575332184635438912517119426957796     
10.17.221.19    Down   Normal  ?               29.20%  49117425183422571410176530597442406739      
10.17.221.17    Up     Normal  81.64 KB        4.41%   56615248844645582918169246064691229930      
10.16.80.54     Down   Normal  ?               21.13%  92563519227261352488017033924602789201      
10.17.221.18    Down   Normal  ?               45.27%  169579575332184635438912517119426957796     

I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
TimedOutException()
    at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
    at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
    at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)

Following is the cassandra-topology.properties
# Cassandra Node IP=Data Center:Rack
10.17.221.17=DC1:RAC1
10.17.221.19=DC1:RAC2

10.17.221.18=DC2:RAC1
10.16.80.54=DC2:RAC2


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2514) batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes

Posted by "Narendra Sharma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021883#comment-13021883 ] 

Narendra Sharma commented on CASSANDRA-2514:
--------------------------------------------

The code to reproduce this issue is a simple batch mutate operation. The operation I performed involved adding 2 columns to a SuperColumn. Let me know if it is not reproducible. I will provide the sample code.

> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2514
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.4
>         Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
>            Reporter: Narendra Sharma
>             Fix For: 0.7.5
>
>         Attachments: CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
>     - name: KeyspaceMetadata
>       replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>       strategy_options:
>         DC1 : 2
>         DC2 : 2
>       replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address         Status State   Load            Owns    Token                                       
>                                                        169579575332184635438912517119426957796     
> 10.17.221.19    Down   Normal  ?               29.20%  49117425183422571410176530597442406739      
> 10.17.221.17    Up     Normal  81.64 KB        4.41%   56615248844645582918169246064691229930      
> 10.16.80.54     Down   Normal  ?               21.13%  92563519227261352488017033924602789201      
> 10.17.221.18    Down   Normal  ?               45.27%  169579575332184635438912517119426957796     
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
>     at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
>     at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
>     at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2514) batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes

Posted by "Narendra Sharma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022238#comment-13022238 ] 

Narendra Sharma commented on CASSANDRA-2514:
--------------------------------------------

Looks good to me. 

Just one comment/question:
hintedEndpoints is subset of writeEndpoints. So is the additional check writeEndpoints.contains(destination), while we are iterating over hintedEndpoints, needed? I think assert would be better here.



> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2514
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0
>         Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
>            Reporter: Narendra Sharma
>            Priority: Minor
>             Fix For: 0.7.5
>
>         Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
>     - name: KeyspaceMetadata
>       replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>       strategy_options:
>         DC1 : 2
>         DC2 : 2
>       replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address         Status State   Load            Owns    Token                                       
>                                                        169579575332184635438912517119426957796     
> 10.17.221.19    Down   Normal  ?               29.20%  49117425183422571410176530597442406739      
> 10.17.221.17    Up     Normal  81.64 KB        4.41%   56615248844645582918169246064691229930      
> 10.16.80.54     Down   Normal  ?               21.13%  92563519227261352488017033924602789201      
> 10.17.221.18    Down   Normal  ?               45.27%  169579575332184635438912517119426957796     
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
>     at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
>     at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
>     at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2514) batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes

Posted by "Narendra Sharma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Narendra Sharma updated CASSANDRA-2514:
---------------------------------------

    Attachment: CASSANDRA-2514.patch

Use hintedEndpoints instead of writeEndpoints to work on live endpoints only.

> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2514
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.4
>         Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
>            Reporter: Narendra Sharma
>             Fix For: 0.7.5
>
>         Attachments: CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
>     - name: KeyspaceMetadata
>       replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>       strategy_options:
>         DC1 : 2
>         DC2 : 2
>       replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address         Status State   Load            Owns    Token                                       
>                                                        169579575332184635438912517119426957796     
> 10.17.221.19    Down   Normal  ?               29.20%  49117425183422571410176530597442406739      
> 10.17.221.17    Up     Normal  81.64 KB        4.41%   56615248844645582918169246064691229930      
> 10.16.80.54     Down   Normal  ?               21.13%  92563519227261352488017033924602789201      
> 10.17.221.18    Down   Normal  ?               45.27%  169579575332184635438912517119426957796     
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
>     at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
>     at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
>     at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (CASSANDRA-2514) batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-2514.
---------------------------------------

    Resolution: Fixed
      Reviewer: jbellis
      Assignee: Narendra Sharma

committed, thanks!

> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2514
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0
>         Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
>            Reporter: Narendra Sharma
>            Assignee: Narendra Sharma
>            Priority: Minor
>             Fix For: 0.7.5
>
>         Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
>     - name: KeyspaceMetadata
>       replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>       strategy_options:
>         DC1 : 2
>         DC2 : 2
>       replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address         Status State   Load            Owns    Token                                       
>                                                        169579575332184635438912517119426957796     
> 10.17.221.19    Down   Normal  ?               29.20%  49117425183422571410176530597442406739      
> 10.17.221.17    Up     Normal  81.64 KB        4.41%   56615248844645582918169246064691229930      
> 10.16.80.54     Down   Normal  ?               21.13%  92563519227261352488017033924602789201      
> 10.17.221.18    Down   Normal  ?               45.27%  169579575332184635438912517119426957796     
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
>     at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
>     at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
>     at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2514) batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022248#comment-13022248 ] 

Jonathan Ellis commented on CASSANDRA-2514:
-------------------------------------------

That's the point, hintedEndpoints is *usually* but not always a subset of writeEndpoints. Here is the code from getHintedEndpoints:

{code}
        // assign dead endpoints to be hinted to the closest live one, or to the local node
        // (since it is trivially the closest) if none are alive.  This way, the cost of doing
        // a hint is only adding the hint header, rather than doing a full extra write, if any
        // destination nodes are alive.
        //
        // we do a 2nd pass on targets instead of using temporary storage,
        // to optimize for the common case (everything was alive).
        InetAddress localAddress = FBUtilities.getLocalAddress();
        for (InetAddress ep : targets)
        {
            if (map.containsKey(ep))
                continue;
            if (!StorageProxy.shouldHint(ep))
            {
                if (logger.isDebugEnabled())
                    logger.debug("not hinting " + ep + " which has been down " + Gossiper.instance.getEndpointDowntime(ep) + "ms");
                continue;
            }

            InetAddress destination = map.isEmpty()
                                    ? localAddress
                                    : snitch.getSortedListByProximity(localAddress, map.keySet()).get(0);
            map.put(destination, ep);
        }
{code}

> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2514
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0
>         Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
>            Reporter: Narendra Sharma
>            Priority: Minor
>             Fix For: 0.7.5
>
>         Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
>     - name: KeyspaceMetadata
>       replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>       strategy_options:
>         DC1 : 2
>         DC2 : 2
>       replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address         Status State   Load            Owns    Token                                       
>                                                        169579575332184635438912517119426957796     
> 10.17.221.19    Down   Normal  ?               29.20%  49117425183422571410176530597442406739      
> 10.17.221.17    Up     Normal  81.64 KB        4.41%   56615248844645582918169246064691229930      
> 10.16.80.54     Down   Normal  ?               21.13%  92563519227261352488017033924602789201      
> 10.17.221.18    Down   Normal  ?               45.27%  169579575332184635438912517119426957796     
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
>     at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
>     at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
>     at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2514) batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2514:
--------------------------------------

             Priority: Minor  (was: Major)
    Affects Version/s:     (was: 0.7.4)
                       0.7.0

how does that look to you?

> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2514
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0
>         Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
>            Reporter: Narendra Sharma
>            Priority: Minor
>             Fix For: 0.7.5
>
>         Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
>     - name: KeyspaceMetadata
>       replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>       strategy_options:
>         DC1 : 2
>         DC2 : 2
>       replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address         Status State   Load            Owns    Token                                       
>                                                        169579575332184635438912517119426957796     
> 10.17.221.19    Down   Normal  ?               29.20%  49117425183422571410176530597442406739      
> 10.17.221.17    Up     Normal  81.64 KB        4.41%   56615248844645582918169246064691229930      
> 10.16.80.54     Down   Normal  ?               21.13%  92563519227261352488017033924602789201      
> 10.17.221.18    Down   Normal  ?               45.27%  169579575332184635438912517119426957796     
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
>     at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
>     at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
>     at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2514) batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022249#comment-13022249 ] 

Jonathan Ellis commented on CASSANDRA-2514:
-------------------------------------------

that is: our last-resort local hint storage may not be part of writeEndpoints (probably won't be, on a large cluster).

> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2514
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0
>         Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
>            Reporter: Narendra Sharma
>            Priority: Minor
>             Fix For: 0.7.5
>
>         Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
>     - name: KeyspaceMetadata
>       replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>       strategy_options:
>         DC1 : 2
>         DC2 : 2
>       replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address         Status State   Load            Owns    Token                                       
>                                                        169579575332184635438912517119426957796     
> 10.17.221.19    Down   Normal  ?               29.20%  49117425183422571410176530597442406739      
> 10.17.221.17    Up     Normal  81.64 KB        4.41%   56615248844645582918169246064691229930      
> 10.16.80.54     Down   Normal  ?               21.13%  92563519227261352488017033924602789201      
> 10.17.221.18    Down   Normal  ?               45.27%  169579575332184635438912517119426957796     
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
>     at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
>     at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
>     at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2514) batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes

Posted by "Narendra Sharma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021875#comment-13021875 ] 

Narendra Sharma commented on CASSANDRA-2514:
--------------------------------------------

I think the issue is because DatacenterWriteResponseHandler.assureSufficientLiveNodes is not checking for live nodes.

DatacenterWriteResponseHandler.assureSufficientLiveNodes works on writeEndpoints. writeEndpoints contains list of the all the endpoints (may be more if there are nodes bootstrapping).

I think either writeEndpoints should ignore dead/unreachable nodes or DatacenterWriteResponseHandler.assureSufficientLiveNodes should use hintedEndpoints.keySet() as that contains the live endpoints.
I compared the implementation with WriteResponseHandler.assureSufficientLiveNodes and found that it uses hintedEndpoints.


I am attaching the patch that works for me.

> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2514
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.4
>         Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
>            Reporter: Narendra Sharma
>             Fix For: 0.7.5
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
>     - name: KeyspaceMetadata
>       replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>       strategy_options:
>         DC1 : 2
>         DC2 : 2
>       replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address         Status State   Load            Owns    Token                                       
>                                                        169579575332184635438912517119426957796     
> 10.17.221.19    Down   Normal  ?               29.20%  49117425183422571410176530597442406739      
> 10.17.221.17    Up     Normal  81.64 KB        4.41%   56615248844645582918169246064691229930      
> 10.16.80.54     Down   Normal  ?               21.13%  92563519227261352488017033924602789201      
> 10.17.221.18    Down   Normal  ?               45.27%  169579575332184635438912517119426957796     
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
>     at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
>     at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
>     at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2514) batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes

Posted by "Narendra Sharma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022258#comment-13022258 ] 

Narendra Sharma commented on CASSANDRA-2514:
--------------------------------------------

Got it. In my setup I had HH disabled. So I overlooked the rest of the getHintedEndpoints.

The change looks good to me now. Thanks!



> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2514
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0
>         Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
>            Reporter: Narendra Sharma
>            Priority: Minor
>             Fix For: 0.7.5
>
>         Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
>     - name: KeyspaceMetadata
>       replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>       strategy_options:
>         DC1 : 2
>         DC2 : 2
>       replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address         Status State   Load            Owns    Token                                       
>                                                        169579575332184635438912517119426957796     
> 10.17.221.19    Down   Normal  ?               29.20%  49117425183422571410176530597442406739      
> 10.17.221.17    Up     Normal  81.64 KB        4.41%   56615248844645582918169246064691229930      
> 10.16.80.54     Down   Normal  ?               21.13%  92563519227261352488017033924602789201      
> 10.17.221.18    Down   Normal  ?               45.27%  169579575332184635438912517119426957796     
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
>     at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
>     at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
>     at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2514) batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022332#comment-13022332 ] 

Hudson commented on CASSANDRA-2514:
-----------------------------------

Integrated in Cassandra-0.7 #451 (See [https://builds.apache.org/hudson/job/Cassandra-0.7/451/])
    fixes for verifying destinationavailability under hinted conditions
patch by Narendra Sharma and jbellis for CASSANDRA-2514


> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2514
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0
>         Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
>            Reporter: Narendra Sharma
>            Assignee: Narendra Sharma
>            Priority: Minor
>             Fix For: 0.7.5
>
>         Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
>     - name: KeyspaceMetadata
>       replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>       strategy_options:
>         DC1 : 2
>         DC2 : 2
>       replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address         Status State   Load            Owns    Token                                       
>                                                        169579575332184635438912517119426957796     
> 10.17.221.19    Down   Normal  ?               29.20%  49117425183422571410176530597442406739      
> 10.17.221.17    Up     Normal  81.64 KB        4.41%   56615248844645582918169246064691229930      
> 10.16.80.54     Down   Normal  ?               21.13%  92563519227261352488017033924602789201      
> 10.17.221.18    Down   Normal  ?               45.27%  169579575332184635438912517119426957796     
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
>     at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
>     at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
>     at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2514) batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2514:
--------------------------------------

    Attachment: 2514-v2.txt

Good catch, that is a bug.

v2 adds a couple improvements:

- only count the hinted endpoint towards live count if it's a normal write destination (hints can be sent elsewhere if all the write destinations are dead)
- similar fix for DSWRH (EACH_QUORUM)
- unrelated fix in WRH for CL.ANY not to continue through to the CL.Q/ALL code

> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2514
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0
>         Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
>            Reporter: Narendra Sharma
>             Fix For: 0.7.5
>
>         Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
>     - name: KeyspaceMetadata
>       replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>       strategy_options:
>         DC1 : 2
>         DC2 : 2
>       replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address         Status State   Load            Owns    Token                                       
>                                                        169579575332184635438912517119426957796     
> 10.17.221.19    Down   Normal  ?               29.20%  49117425183422571410176530597442406739      
> 10.17.221.17    Up     Normal  81.64 KB        4.41%   56615248844645582918169246064691229930      
> 10.16.80.54     Down   Normal  ?               21.13%  92563519227261352488017033924602789201      
> 10.17.221.18    Down   Normal  ?               45.27%  169579575332184635438912517119426957796     
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
>     at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
>     at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
>     at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira