You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Narendra Sharma (JIRA)" <ji...@apache.org> on 2011/04/20 02:30:06 UTC
[jira] [Created] (CASSANDRA-2514) batch_mutate operations with
CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live
nodes
batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
-----------------------------------------------------------------------------------------------------------
Key: CASSANDRA-2514
URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
Project: Cassandra
Issue Type: Bug
Components: Core
Affects Versions: 0.7.4
Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
2. 2 DC setup
3. RF = 4 (DC1 = 2, DC2 = 2)
4. CL = LOCAL_QUORUM
Reporter: Narendra Sharma
Fix For: 0.7.5
We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
<snip>
keyspaces:
- name: KeyspaceMetadata
replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
strategy_options:
DC1 : 2
DC2 : 2
replication_factor: 4
</snip>
I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
Address Status State Load Owns Token
169579575332184635438912517119426957796
10.17.221.19 Down Normal ? 29.20% 49117425183422571410176530597442406739
10.17.221.17 Up Normal 81.64 KB 4.41% 56615248844645582918169246064691229930
10.16.80.54 Down Normal ? 21.13% 92563519227261352488017033924602789201
10.17.221.18 Down Normal ? 45.27% 169579575332184635438912517119426957796
I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
TimedOutException()
at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
Following is the cassandra-topology.properties
# Cassandra Node IP=Data Center:Rack
10.17.221.17=DC1:RAC1
10.17.221.19=DC1:RAC2
10.17.221.18=DC2:RAC1
10.16.80.54=DC2:RAC2
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2514) batch_mutate operations with
CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live
nodes
Posted by "Narendra Sharma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021883#comment-13021883 ]
Narendra Sharma commented on CASSANDRA-2514:
--------------------------------------------
The code to reproduce this issue is a simple batch mutate operation. The operation I performed involved adding 2 columns to a SuperColumn. Let me know if it is not reproducible. I will provide the sample code.
> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-2514
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.7.4
> Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
> Reporter: Narendra Sharma
> Fix For: 0.7.5
>
> Attachments: CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
> - name: KeyspaceMetadata
> replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
> strategy_options:
> DC1 : 2
> DC2 : 2
> replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address Status State Load Owns Token
> 169579575332184635438912517119426957796
> 10.17.221.19 Down Normal ? 29.20% 49117425183422571410176530597442406739
> 10.17.221.17 Up Normal 81.64 KB 4.41% 56615248844645582918169246064691229930
> 10.16.80.54 Down Normal ? 21.13% 92563519227261352488017033924602789201
> 10.17.221.18 Down Normal ? 45.27% 169579575332184635438912517119426957796
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
> at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
> at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
> at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2514) batch_mutate operations with
CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live
nodes
Posted by "Narendra Sharma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022238#comment-13022238 ]
Narendra Sharma commented on CASSANDRA-2514:
--------------------------------------------
Looks good to me.
Just one comment/question:
hintedEndpoints is subset of writeEndpoints. So is the additional check writeEndpoints.contains(destination), while we are iterating over hintedEndpoints, needed? I think assert would be better here.
> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-2514
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.7.0
> Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
> Reporter: Narendra Sharma
> Priority: Minor
> Fix For: 0.7.5
>
> Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
> - name: KeyspaceMetadata
> replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
> strategy_options:
> DC1 : 2
> DC2 : 2
> replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address Status State Load Owns Token
> 169579575332184635438912517119426957796
> 10.17.221.19 Down Normal ? 29.20% 49117425183422571410176530597442406739
> 10.17.221.17 Up Normal 81.64 KB 4.41% 56615248844645582918169246064691229930
> 10.16.80.54 Down Normal ? 21.13% 92563519227261352488017033924602789201
> 10.17.221.18 Down Normal ? 45.27% 169579575332184635438912517119426957796
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
> at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
> at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
> at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-2514) batch_mutate operations with
CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live
nodes
Posted by "Narendra Sharma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Narendra Sharma updated CASSANDRA-2514:
---------------------------------------
Attachment: CASSANDRA-2514.patch
Use hintedEndpoints instead of writeEndpoints to work on live endpoints only.
> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-2514
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.7.4
> Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
> Reporter: Narendra Sharma
> Fix For: 0.7.5
>
> Attachments: CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
> - name: KeyspaceMetadata
> replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
> strategy_options:
> DC1 : 2
> DC2 : 2
> replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address Status State Load Owns Token
> 169579575332184635438912517119426957796
> 10.17.221.19 Down Normal ? 29.20% 49117425183422571410176530597442406739
> 10.17.221.17 Up Normal 81.64 KB 4.41% 56615248844645582918169246064691229930
> 10.16.80.54 Down Normal ? 21.13% 92563519227261352488017033924602789201
> 10.17.221.18 Down Normal ? 45.27% 169579575332184635438912517119426957796
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
> at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
> at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
> at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (CASSANDRA-2514) batch_mutate operations with
CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live
nodes
Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Ellis resolved CASSANDRA-2514.
---------------------------------------
Resolution: Fixed
Reviewer: jbellis
Assignee: Narendra Sharma
committed, thanks!
> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-2514
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.7.0
> Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
> Reporter: Narendra Sharma
> Assignee: Narendra Sharma
> Priority: Minor
> Fix For: 0.7.5
>
> Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
> - name: KeyspaceMetadata
> replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
> strategy_options:
> DC1 : 2
> DC2 : 2
> replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address Status State Load Owns Token
> 169579575332184635438912517119426957796
> 10.17.221.19 Down Normal ? 29.20% 49117425183422571410176530597442406739
> 10.17.221.17 Up Normal 81.64 KB 4.41% 56615248844645582918169246064691229930
> 10.16.80.54 Down Normal ? 21.13% 92563519227261352488017033924602789201
> 10.17.221.18 Down Normal ? 45.27% 169579575332184635438912517119426957796
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
> at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
> at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
> at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2514) batch_mutate operations with
CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live
nodes
Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022248#comment-13022248 ]
Jonathan Ellis commented on CASSANDRA-2514:
-------------------------------------------
That's the point, hintedEndpoints is *usually* but not always a subset of writeEndpoints. Here is the code from getHintedEndpoints:
{code}
// assign dead endpoints to be hinted to the closest live one, or to the local node
// (since it is trivially the closest) if none are alive. This way, the cost of doing
// a hint is only adding the hint header, rather than doing a full extra write, if any
// destination nodes are alive.
//
// we do a 2nd pass on targets instead of using temporary storage,
// to optimize for the common case (everything was alive).
InetAddress localAddress = FBUtilities.getLocalAddress();
for (InetAddress ep : targets)
{
if (map.containsKey(ep))
continue;
if (!StorageProxy.shouldHint(ep))
{
if (logger.isDebugEnabled())
logger.debug("not hinting " + ep + " which has been down " + Gossiper.instance.getEndpointDowntime(ep) + "ms");
continue;
}
InetAddress destination = map.isEmpty()
? localAddress
: snitch.getSortedListByProximity(localAddress, map.keySet()).get(0);
map.put(destination, ep);
}
{code}
> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-2514
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.7.0
> Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
> Reporter: Narendra Sharma
> Priority: Minor
> Fix For: 0.7.5
>
> Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
> - name: KeyspaceMetadata
> replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
> strategy_options:
> DC1 : 2
> DC2 : 2
> replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address Status State Load Owns Token
> 169579575332184635438912517119426957796
> 10.17.221.19 Down Normal ? 29.20% 49117425183422571410176530597442406739
> 10.17.221.17 Up Normal 81.64 KB 4.41% 56615248844645582918169246064691229930
> 10.16.80.54 Down Normal ? 21.13% 92563519227261352488017033924602789201
> 10.17.221.18 Down Normal ? 45.27% 169579575332184635438912517119426957796
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
> at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
> at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
> at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-2514) batch_mutate operations with
CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live
nodes
Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Ellis updated CASSANDRA-2514:
--------------------------------------
Priority: Minor (was: Major)
Affects Version/s: (was: 0.7.4)
0.7.0
how does that look to you?
> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-2514
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.7.0
> Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
> Reporter: Narendra Sharma
> Priority: Minor
> Fix For: 0.7.5
>
> Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
> - name: KeyspaceMetadata
> replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
> strategy_options:
> DC1 : 2
> DC2 : 2
> replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address Status State Load Owns Token
> 169579575332184635438912517119426957796
> 10.17.221.19 Down Normal ? 29.20% 49117425183422571410176530597442406739
> 10.17.221.17 Up Normal 81.64 KB 4.41% 56615248844645582918169246064691229930
> 10.16.80.54 Down Normal ? 21.13% 92563519227261352488017033924602789201
> 10.17.221.18 Down Normal ? 45.27% 169579575332184635438912517119426957796
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
> at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
> at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
> at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2514) batch_mutate operations with
CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live
nodes
Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022249#comment-13022249 ]
Jonathan Ellis commented on CASSANDRA-2514:
-------------------------------------------
that is: our last-resort local hint storage may not be part of writeEndpoints (probably won't be, on a large cluster).
> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-2514
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.7.0
> Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
> Reporter: Narendra Sharma
> Priority: Minor
> Fix For: 0.7.5
>
> Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
> - name: KeyspaceMetadata
> replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
> strategy_options:
> DC1 : 2
> DC2 : 2
> replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address Status State Load Owns Token
> 169579575332184635438912517119426957796
> 10.17.221.19 Down Normal ? 29.20% 49117425183422571410176530597442406739
> 10.17.221.17 Up Normal 81.64 KB 4.41% 56615248844645582918169246064691229930
> 10.16.80.54 Down Normal ? 21.13% 92563519227261352488017033924602789201
> 10.17.221.18 Down Normal ? 45.27% 169579575332184635438912517119426957796
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
> at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
> at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
> at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2514) batch_mutate operations with
CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live
nodes
Posted by "Narendra Sharma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021875#comment-13021875 ]
Narendra Sharma commented on CASSANDRA-2514:
--------------------------------------------
I think the issue is because DatacenterWriteResponseHandler.assureSufficientLiveNodes is not checking for live nodes.
DatacenterWriteResponseHandler.assureSufficientLiveNodes works on writeEndpoints. writeEndpoints contains list of the all the endpoints (may be more if there are nodes bootstrapping).
I think either writeEndpoints should ignore dead/unreachable nodes or DatacenterWriteResponseHandler.assureSufficientLiveNodes should use hintedEndpoints.keySet() as that contains the live endpoints.
I compared the implementation with WriteResponseHandler.assureSufficientLiveNodes and found that it uses hintedEndpoints.
I am attaching the patch that works for me.
> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-2514
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.7.4
> Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
> Reporter: Narendra Sharma
> Fix For: 0.7.5
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
> - name: KeyspaceMetadata
> replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
> strategy_options:
> DC1 : 2
> DC2 : 2
> replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address Status State Load Owns Token
> 169579575332184635438912517119426957796
> 10.17.221.19 Down Normal ? 29.20% 49117425183422571410176530597442406739
> 10.17.221.17 Up Normal 81.64 KB 4.41% 56615248844645582918169246064691229930
> 10.16.80.54 Down Normal ? 21.13% 92563519227261352488017033924602789201
> 10.17.221.18 Down Normal ? 45.27% 169579575332184635438912517119426957796
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
> at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
> at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
> at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2514) batch_mutate operations with
CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live
nodes
Posted by "Narendra Sharma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022258#comment-13022258 ]
Narendra Sharma commented on CASSANDRA-2514:
--------------------------------------------
Got it. In my setup I had HH disabled. So I overlooked the rest of the getHintedEndpoints.
The change looks good to me now. Thanks!
> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-2514
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.7.0
> Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
> Reporter: Narendra Sharma
> Priority: Minor
> Fix For: 0.7.5
>
> Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
> - name: KeyspaceMetadata
> replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
> strategy_options:
> DC1 : 2
> DC2 : 2
> replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address Status State Load Owns Token
> 169579575332184635438912517119426957796
> 10.17.221.19 Down Normal ? 29.20% 49117425183422571410176530597442406739
> 10.17.221.17 Up Normal 81.64 KB 4.41% 56615248844645582918169246064691229930
> 10.16.80.54 Down Normal ? 21.13% 92563519227261352488017033924602789201
> 10.17.221.18 Down Normal ? 45.27% 169579575332184635438912517119426957796
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
> at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
> at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
> at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2514) batch_mutate operations with
CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live
nodes
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022332#comment-13022332 ]
Hudson commented on CASSANDRA-2514:
-----------------------------------
Integrated in Cassandra-0.7 #451 (See [https://builds.apache.org/hudson/job/Cassandra-0.7/451/])
fixes for verifying destinationavailability under hinted conditions
patch by Narendra Sharma and jbellis for CASSANDRA-2514
> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-2514
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.7.0
> Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
> Reporter: Narendra Sharma
> Assignee: Narendra Sharma
> Priority: Minor
> Fix For: 0.7.5
>
> Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
> - name: KeyspaceMetadata
> replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
> strategy_options:
> DC1 : 2
> DC2 : 2
> replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address Status State Load Owns Token
> 169579575332184635438912517119426957796
> 10.17.221.19 Down Normal ? 29.20% 49117425183422571410176530597442406739
> 10.17.221.17 Up Normal 81.64 KB 4.41% 56615248844645582918169246064691229930
> 10.16.80.54 Down Normal ? 21.13% 92563519227261352488017033924602789201
> 10.17.221.18 Down Normal ? 45.27% 169579575332184635438912517119426957796
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
> at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
> at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
> at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-2514) batch_mutate operations with
CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live
nodes
Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CASSANDRA-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Ellis updated CASSANDRA-2514:
--------------------------------------
Attachment: 2514-v2.txt
Good catch, that is a bug.
v2 adds a couple improvements:
- only count the hinted endpoint towards live count if it's a normal write destination (hints can be sent elsewhere if all the write destinations are dead)
- similar fix for DSWRH (EACH_QUORUM)
- unrelated fix in WRH for CL.ANY not to continue through to the CL.Q/ALL code
> batch_mutate operations with CL=LOCAL_QUORUM throw TimeOutException when there aren't sufficient live nodes
> -----------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-2514
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2514
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.7.0
> Environment: 1. Cassandra 0.7.4 running on RHEL 5.5
> 2. 2 DC setup
> 3. RF = 4 (DC1 = 2, DC2 = 2)
> 4. CL = LOCAL_QUORUM
> Reporter: Narendra Sharma
> Fix For: 0.7.5
>
> Attachments: 2514-v2.txt, CASSANDRA-2514.patch
>
>
> We have a 2 DC setup with RF = 4. There are 2 nodes in each DC. Following is the keyspace definition:
> <snip>
> keyspaces:
> - name: KeyspaceMetadata
> replica_placement_strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
> strategy_options:
> DC1 : 2
> DC2 : 2
> replication_factor: 4
> </snip>
> I shutdown all except one node and waited for the live node to recognize that other nodes are dead. Following is the nodetool ring output on the live node:
> Address Status State Load Owns Token
> 169579575332184635438912517119426957796
> 10.17.221.19 Down Normal ? 29.20% 49117425183422571410176530597442406739
> 10.17.221.17 Up Normal 81.64 KB 4.41% 56615248844645582918169246064691229930
> 10.16.80.54 Down Normal ? 21.13% 92563519227261352488017033924602789201
> 10.17.221.18 Down Normal ? 45.27% 169579575332184635438912517119426957796
> I expect UnavailableException when I send batch_mutate request to node that is up. However, it returned TimeOutException:
> TimedOutException()
> at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:16493)
> at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:916)
> at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:890)
> Following is the cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.17.221.17=DC1:RAC1
> 10.17.221.19=DC1:RAC2
> 10.17.221.18=DC2:RAC1
> 10.16.80.54=DC2:RAC2
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira