You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Mck SembWever (JIRA)" <ji...@apache.org> on 2011/01/22 17:50:44 UTC

[jira] Issue Comment Edited: (CASSANDRA-1831) NetworkTopologyStrategy allows mismatched RF resulting in obscure failures

    [ https://issues.apache.org/jira/browse/CASSANDRA-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985152#action_12985152 ] 

Mck SembWever edited comment on CASSANDRA-1831 at 1/22/11 11:49 AM:
--------------------------------------------------------------------

I hit this issue. It wasn't immediately obvious to me what was required configuration to get NetworkTopologyStrategy working. The best docs i can find is in http://svn.apache.org/repos/asf/cassandra/trunk/conf/cassandra.yaml
A wiki page on NetworkTopologyStrategy would help a lot. (Or someone just telling me to try first OldNetworkTopologyStrategy, why wasn't it called instead SimpleNetworkTopologyStrategy?)

(http://wiki.apache.org/cassandra/Operations#Network_topology still refers to the old RackAwareStrategy)

      was (Author: michaelsembwever):
    I hit this issue. It wasn't immediately obvious to me what was required configuration to get NetworkTopologyStrategy working. The best docs i can find is in http://svn.apache.org/repos/asf/cassandra/trunk/conf/cassandra.yaml
A wiki page on NetworkTopologyStrategy would help a lot.

(http://wiki.apache.org/cassandra/Operations#Network_topology still refers to the old RackAwareStrategy)
  
> NetworkTopologyStrategy allows mismatched RF resulting in obscure failures
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1831
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1831
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.0 rc 1
>            Reporter: Peter Schuller
>
> On today's 0.7 branch:
> Creating a keyspace like this (not how to do it in production, but that's not the point):
>    create keyspace MyKeySpace with replication_factor = 2 and placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy';
> This is accepted by Cassandra in spite of there being no strategy options. Describing the keyspace will then give output similar to:
> Keyspace: MyKeySpace:
>  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
> null
> Attempts to write and read respectively gives the errors included at the bottom of this comment.
> What happens is that the NTS's getReplicationFactor() returns the sum of RF for each DC. But lacking any replicate placement options for DC:s, the sum will always be 0. The result is that NTS.calculateNaturalEndpoints() yields 0 endpoints thus triggering the assertion failures apparent in the strack traces.
> This was caused by misconfiguration during testing but should be handled better. What are people's thoughts on the set of changes that would constitute a proper fix?
> Is there a reason for NTS to ever conclude that RF is different than that of the CF def? If not, I would say that one fix is to make the NTS bail early if the calculated RF adding up the DC placements does not match the configured RF for the column family. (I'll submit a patch if people agree.)
> Beyond that, what else, if anything should be done? Should the creation fail due to the RF being inconsistent with strategy options? Is it correct that code assumes that naturalEndPoints will never return fewer nodes than RF? It seems natural to me that the natural endpoint count should always match RF, unless the total number of nodes in the cluster is lacking. But this gets complicated with NTS since the requirement is suddenly that you have enough in each DC. This probably relates to previous discussions on whether or not to allow an RF which is higher than the number of nodes in a cluster.
> In this case, we failed hard because we got exactly 0 endpoints and triggered assertions. In other cases we might have gotten say 1, in which case we may have successfully been able to read and write as if we had a lower RF even though the column family RF was set to 2. This seems dangerous.
> ERROR [pool-1-thread-2] 2010-12-07 11:18:40,638 Cassandra.java (line
> 3044) Internal error processing batch_mutate
> java.lang.AssertionError: invalid response count 1 for replication factor 0
>        at org.apache.cassandra.service.WriteResponseHandler.determineBlockFor(WriteResponseHandler.java:98)
>        at org.apache.cassandra.service.WriteResponseHandler.<init>(WriteResponseHandler.java:48)
>        at org.apache.cassandra.service.WriteResponseHandler.create(WriteResponseHandler.java:61)
>        at org.apache.cassandra.locator.AbstractReplicationStrategy.getWriteResponseHandler(AbstractReplicationStrategy.java:125)
>        at org.apache.cassandra.locator.NetworkTopologyStrategy.getWriteResponseHandler(NetworkTopologyStrategy.java:166)
>        at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:114)
>        at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:446)
>        at org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:419)
>        at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3036)
>        at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
>        at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> ERROR [pool-1-thread-3] 2010-12-07 11:18:50,474 Cassandra.java (line
> 2876) Internal error processing get_range_slices
> java.lang.AssertionError
>        at org.apache.cassandra.service.RangeSliceResponseResolver.<init>(RangeSliceResponseResolver.java:53)
>        at org.apache.cassandra.service.StorageProxy.getRangeSlice(StorageProxy.java:450)
>        at org.apache.cassandra.thrift.CassandraServer.get_range_slices(CassandraServer.java:507)
>        at org.apache.cassandra.thrift.Cassandra$Processor$get_range_slices.process(Cassandra.java:2868)
>        at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
>        at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
>  INFO [MigrationStage:1] 2010-12-07 11:24:09,220

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.