You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benjamin Lerer (JIRA)" <ji...@apache.org> on 2015/05/08 22:38:01 UTC
[jira] [Commented] (CASSANDRA-8940) Inconsistent select count and select distinct

    [ https://issues.apache.org/jira/browse/CASSANDRA-8940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535458#comment-14535458 ] 

Benjamin Lerer commented on CASSANDRA-8940:
-------------------------------------------

[~frensjan], you did not succeed to reproduce the problem with ccm because by default ccm disable {{vnodes}}. This means that the data will be distributed on only 3 contiguous ranges and the {{StoreProxy}} will have to perform at most 2 requests.

If the ccm cluster is created with the {{vnodes}} option. The data will be distributed over 256 * 3 = 768 ranges and the problem is easily reproducible.

In our scenario, for each page of data (5000 cql rows), the {{StoreProxy}} will initially issue a first request and check if enough results are returned. If not enough result have been returned it will guess based on the amount of data returned for the first range how much more range it needs to query and will query them in parallel.

In the worst case, where no result have been found in the first range, the {{StoreProxy}} will assume that we only have a small amount of data per range and will issue 767 concurrent requests to get the remaining data.     
A third of those request will target some ranges of data located on the coordinator node. 

Cassandra will optimise those requests by not serializing and deserializing them.

The problem was that the {{SliceQueryFilter}} which is part of the request and which is used to filter out the data ended up being shared between the threads while it should not have been as it is mutable.  

I described worst case scenario but the problem could occurs with a smaller amount of concurrent requests.


> Inconsistent select count and select distinct
> ---------------------------------------------
>
>                 Key: CASSANDRA-8940
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8940
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: 2.1.2
>            Reporter: Frens Jan Rumph
>            Assignee: Benjamin Lerer
>         Attachments: 7b74fb00-e935-11e4-b10c-317579db7eb4.csv, 8d5899d0-e935-11e4-847b-2d06da75a6cd.csv, Vagrantfile, install_cassandra.sh, setup_hosts.sh
>
>
> When performing {{select count( * ) from ...}} I expect the results to be consistent over multiple query executions if the table at hand is not written to / deleted from in the mean time. However, in my set-up it is not. The counts returned vary considerable (several percent). The same holds for {{select distinct partition-key-columns from ...}}.
> I have a table in a keyspace with replication_factor = 1 which is something like:
> {code}
> CREATE TABLE tbl (
>     id frozen<id_type>,
>     bucket bigint,
>     offset int,
>     value double,
>     PRIMARY KEY ((id, bucket), offset)
> )
> {code}
> The frozen udt is:
> {code}
> CREATE TYPE id_type (
>     tags map<text, text>
> );
> {code}
> The table contains around 35k rows (I'm not trying to be funny here ...). The consistency level for the queries was ONE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)