You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Russell Alexander Spitzer (JIRA)" <ji...@apache.org> on 2015/08/12 01:43:46 UTC

[jira] [Updated] (CASSANDRA-10050) Secondary Index Performance Dependent on TokenRange Searched in Analytics

     [ https://issues.apache.org/jira/browse/CASSANDRA-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Russell Alexander Spitzer updated CASSANDRA-10050:
--------------------------------------------------
    Description: 
In doing some test work on the Spark Cassandra Connector I saw some odd performance when pushing down range queries with Secondary Index filters. When running the queries we see huge amount of time when the C* server is not doing any work and the query seem to be hanging. This investigation led to the work in this document

https://docs.google.com/spreadsheets/d/1aJg3KX7nPnY77RJ9ZT-IfaYADgJh0A--nAxItvC6hb4/edit#gid=0

The Spark Cassandra Connector builds up token range specific queries and allows the user to pushdown relevant fields to C*. Here we have two indexed fields (size) and (color) being pushed down to C*. 

{code}
SELECT count(*) FROM ks.tab WHERE token("store") > $min AND token("store") <= $max AND color = 'red' AND size = 'P' ALLOW FILTERING;{code}

These queries will have different token ranges inserted and executed as separate spark tasks. Spark tasks with token ranges near the Min(token) end up executing much faster than those near Max(token) which also happen to through errors.

{code}
Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
{code}

I took the queries and ran them through CQLSH to see the difference in time. A linear relationship is seen based on where the tokenRange being queried is starting with only 2 second for queries near the beginning of the full token spectrum and over 12 seconds at the end of the spectrum. 

The question is, can this behavior be improved? or should we not recommend using secondary indexes with Analytics workloads?



  was:
In doing some test work on the Spark Cassandra Connector I saw some odd performance when pushing down range queries with Secondary Index filters. When running the queries we see huge amount of time when the C* server is not doing any work and the query seem to be hanging. This investigation led to the work in this document

https://docs.google.com/spreadsheets/d/1aJg3KX7nPnY77RJ9ZT-IfaYADgJh0A--nAxItvC6hb4/edit#gid=0

The Spark Cassandra Connector builds up token range specific queries and allows the user to pushdown relevant fields to C*. Here we should two indexed fields (size) and (color) being pushed down to C*. 

{code}
SELECT count(*) FROM ks.tab WHERE token("store") > $min AND token("store") <= $max AND color = 'red' AND size = 'P' ALLOW FILTERING;{code}

These queries will have different token ranges inserted and executed as separate spark tasks. Spark tasks with token ranges near the Min(token) end up executing much faster than those near Max(token) which also happen to through errors.

{code}
Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
{code}

I took the queries and ran them through CQLSH to see the difference in time. A linear relationship is seen based on where the tokenRange being queried is starting with only 2 second for queries near the beginning of the full token spectrum and over 12 seconds at the end of the spectrum. 

The question is, can this behavior be improved? or should we not recommend using secondary indexes with Analytics workloads?




> Secondary Index Performance Dependent on TokenRange Searched in Analytics
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10050
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10050
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Single node, macbook, 2.1.8
>            Reporter: Russell Alexander Spitzer
>
> In doing some test work on the Spark Cassandra Connector I saw some odd performance when pushing down range queries with Secondary Index filters. When running the queries we see huge amount of time when the C* server is not doing any work and the query seem to be hanging. This investigation led to the work in this document
> https://docs.google.com/spreadsheets/d/1aJg3KX7nPnY77RJ9ZT-IfaYADgJh0A--nAxItvC6hb4/edit#gid=0
> The Spark Cassandra Connector builds up token range specific queries and allows the user to pushdown relevant fields to C*. Here we have two indexed fields (size) and (color) being pushed down to C*. 
> {code}
> SELECT count(*) FROM ks.tab WHERE token("store") > $min AND token("store") <= $max AND color = 'red' AND size = 'P' ALLOW FILTERING;{code}
> These queries will have different token ranges inserted and executed as separate spark tasks. Spark tasks with token ranges near the Min(token) end up executing much faster than those near Max(token) which also happen to through errors.
> {code}
> Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
> {code}
> I took the queries and ran them through CQLSH to see the difference in time. A linear relationship is seen based on where the tokenRange being queried is starting with only 2 second for queries near the beginning of the full token spectrum and over 12 seconds at the end of the spectrum. 
> The question is, can this behavior be improved? or should we not recommend using secondary indexes with Analytics workloads?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)