You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jan Chochol (JIRA)" <ji...@apache.org> on 2014/01/15 10:18:19 UTC
[jira] [Created] (CASSANDRA-6587) Slow query when using token range and secondary index

Jan Chochol created CASSANDRA-6587:
--------------------------------------

             Summary: Slow query when using token range and secondary index
                 Key: CASSANDRA-6587
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6587
             Project: Cassandra
          Issue Type: Bug
          Components: Core
            Reporter: Jan Chochol


We are using token ranges to simulate pagination on external API. To achieve this, we use similar queries:
{noformat}
SELECT * FROM table WHERE TOKEN(partition_key) > TOKEN('offset') AND secondary_key = 'value' LIMIT 1000;
{noformat}

We found that such statement is quite ineffective, and we do not know how to solve it.

Let's try some example.

You can fill Cassandra with folowing script:
{noformat}
perl -e "print(\"DROP KEYSPACE t;\nCREATE KEYSPACE t WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1};\nuse t;\nCREATE TABLE t (a varchar PRIMARY KEY, b varchar, c varchar, d varchar);\nCREATE INDEX t_b ON t (b);\nCREATE INDEX t_c ON t (c);\nCREATE INDEX t_d ON t (d);\n\");\$max = 100000; for(\$i = 0; \$i < \$max; \$i++) { \$j = int(\$i * 10 / \$max); \$k = int(\$i * 100 / \$max); print(\"INSERT INTO t (a, b, c, d) VALUES ('a\$i', 'b\$j', 'c\$k', 'd\$i');\n\")}; for(\$i = 0; \$i < \$max; \$i++) { print(\"INSERT INTO t (a, b, c, d) VALUES ('e\$i', 'f\$j', 'g\$k', 'h\$i');\n\")}" | cqlsh
{noformat}

First we looked for last but one parition key:
{noformat}
[root@jch3-devel:~/c4] echo "SELECT a FROM t.t WHERE b = 'b1' LIMIT 100000;" | cqlsh | tail
 a18283
 a11336
 a14712
 a11476
 a19396
 a14269
 a10719
 a14521
 a13934
{noformat}

Than we issue following commands for some interesting behaviour:
{noformat}
SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) = TOKEN('a14521') LIMIT 10000000;
SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) = TOKEN('a14521') LIMIT 10;
SELECT a, d FROM t.t WHERE b = 'b1' AND a = 'a14521' LIMIT 10;
{noformat}

And here is result:
{noformat}
[root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) = TOKEN('a14521') LIMIT 10000000;" | cqlsh

 a      | d
--------+--------
 a14521 | d14521


real    0m0.647s
user    0m0.307s
sys     0m0.076s

[root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) = TOKEN('a14521') LIMIT 10;" | cqlsh

 a      | d
--------+--------
 a14521 | d14521


real    0m16.454s
user    0m0.341s
sys     0m0.090s

[root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND a = 'a14521' LIMIT 10;" | cqlsh

 a      | d
--------+--------
 a14521 | d14521


real    0m0.404s
user    0m0.309s
sys     0m0.071s
{noformat}

Problem with {{LIMIT}} is described in CASSANDRA-6348, and is quite funny - lower the limit, slower the requst (and with different structure of data it can be even worse).
This query is quite silly in reality (asking with secondary key, when you have primary key), but is close as possible to our use case:
{noformat}
SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) > TOKEN('a14521') LIMIT 10;
{noformat}
But we simply can not do:
{noformat}
SELECT a, d FROM t.t WHERE b = 'b1' AND a > 'a14521' LIMIT 10;
{noformat}
As this is unsupported.

{{CompositesSearcher.java}} gives us some clue about the problem:
{noformat}
        /*
         * XXX: If the range requested is a token range, we'll have to start at the beginning (and stop at the end) of
         * the indexed row unfortunately (which will be inefficient), because we have not way to intuit the small
         * possible key having a given token. A fix would be to actually store the token along the key in the
         * indexed row.
         */
{noformat}
Index row contains parition keys in partion key ordering (ordering exposed in CQL3 as {{TOKEN(partition_key)}}), so these two request are expected to return same values:
{noformat}
SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) > TOKEN('a14521') LIMIT 1;
SELECT a, d FROM t.t WHERE b = 'b1' AND a > 'a14521' LIMIT 1;
{noformat}
But the second is not supported.

Currently we are considering to go to our production with this patch:
{noformat}
diff --git a/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java b/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java
index 44a1e64..0228c3a 100644
--- a/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java
+++ b/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java
@@ -1123,8 +1123,10 @@ public class SelectStatement implements CQLStatement
                         stmt.keyIsInRelation = true;
                     }
                 }
-                else
+                else if (stmt.metadataRestrictions.isEmpty())
                 {
+                    // This is poor man heuristic, and probably far from correct, but it allows us these requests:
+                    // SELECT * FROM table WHERE partition_key > 'offset' AND secondary_key = 'value'
                     throw new InvalidRequestException("Only EQ and IN relation are supported on the partition key (you will need to use the token() function for non equality based relation)");
                 }
                 previous = cname;
diff --git a/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java b/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
index 5ab1df6..2ba2845 100644
--- a/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
+++ b/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
@@ -190,7 +190,8 @@ public class CompositesSearcher extends SecondaryIndexSearcher
 
             private int meanColumns = Math.max(index.getIndexCfs().getMeanColumns(), 1);
             // We shouldn't fetch only 1 row as this provides buggy paging in case the first row doesn't satisfy all clauses
-            private final int rowsPerQuery = Math.max(Math.min(filter.maxRows(), filter.maxColumns() / meanColumns), 2);
+            // We  are not too interested in memory consumption, as we are using only "small data"
+            private final int rowsPerQuery = Math.max(Math.min(filter.maxRows(), filter.maxColumns()), 2);
 
             public boolean needsFiltering()
             {
{noformat}
It is probably far from correct, but at least it gives us acceptable performance:


{noformat}
[root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) > TOKEN('a14521') LIMIT 1;" | cqlsh

 a      | d
--------+--------
 a13934 | d13934


real    0m15.359s
user    0m0.313s
sys     0m0.090s

[root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND a > 'a14521' LIMIT 1;" | cqlsh

 a      | d
--------+--------
 a13934 | d13934


real    0m0.449s
user    0m0.323s
sys     0m0.106s
{noformat}

We do not think that {{TOKEN}} range requests with secondary indexes should be so much slower.
Can you please exentend CQL3 with possibility to parition keys range requests with secondary indexes (than change this issue to feature request), or make {{TOKEN}} range requests with secondary indexes faster?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)