You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Philip Thompson (JIRA)" <ji...@apache.org> on 2015/03/18 22:03:39 UTC

[jira] [Updated] (CASSANDRA-7805) Performance regression in multi-get (in clause) due to automatic paging

     [ https://issues.apache.org/jira/browse/CASSANDRA-7805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Philip Thompson updated CASSANDRA-7805:
---------------------------------------
    Fix Version/s: 2.0.14

> Performance regression in multi-get (in clause) due to automatic paging
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-7805
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7805
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: J.B. Langston
>            Priority: Minor
>             Fix For: 2.0.14
>
>
> Comparative benchmarking of 1.2 vs. 2.0 shows a regression in multi-get (in clause) queries due to automatic paging.  Take the following example:
> select myId, col1, col2, col3 from myTable where col1 = 'xyz' and myId IN (id1, id1, ..., id100); // primary key is (myId, col1)
> We were suprised to see that in 2.0, the above query was giving an order of magnitude worse performance than in 1.2. Digging in, I believe it is due to the issue described in the comment at the top of MultiPartitionPager.java (v2.0.9): "Note that this is not easy to make efficient. Indeed, we need to page the first command fully before returning results from the next one, but if the result returned by each command is small (compared to pageSize), paging the commands one at a time under-performs compared to parallelizing."
> The perf regression is due to the new paging feature in 2.0. The server is executing the read for each id in the IN clause *sequentially* in order to implement the paging semantics.
> The wisdom of using multi-get like this has been debated in other forums, but the thing that's unfortunate from a user point of view, is if they had a bunch of code working against 1.2 and then they upgrade their cluster to 2.0 and all of a sudden start to see an order of magnitude or worse perf regression. That will be perceived as a problem. I think it would surprise anyone not familiar with the code that the separate reads for the IN clause would be done sequentially rather than in parallel.
> As a workaround, disable paging in the Java driver by setting fetchSize to Integer.MAX_VALUE on your QueryOptions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)