You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "jonathan lacefield (JIRA)" <ji...@apache.org> on 2015/03/26 01:38:53 UTC
[jira] [Comment Edited] (CASSANDRA-9028) Optimize LIMIT execution to mitigate need for a full partition scan

    [ https://issues.apache.org/jira/browse/CASSANDRA-9028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381123#comment-14381123 ] 

jonathan lacefield edited comment on CASSANDRA-9028 at 3/26/15 12:38 AM:
-------------------------------------------------------------------------

Hi Sylvain,

  My apologies if my statement isn't specifically correct.  We thought this because of some behavior observed with a user.  The behavior observed was that we executed 2 types of queries with tracing enabled and noticed that the LIMIT query "touched" all sstables for which the particular partition existed.  However, the clustering key specific query only "touched" the individual sstable that contained the partition and clusterin key value for which was queried.

  I have recreated this behavior using the following, and attached, examples.
  test.ddl - simple table with a partition key and clustering column
  trace.out - output of tracing the 2 types of queries (with queries) 
  Data.*.json - output of each json file.

   If my statement is incorrect, will you please help clarify the internals that are impacting these results?  

  The goal of this enhancement request is that each query would preform, from a latency perspective, the same.  My thinking is that the tracing should appear the same for the performance to be the same.


was (Author: jlacefie):
Hi Sylvain,

  My apologies if my statement isn't specifically correct.  We thought this because of some behavior observed with a user.  The behavior observed was that we executed 2 types of queries with tracing enabled and notices that the LIMIT query "touched" all sstables for which the particular partition existed.  However, the clustering key specific query only "touched" the individual sstable that contained the partition and clusterin key value for which was queried.

  I have recreated this behavior using the following, and attached, examples.
  test.ddl - simple table with a partition key and clustering column
  trace.out - output of tracing the 2 types of queries (with queries) 
  Data.*.json - output of each json file.

   If my statement is incorrect, will you please help clarify the internals that are impacting these results?  The goal of this enhancement is that each query would preform, from a latency perspective, the same.  My thinking is that the tracing should appear the same for the performance to be the same.

> Optimize LIMIT execution to mitigate need for a full partition scan
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-9028
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9028
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API, Core
>            Reporter: jonathan lacefield
>         Attachments: Data.1.json, Data.2.json, Data.3.json, test.ddl, tracing.out
>
>
> Currently, a SELECT statement for a single Partition Key that contains a LIMIT X clause will fetch an entire partition from a node and place the partition into memory prior to applying the limit clause and returning results to be served to the client via the coordinator.
> This JIRA is to request an optimization for the CQL LIMIT clause to avoid the entire partition retrieval step, and instead only retrieve the components to satisfy the LIMIT condition.
> Ideally, any LIMIT X would avoid the need to retrieve a full partition.  This may not be possible though.  As a compromise, it would still be incredibly beneficial if a LIMIT 1 clause could be optimized to only retrieve the "latest" item.  Ideally a LIMIT 1 would "operationally behave" the same way as a Clustering Key WHERE clause where the "latest", i.e. LIMIT 1 field, col value was specified.
> We can supply some trace results to help show the difference between 2 different queries that preform the same logical function if desired.
>   For example, a query that returns the latest value for a clustering col where QUERY 1 uses a LIMIT 1 clause and QUERY 2 uses a WHERE <clustering col> = <latest value>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)