You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Edward Capriolo (JIRA)" <ji...@apache.org> on 2012/11/05 22:44:12 UTC

[jira] [Created] (CASSANDRA-4915) CQL should force limit when query samples data.

Edward Capriolo created CASSANDRA-4915:
------------------------------------------

             Summary: CQL should force limit when query samples data.
                 Key: CASSANDRA-4915
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
             Project: Cassandra
          Issue Type: Improvement
    Affects Versions: 1.2.0 beta 1
            Reporter: Edward Capriolo
            Priority: Minor


When issuing a query like:
{noformat}
CREATE TABLE videos (
  videoid uuid,
  videoname varchar,
  username varchar,
  description varchar,
  tags varchar,
  upload_date timestamp,
  PRIMARY KEY (videoid,videoname)
);
SELECT * FROM videos WHERE videoname = 'My funny cat';
{noformat}

Cassandra samples some data using get_range_slice and then applies the query.

This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 

My suggestions:
1) force people to supply a LIMIT clause on any query that is going to
page over get_range_slice
2) having some type of explain support so I can establish if this
query will work in the

I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should force limit when query samples data.

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491820#comment-13491820 ] 

Sylvain Lebresne commented on CASSANDRA-4915:
---------------------------------------------

bq. and relying on the limit being X vs 10X or 0.1X is silly

Why I agree on the silliness, I don't fully share your optimism that people won't start relying on it. I also would prefer being able to clearly specify that "without limit we return as much result as there is with the technical limitation that it's Integer.MAX_VALUE", rather than having to settle for "without limit we return results with a limit that depends on the weather and the exact value of which you shouldn't rely on".  I also think that having an arbitrary default limit is a very bad OOM protection (I think it's still fairly easy to OOM even with the 10,000 limit unless you are mindful of your query). But I'd rather discuss that in CASSANDRA-4918 for the sake of not mixing unrelated issues.

Because I do think there is an issue here that has nothing to do whatsoever with the limit and preventing OOMing. That issue is that we allow some queries that do not scale with the number of records in the database. And to be clear, 'not scale with the number of records in the database' means that even for a *constant* query output it doesn't scale. Those queries are:
# the one in the description of this ticket
# as Jonathan said (and I don't disagree with it's statement), secondary index queries with additional restrictions.

Now I agree that we can't completely protect people against those short of refusing the queries. But I do think we have some discrepancies in what we support and don't support: we refuse 'SELECT * FROM t WHERE partition_key = .. AND clustering_key_part2 = ...' based on the argument than because clustering_key_part1 is not provided, we would have to do a full scan of the internal row and the inefficiency of that would be too surprising for the user. But we do allow the query in the description of this ticket even though honestly it's the same kind of query (I.e, it's a query where we don't have *any* index to really start with).

And I don't like discrepancies. Or in other words, we've claimed that an advantage of Cassandra is that that query performance is predictable, but queries that for the same output (even a very small one) have an execution time that is proportional to the number of record in the database is imho the exact definition of query performance being non predictable (or at least non-scalable). So I think it would be of interest to clarify what it is exactly that we guarantee in term of query performance being predictable. And for that I see a number of options:
# We leave thing as they are, but then the rule of when a query will have a predicable performance (which for me means that the performance will be almost only dependant on the query output) are fairly opaque and not very coherent. And in particular in that case it feels random to refuse queries that would require a full internal row scan when we happily do the ones that require an entire ring scan.
# We get strict about allowing only queries that we can guarantee have predictable performance (with the definition above that I think is reasonable). That does mean refusing the query in the description, but also indeed queries on 2ndary indexes that have more than one restriction, which probably make that solution too restrictive to be desirable.
# We try to hit some middle ground, where while we allow some guarantee we can't guarantee the predictability, we at least make it so that the rule for when the predictability is guaranteed easy to understand/follow. My proposition for "ALLOW FULL SCAN" above was a tentative of that. If we allow that, and unless I forget something which is possible, I think we can say that: a query will have predictable performance unless it either use 2ndary index or it uses 'allow full scan'. And for 2ndary index we can refine that a bit and say 'it still will have guaranteed predictable performance if you only use one restriction in the query'. But at least, we'd have clear guarantee without 2ndary index, and I do thing that 1) it's very useful and 2) it's not crazy to say that 2ndary index involves more complex processing and offer thus less guarantee in term of predictability.

In favor of my third point, I want to mention that this is exactly the guarantee that thrift provides today, because today a non-2ndary query in thrift always give you predictable performance in the sense that the query performance will be proportional to the query ouptut (that you can control with the limit), because a get_range_slice in thrift (without IndexExpression) with a count of 1 will only ever scan one row (and if that one row doesn't have anything for the filter, the result will be an empty row), but that is *not* how CQL3 works today.

                
> CQL should force limit when query samples data.
> -----------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502776#comment-13502776 ] 

Jonathan Ellis commented on CASSANDRA-4915:
-------------------------------------------

It looks like this is reading entire rows into memory.  If so let's just leave the restriction in for 1.2.0 and push the predicate down into the scan for 1.2.x.
                
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Edward Capriolo
>            Assignee: Sylvain Lebresne
>             Fix For: 1.2.0 rc1
>
>         Attachments: 0001-4915.txt, 0002-Allow-non-indexed-expr-with-ALLOW-FILTERING.txt
>
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503021#comment-13503021 ] 

Jonathan Ellis commented on CASSANDRA-4915:
-------------------------------------------

All right, let's just run with part 1 for now.
                
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Edward Capriolo
>            Assignee: Sylvain Lebresne
>             Fix For: 1.2.0 rc1
>
>         Attachments: 0001-4915.txt, 0002-Allow-non-indexed-expr-with-ALLOW-FILTERING.txt
>
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-4915:
----------------------------------------

    Attachment:     (was: 4915.patch)
    
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Edward Capriolo
>            Assignee: Sylvain Lebresne
>             Fix For: 1.2.0 rc1
>
>         Attachments: 0001-4915.txt, 0002-Allow-non-indexed-expr-with-ALLOW-FILTERING.txt
>
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-4915:
--------------------------------------

             Reviewer: jbellis
             Priority: Major  (was: Minor)
    Affects Version/s:     (was: 1.2.0 beta 1)
                       0.8.0
        Fix Version/s: 1.2.0 rc1
             Assignee: Sylvain Lebresne
    
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Edward Capriolo
>            Assignee: Sylvain Lebresne
>             Fix For: 1.2.0 rc1
>
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502785#comment-13502785 ] 

Sylvain Lebresne commented on CASSANDRA-4915:
---------------------------------------------

bq. It looks like this is reading entire rows into memory

It does and I'm fine leaving that second part to later.
                
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Edward Capriolo
>            Assignee: Sylvain Lebresne
>             Fix For: 1.2.0 rc1
>
>         Attachments: 0001-4915.txt, 0002-Allow-non-indexed-expr-with-ALLOW-FILTERING.txt
>
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500124#comment-13500124 ] 

Jonathan Ellis commented on CASSANDRA-4915:
-------------------------------------------

Tagging for 1.2 so we don't break working queries later.
                
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Edward Capriolo
>            Assignee: Sylvain Lebresne
>             Fix For: 1.2.0 rc1
>
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Posted by "Edward Capriolo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498239#comment-13498239 ] 

Edward Capriolo commented on CASSANDRA-4915:
--------------------------------------------

I am unclear on what "WITH FILTERING ALLOWED" does. Is it a clause that must be applied to let this query run? Or is it a clause that changes how the query runs?

SELECT * FROM videos WHERE videoname = 'My funny cat' WITH FILTERING ALLOWED;

SELECT * FROM videos WHERE videoname = 'My funny cat' ;

I do agree that LIMIT is the wrong concept because that limits the result set and that will not short-circuit a query.

A better syntax could be this: 

SELECT * FROM videos WHERE videoname = 'My funny cat' WITH MAX_EXAMINED = 5;

Where MAX_EXAMINED would be a count of columns/rows? processed. Whenever we iterate over more then MAX_EXAMINED we shout circuit and return what we have. (which may be nothing)
                
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498130#comment-13498130 ] 

Sylvain Lebresne commented on CASSANDRA-4915:
---------------------------------------------

bq. How about WITH FILTERING ALLOWED?

Agreed, that's a better name.
                
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should force limit when query samples data.

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491549#comment-13491549 ] 

Sylvain Lebresne commented on CASSANDRA-4915:
---------------------------------------------

bq. What do you think about forcing the construct 'WHERE token(key)=0'?

I don't think it solves the problem honestly. There is nothing in that telling you that we won't use an index to answer your query and that the query will almost surely timeout if you have lots of rows but little matching the "videoname = 'My funny cat'" predicate. And in fact when/if we support indexing on a clustering key component (videoname in that case), it will make complete sense to do an indexed query with a 'token(key) > 0' condition (meaning, we allow this for indexed queries today and that doesn't imply the query is a full scan).
                
> CQL should force limit when query samples data.
> -----------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-4915:
----------------------------------------

    Attachment: 0002-Allow-non-indexed-expr-with-ALLOW-FILTERING.txt
                0001-4915.txt
    
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Edward Capriolo
>            Assignee: Sylvain Lebresne
>             Fix For: 1.2.0 rc1
>
>         Attachments: 0001-4915.txt, 0002-Allow-non-indexed-expr-with-ALLOW-FILTERING.txt
>
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498119#comment-13498119 ] 

Jonathan Ellis commented on CASSANDRA-4915:
-------------------------------------------

Okay, I think I'm on board with option 3.

How about {{WITH FILTERING ALLOWED}}?  {{FULL SCAN}} is a bit misleading in the indexed case.  (And in the case of a simple {{SELECT *}}, which we wouldn't require the modifier for.)
                
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502687#comment-13502687 ] 

Sylvain Lebresne commented on CASSANDRA-4915:
---------------------------------------------

We can. But unfortunately that doesn't work out of the box. Because if we scan without index, we'll read the full internal row and we then need to be able to extract the CQL3 rows within that that matches the filter. I'm attaching a second patch that does exactly that (And I have a dtest to check that this work as expected).
                
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Edward Capriolo
>            Assignee: Sylvain Lebresne
>             Fix For: 1.2.0 rc1
>
>         Attachments: 0001-4915.txt, 0002-Allow-non-indexed-expr-with-ALLOW-FILTERING.txt
>
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should force limit when query samples data.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491734#comment-13491734 ] 

Jonathan Ellis commented on CASSANDRA-4915:
-------------------------------------------

Note that while implicit {{LIMIT}} does not prevent expensive queries like this, it does keep you from OOMing the server!  So it is useful in that respect.

I don't see why we'd need a CQL4 when we don't need that anymore, though.  We respect user-specified {{LIMIT}} already, and relying on the limit being X vs 10X or 0.1X is silly.  But we could codify that as "Cassandra may, but is not required to, impose a limit if none is specified."
                
> CQL should force limit when query samples data.
> -----------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should force limit when query samples data.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491730#comment-13491730 ] 

Jonathan Ellis commented on CASSANDRA-4915:
-------------------------------------------

Short of real native paging (CASSANDRA-4415), I don't think this is really preventable.  {{ALLOW FULL SCAN}} would only give you a false sense of security; consider {{SELECT * FROM users WHERE first_name='Ben' AND last_name='Higgenbotham'}}.  If first_name is indexed but not last_name, and you have millions of Bens and a handful of Higgenbothams, you have the same problem even though our simplistic heuristic of "is it indexed?" would consider it "safe."
                
> CQL should force limit when query samples data.
> -----------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-4915:
----------------------------------------

    Attachment: 4915.patch

Attaching patch for the ALLOW FILTERING part. As said above, I'd rather leave the short-cicuiting when we've filter more than X records to a later ticket, as this is more involved (and we can add the support with it being a breaking change anyway).
                
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Edward Capriolo
>            Assignee: Sylvain Lebresne
>             Fix For: 1.2.0 rc1
>
>         Attachments: 4915.patch
>
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13499052#comment-13499052 ] 

Sylvain Lebresne commented on CASSANDRA-4915:
---------------------------------------------

bq. Is it a clause that must be applied to let this query run?

Yes, that's the idea.

bq. Whenever we iterate over more then MAX_EXAMINED we shout circuit and return what we have

That's a good idea. However I'd rather see it as a way to fine-tune the behavior of the {{FILTERING ALLOWED}} idea above (even if at the end, we end up with pretty much the same than what you suggest). Let me explain.

What I'd like to do is:
# refuse queres as they are today when they might involve "filtering" data.  By filtering here I mean that some records are read but discarded from the resultSet.
# adds an {{ALLOW FILTERING}} syntax that "unlock" those queries (as in, allow the query to run).
# when {{ALLOW FILTERING}} is used, allow to specify the maximum number of filtered records with say {{ALLOW FILTERING MAX 500}}.

I believe we're reached concensus on 1., but basically the arguments are above.
Now 2.+ 3. is pretty much the equivalent of Ed's idea (more precisely, using {{LIMIT X ALLOW FILTERING MAX Y}} would be the equivalent of {{LIMIT X MAX_PREPARED X+Y}} if I understand Ed's proposal right). However, the reason why I think we should allow 2. alone are that:
* I do think 2. is useful in it's own right. Or rather, you may have cases where you want all results period. How course you could provide a very big value for the max filtered, but that's lame. Or another way to put it is that it's one thing to say "I understand this query may do some unknown amount of useless work underneath but go ahead" and a slightly different one to control exactly how much of that uselless work you allow.
* Part 3. is a bit of a break of the API abstraction. What I mean here is that the actual behavior/result of a MAX_EXAMINED will depends on implementation details. Say tomorrow we'll optimize somehow how much records are actually examined to answer a query, then a query MAX_EXAMINED may return a different result tomorrow even on the exact same setting. Part 2. doesn't have this problem, and so while I'm good having 3. because I see how it can be useful, I'd rather not have it alone.
* On the very practical side of things, part 3. is more complex to implement.  I'm pretty sure it'll require some storage engine change for instance. Also, I think there is points to clarify: if you shortcut the query, how does the user know if the query was shortcut or not? We can probably add some flag to the ResultSet I suppose, or somethine else, but the point is that I'd rather take the time to do that part right. Meaning that I think shoving it in 1.2.0 at this point is imho a bad idea. So I'd rather do part 2. now, which I'm confident is well defined, and improve with part 3. later.

                
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Posted by "Edward Capriolo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13499117#comment-13499117 ] 

Edward Capriolo commented on CASSANDRA-4915:
--------------------------------------------

I agree on all counts. 

Part 3's undefined nature makes it difficult to solve. I think it benefit in very wide rows, or large collections. In a thrift slice today we have start,end and size to constrain the problem.

                
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should force limit when query samples data.

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492414#comment-13492414 ] 

Sylvain Lebresne commented on CASSANDRA-4915:
---------------------------------------------

Btw, realized that I would be fine about requiring 'ALLOW FULL SCAN' for 2ndary index queries when we know we suck at them (i.e. when we have a restriction that is not indexed or for which we don't use the index).
                
> CQL should force limit when query samples data.
> -----------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should force limit when query samples data.

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491326#comment-13491326 ] 

Sylvain Lebresne commented on CASSANDRA-4915:
---------------------------------------------

I agree that us doing a full scan for that kind of query is confusing. In fact, that's a break of our otherwise applied rule: we don't allow queries that are not indexed. In that case we don't have an index and fallback to a full scan which we never do otherwise.

So the "logical" thing would just be to refuse that type of query (but to be clear, I think "SELECT * FROM videos" should always be allowed, because there is no surprise here, you've asked everything). We talked about allowing indexing the component of the clustering key (and though it's not done yet, I see no reason not to do it eventually), and once that is done we will be able to do those queries efficiently and it's only then that we should, again in theory, allow them.

Now in practice there is the fact that those queries more or less correspond to range_slice_queries and there is a good chance people would complain if we disallow them. I do note that it's not fully equivalent to the thrift case however, in the sense that in the thrift case you're literally asking for some sub-slice of all rows (or at least a range of rows), and in the result you will get all the rows, but with an empty set of columns if the provided filter selected nothing. In CQL3, you select _only_ the rows _where_ some predicate is true, so you won't get all those internal rows that have nothing for you.

bq. force people to supply a LIMIT clause

I really don't think this is a LIMIT problem and thus I don't think forcing (or doing anything with) LIMIT is the solution. Namely, if you have billions of rows and none of them has {{videoname = 'My funny cat'}}, then whatever the limit you provide (even 1) this query will timeout. Now I have some things to say about LIMIT and I've created CASSANDRA-4918 for that, but this is a completely orthogonal problem imo.

So in terms of solutions, here are the ones I would suggest by order of preferences:
# we could add a new {{ALLOW FULL SCAN}} option to {{SELECT}} queries that would explicitly say "I allow the engine to do a full scan and thus I understand my query performance may suck immensely". We would then not allow queries like
{noformat}
SELECT * FROM videos WHERE videoname = 'My funny cat'
{noformat}
  until we support 2ndary indexing videoname, but we would allow
{noformat}
SELECT * FROM videos WHERE videoname = 'My funny cat' ALLOW FULL SCAN
{noformat}
  (alternative syntax could be 'ALLOW NON-INDEXED SCAN' or whatever). I think this would be in line with what we want for Cassandra: make the user explicitly conscious of the performance implications of its queries. We could even later extend the support of this 'ALLOW FULL SCAN' bits by bits to other type of queries we refuse today (though I'm certainly not implying this should be a priority).
# if others really don't like my previous idea, I do think that the logical next best thing is to refuse that type of queries pure and simple.
# as a last resort (though I don't really like it tbh), we could add some form a simple explain that would tell you whether a query is indexed or not (but I largely prefer the 'you have to explicitly say you're fine with non-indexed' solution).

                
> CQL should force limit when query samples data.
> -----------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-4915:
--------------------------------------

    Summary: CQL should prevent or warn about inefficient queries  (was: CQL should force limit when query samples data.)
    
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should force limit when query samples data.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13490994#comment-13490994 ] 

Jonathan Ellis commented on CASSANDRA-4915:
-------------------------------------------

I think you may be seeing CASSANDRA-4858 -- we don't do pre-query sampling.
                
> CQL should force limit when query samples data.
> -----------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should force limit when query samples data.

Posted by "Edward Capriolo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491523#comment-13491523 ] 

Edward Capriolo commented on CASSANDRA-4915:
--------------------------------------------

What do you think about forcing the construct 'WHERE token(key)=0'? This is like the limit concept put I believe it is clear that this query is a range scanning query and it is clearly starting at some key.
                
> CQL should force limit when query samples data.
> -----------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast because Cassandra is performing an optimized query (over an index, or using a slicePredicate) or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like position by applying an automatic limit clause without the user asking for them. I also do not believe the CQL language should let the user issue queries that will not work as intended with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira