You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Alex Petrov (JIRA)" <ji...@apache.org> on 2016/05/13 08:34:12 UTC

[jira] [Resolved] (CASSANDRA-11319) SELECT DISTINCT Should allow filtering by where clause to support time series

     [ https://issues.apache.org/jira/browse/CASSANDRA-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Petrov resolved CASSANDRA-11319.
-------------------------------------
    Resolution: Fixed

{{SELECT DISTINCT}} queries with {{WHERE}} clause only supports restriction by __full__ partition key and/or static columns for now (you can find out more here: [https://issues.apache.org/jira/browse/CASSANDRA-11339]).

I'd first like to remind that allowing filtering, depending on the data layout, might have a severe performance impacts. 

Although judging from what you'd like to query you can actually still avoid keeping the index by using static columns. Since static columns are all same for partition (which is certainly the case in your example), you could keep a combination of {{sensorUnitId}} and {{sensorId}} in a static column and use filtering without an additional index: 

{{code}}
CREATE TABLE tablewithstatic (a int, b int, c int, d int, e int, s int static, PRIMARY KEY ((a,b,c), d));
select distinct a,b,c FROM tablewithstatic WHERE s=1 ALLOW FILTERING;
{{code}}

You can also refer to Patrick's guide on possible alternative partitioning schemas [here|https://academy.datastax.com/resources/getting-started-time-series-data-modeling] for a couple more ideas.

There's also [11031|https://issues.apache.org/jira/browse/CASSANDRA-11031] which will allow filtering for parts of partition key.

> SELECT DISTINCT Should allow filtering by where clause to support time series
> -----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11319
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11319
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: CQL
>         Environment: Cassandra 3.0.3
>            Reporter: Jason Kania
>
> Due to the lack of built in sharding, we have been trying to split very wide rows. However in trying to find all the sharding column values after the fact, we have not been able to find a solution that is manageable. If a table is defined as follows:
> {code}
> CREATE TABLE IF NOT EXISTS "sensorReadings"
> (
> 	"measurementList" blob,
> 	"sensorId" int,
> 	"sensorUnitId" int,
> 	"shardId" int,
> 	"time" timestamp,
> 	PRIMARY KEY ( ("sensorUnitId", "sensorId", "shardId"), "time" )
> );
> {code}
> then
> {code}
> select DISTINCT "sensorUnitId","sensorId","shardId" from "sensorReadings";
> {code}
> will give all the unique partition keys but this can still be a very large set and so it should be possible to refine this with a where clause that contains only partition columns ie:
> {code}
> select DISTINCT "sensorUnitId","sensorId","shardId" from "sensorReadings" WHERE "sensorUnitId"='sensor17' AND "sensorId"=8;
> {code}
> Without this ability, we am forced to keep a table with available shardIds and update on every write so that we can even query the original table. While several scenarios allow the shardId to be determined automatically, attempts to iterate over the shards are seriously hampered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)