You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "graham sanderson (JIRA)" <ji...@apache.org> on 2014/06/26 03:15:25 UTC

[jira] [Updated] (CASSANDRA-7449) Variation of SELECT DISTINCT to find clustering keys with only static columns

     [ https://issues.apache.org/jira/browse/CASSANDRA-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

graham sanderson updated CASSANDRA-7449:
----------------------------------------

    Description: 
A possible use case for static columns involves (per partition) multiple small TTL time series data values combined with a potentially much larger static piece of data.

While the TTL time series data will go away on its own, there is no way to TTL the static data (and keep it updated with the latest TTL) without re-inserting it every time to reset the TTL (which is undesirable since it is large and unchanged)

The use case looks something like this:

{code}
CREATE KEYSPACE test WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '1'
};

USE test;

CREATE TABLE expiring_series (
	id text,
	series_order int,
	small_data text,
	large_data text static,
	PRIMARY KEY (id, series_order)
);

INSERT INTO expiring_series (id, large_data) VALUES ('123', 'this is large and should not be inserted every time');
SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;
SELECT id FROM expiring_series;
SELECT DISTINCT id FROM expiring_series;
SELECT STATICONLY id FROM expiring_series;

INSERT INTO expiring_series (id, large_data) VALUES ('123', 'this is large and should not be inserted every time');
INSERT INTO expiring_series (id, series_order, small_data) VALUES ('123', 1, 'antelope') USING TTL 120;

// time passes (point A)

INSERT INTO expiring_series (id, series_order, small_data) VALUES ('123', 2, 'gibbon') USING TTL 120;

// time passes (point B)

INSERT INTO expiring_series (id, series_order, small_data) VALUES ('123', 3, 'firebucket') USING TTL 120;

// time passes (point C)

// time passes and the first row expires (point D)

// more time passes and eventually all the "rows" expire (point E)
{code}

GIven the way the storage engine works, there is no trivial way to make the static column expire when the last row expires, however if there was an easy way to find partitions with no regular rows (just static columns), then that would make manual clean up easy

The possible implementation of such a feature is very similar to SELECT DISTINCT, so I'm suggesting SELECT STATICONLY

Looking at the points again

h4. Point A
{code}
cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;

 id  | series_order | small_data | large_data                                          | ttl(small_data)
-----+--------------+------------+-----------------------------------------------------+-----------------
 123 |            1 |   antelope | this is large and should not be inserted every time |             108

(1 rows)

cqlsh:test> SELECT id FROM expiring_series;

 id
-----
 123

(1 rows)

cqlsh:test> SELECT DISTINCT id FROM expiring_series;

 id
-----
 123

(1 rows)

cqlsh:test> SELECT STATICONLY id FROM expiring_series;

(0 rows)
{code}
h4. Point B
{code}
cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;

 id  | series_order | small_data | large_data                                          | ttl(small_data)
-----+--------------+------------+-----------------------------------------------------+-----------------
 123 |            1 |   antelope | this is large and should not be inserted every time |              87
 123 |            2 |     gibbon | this is large and should not be inserted every time |             111

(2 rows)

cqlsh:test> SELECT id FROM expiring_series;

 id
-----
 123
 123

(2 rows)

cqlsh:test> SELECT DISTINCT id FROM expiring_series;

 id
-----
 123

(1 rows)

cqlsh:test> SELECT STATICONLY id FROM expiring_series;

(0 rows)
{code}
h4. Point C
{code}
cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;

 id  | series_order | small_data | large_data                                          | ttl(small_data)
-----+--------------+------------+-----------------------------------------------------+-----------------
 123 |            1 |   antelope | this is large and should not be inserted every time |              67
 123 |            2 |     gibbon | this is large and should not be inserted every time |              91
 123 |            3 | firebucket | this is large and should not be inserted every time |             110

(3 rows)

cqlsh:test> SELECT id FROM expiring_series;

 id
-----
 123
 123
 123

(3 rows)

cqlsh:test> SELECT DISTINCT id FROM expiring_series;

 id
-----
 123

(1 rows)

cqlsh:test> SELECT STATICONLY id FROM expiring_series;

(0 rows)
{code}
h4. Point D
{code}
cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;

 id  | series_order | small_data | large_data                                          | ttl(small_data)
-----+--------------+------------+-----------------------------------------------------+-----------------
 123 |            2 |     gibbon | this is large and should not be inserted every time |              22
 123 |            3 | firebucket | this is large and should not be inserted every time |              41

(2 rows)

cqlsh:test> SELECT id FROM expiring_series;

 id
-----
 123
 123

(2 rows)

cqlsh:test> SELECT DISTINCT id FROM expiring_series;

 id
-----
 123

(1 rows)

cqlsh:test> SELECT STATICONLY id FROM expiring_series;

(0 rows)
{code}
h4. Point E
{code}
cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;

 id  | series_order | small_data | large_data                                          | ttl(small_data)
-----+--------------+------------+-----------------------------------------------------+-----------------
 123 |         null |       null | this is large and should not be inserted every time |            null

(1 rows)

cqlsh:test> SELECT id FROM expiring_series;

(0 rows)

cqlsh:test> SELECT DISTINCT id FROM expiring_series;

 id
-----
 123

(1 rows)

cqlsh:test> SELECT STATICONLY id FROM expiring_series;

 id
-----
 123

(1 rows)
{code}

Notice that after the last id has expired SELECT STATICONLY id return the row, and it can be deleted (under whatever concurrency rules the application needs)

{code}
cqlsh:test> DELETE FROM expiring_series where id = '123';
cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;

(0 rows)

cqlsh:test> SELECT id FROM expiring_series;

(0 rows)

cqlsh:test> SELECT DISTINCT id FROM expiring_series;

(0 rows)

cqlsh:test> SELECT STATICONLY id FROM expiring_series;

(0 rows)
{code}

  was:
A possible use case for static columns involves (per partition) multiple small TTL time series data values combined with a potentially much larger static piece of data.

While the TTL time series data will go away on its own, there is no way to TTL the static data (and keep it updated with the latest TTL) without re-inserting it every time to reset the TTL (which is undesirable since it is large and unchanged)

The use case looks something like this:

{code}
CREATE KEYSPACE test WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '1'
};

USE test;

CREATE TABLE expiring_series (
	id text,
	series_order int,
	small_data text,
	large_data text static,
	PRIMARY KEY (id, series_order)
);

INSERT INTO expiring_series (id, large_data) VALUES ('123', 'this is large and should not be inserted every time');
SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;
SELECT id FROM expiring_series;
SELECT DISTINCT id FROM expiring_series;
SELECT STATICONLY id FROM expiring_series;

INSERT INTO expiring_series (id, large_data) VALUES ('123', 'this is large and should not be inserted every time');
INSERT INTO expiring_series (id, series_order, small_data) VALUES ('123', 1, 'antelope') USING TTL 120;

// time passes (point A)

INSERT INTO expiring_series (id, series_order, small_data) VALUES ('123', 2, 'gibbon') USING TTL 120;

// time passes (point B)

INSERT INTO expiring_series (id, series_order, small_data) VALUES ('123', 3, 'firebucket') USING TTL 120;
{code}

// time passes (point C)

// time passes and the first row expires (point D)

// more time passes and eventually all the "rows" expire (point E)
{code}

GIven the way the storage engine works, there is no trivial way to make the static column expire when the last row expires, however if there was an easy way to find partitions with no regular rows (just static columns), then that would make manual clean up easy

The possible implementation of such a feature is very similar to SELECT DISTINCT, so I'm suggesting SELECT STATICONLY

Looking at the points again

h4. Point A
{code}
cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;

 id  | series_order | small_data | large_data                                          | ttl(small_data)
-----+--------------+------------+-----------------------------------------------------+-----------------
 123 |            1 |   antelope | this is large and should not be inserted every time |             108

(1 rows)

cqlsh:test> SELECT id FROM expiring_series;

 id
-----
 123

(1 rows)

cqlsh:test> SELECT DISTINCT id FROM expiring_series;

 id
-----
 123

(1 rows)

cqlsh:test> SELECT STATICONLY id FROM expiring_series;

(0 rows)
{code}
h4. Point B
{code}
cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;

 id  | series_order | small_data | large_data                                          | ttl(small_data)
-----+--------------+------------+-----------------------------------------------------+-----------------
 123 |            1 |   antelope | this is large and should not be inserted every time |              87
 123 |            2 |     gibbon | this is large and should not be inserted every time |             111

(2 rows)

cqlsh:test> SELECT id FROM expiring_series;

 id
-----
 123
 123

(2 rows)

cqlsh:test> SELECT DISTINCT id FROM expiring_series;

 id
-----
 123

(1 rows)

cqlsh:test> SELECT STATICONLY id FROM expiring_series;

(0 rows)
{code}
h4. Point C
{code}
cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;

 id  | series_order | small_data | large_data                                          | ttl(small_data)
-----+--------------+------------+-----------------------------------------------------+-----------------
 123 |            1 |   antelope | this is large and should not be inserted every time |              67
 123 |            2 |     gibbon | this is large and should not be inserted every time |              91
 123 |            3 | firebucket | this is large and should not be inserted every time |             110

(3 rows)

cqlsh:test> SELECT id FROM expiring_series;

 id
-----
 123
 123
 123

(3 rows)

cqlsh:test> SELECT DISTINCT id FROM expiring_series;

 id
-----
 123

(1 rows)

cqlsh:test> SELECT STATICONLY id FROM expiring_series;

(0 rows)
{code}
h4. Point D
{code}
cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;

 id  | series_order | small_data | large_data                                          | ttl(small_data)
-----+--------------+------------+-----------------------------------------------------+-----------------
 123 |            2 |     gibbon | this is large and should not be inserted every time |              22
 123 |            3 | firebucket | this is large and should not be inserted every time |              41

(2 rows)

cqlsh:test> SELECT id FROM expiring_series;

 id
-----
 123
 123

(2 rows)

cqlsh:test> SELECT DISTINCT id FROM expiring_series;

 id
-----
 123

(1 rows)

cqlsh:test> SELECT STATICONLY id FROM expiring_series;

(0 rows)
{code}
h4. Point E
{code}
cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;

 id  | series_order | small_data | large_data                                          | ttl(small_data)
-----+--------------+------------+-----------------------------------------------------+-----------------
 123 |         null |       null | this is large and should not be inserted every time |            null

(1 rows)

cqlsh:test> SELECT id FROM expiring_series;

(0 rows)

cqlsh:test> SELECT DISTINCT id FROM expiring_series;

 id
-----
 123

(1 rows)

cqlsh:test> SELECT STATICONLY id FROM expiring_series;

 id
-----
 123

(1 rows)
{code}

Notice that after the last id has expired SELECT STATICONLY id return the row, and it can be deleted (under whatever concurrency rules the application needs)

{code}
cqlsh:test> DELETE FROM expiring_series where id = '123';
cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;

(0 rows)

cqlsh:test> SELECT id FROM expiring_series;

(0 rows)

cqlsh:test> SELECT DISTINCT id FROM expiring_series;

(0 rows)

cqlsh:test> SELECT STATICONLY id FROM expiring_series;

(0 rows)
{code}


> Variation of SELECT DISTINCT to find clustering keys with only static columns
> -----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7449
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7449
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API, Core
>            Reporter: graham sanderson
>
> A possible use case for static columns involves (per partition) multiple small TTL time series data values combined with a potentially much larger static piece of data.
> While the TTL time series data will go away on its own, there is no way to TTL the static data (and keep it updated with the latest TTL) without re-inserting it every time to reset the TTL (which is undesirable since it is large and unchanged)
> The use case looks something like this:
> {code}
> CREATE KEYSPACE test WITH replication = {
>   'class': 'SimpleStrategy',
>   'replication_factor': '1'
> };
> USE test;
> CREATE TABLE expiring_series (
> 	id text,
> 	series_order int,
> 	small_data text,
> 	large_data text static,
> 	PRIMARY KEY (id, series_order)
> );
> INSERT INTO expiring_series (id, large_data) VALUES ('123', 'this is large and should not be inserted every time');
> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;
> SELECT id FROM expiring_series;
> SELECT DISTINCT id FROM expiring_series;
> SELECT STATICONLY id FROM expiring_series;
> INSERT INTO expiring_series (id, large_data) VALUES ('123', 'this is large and should not be inserted every time');
> INSERT INTO expiring_series (id, series_order, small_data) VALUES ('123', 1, 'antelope') USING TTL 120;
> // time passes (point A)
> INSERT INTO expiring_series (id, series_order, small_data) VALUES ('123', 2, 'gibbon') USING TTL 120;
> // time passes (point B)
> INSERT INTO expiring_series (id, series_order, small_data) VALUES ('123', 3, 'firebucket') USING TTL 120;
> // time passes (point C)
> // time passes and the first row expires (point D)
> // more time passes and eventually all the "rows" expire (point E)
> {code}
> GIven the way the storage engine works, there is no trivial way to make the static column expire when the last row expires, however if there was an easy way to find partitions with no regular rows (just static columns), then that would make manual clean up easy
> The possible implementation of such a feature is very similar to SELECT DISTINCT, so I'm suggesting SELECT STATICONLY
> Looking at the points again
> h4. Point A
> {code}
> cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;
>  id  | series_order | small_data | large_data                                          | ttl(small_data)
> -----+--------------+------------+-----------------------------------------------------+-----------------
>  123 |            1 |   antelope | this is large and should not be inserted every time |             108
> (1 rows)
> cqlsh:test> SELECT id FROM expiring_series;
>  id
> -----
>  123
> (1 rows)
> cqlsh:test> SELECT DISTINCT id FROM expiring_series;
>  id
> -----
>  123
> (1 rows)
> cqlsh:test> SELECT STATICONLY id FROM expiring_series;
> (0 rows)
> {code}
> h4. Point B
> {code}
> cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;
>  id  | series_order | small_data | large_data                                          | ttl(small_data)
> -----+--------------+------------+-----------------------------------------------------+-----------------
>  123 |            1 |   antelope | this is large and should not be inserted every time |              87
>  123 |            2 |     gibbon | this is large and should not be inserted every time |             111
> (2 rows)
> cqlsh:test> SELECT id FROM expiring_series;
>  id
> -----
>  123
>  123
> (2 rows)
> cqlsh:test> SELECT DISTINCT id FROM expiring_series;
>  id
> -----
>  123
> (1 rows)
> cqlsh:test> SELECT STATICONLY id FROM expiring_series;
> (0 rows)
> {code}
> h4. Point C
> {code}
> cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;
>  id  | series_order | small_data | large_data                                          | ttl(small_data)
> -----+--------------+------------+-----------------------------------------------------+-----------------
>  123 |            1 |   antelope | this is large and should not be inserted every time |              67
>  123 |            2 |     gibbon | this is large and should not be inserted every time |              91
>  123 |            3 | firebucket | this is large and should not be inserted every time |             110
> (3 rows)
> cqlsh:test> SELECT id FROM expiring_series;
>  id
> -----
>  123
>  123
>  123
> (3 rows)
> cqlsh:test> SELECT DISTINCT id FROM expiring_series;
>  id
> -----
>  123
> (1 rows)
> cqlsh:test> SELECT STATICONLY id FROM expiring_series;
> (0 rows)
> {code}
> h4. Point D
> {code}
> cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;
>  id  | series_order | small_data | large_data                                          | ttl(small_data)
> -----+--------------+------------+-----------------------------------------------------+-----------------
>  123 |            2 |     gibbon | this is large and should not be inserted every time |              22
>  123 |            3 | firebucket | this is large and should not be inserted every time |              41
> (2 rows)
> cqlsh:test> SELECT id FROM expiring_series;
>  id
> -----
>  123
>  123
> (2 rows)
> cqlsh:test> SELECT DISTINCT id FROM expiring_series;
>  id
> -----
>  123
> (1 rows)
> cqlsh:test> SELECT STATICONLY id FROM expiring_series;
> (0 rows)
> {code}
> h4. Point E
> {code}
> cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;
>  id  | series_order | small_data | large_data                                          | ttl(small_data)
> -----+--------------+------------+-----------------------------------------------------+-----------------
>  123 |         null |       null | this is large and should not be inserted every time |            null
> (1 rows)
> cqlsh:test> SELECT id FROM expiring_series;
> (0 rows)
> cqlsh:test> SELECT DISTINCT id FROM expiring_series;
>  id
> -----
>  123
> (1 rows)
> cqlsh:test> SELECT STATICONLY id FROM expiring_series;
>  id
> -----
>  123
> (1 rows)
> {code}
> Notice that after the last id has expired SELECT STATICONLY id return the row, and it can be deleted (under whatever concurrency rules the application needs)
> {code}
> cqlsh:test> DELETE FROM expiring_series where id = '123';
> cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) from expiring_series;
> (0 rows)
> cqlsh:test> SELECT id FROM expiring_series;
> (0 rows)
> cqlsh:test> SELECT DISTINCT id FROM expiring_series;
> (0 rows)
> cqlsh:test> SELECT STATICONLY id FROM expiring_series;
> (0 rows)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)