You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Tupshin Harper <tu...@tupshin.com> on 2014/03/01 01:54:19 UTC

Re: Getting the most-recent version from time-series data

You are correct that with that schema, all data for a give key would be in
a single partition, and hence on the same node(s). I missed that before.

-Tupshin




On Fri, Feb 28, 2014 at 12:47 PM, Clint Kelly <cl...@gmail.com> wrote:

> Hi Tupshin,
>
> Thanks for your help once again, I really appreciate it.  Quick question
> regarding the issue of token-aware routing, etc.  Let's say that I am using
> the table described earlier:
>
>
> CREATE TABLE time_series_stuff (
>   key text,
>   family text,
>   version int,
>   val text,
>   PRIMARY KEY (key, family, version)
> ) WITH CLUSTERING ORDER BY (family ASC, version DESC)
>
> I want to retrieve values for the most-recent version of every family for
> a given key, doing something like:
>
>     SELECT * from time_series_stuff where key='mykey'
>
> but getting only one version per family.
>
> All of this data should live on the same node (or set of replica nodes),
> correct?  I am specifying the partition key here, and I thought that only
> the partition key determined on what physical nodes data exists.
> Therefore, I would think that all of the results from this query would come
> from a single replica node (or set of replica nodes, if the consistency
> level is greater than 1).
>
> Would you mind clarifying?  Thanks a lot!
>
> Best regards,
> Clint
>
>
>
>
>
>
> On Wed, Feb 26, 2014 at 4:56 AM, Tupshin Harper <tu...@tupshin.com>wrote:
>
>> And one last clarification. Where I said "stored procedure" earlier, I
>> meant "prepared statement". Sorry for the confusion. Too much typing while
>> tired.
>>
>> -Tupshin
>>
>>
>> On Tue, Feb 25, 2014 at 10:36 PM, Tupshin Harper <tu...@tupshin.com>wrote:
>>
>>> I failed to address the matter of not knowing the families in advance.
>>>
>>> I can't really recommend any solution to that other than storing the
>>> list of families in another structure that is readily queryable. I don't
>>> know how many families you are thinking, but if it is in the millions or
>>> more, You might consider constructing another table such as:
>>> CREATE TABLE families (
>>>   key int,
>>>   family text,
>>>   PRIMARY KEY (key, family)
>>> );
>>>
>>>
>>> store your families there, with a knowable set of keys (I suggest
>>> something like the last 3 digits of the md5 hash of the family). So then
>>> you could retrieve your families in nice sized batches
>>> SELECT family FROM id WHERE key=0;
>>> and then do the fan-out selects that I described previously.
>>>
>>> -Tupshin
>>>
>>>
>>> On Tue, Feb 25, 2014 at 10:15 PM, Tupshin Harper <tu...@tupshin.com>wrote:
>>>
>>>> Hi Clint,
>>>>
>>>> What you are describing could actually be accomplished with the Thrift
>>>> API and a multiget_slice with a slicerange having a count of 1. Initially I
>>>> was thinking that this was an important feature gap between Thrift and CQL,
>>>> and was going to suggest that it should be implemented (possible syntax is
>>>> in https://issues.apache.org/jira/browse/CASSANDRA-6167 which is
>>>> almost a superset of this feature).
>>>>
>>>> But then I was convinced by some colleagues, that with a modern CQL
>>>> driver that is token aware, you are actually better off (in terms of
>>>> latency, throughput, and reliability), by doing each query separately on
>>>> the client.
>>>>
>>>> The reasoning is that if you did this with a single query, it would
>>>> necessarily be sent to a coordinator that wouldn't own most of the data
>>>> that you are looking for. That coordinator would then need to fan out the
>>>> read to all the nodes owning the partitions you are looking for.
>>>>
>>>> Far better to just do it directly on the client. The token aware client
>>>> will send each request for a row straight to a node that owns it. With a
>>>> separate connection open to each node, this is done in parallel from the
>>>> get-go. Fewer hops. Less load on the coordinator. No bottlenecks. And with
>>>> a stored procedure, very very little additional overhead to the client,
>>>> server, or network.
>>>>
>>>> -Tupshin
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 7:48 PM, Clint Kelly <cl...@gmail.com>wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Let's say that I have a table that looks like the following:
>>>>>
>>>>> CREATE TABLE time_series_stuff (
>>>>>   key text,
>>>>>   family text,
>>>>>   version int,
>>>>>   val text,
>>>>>   PRIMARY KEY (key, family, version)
>>>>> ) WITH CLUSTERING ORDER BY (family ASC, version DESC) AND
>>>>>   bloom_filter_fp_chance=0.010000 AND
>>>>>   caching='KEYS_ONLY' AND
>>>>>   comment='' AND
>>>>>   dclocal_read_repair_chance=0.000000 AND
>>>>>   gc_grace_seconds=864000 AND
>>>>>   index_interval=128 AND
>>>>>   read_repair_chance=0.100000 AND
>>>>>   replicate_on_write='true' AND
>>>>>   populate_io_cache_on_flush='false' AND
>>>>>   default_time_to_live=0 AND
>>>>>   speculative_retry='99.0PERCENTILE' AND
>>>>>   memtable_flush_period_in_ms=0 AND
>>>>>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>>>>>   compression={'sstable_compression': 'LZ4Compressor'};
>>>>>
>>>>> cqlsh:fiddle> select * from time_series_stuff ;
>>>>>
>>>>>  key    | family  | version | val
>>>>> --------+---------+---------+--------
>>>>>  monday | revenue |       3 | $$$$$$
>>>>>  monday | revenue |       2 |    $$$
>>>>>  monday | revenue |       1 |     $$
>>>>>  monday | revenue |       0 |      $
>>>>>  monday | traffic |       2 | medium
>>>>>  monday | traffic |       1 |  light
>>>>>  monday | traffic |       0 |  heavy
>>>>>
>>>>> (7 rows)
>>>>>
>>>>> Now let's say that I'd like to perform a query that gets me the most
>>>>> recent N versions of "revenue" and "traffic."
>>>>>
>>>>> Is there a CQL query to do this?  Let's say that N=1.  Then I know
>>>>> that I can do:
>>>>>
>>>>> cqlsh:fiddle> select * from time_series_stuff where key='monday' and
>>>>> family='revenue' limit 1;
>>>>>
>>>>>  key    | family  | version | val
>>>>> --------+---------+---------+--------
>>>>>  monday | revenue |       3 | $$$$$$
>>>>>
>>>>> (1 rows)
>>>>>
>>>>> cqlsh:fiddle> select * from time_series_stuff where key='monday' and
>>>>> family='traffic' limit 1;
>>>>>
>>>>>  key    | family  | version | val
>>>>> --------+---------+---------+--------
>>>>>  monday | traffic |       2 | medium
>>>>>
>>>>> (1 rows)
>>>>>
>>>>> But what if I have lots of "families" and I want to get the most
>>>>> recent N versions of all of them in a single CQL statement.  Is that
>>>>> possible?  Unfortunately I am working on something where the family names
>>>>> and the number of most-recent versions are not known a priori (I am porting
>>>>> some code that was designed for HBase).
>>>>>
>>>>> Best regards,
>>>>> Clint
>>>>>
>>>>
>>>>
>>>
>>
>