You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Serega Sheypak <se...@gmail.com> on 2015/04/04 13:03:43 UTC

Timeseries analysis using Cassandra and partition by date period

Hi, I switched from HBase to Cassandra and try to find problem solution for
timeseries analysis on top Cassandra.
I have a entity named "Event".
"Event" has attributes:
user_id - a guy who triggered event
event_ts - when even happened
event_type - type of event
some_other_attr - some other attrs we don't care about right now.

The DDL for entity event looks this way:

CREATE TABLE user_plans (

  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text

PRIMARY KEY (user_id, ends)
);

Table is "infinite", It would grow continuously during application lifetime.
I want to ask question:
Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.

Right now it would lead to full table scan.

There is a trick in HBase. HBase has table abstraction and HBase has Column
Family abstraction.
Column family should be declared in advance.
Column family - physically is a pack of HFiles ("SSTables in C*").
So I can easily add partitioning for my HBase table:
alter table hbase_events add column familiy '2015_01'
and store all 2015 January data to Column familiy named '2015_01'.

When I want to get January data, I would directly access column family
named '2015_01' and I won't massage all data in table, just this piece.

What is approach in C* in this case?
I have an idea create several tables: event_2015_01, event_2015_02, e.t.c.
but it looks rather ugly from my current understanding how it works.

Re: Timeseries analysis using Cassandra and partition by date period

Posted by Serega Sheypak <se...@gmail.com>.

Thank you, we'll see that instrument,

2015-04-06 12:30 GMT+02:00 Srinivasa T N <se...@gmail.com>:

> Comparison to OpenTSDB HBase
>
> For one we do not use id’s for strings. The string data (metric names and
> tags) are written to row keys and the appropriate indexes. Because
> Cassandra has much wider rows there are far fewer keys written to the
> database. The space saved by using id’s is minor and by not using id’s we
> avoid having to use any kind of locks across the cluster.
>
> As mentioned the Cassandra has wider rows. The default row size in
> OpenTSDB HBase is 1 hour. Cassandra is set to 3 weeks.
> http://kairosdb.github.io/kairosdocs/CassandraSchema.html
>
> On Mon, Apr 6, 2015 at 3:27 PM, Serega Sheypak <se...@gmail.com>
> wrote:
>
>> Thanks, is it a kind of opentsdb?
>>
>> 2015-04-05 18:28 GMT+02:00 Kevin Burton <bu...@spinn3r.com>:
>>
>>> > Hi, I switched from HBase to Cassandra and try to find problem
>>> solution for timeseries analysis on top Cassandra.
>>>
>>> Depending on what you’re looking for, you might want to check out
>>> KairosDB.
>>>
>>> 0.95 beta2 just shipped yesterday as well so you have good timing.
>>>
>>> https://github.com/kairosdb/kairosdb
>>>
>>> On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <
>>> serega.sheypak@gmail.com> wrote:
>>>
>>>> Okay, so bucketing by day/week/month is a capacity planning stuff and
>>>> actual questions I want to ask.
>>>> As as a conclusion:
>>>> I have a table events
>>>>
>>>> CREATE TABLE user_plans (
>>>>   id timeuuid,
>>>>   user_id timeuuid,
>>>>   event_ts timestamp,
>>>>   event_type int,
>>>>   some_other_attr text
>>>>
>>>> PRIMARY KEY (user_id, ends)
>>>> );
>>>> which fits tactic queries:
>>>> select smth from user_plans where user_id='xxx' and end_ts > now()
>>>>
>>>> Then I create second table user_plans_daily (or weekly, monthy)
>>>>
>>>> with DDL:
>>>> CREATE TABLE user_plans_daily/weekly/monthly (
>>>>   ymd int,
>>>>   user_id timeuuid,
>>>>   event_ts timestamp,
>>>>   event_type int,
>>>>   some_other_attr text
>>>> )
>>>> PRIMARY KEY ((ymd, user_id), event_ts )
>>>> WITH CLUSTERING ORDER BY (event_ts DESC);
>>>>
>>>> And this table is good for answering strategic questions:
>>>> select * from
>>>> user_plans_daily/weekly/monthly
>>>> where ymd in (....)
>>>> And I should avoid long condition inside IN clause, that is why you
>>>> suggest me to create bigger bucket, correct?
>>>>
>>>>
>>>> 2015-04-04 20:00 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>>>>
>>>>> It sounds like your time bucket should be a month, but it depends on
>>>>> the amount of data per user per day and your main query range. Within the
>>>>> partition you can then query for a range of days.
>>>>>
>>>>> Yes, all of the rows within a partition are stored on one physical
>>>>> node as well as the replica nodes.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <
>>>>> serega.sheypak@gmail.com> wrote:
>>>>>
>>>>>> >non-equal relation on a partition key is not supported
>>>>>> Ok, can I generate select query:
>>>>>> select some_attributes
>>>>>> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ...
>>>>>> or 20150331
>>>>>>
>>>>>> > The partition key determines which node can satisfy the query
>>>>>> So you mean that all rows with the same *(ymd, user_id)* would be on
>>>>>> one physical node?
>>>>>>
>>>>>>
>>>>>> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>>>>>>
>>>>>>> Unfortunately, a non-equal relation on a partition key is not
>>>>>>> supported. You would need to bucket by some larger unit, like a month, and
>>>>>>> then use the date/time as a clustering column for the row key. Then you
>>>>>>> could query within the partition. The partition key determines which node
>>>>>>> can satisfy the query. Designing your partition key judiciously is the key
>>>>>>> (haha!) to performant Cassandra applications.
>>>>>>>
>>>>>>> -- Jack Krupansky
>>>>>>>
>>>>>>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <
>>>>>>> serega.sheypak@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi, we plan to have 10^8 users and each user could generate 10
>>>>>>>> events per day.
>>>>>>>> So we have:
>>>>>>>> 10^8 records per day
>>>>>>>> 10^8*30 records per month.
>>>>>>>> Our timewindow analysis could be from 1 to 6 months.
>>>>>>>>
>>>>>>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact
>>>>>>>> ts of event.
>>>>>>>>
>>>>>>>> So you suggest this approach:
>>>>>>>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>>>>>>>> *WITH CLUSTERING ORDER BY (**event_ts*
>>>>>>>> * DESC);*
>>>>>>>>
>>>>>>>> where ymd=20150102 (the Second of January)?
>>>>>>>>
>>>>>>>> *What happens to writes:*
>>>>>>>> SSTable with past days (ymd < current_day) stay untouched and don't
>>>>>>>> take part in Compaction process since there are o changes to them?
>>>>>>>>
>>>>>>>> What happens to read:
>>>>>>>> I issue query:
>>>>>>>> select some_attributes
>>>>>>>> from events where ymd >= 20150101 and ymd < 20150301
>>>>>>>> Does Cassandra skip SSTables which don't have ymd in specified
>>>>>>>> range and give me a kind of partition elimination, like in traditional DBs?
>>>>>>>>
>>>>>>>>
>>>>>>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <jack.krupansky@gmail.com
>>>>>>>> >:
>>>>>>>>
>>>>>>>>> It depends on the actual number of events per user, but simply
>>>>>>>>> bucketing the partition key can give you the same effect - clustering rows
>>>>>>>>> by time range. A composite partition key could be comprised of the user
>>>>>>>>> name and the date.
>>>>>>>>>
>>>>>>>>> It also depends on the data rate - is it many events per day or
>>>>>>>>> just a few events per week, or over what time period. You need to be
>>>>>>>>> careful - you don't want your Cassandra partitions to be too big (millions
>>>>>>>>> of rows) or too small (just a few or even one row per partition.)
>>>>>>>>>
>>>>>>>>> -- Jack Krupansky
>>>>>>>>>
>>>>>>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <
>>>>>>>>> serega.sheypak@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi, I switched from HBase to Cassandra and try to find problem
>>>>>>>>>> solution for timeseries analysis on top Cassandra.
>>>>>>>>>> I have a entity named "Event".
>>>>>>>>>> "Event" has attributes:
>>>>>>>>>> user_id - a guy who triggered event
>>>>>>>>>> event_ts - when even happened
>>>>>>>>>> event_type - type of event
>>>>>>>>>> some_other_attr - some other attrs we don't care about right now.
>>>>>>>>>>
>>>>>>>>>> The DDL for entity event looks this way:
>>>>>>>>>>
>>>>>>>>>> CREATE TABLE user_plans (
>>>>>>>>>>
>>>>>>>>>>   id timeuuid,
>>>>>>>>>>   user_id timeuuid,
>>>>>>>>>>   event_ts timestamp,
>>>>>>>>>>   event_type int,
>>>>>>>>>>   some_other_attr text
>>>>>>>>>>
>>>>>>>>>> PRIMARY KEY (user_id, ends)
>>>>>>>>>> );
>>>>>>>>>>
>>>>>>>>>> Table is "infinite", It would grow continuously during
>>>>>>>>>> application lifetime.
>>>>>>>>>> I want to ask question:
>>>>>>>>>> Cassandra, give me all event where event_ts >= xxx
>>>>>>>>>> and event_ts <=yyy.
>>>>>>>>>>
>>>>>>>>>> Right now it would lead to full table scan.
>>>>>>>>>>
>>>>>>>>>> There is a trick in HBase. HBase has table abstraction and HBase
>>>>>>>>>> has Column Family abstraction.
>>>>>>>>>> Column family should be declared in advance.
>>>>>>>>>> Column family - physically is a pack of HFiles ("SSTables in C*").
>>>>>>>>>> So I can easily add partitioning for my HBase table:
>>>>>>>>>> alter table hbase_events add column familiy '2015_01'
>>>>>>>>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>>>>>>>>
>>>>>>>>>> When I want to get January data, I would directly access column
>>>>>>>>>> family named '2015_01' and I won't massage all data in table, just this
>>>>>>>>>> piece.
>>>>>>>>>>
>>>>>>>>>> What is approach in C* in this case?
>>>>>>>>>> I have an idea create several tables: event_2015_01,
>>>>>>>>>> event_2015_02, e.t.c. but it looks rather ugly from my current
>>>>>>>>>> understanding how it works.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> <https://plus.google.com/102718274791889610666/posts>
>>> <http://spinn3r.com>
>>>
>>>
>>
>

Re: Timeseries analysis using Cassandra and partition by date period

Posted by Srinivasa T N <se...@gmail.com>.

 Comparison to OpenTSDB HBase

For one we do not use id’s for strings. The string data (metric names and
tags) are written to row keys and the appropriate indexes. Because
Cassandra has much wider rows there are far fewer keys written to the
database. The space saved by using id’s is minor and by not using id’s we
avoid having to use any kind of locks across the cluster.

As mentioned the Cassandra has wider rows. The default row size in OpenTSDB
HBase is 1 hour. Cassandra is set to 3 weeks.
http://kairosdb.github.io/kairosdocs/CassandraSchema.html

On Mon, Apr 6, 2015 at 3:27 PM, Serega Sheypak <se...@gmail.com>
wrote:

> Thanks, is it a kind of opentsdb?
>
> 2015-04-05 18:28 GMT+02:00 Kevin Burton <bu...@spinn3r.com>:
>
>> > Hi, I switched from HBase to Cassandra and try to find problem solution
>> for timeseries analysis on top Cassandra.
>>
>> Depending on what you’re looking for, you might want to check out
>> KairosDB.
>>
>> 0.95 beta2 just shipped yesterday as well so you have good timing.
>>
>> https://github.com/kairosdb/kairosdb
>>
>> On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <serega.sheypak@gmail.com
>> > wrote:
>>
>>> Okay, so bucketing by day/week/month is a capacity planning stuff and
>>> actual questions I want to ask.
>>> As as a conclusion:
>>> I have a table events
>>>
>>> CREATE TABLE user_plans (
>>>   id timeuuid,
>>>   user_id timeuuid,
>>>   event_ts timestamp,
>>>   event_type int,
>>>   some_other_attr text
>>>
>>> PRIMARY KEY (user_id, ends)
>>> );
>>> which fits tactic queries:
>>> select smth from user_plans where user_id='xxx' and end_ts > now()
>>>
>>> Then I create second table user_plans_daily (or weekly, monthy)
>>>
>>> with DDL:
>>> CREATE TABLE user_plans_daily/weekly/monthly (
>>>   ymd int,
>>>   user_id timeuuid,
>>>   event_ts timestamp,
>>>   event_type int,
>>>   some_other_attr text
>>> )
>>> PRIMARY KEY ((ymd, user_id), event_ts )
>>> WITH CLUSTERING ORDER BY (event_ts DESC);
>>>
>>> And this table is good for answering strategic questions:
>>> select * from
>>> user_plans_daily/weekly/monthly
>>> where ymd in (....)
>>> And I should avoid long condition inside IN clause, that is why you
>>> suggest me to create bigger bucket, correct?
>>>
>>>
>>> 2015-04-04 20:00 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>>>
>>>> It sounds like your time bucket should be a month, but it depends on
>>>> the amount of data per user per day and your main query range. Within the
>>>> partition you can then query for a range of days.
>>>>
>>>> Yes, all of the rows within a partition are stored on one physical node
>>>> as well as the replica nodes.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <
>>>> serega.sheypak@gmail.com> wrote:
>>>>
>>>>> >non-equal relation on a partition key is not supported
>>>>> Ok, can I generate select query:
>>>>> select some_attributes
>>>>> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or
>>>>> 20150331
>>>>>
>>>>> > The partition key determines which node can satisfy the query
>>>>> So you mean that all rows with the same *(ymd, user_id)* would be on
>>>>> one physical node?
>>>>>
>>>>>
>>>>> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>>>>>
>>>>>> Unfortunately, a non-equal relation on a partition key is not
>>>>>> supported. You would need to bucket by some larger unit, like a month, and
>>>>>> then use the date/time as a clustering column for the row key. Then you
>>>>>> could query within the partition. The partition key determines which node
>>>>>> can satisfy the query. Designing your partition key judiciously is the key
>>>>>> (haha!) to performant Cassandra applications.
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <
>>>>>> serega.sheypak@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, we plan to have 10^8 users and each user could generate 10
>>>>>>> events per day.
>>>>>>> So we have:
>>>>>>> 10^8 records per day
>>>>>>> 10^8*30 records per month.
>>>>>>> Our timewindow analysis could be from 1 to 6 months.
>>>>>>>
>>>>>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts
>>>>>>> of event.
>>>>>>>
>>>>>>> So you suggest this approach:
>>>>>>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>>>>>>> *WITH CLUSTERING ORDER BY (**event_ts*
>>>>>>> * DESC);*
>>>>>>>
>>>>>>> where ymd=20150102 (the Second of January)?
>>>>>>>
>>>>>>> *What happens to writes:*
>>>>>>> SSTable with past days (ymd < current_day) stay untouched and don't
>>>>>>> take part in Compaction process since there are o changes to them?
>>>>>>>
>>>>>>> What happens to read:
>>>>>>> I issue query:
>>>>>>> select some_attributes
>>>>>>> from events where ymd >= 20150101 and ymd < 20150301
>>>>>>> Does Cassandra skip SSTables which don't have ymd in specified range
>>>>>>> and give me a kind of partition elimination, like in traditional DBs?
>>>>>>>
>>>>>>>
>>>>>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <ja...@gmail.com>
>>>>>>> :
>>>>>>>
>>>>>>>> It depends on the actual number of events per user, but simply
>>>>>>>> bucketing the partition key can give you the same effect - clustering rows
>>>>>>>> by time range. A composite partition key could be comprised of the user
>>>>>>>> name and the date.
>>>>>>>>
>>>>>>>> It also depends on the data rate - is it many events per day or
>>>>>>>> just a few events per week, or over what time period. You need to be
>>>>>>>> careful - you don't want your Cassandra partitions to be too big (millions
>>>>>>>> of rows) or too small (just a few or even one row per partition.)
>>>>>>>>
>>>>>>>> -- Jack Krupansky
>>>>>>>>
>>>>>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <
>>>>>>>> serega.sheypak@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi, I switched from HBase to Cassandra and try to find problem
>>>>>>>>> solution for timeseries analysis on top Cassandra.
>>>>>>>>> I have a entity named "Event".
>>>>>>>>> "Event" has attributes:
>>>>>>>>> user_id - a guy who triggered event
>>>>>>>>> event_ts - when even happened
>>>>>>>>> event_type - type of event
>>>>>>>>> some_other_attr - some other attrs we don't care about right now.
>>>>>>>>>
>>>>>>>>> The DDL for entity event looks this way:
>>>>>>>>>
>>>>>>>>> CREATE TABLE user_plans (
>>>>>>>>>
>>>>>>>>>   id timeuuid,
>>>>>>>>>   user_id timeuuid,
>>>>>>>>>   event_ts timestamp,
>>>>>>>>>   event_type int,
>>>>>>>>>   some_other_attr text
>>>>>>>>>
>>>>>>>>> PRIMARY KEY (user_id, ends)
>>>>>>>>> );
>>>>>>>>>
>>>>>>>>> Table is "infinite", It would grow continuously during application
>>>>>>>>> lifetime.
>>>>>>>>> I want to ask question:
>>>>>>>>> Cassandra, give me all event where event_ts >= xxx
>>>>>>>>> and event_ts <=yyy.
>>>>>>>>>
>>>>>>>>> Right now it would lead to full table scan.
>>>>>>>>>
>>>>>>>>> There is a trick in HBase. HBase has table abstraction and HBase
>>>>>>>>> has Column Family abstraction.
>>>>>>>>> Column family should be declared in advance.
>>>>>>>>> Column family - physically is a pack of HFiles ("SSTables in C*").
>>>>>>>>> So I can easily add partitioning for my HBase table:
>>>>>>>>> alter table hbase_events add column familiy '2015_01'
>>>>>>>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>>>>>>>
>>>>>>>>> When I want to get January data, I would directly access column
>>>>>>>>> family named '2015_01' and I won't massage all data in table, just this
>>>>>>>>> piece.
>>>>>>>>>
>>>>>>>>> What is approach in C* in this case?
>>>>>>>>> I have an idea create several tables: event_2015_01,
>>>>>>>>> event_2015_02, e.t.c. but it looks rather ugly from my current
>>>>>>>>> understanding how it works.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>> <http://spinn3r.com>
>>
>>
>

Re: Timeseries analysis using Cassandra and partition by date period

Posted by Serega Sheypak <se...@gmail.com>.

Thanks, is it a kind of opentsdb?

2015-04-05 18:28 GMT+02:00 Kevin Burton <bu...@spinn3r.com>:

> > Hi, I switched from HBase to Cassandra and try to find problem solution
> for timeseries analysis on top Cassandra.
>
> Depending on what you’re looking for, you might want to check out KairosDB.
>
> 0.95 beta2 just shipped yesterday as well so you have good timing.
>
> https://github.com/kairosdb/kairosdb
>
> On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
>> Okay, so bucketing by day/week/month is a capacity planning stuff and
>> actual questions I want to ask.
>> As as a conclusion:
>> I have a table events
>>
>> CREATE TABLE user_plans (
>>   id timeuuid,
>>   user_id timeuuid,
>>   event_ts timestamp,
>>   event_type int,
>>   some_other_attr text
>>
>> PRIMARY KEY (user_id, ends)
>> );
>> which fits tactic queries:
>> select smth from user_plans where user_id='xxx' and end_ts > now()
>>
>> Then I create second table user_plans_daily (or weekly, monthy)
>>
>> with DDL:
>> CREATE TABLE user_plans_daily/weekly/monthly (
>>   ymd int,
>>   user_id timeuuid,
>>   event_ts timestamp,
>>   event_type int,
>>   some_other_attr text
>> )
>> PRIMARY KEY ((ymd, user_id), event_ts )
>> WITH CLUSTERING ORDER BY (event_ts DESC);
>>
>> And this table is good for answering strategic questions:
>> select * from
>> user_plans_daily/weekly/monthly
>> where ymd in (....)
>> And I should avoid long condition inside IN clause, that is why you
>> suggest me to create bigger bucket, correct?
>>
>>
>> 2015-04-04 20:00 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>>
>>> It sounds like your time bucket should be a month, but it depends on the
>>> amount of data per user per day and your main query range. Within the
>>> partition you can then query for a range of days.
>>>
>>> Yes, all of the rows within a partition are stored on one physical node
>>> as well as the replica nodes.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <serega.sheypak@gmail.com
>>> > wrote:
>>>
>>>> >non-equal relation on a partition key is not supported
>>>> Ok, can I generate select query:
>>>> select some_attributes
>>>> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or
>>>> 20150331
>>>>
>>>> > The partition key determines which node can satisfy the query
>>>> So you mean that all rows with the same *(ymd, user_id)* would be on
>>>> one physical node?
>>>>
>>>>
>>>> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>>>>
>>>>> Unfortunately, a non-equal relation on a partition key is not
>>>>> supported. You would need to bucket by some larger unit, like a month, and
>>>>> then use the date/time as a clustering column for the row key. Then you
>>>>> could query within the partition. The partition key determines which node
>>>>> can satisfy the query. Designing your partition key judiciously is the key
>>>>> (haha!) to performant Cassandra applications.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <
>>>>> serega.sheypak@gmail.com> wrote:
>>>>>
>>>>>> Hi, we plan to have 10^8 users and each user could generate 10 events
>>>>>> per day.
>>>>>> So we have:
>>>>>> 10^8 records per day
>>>>>> 10^8*30 records per month.
>>>>>> Our timewindow analysis could be from 1 to 6 months.
>>>>>>
>>>>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts
>>>>>> of event.
>>>>>>
>>>>>> So you suggest this approach:
>>>>>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>>>>>> *WITH CLUSTERING ORDER BY (**event_ts*
>>>>>> * DESC);*
>>>>>>
>>>>>> where ymd=20150102 (the Second of January)?
>>>>>>
>>>>>> *What happens to writes:*
>>>>>> SSTable with past days (ymd < current_day) stay untouched and don't
>>>>>> take part in Compaction process since there are o changes to them?
>>>>>>
>>>>>> What happens to read:
>>>>>> I issue query:
>>>>>> select some_attributes
>>>>>> from events where ymd >= 20150101 and ymd < 20150301
>>>>>> Does Cassandra skip SSTables which don't have ymd in specified range
>>>>>> and give me a kind of partition elimination, like in traditional DBs?
>>>>>>
>>>>>>
>>>>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>>>>>>
>>>>>>> It depends on the actual number of events per user, but simply
>>>>>>> bucketing the partition key can give you the same effect - clustering rows
>>>>>>> by time range. A composite partition key could be comprised of the user
>>>>>>> name and the date.
>>>>>>>
>>>>>>> It also depends on the data rate - is it many events per day or just
>>>>>>> a few events per week, or over what time period. You need to be careful -
>>>>>>> you don't want your Cassandra partitions to be too big (millions of rows)
>>>>>>> or too small (just a few or even one row per partition.)
>>>>>>>
>>>>>>> -- Jack Krupansky
>>>>>>>
>>>>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <
>>>>>>> serega.sheypak@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi, I switched from HBase to Cassandra and try to find problem
>>>>>>>> solution for timeseries analysis on top Cassandra.
>>>>>>>> I have a entity named "Event".
>>>>>>>> "Event" has attributes:
>>>>>>>> user_id - a guy who triggered event
>>>>>>>> event_ts - when even happened
>>>>>>>> event_type - type of event
>>>>>>>> some_other_attr - some other attrs we don't care about right now.
>>>>>>>>
>>>>>>>> The DDL for entity event looks this way:
>>>>>>>>
>>>>>>>> CREATE TABLE user_plans (
>>>>>>>>
>>>>>>>>   id timeuuid,
>>>>>>>>   user_id timeuuid,
>>>>>>>>   event_ts timestamp,
>>>>>>>>   event_type int,
>>>>>>>>   some_other_attr text
>>>>>>>>
>>>>>>>> PRIMARY KEY (user_id, ends)
>>>>>>>> );
>>>>>>>>
>>>>>>>> Table is "infinite", It would grow continuously during application
>>>>>>>> lifetime.
>>>>>>>> I want to ask question:
>>>>>>>> Cassandra, give me all event where event_ts >= xxx
>>>>>>>> and event_ts <=yyy.
>>>>>>>>
>>>>>>>> Right now it would lead to full table scan.
>>>>>>>>
>>>>>>>> There is a trick in HBase. HBase has table abstraction and HBase
>>>>>>>> has Column Family abstraction.
>>>>>>>> Column family should be declared in advance.
>>>>>>>> Column family - physically is a pack of HFiles ("SSTables in C*").
>>>>>>>> So I can easily add partitioning for my HBase table:
>>>>>>>> alter table hbase_events add column familiy '2015_01'
>>>>>>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>>>>>>
>>>>>>>> When I want to get January data, I would directly access column
>>>>>>>> family named '2015_01' and I won't massage all data in table, just this
>>>>>>>> piece.
>>>>>>>>
>>>>>>>> What is approach in C* in this case?
>>>>>>>> I have an idea create several tables: event_2015_01, event_2015_02,
>>>>>>>> e.t.c. but it looks rather ugly from my current understanding how it works.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>

Re: Timeseries analysis using Cassandra and partition by date period

Posted by Kevin Burton <bu...@spinn3r.com>.

> Hi, I switched from HBase to Cassandra and try to find problem solution
for timeseries analysis on top Cassandra.

Depending on what you’re looking for, you might want to check out KairosDB.

0.95 beta2 just shipped yesterday as well so you have good timing.

https://github.com/kairosdb/kairosdb

On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <se...@gmail.com>
wrote:

> Okay, so bucketing by day/week/month is a capacity planning stuff and
> actual questions I want to ask.
> As as a conclusion:
> I have a table events
>
> CREATE TABLE user_plans (
>   id timeuuid,
>   user_id timeuuid,
>   event_ts timestamp,
>   event_type int,
>   some_other_attr text
>
> PRIMARY KEY (user_id, ends)
> );
> which fits tactic queries:
> select smth from user_plans where user_id='xxx' and end_ts > now()
>
> Then I create second table user_plans_daily (or weekly, monthy)
>
> with DDL:
> CREATE TABLE user_plans_daily/weekly/monthly (
>   ymd int,
>   user_id timeuuid,
>   event_ts timestamp,
>   event_type int,
>   some_other_attr text
> )
> PRIMARY KEY ((ymd, user_id), event_ts )
> WITH CLUSTERING ORDER BY (event_ts DESC);
>
> And this table is good for answering strategic questions:
> select * from
> user_plans_daily/weekly/monthly
> where ymd in (....)
> And I should avoid long condition inside IN clause, that is why you
> suggest me to create bigger bucket, correct?
>
>
> 2015-04-04 20:00 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>
>> It sounds like your time bucket should be a month, but it depends on the
>> amount of data per user per day and your main query range. Within the
>> partition you can then query for a range of days.
>>
>> Yes, all of the rows within a partition are stored on one physical node
>> as well as the replica nodes.
>>
>> -- Jack Krupansky
>>
>> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <se...@gmail.com>
>> wrote:
>>
>>> >non-equal relation on a partition key is not supported
>>> Ok, can I generate select query:
>>> select some_attributes
>>> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or
>>> 20150331
>>>
>>> > The partition key determines which node can satisfy the query
>>> So you mean that all rows with the same *(ymd, user_id)* would be on
>>> one physical node?
>>>
>>>
>>> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>>>
>>>> Unfortunately, a non-equal relation on a partition key is not
>>>> supported. You would need to bucket by some larger unit, like a month, and
>>>> then use the date/time as a clustering column for the row key. Then you
>>>> could query within the partition. The partition key determines which node
>>>> can satisfy the query. Designing your partition key judiciously is the key
>>>> (haha!) to performant Cassandra applications.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <
>>>> serega.sheypak@gmail.com> wrote:
>>>>
>>>>> Hi, we plan to have 10^8 users and each user could generate 10 events
>>>>> per day.
>>>>> So we have:
>>>>> 10^8 records per day
>>>>> 10^8*30 records per month.
>>>>> Our timewindow analysis could be from 1 to 6 months.
>>>>>
>>>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts
>>>>> of event.
>>>>>
>>>>> So you suggest this approach:
>>>>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>>>>> *WITH CLUSTERING ORDER BY (**event_ts*
>>>>> * DESC);*
>>>>>
>>>>> where ymd=20150102 (the Second of January)?
>>>>>
>>>>> *What happens to writes:*
>>>>> SSTable with past days (ymd < current_day) stay untouched and don't
>>>>> take part in Compaction process since there are o changes to them?
>>>>>
>>>>> What happens to read:
>>>>> I issue query:
>>>>> select some_attributes
>>>>> from events where ymd >= 20150101 and ymd < 20150301
>>>>> Does Cassandra skip SSTables which don't have ymd in specified range
>>>>> and give me a kind of partition elimination, like in traditional DBs?
>>>>>
>>>>>
>>>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>>>>>
>>>>>> It depends on the actual number of events per user, but simply
>>>>>> bucketing the partition key can give you the same effect - clustering rows
>>>>>> by time range. A composite partition key could be comprised of the user
>>>>>> name and the date.
>>>>>>
>>>>>> It also depends on the data rate - is it many events per day or just
>>>>>> a few events per week, or over what time period. You need to be careful -
>>>>>> you don't want your Cassandra partitions to be too big (millions of rows)
>>>>>> or too small (just a few or even one row per partition.)
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <
>>>>>> serega.sheypak@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, I switched from HBase to Cassandra and try to find problem
>>>>>>> solution for timeseries analysis on top Cassandra.
>>>>>>> I have a entity named "Event".
>>>>>>> "Event" has attributes:
>>>>>>> user_id - a guy who triggered event
>>>>>>> event_ts - when even happened
>>>>>>> event_type - type of event
>>>>>>> some_other_attr - some other attrs we don't care about right now.
>>>>>>>
>>>>>>> The DDL for entity event looks this way:
>>>>>>>
>>>>>>> CREATE TABLE user_plans (
>>>>>>>
>>>>>>>   id timeuuid,
>>>>>>>   user_id timeuuid,
>>>>>>>   event_ts timestamp,
>>>>>>>   event_type int,
>>>>>>>   some_other_attr text
>>>>>>>
>>>>>>> PRIMARY KEY (user_id, ends)
>>>>>>> );
>>>>>>>
>>>>>>> Table is "infinite", It would grow continuously during application
>>>>>>> lifetime.
>>>>>>> I want to ask question:
>>>>>>> Cassandra, give me all event where event_ts >= xxx
>>>>>>> and event_ts <=yyy.
>>>>>>>
>>>>>>> Right now it would lead to full table scan.
>>>>>>>
>>>>>>> There is a trick in HBase. HBase has table abstraction and HBase has
>>>>>>> Column Family abstraction.
>>>>>>> Column family should be declared in advance.
>>>>>>> Column family - physically is a pack of HFiles ("SSTables in C*").
>>>>>>> So I can easily add partitioning for my HBase table:
>>>>>>> alter table hbase_events add column familiy '2015_01'
>>>>>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>>>>>
>>>>>>> When I want to get January data, I would directly access column
>>>>>>> family named '2015_01' and I won't massage all data in table, just this
>>>>>>> piece.
>>>>>>>
>>>>>>> What is approach in C* in this case?
>>>>>>> I have an idea create several tables: event_2015_01, event_2015_02,
>>>>>>> e.t.c. but it looks rather ugly from my current understanding how it works.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>

Re: Timeseries analysis using Cassandra and partition by date period

Posted by Serega Sheypak <se...@gmail.com>.

Okay, so bucketing by day/week/month is a capacity planning stuff and
actual questions I want to ask.
As as a conclusion:
I have a table events

CREATE TABLE user_plans (
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text

PRIMARY KEY (user_id, ends)
);
which fits tactic queries:
select smth from user_plans where user_id='xxx' and end_ts > now()

Then I create second table user_plans_daily (or weekly, monthy)

with DDL:
CREATE TABLE user_plans_daily/weekly/monthly (
  ymd int,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
)
PRIMARY KEY ((ymd, user_id), event_ts )
WITH CLUSTERING ORDER BY (event_ts DESC);

And this table is good for answering strategic questions:
select * from
user_plans_daily/weekly/monthly
where ymd in (....)
And I should avoid long condition inside IN clause, that is why you suggest
me to create bigger bucket, correct?


2015-04-04 20:00 GMT+02:00 Jack Krupansky <ja...@gmail.com>:

> It sounds like your time bucket should be a month, but it depends on the
> amount of data per user per day and your main query range. Within the
> partition you can then query for a range of days.
>
> Yes, all of the rows within a partition are stored on one physical node as
> well as the replica nodes.
>
> -- Jack Krupansky
>
> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <se...@gmail.com>
> wrote:
>
>> >non-equal relation on a partition key is not supported
>> Ok, can I generate select query:
>> select some_attributes
>> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or
>> 20150331
>>
>> > The partition key determines which node can satisfy the query
>> So you mean that all rows with the same *(ymd, user_id)* would be on one
>> physical node?
>>
>>
>> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>>
>>> Unfortunately, a non-equal relation on a partition key is not supported.
>>> You would need to bucket by some larger unit, like a month, and then use
>>> the date/time as a clustering column for the row key. Then you could query
>>> within the partition. The partition key determines which node can satisfy
>>> the query. Designing your partition key judiciously is the key (haha!) to
>>> performant Cassandra applications.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <serega.sheypak@gmail.com
>>> > wrote:
>>>
>>>> Hi, we plan to have 10^8 users and each user could generate 10 events
>>>> per day.
>>>> So we have:
>>>> 10^8 records per day
>>>> 10^8*30 records per month.
>>>> Our timewindow analysis could be from 1 to 6 months.
>>>>
>>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of
>>>> event.
>>>>
>>>> So you suggest this approach:
>>>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>>>> *WITH CLUSTERING ORDER BY (**event_ts*
>>>> * DESC);*
>>>>
>>>> where ymd=20150102 (the Second of January)?
>>>>
>>>> *What happens to writes:*
>>>> SSTable with past days (ymd < current_day) stay untouched and don't
>>>> take part in Compaction process since there are o changes to them?
>>>>
>>>> What happens to read:
>>>> I issue query:
>>>> select some_attributes
>>>> from events where ymd >= 20150101 and ymd < 20150301
>>>> Does Cassandra skip SSTables which don't have ymd in specified range
>>>> and give me a kind of partition elimination, like in traditional DBs?
>>>>
>>>>
>>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>>>>
>>>>> It depends on the actual number of events per user, but simply
>>>>> bucketing the partition key can give you the same effect - clustering rows
>>>>> by time range. A composite partition key could be comprised of the user
>>>>> name and the date.
>>>>>
>>>>> It also depends on the data rate - is it many events per day or just a
>>>>> few events per week, or over what time period. You need to be careful - you
>>>>> don't want your Cassandra partitions to be too big (millions of rows) or
>>>>> too small (just a few or even one row per partition.)
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <
>>>>> serega.sheypak@gmail.com> wrote:
>>>>>
>>>>>> Hi, I switched from HBase to Cassandra and try to find problem
>>>>>> solution for timeseries analysis on top Cassandra.
>>>>>> I have a entity named "Event".
>>>>>> "Event" has attributes:
>>>>>> user_id - a guy who triggered event
>>>>>> event_ts - when even happened
>>>>>> event_type - type of event
>>>>>> some_other_attr - some other attrs we don't care about right now.
>>>>>>
>>>>>> The DDL for entity event looks this way:
>>>>>>
>>>>>> CREATE TABLE user_plans (
>>>>>>
>>>>>>   id timeuuid,
>>>>>>   user_id timeuuid,
>>>>>>   event_ts timestamp,
>>>>>>   event_type int,
>>>>>>   some_other_attr text
>>>>>>
>>>>>> PRIMARY KEY (user_id, ends)
>>>>>> );
>>>>>>
>>>>>> Table is "infinite", It would grow continuously during application
>>>>>> lifetime.
>>>>>> I want to ask question:
>>>>>> Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.
>>>>>>
>>>>>> Right now it would lead to full table scan.
>>>>>>
>>>>>> There is a trick in HBase. HBase has table abstraction and HBase has
>>>>>> Column Family abstraction.
>>>>>> Column family should be declared in advance.
>>>>>> Column family - physically is a pack of HFiles ("SSTables in C*").
>>>>>> So I can easily add partitioning for my HBase table:
>>>>>> alter table hbase_events add column familiy '2015_01'
>>>>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>>>>
>>>>>> When I want to get January data, I would directly access column
>>>>>> family named '2015_01' and I won't massage all data in table, just this
>>>>>> piece.
>>>>>>
>>>>>> What is approach in C* in this case?
>>>>>> I have an idea create several tables: event_2015_01, event_2015_02,
>>>>>> e.t.c. but it looks rather ugly from my current understanding how it works.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Timeseries analysis using Cassandra and partition by date period

Posted by Jack Krupansky <ja...@gmail.com>.

It sounds like your time bucket should be a month, but it depends on the
amount of data per user per day and your main query range. Within the
partition you can then query for a range of days.

Yes, all of the rows within a partition are stored on one physical node as
well as the replica nodes.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <se...@gmail.com>
wrote:

> >non-equal relation on a partition key is not supported
> Ok, can I generate select query:
> select some_attributes
> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or
> 20150331
>
> > The partition key determines which node can satisfy the query
> So you mean that all rows with the same *(ymd, user_id)* would be on one
> physical node?
>
>
> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>
>> Unfortunately, a non-equal relation on a partition key is not supported.
>> You would need to bucket by some larger unit, like a month, and then use
>> the date/time as a clustering column for the row key. Then you could query
>> within the partition. The partition key determines which node can satisfy
>> the query. Designing your partition key judiciously is the key (haha!) to
>> performant Cassandra applications.
>>
>> -- Jack Krupansky
>>
>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <se...@gmail.com>
>> wrote:
>>
>>> Hi, we plan to have 10^8 users and each user could generate 10 events
>>> per day.
>>> So we have:
>>> 10^8 records per day
>>> 10^8*30 records per month.
>>> Our timewindow analysis could be from 1 to 6 months.
>>>
>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of
>>> event.
>>>
>>> So you suggest this approach:
>>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>>> *WITH CLUSTERING ORDER BY (**event_ts*
>>> * DESC);*
>>>
>>> where ymd=20150102 (the Second of January)?
>>>
>>> *What happens to writes:*
>>> SSTable with past days (ymd < current_day) stay untouched and don't take
>>> part in Compaction process since there are o changes to them?
>>>
>>> What happens to read:
>>> I issue query:
>>> select some_attributes
>>> from events where ymd >= 20150101 and ymd < 20150301
>>> Does Cassandra skip SSTables which don't have ymd in specified range and
>>> give me a kind of partition elimination, like in traditional DBs?
>>>
>>>
>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>>>
>>>> It depends on the actual number of events per user, but simply
>>>> bucketing the partition key can give you the same effect - clustering rows
>>>> by time range. A composite partition key could be comprised of the user
>>>> name and the date.
>>>>
>>>> It also depends on the data rate - is it many events per day or just a
>>>> few events per week, or over what time period. You need to be careful - you
>>>> don't want your Cassandra partitions to be too big (millions of rows) or
>>>> too small (just a few or even one row per partition.)
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <
>>>> serega.sheypak@gmail.com> wrote:
>>>>
>>>>> Hi, I switched from HBase to Cassandra and try to find problem
>>>>> solution for timeseries analysis on top Cassandra.
>>>>> I have a entity named "Event".
>>>>> "Event" has attributes:
>>>>> user_id - a guy who triggered event
>>>>> event_ts - when even happened
>>>>> event_type - type of event
>>>>> some_other_attr - some other attrs we don't care about right now.
>>>>>
>>>>> The DDL for entity event looks this way:
>>>>>
>>>>> CREATE TABLE user_plans (
>>>>>
>>>>>   id timeuuid,
>>>>>   user_id timeuuid,
>>>>>   event_ts timestamp,
>>>>>   event_type int,
>>>>>   some_other_attr text
>>>>>
>>>>> PRIMARY KEY (user_id, ends)
>>>>> );
>>>>>
>>>>> Table is "infinite", It would grow continuously during application
>>>>> lifetime.
>>>>> I want to ask question:
>>>>> Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.
>>>>>
>>>>> Right now it would lead to full table scan.
>>>>>
>>>>> There is a trick in HBase. HBase has table abstraction and HBase has
>>>>> Column Family abstraction.
>>>>> Column family should be declared in advance.
>>>>> Column family - physically is a pack of HFiles ("SSTables in C*").
>>>>> So I can easily add partitioning for my HBase table:
>>>>> alter table hbase_events add column familiy '2015_01'
>>>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>>>
>>>>> When I want to get January data, I would directly access column family
>>>>> named '2015_01' and I won't massage all data in table, just this piece.
>>>>>
>>>>> What is approach in C* in this case?
>>>>> I have an idea create several tables: event_2015_01, event_2015_02,
>>>>> e.t.c. but it looks rather ugly from my current understanding how it works.
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Timeseries analysis using Cassandra and partition by date period

Posted by Serega Sheypak <se...@gmail.com>.

>non-equal relation on a partition key is not supported
Ok, can I generate select query:
select some_attributes
from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or
20150331

> The partition key determines which node can satisfy the query
So you mean that all rows with the same *(ymd, user_id)* would be on one
physical node?


2015-04-04 16:38 GMT+02:00 Jack Krupansky <ja...@gmail.com>:

> Unfortunately, a non-equal relation on a partition key is not supported.
> You would need to bucket by some larger unit, like a month, and then use
> the date/time as a clustering column for the row key. Then you could query
> within the partition. The partition key determines which node can satisfy
> the query. Designing your partition key judiciously is the key (haha!) to
> performant Cassandra applications.
>
> -- Jack Krupansky
>
> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
>> Hi, we plan to have 10^8 users and each user could generate 10 events per
>> day.
>> So we have:
>> 10^8 records per day
>> 10^8*30 records per month.
>> Our timewindow analysis could be from 1 to 6 months.
>>
>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of
>> event.
>>
>> So you suggest this approach:
>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>> *WITH CLUSTERING ORDER BY (**event_ts*
>> * DESC);*
>>
>> where ymd=20150102 (the Second of January)?
>>
>> *What happens to writes:*
>> SSTable with past days (ymd < current_day) stay untouched and don't take
>> part in Compaction process since there are o changes to them?
>>
>> What happens to read:
>> I issue query:
>> select some_attributes
>> from events where ymd >= 20150101 and ymd < 20150301
>> Does Cassandra skip SSTables which don't have ymd in specified range and
>> give me a kind of partition elimination, like in traditional DBs?
>>
>>
>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>>
>>> It depends on the actual number of events per user, but simply bucketing
>>> the partition key can give you the same effect - clustering rows by time
>>> range. A composite partition key could be comprised of the user name and
>>> the date.
>>>
>>> It also depends on the data rate - is it many events per day or just a
>>> few events per week, or over what time period. You need to be careful - you
>>> don't want your Cassandra partitions to be too big (millions of rows) or
>>> too small (just a few or even one row per partition.)
>>>
>>> -- Jack Krupansky
>>>
>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <serega.sheypak@gmail.com
>>> > wrote:
>>>
>>>> Hi, I switched from HBase to Cassandra and try to find problem solution
>>>> for timeseries analysis on top Cassandra.
>>>> I have a entity named "Event".
>>>> "Event" has attributes:
>>>> user_id - a guy who triggered event
>>>> event_ts - when even happened
>>>> event_type - type of event
>>>> some_other_attr - some other attrs we don't care about right now.
>>>>
>>>> The DDL for entity event looks this way:
>>>>
>>>> CREATE TABLE user_plans (
>>>>
>>>>   id timeuuid,
>>>>   user_id timeuuid,
>>>>   event_ts timestamp,
>>>>   event_type int,
>>>>   some_other_attr text
>>>>
>>>> PRIMARY KEY (user_id, ends)
>>>> );
>>>>
>>>> Table is "infinite", It would grow continuously during application
>>>> lifetime.
>>>> I want to ask question:
>>>> Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.
>>>>
>>>> Right now it would lead to full table scan.
>>>>
>>>> There is a trick in HBase. HBase has table abstraction and HBase has
>>>> Column Family abstraction.
>>>> Column family should be declared in advance.
>>>> Column family - physically is a pack of HFiles ("SSTables in C*").
>>>> So I can easily add partitioning for my HBase table:
>>>> alter table hbase_events add column familiy '2015_01'
>>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>>
>>>> When I want to get January data, I would directly access column family
>>>> named '2015_01' and I won't massage all data in table, just this piece.
>>>>
>>>> What is approach in C* in this case?
>>>> I have an idea create several tables: event_2015_01, event_2015_02,
>>>> e.t.c. but it looks rather ugly from my current understanding how it works.
>>>>
>>>>
>>>>
>>>
>>
>

Re: Timeseries analysis using Cassandra and partition by date period

Posted by Jack Krupansky <ja...@gmail.com>.

Unfortunately, a non-equal relation on a partition key is not supported.
You would need to bucket by some larger unit, like a month, and then use
the date/time as a clustering column for the row key. Then you could query
within the partition. The partition key determines which node can satisfy
the query. Designing your partition key judiciously is the key (haha!) to
performant Cassandra applications.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <se...@gmail.com>
wrote:

> Hi, we plan to have 10^8 users and each user could generate 10 events per
> day.
> So we have:
> 10^8 records per day
> 10^8*30 records per month.
> Our timewindow analysis could be from 1 to 6 months.
>
> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of
> event.
>
> So you suggest this approach:
> *PRIMARY KEY ((ymd, user_id), event_ts ) *
> *WITH CLUSTERING ORDER BY (**event_ts*
> * DESC);*
>
> where ymd=20150102 (the Second of January)?
>
> *What happens to writes:*
> SSTable with past days (ymd < current_day) stay untouched and don't take
> part in Compaction process since there are o changes to them?
>
> What happens to read:
> I issue query:
> select some_attributes
> from events where ymd >= 20150101 and ymd < 20150301
> Does Cassandra skip SSTables which don't have ymd in specified range and
> give me a kind of partition elimination, like in traditional DBs?
>
>
> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <ja...@gmail.com>:
>
>> It depends on the actual number of events per user, but simply bucketing
>> the partition key can give you the same effect - clustering rows by time
>> range. A composite partition key could be comprised of the user name and
>> the date.
>>
>> It also depends on the data rate - is it many events per day or just a
>> few events per week, or over what time period. You need to be careful - you
>> don't want your Cassandra partitions to be too big (millions of rows) or
>> too small (just a few or even one row per partition.)
>>
>> -- Jack Krupansky
>>
>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <se...@gmail.com>
>> wrote:
>>
>>> Hi, I switched from HBase to Cassandra and try to find problem solution
>>> for timeseries analysis on top Cassandra.
>>> I have a entity named "Event".
>>> "Event" has attributes:
>>> user_id - a guy who triggered event
>>> event_ts - when even happened
>>> event_type - type of event
>>> some_other_attr - some other attrs we don't care about right now.
>>>
>>> The DDL for entity event looks this way:
>>>
>>> CREATE TABLE user_plans (
>>>
>>>   id timeuuid,
>>>   user_id timeuuid,
>>>   event_ts timestamp,
>>>   event_type int,
>>>   some_other_attr text
>>>
>>> PRIMARY KEY (user_id, ends)
>>> );
>>>
>>> Table is "infinite", It would grow continuously during application
>>> lifetime.
>>> I want to ask question:
>>> Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.
>>>
>>> Right now it would lead to full table scan.
>>>
>>> There is a trick in HBase. HBase has table abstraction and HBase has
>>> Column Family abstraction.
>>> Column family should be declared in advance.
>>> Column family - physically is a pack of HFiles ("SSTables in C*").
>>> So I can easily add partitioning for my HBase table:
>>> alter table hbase_events add column familiy '2015_01'
>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>
>>> When I want to get January data, I would directly access column family
>>> named '2015_01' and I won't massage all data in table, just this piece.
>>>
>>> What is approach in C* in this case?
>>> I have an idea create several tables: event_2015_01, event_2015_02,
>>> e.t.c. but it looks rather ugly from my current understanding how it works.
>>>
>>>
>>>
>>
>

Re: Timeseries analysis using Cassandra and partition by date period

Posted by Serega Sheypak <se...@gmail.com>.

Hi, we plan to have 10^8 users and each user could generate 10 events per
day.
So we have:
10^8 records per day
10^8*30 records per month.
Our timewindow analysis could be from 1 to 6 months.

Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of
event.

So you suggest this approach:
*PRIMARY KEY ((ymd, user_id), event_ts ) *
*WITH CLUSTERING ORDER BY (**event_ts*
* DESC);*

where ymd=20150102 (the Second of January)?

*What happens to writes:*
SSTable with past days (ymd < current_day) stay untouched and don't take
part in Compaction process since there are o changes to them?

What happens to read:
I issue query:
select some_attributes
from events where ymd >= 20150101 and ymd < 20150301
Does Cassandra skip SSTables which don't have ymd in specified range and
give me a kind of partition elimination, like in traditional DBs?


2015-04-04 14:41 GMT+02:00 Jack Krupansky <ja...@gmail.com>:

> It depends on the actual number of events per user, but simply bucketing
> the partition key can give you the same effect - clustering rows by time
> range. A composite partition key could be comprised of the user name and
> the date.
>
> It also depends on the data rate - is it many events per day or just a few
> events per week, or over what time period. You need to be careful - you
> don't want your Cassandra partitions to be too big (millions of rows) or
> too small (just a few or even one row per partition.)
>
> -- Jack Krupansky
>
> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <se...@gmail.com>
> wrote:
>
>> Hi, I switched from HBase to Cassandra and try to find problem solution
>> for timeseries analysis on top Cassandra.
>> I have a entity named "Event".
>> "Event" has attributes:
>> user_id - a guy who triggered event
>> event_ts - when even happened
>> event_type - type of event
>> some_other_attr - some other attrs we don't care about right now.
>>
>> The DDL for entity event looks this way:
>>
>> CREATE TABLE user_plans (
>>
>>   id timeuuid,
>>   user_id timeuuid,
>>   event_ts timestamp,
>>   event_type int,
>>   some_other_attr text
>>
>> PRIMARY KEY (user_id, ends)
>> );
>>
>> Table is "infinite", It would grow continuously during application
>> lifetime.
>> I want to ask question:
>> Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.
>>
>> Right now it would lead to full table scan.
>>
>> There is a trick in HBase. HBase has table abstraction and HBase has
>> Column Family abstraction.
>> Column family should be declared in advance.
>> Column family - physically is a pack of HFiles ("SSTables in C*").
>> So I can easily add partitioning for my HBase table:
>> alter table hbase_events add column familiy '2015_01'
>> and store all 2015 January data to Column familiy named '2015_01'.
>>
>> When I want to get January data, I would directly access column family
>> named '2015_01' and I won't massage all data in table, just this piece.
>>
>> What is approach in C* in this case?
>> I have an idea create several tables: event_2015_01, event_2015_02,
>> e.t.c. but it looks rather ugly from my current understanding how it works.
>>
>>
>>
>

Re: Timeseries analysis using Cassandra and partition by date period

Posted by Jack Krupansky <ja...@gmail.com>.

It depends on the actual number of events per user, but simply bucketing
the partition key can give you the same effect - clustering rows by time
range. A composite partition key could be comprised of the user name and
the date.

It also depends on the data rate - is it many events per day or just a few
events per week, or over what time period. You need to be careful - you
don't want your Cassandra partitions to be too big (millions of rows) or
too small (just a few or even one row per partition.)

-- Jack Krupansky

On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <se...@gmail.com>
wrote:

> Hi, I switched from HBase to Cassandra and try to find problem solution
> for timeseries analysis on top Cassandra.
> I have a entity named "Event".
> "Event" has attributes:
> user_id - a guy who triggered event
> event_ts - when even happened
> event_type - type of event
> some_other_attr - some other attrs we don't care about right now.
>
> The DDL for entity event looks this way:
>
> CREATE TABLE user_plans (
>
>   id timeuuid,
>   user_id timeuuid,
>   event_ts timestamp,
>   event_type int,
>   some_other_attr text
>
> PRIMARY KEY (user_id, ends)
> );
>
> Table is "infinite", It would grow continuously during application
> lifetime.
> I want to ask question:
> Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.
>
> Right now it would lead to full table scan.
>
> There is a trick in HBase. HBase has table abstraction and HBase has
> Column Family abstraction.
> Column family should be declared in advance.
> Column family - physically is a pack of HFiles ("SSTables in C*").
> So I can easily add partitioning for my HBase table:
> alter table hbase_events add column familiy '2015_01'
> and store all 2015 January data to Column familiy named '2015_01'.
>
> When I want to get January data, I would directly access column family
> named '2015_01' and I won't massage all data in table, just this piece.
>
> What is approach in C* in this case?
> I have an idea create several tables: event_2015_01, event_2015_02, e.t.c.
> but it looks rather ugly from my current understanding how it works.
>
>
>