You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Naresh Yadav <ny...@gmail.com> on 2014/01/09 14:15:55 UTC

Help on Designing Cassandra table for my usecase

Hi all,

I have a use case with huge data which i am not able to design in cassandra.

Table name : MetricResult

Sample Data :

Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,     Value=10
Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,  Value=20
Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,     Value=30
Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,  Value=10
Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
Value=90
Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,
Value=70
Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,
Value=8000
Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,
Value=4000
Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,
Value=9000
Metric=Resource, Time=Week Period=Week1-2013,                      Value=100

So in above case i have case of
         TimeSeries data  i.e Time,Period column
         Dynamic columns i.e Tag column
         Indexing on dynamic columns i.e Tag column
         Aggregations SUM, AVERAGE
         Same value comes again for a Metric, Time, Period, Tag then
overwrite it

Queries i need to support :
--------------------------------------
a)Give data for Metric=Sales AND Time=Month
       O/P : 5 rows
b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
       O/P : 2 rows
c)Give data for Metric=Sales AND Tag=U.S.A
       O/P : 5 rows
d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
       O/P :1 row


This table can have TB's of data and for a Metric,Period can have millions
of rows.

Please give suggestion to design/model this table in Cassandra. If some
limitation in Cassandra then suggest best technology to handle this.


Thanks
Naresh

unsubscribe

Posted by Earl Ruby <er...@webcdr.com>.

unsubscribe

Re: Help on Designing Cassandra table for my usecase

Posted by Thunder Stumpges <th...@gmail.com>.

It does sound like that could work for you. From the sample data it doesn't look like tag will be high cardinality (relative to number of rows) so as long as you won't have rows with too many tags (collections are best kept small, but they claim can be in the hundreds but not to exceed 64k) I don't have any experience with secondary indexes under load and definitely not with collections. 

Looks promising though!
Good luck,
Thunder



> On Jan 10, 2014, at 5:02 AM, Naresh Yadav <ny...@gmail.com> wrote:
> 
> @vivek thanks for pointing that out..Other than primary key defining only one secondary index tags and in my case same tags will be repeating itself across period for sure for a metric=Sales AND also across metric Sales, Cost also can be same set of tags to some extent not always..
> 
> 
> Thanks
> Naresh
> 
> 
>> On Fri, Jan 10, 2014 at 6:05 PM, Vivek Mishra <mi...@gmail.com> wrote:
>> @Naresh
>> Too many indices or indices with high cardinality should be discouraged and are always performance issues. A set will not contain duplicate values.
>> 
>> -Vivek
>> 
>> 
>>> On Fri, Jan 10, 2014 at 5:48 PM, Naresh Yadav <ny...@gmail.com> wrote:
>>> @Thunder
>>> I just came to know about (CASSANDRA-4511) which allows Index on Collections and that will be part of release 2.1.
>>> I hope in that case my problem will be solved by changing your designed table with tag column as set<text> and defining secondary index on it. Is there any risk of performance problem of this design keeping in mind huge data ???
>>> 
>>> 
>>> Naresh
>>> 
>>>> On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav <ny...@gmail.com> wrote:
>>>> @Thunder thanks for suggesting design but my main problem is indexing/quering dynamic Tag on each row that is main context of each row and most of queries will include that..
>>>> 
>>>> As an alternative to cassandra, i tried Apache Blur, in blur table i am able to store exact same data and all queries also worked..so blur  allows dynamic indexing  of tag column BUT moving away from cassandra, i am loosing its strength because of that i am not confident on this decision as data will be huge in my case.
>>>> 
>>>> Please guide me on this with better suggestions.
>>>> 
>>>> Thanks
>>>> Naresh
>>>> 
>>>>> On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges <th...@gmail.com> wrote:
>>>>> Well I think you have essentially time-series data, which C* should handle well, however I think your "Tag" column is going to cause troubles. C* does have collection columns, but they are not indexable nor usable in WHERE clause. Your example has both the uniqueness of the data (primary key) and query filtering on potentially multiple "Tag" columns. That is not supported in C* AFAIK.If it were a single Tag, that could be a column that is Indexed possibly. 
>>>>> 
>>>>> Ignoring that issue with the many different Tags, You could model the table as:
>>>>> 
>>>>> CREATE TABLE metric_data (
>>>>>   metric text,
>>>>>   time text,
>>>>>   period text,
>>>>>   tag text,
>>>>>   value int,
>>>>>   PRIMARY KEY( (metric,time), period, tag)
>>>>> )
>>>>> 
>>>>> That would make a composite partitioning key on metric and time meaning you'd always have to pass those (or else randomly page via TOKEN through all rows). After specifying metric and time, you could optionally also specify period and/or tag, and results would be ordered (clustered) by period. This would satisfy your queries a,b, and d but not c (as you did not specify time). If Time was a granularity column, does it even make sense to return records across differing time values? What does it mean to return the 4 month rows and 1 year row in your example? Could you issue N queries in this case (where N is a small number of each of your time granularities) ?
>>>>> 
>>>>> I'm not sure how close that gets you, or if you can re-work your concept of Tag at all.
>>>>> Good luck.
>>>>> Thunder
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger <hk...@gmail.com> wrote:
>>>>>> To my eye that looks something what the traditional analytics systems do. You can check out e.g. Acunu Analytics which uses Cassandra as a backend.
>>>>>> 
>>>>>> Cheers,
>>>>>> Hannu
>>>>>> 
>>>>>> 
>>>>>> 2014/1/9 Naresh Yadav <ny...@gmail.com>
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I have a use case with huge data which i am not able to design in cassandra.
>>>>>>> 
>>>>>>> Table name : MetricResult      
>>>>>>> 
>>>>>>> Sample Data :
>>>>>>> 
>>>>>>> Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,     Value=10
>>>>>>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,  Value=20
>>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,     Value=30
>>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,  Value=10
>>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=India,                      Value=90
>>>>>>> Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,                    Value=70
>>>>>>> Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,                     Value=8000
>>>>>>> Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,                    Value=4000
>>>>>>> Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,                     Value=9000
>>>>>>> Metric=Resource, Time=Week Period=Week1-2013,                      Value=100
>>>>>>> 
>>>>>>> So in above case i have case of 
>>>>>>>          TimeSeries data  i.e Time,Period column
>>>>>>>          Dynamic columns i.e Tag column
>>>>>>>          Indexing on dynamic columns i.e Tag column
>>>>>>>          Aggregations SUM, AVERAGE
>>>>>>>          Same value comes again for a Metric, Time, Period, Tag then overwrite it 
>>>>>>> 
>>>>>>> Queries i need to support :
>>>>>>> --------------------------------------
>>>>>>> a)Give data for Metric=Sales AND Time=Month
>>>>>>>        O/P : 5 rows
>>>>>>> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
>>>>>>>        O/P : 2 rows
>>>>>>> c)Give data for Metric=Sales AND Tag=U.S.A
>>>>>>>        O/P : 5 rows
>>>>>>> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
>>>>>>>        O/P :1 row
>>>>>>> 
>>>>>>> 
>>>>>>> This table can have TB's of data and for a Metric,Period can have millions of rows.
>>>>>>> 
>>>>>>> Please give suggestion to design/model this table in Cassandra. If some limitation in Cassandra then suggest best technology to handle this.
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Naresh
> 
> 
>

Re: Help on Designing Cassandra table for my usecase

Posted by Naresh Yadav <ny...@gmail.com>.

@vivek thanks for pointing that out..Other than primary key defining only
one secondary index tags and in my case same tags will be repeating itself
across period for sure for a metric=Sales AND also across metric Sales,
Cost also can be same set of tags to some extent not always..


Thanks
Naresh


On Fri, Jan 10, 2014 at 6:05 PM, Vivek Mishra <mi...@gmail.com> wrote:

> @Naresh
> Too many indices or indices with high cardinality should be discouraged
> and are always performance issues. A set will not contain duplicate values.
>
> -Vivek
>
>
> On Fri, Jan 10, 2014 at 5:48 PM, Naresh Yadav <ny...@gmail.com>wrote:
>
>> @Thunder
>> I just came to know about (CASSANDRA-4511<https://issues.apache.org/jira/browse/CASSANDRA-4511>)
>> which allows Index on Collections and that will be part of release 2.1.
>> I hope in that case my problem will be solved by changing your designed
>> table with tag column as set<text> and defining secondary index on it. Is
>> there any risk of performance problem of this design keeping in mind huge
>> data ???
>>
>>
>> Naresh
>>
>> On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav <ny...@gmail.com>wrote:
>>
>>> @Thunder thanks for suggesting design but my main problem is
>>> indexing/quering dynamic Tag on each row that is main context of each row
>>> and most of queries will include that..
>>>
>>> As an alternative to cassandra, i tried Apache Blur, in blur table i am
>>> able to store exact same data and all queries also worked..so blur  allows
>>> dynamic indexing  of tag column BUT moving away from cassandra, i am
>>> loosing its strength because of that i am not confident on this decision as
>>> data will be huge in my case.
>>>
>>> Please guide me on this with better suggestions.
>>>
>>> Thanks
>>> Naresh
>>>
>>> On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges <
>>> thunder.stumpges@gmail.com> wrote:
>>>
>>>> Well I think you have essentially time-series data, which C* should
>>>> handle well, however I think your "Tag" column is going to cause troubles.
>>>> C* does have collection columns, but they are not indexable nor usable in
>>>> WHERE clause. Your example has both the uniqueness of the data (primary
>>>> key) and query filtering on potentially multiple "Tag" columns. That is not
>>>> supported in C* AFAIK.If it were a single Tag, that could be a column that
>>>> is Indexed possibly.
>>>>
>>>> Ignoring that issue with the many different Tags, You could model the
>>>> table as:
>>>>
>>>> CREATE TABLE metric_data (
>>>>   metric text,
>>>>   time text,
>>>>   period text,
>>>>   tag text,
>>>>   value int,
>>>>   PRIMARY KEY( (metric,time), period, tag)
>>>> )
>>>>
>>>> That would make a composite partitioning key on metric and time meaning
>>>> you'd always have to pass those (or else randomly page via TOKEN through
>>>> all rows). After specifying metric and time, you could optionally also
>>>> specify period and/or tag, and results would be ordered (clustered) by
>>>> period. This would satisfy your queries a,b, and d but not c (as you did
>>>> not specify time). If Time was a granularity column, does it even make
>>>> sense to return records across differing time values? What does it mean to
>>>> return the 4 month rows and 1 year row in your example? Could you issue N
>>>> queries in this case (where N is a small number of each of your time
>>>> granularities) ?
>>>>
>>>> I'm not sure how close that gets you, or if you can re-work your
>>>> concept of Tag at all.
>>>> Good luck.
>>>> Thunder
>>>>
>>>>
>>>>
>>>> On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger <hk...@gmail.com>wrote:
>>>>
>>>>> To my eye that looks something what the traditional analytics systems
>>>>> do. You can check out e.g. Acunu Analytics which uses Cassandra as a
>>>>> backend.
>>>>>
>>>>> Cheers,
>>>>> Hannu
>>>>>
>>>>>
>>>>> 2014/1/9 Naresh Yadav <ny...@gmail.com>
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have a use case with huge data which i am not able to design in
>>>>>> cassandra.
>>>>>>
>>>>>> Table name : MetricResult
>>>>>>
>>>>>> Sample Data :
>>>>>>
>>>>>> Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
>>>>>> Value=10
>>>>>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,
>>>>>> Value=20
>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,
>>>>>> Value=30
>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,
>>>>>> Value=10
>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
>>>>>>    Value=90
>>>>>> Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,
>>>>>>      Value=70
>>>>>> Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,
>>>>>> Value=8000
>>>>>> Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,
>>>>>> Value=4000
>>>>>> Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,
>>>>>>    Value=9000
>>>>>> Metric=Resource, Time=Week Period=Week1-2013,
>>>>>> Value=100
>>>>>>
>>>>>> So in above case i have case of
>>>>>>          TimeSeries data  i.e Time,Period column
>>>>>>          Dynamic columns i.e Tag column
>>>>>>          Indexing on dynamic columns i.e Tag column
>>>>>>          Aggregations SUM, AVERAGE
>>>>>>          Same value comes again for a Metric, Time, Period, Tag then
>>>>>> overwrite it
>>>>>>
>>>>>> Queries i need to support :
>>>>>> --------------------------------------
>>>>>> a)Give data for Metric=Sales AND Time=Month
>>>>>>        O/P : 5 rows
>>>>>> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
>>>>>>        O/P : 2 rows
>>>>>> c)Give data for Metric=Sales AND Tag=U.S.A
>>>>>>        O/P : 5 rows
>>>>>> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND
>>>>>> Tag=Pen
>>>>>>        O/P :1 row
>>>>>>
>>>>>>
>>>>>> This table can have TB's of data and for a Metric,Period can have
>>>>>> millions of rows.
>>>>>>
>>>>>> Please give suggestion to design/model this table in Cassandra. If
>>>>>> some limitation in Cassandra then suggest best technology to handle this.
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Naresh
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>

Re: Help on Designing Cassandra table for my usecase

Posted by Peter Lin <wo...@gmail.com>.

indexes on columns with high cardinality is a general database issue, so
it's not unique to cassandra or nosql.


On Fri, Jan 10, 2014 at 7:35 AM, Vivek Mishra <mi...@gmail.com> wrote:

> @Naresh
> Too many indices or indices with high cardinality should be discouraged
> and are always performance issues. A set will not contain duplicate values.
>
> -Vivek
>
>
> On Fri, Jan 10, 2014 at 5:48 PM, Naresh Yadav <ny...@gmail.com>wrote:
>
>> @Thunder
>> I just came to know about (CASSANDRA-4511<https://issues.apache.org/jira/browse/CASSANDRA-4511>)
>> which allows Index on Collections and that will be part of release 2.1.
>> I hope in that case my problem will be solved by changing your designed
>> table with tag column as set<text> and defining secondary index on it. Is
>> there any risk of performance problem of this design keeping in mind huge
>> data ???
>>
>>
>> Naresh
>>
>> On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav <ny...@gmail.com>wrote:
>>
>>> @Thunder thanks for suggesting design but my main problem is
>>> indexing/quering dynamic Tag on each row that is main context of each row
>>> and most of queries will include that..
>>>
>>> As an alternative to cassandra, i tried Apache Blur, in blur table i am
>>> able to store exact same data and all queries also worked..so blur  allows
>>> dynamic indexing  of tag column BUT moving away from cassandra, i am
>>> loosing its strength because of that i am not confident on this decision as
>>> data will be huge in my case.
>>>
>>> Please guide me on this with better suggestions.
>>>
>>> Thanks
>>> Naresh
>>>
>>> On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges <
>>> thunder.stumpges@gmail.com> wrote:
>>>
>>>> Well I think you have essentially time-series data, which C* should
>>>> handle well, however I think your "Tag" column is going to cause troubles.
>>>> C* does have collection columns, but they are not indexable nor usable in
>>>> WHERE clause. Your example has both the uniqueness of the data (primary
>>>> key) and query filtering on potentially multiple "Tag" columns. That is not
>>>> supported in C* AFAIK.If it were a single Tag, that could be a column that
>>>> is Indexed possibly.
>>>>
>>>> Ignoring that issue with the many different Tags, You could model the
>>>> table as:
>>>>
>>>> CREATE TABLE metric_data (
>>>>   metric text,
>>>>   time text,
>>>>   period text,
>>>>   tag text,
>>>>   value int,
>>>>   PRIMARY KEY( (metric,time), period, tag)
>>>> )
>>>>
>>>> That would make a composite partitioning key on metric and time meaning
>>>> you'd always have to pass those (or else randomly page via TOKEN through
>>>> all rows). After specifying metric and time, you could optionally also
>>>> specify period and/or tag, and results would be ordered (clustered) by
>>>> period. This would satisfy your queries a,b, and d but not c (as you did
>>>> not specify time). If Time was a granularity column, does it even make
>>>> sense to return records across differing time values? What does it mean to
>>>> return the 4 month rows and 1 year row in your example? Could you issue N
>>>> queries in this case (where N is a small number of each of your time
>>>> granularities) ?
>>>>
>>>> I'm not sure how close that gets you, or if you can re-work your
>>>> concept of Tag at all.
>>>> Good luck.
>>>> Thunder
>>>>
>>>>
>>>>
>>>> On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger <hk...@gmail.com>wrote:
>>>>
>>>>> To my eye that looks something what the traditional analytics systems
>>>>> do. You can check out e.g. Acunu Analytics which uses Cassandra as a
>>>>> backend.
>>>>>
>>>>> Cheers,
>>>>> Hannu
>>>>>
>>>>>
>>>>> 2014/1/9 Naresh Yadav <ny...@gmail.com>
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have a use case with huge data which i am not able to design in
>>>>>> cassandra.
>>>>>>
>>>>>> Table name : MetricResult
>>>>>>
>>>>>> Sample Data :
>>>>>>
>>>>>> Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
>>>>>> Value=10
>>>>>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,
>>>>>> Value=20
>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,
>>>>>> Value=30
>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,
>>>>>> Value=10
>>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
>>>>>>    Value=90
>>>>>> Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,
>>>>>>      Value=70
>>>>>> Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,
>>>>>> Value=8000
>>>>>> Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,
>>>>>> Value=4000
>>>>>> Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,
>>>>>>    Value=9000
>>>>>> Metric=Resource, Time=Week Period=Week1-2013,
>>>>>> Value=100
>>>>>>
>>>>>> So in above case i have case of
>>>>>>          TimeSeries data  i.e Time,Period column
>>>>>>          Dynamic columns i.e Tag column
>>>>>>          Indexing on dynamic columns i.e Tag column
>>>>>>          Aggregations SUM, AVERAGE
>>>>>>          Same value comes again for a Metric, Time, Period, Tag then
>>>>>> overwrite it
>>>>>>
>>>>>> Queries i need to support :
>>>>>> --------------------------------------
>>>>>> a)Give data for Metric=Sales AND Time=Month
>>>>>>        O/P : 5 rows
>>>>>> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
>>>>>>        O/P : 2 rows
>>>>>> c)Give data for Metric=Sales AND Tag=U.S.A
>>>>>>        O/P : 5 rows
>>>>>> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND
>>>>>> Tag=Pen
>>>>>>        O/P :1 row
>>>>>>
>>>>>>
>>>>>> This table can have TB's of data and for a Metric,Period can have
>>>>>> millions of rows.
>>>>>>
>>>>>> Please give suggestion to design/model this table in Cassandra. If
>>>>>> some limitation in Cassandra then suggest best technology to handle this.
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Naresh
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>

Re: Help on Designing Cassandra table for my usecase

Posted by Vivek Mishra <mi...@gmail.com>.

@Naresh
Too many indices or indices with high cardinality should be discouraged and
are always performance issues. A set will not contain duplicate values.

-Vivek


On Fri, Jan 10, 2014 at 5:48 PM, Naresh Yadav <ny...@gmail.com> wrote:

> @Thunder
> I just came to know about (CASSANDRA-4511<https://issues.apache.org/jira/browse/CASSANDRA-4511>)
> which allows Index on Collections and that will be part of release 2.1.
> I hope in that case my problem will be solved by changing your designed
> table with tag column as set<text> and defining secondary index on it. Is
> there any risk of performance problem of this design keeping in mind huge
> data ???
>
>
> Naresh
>
> On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav <ny...@gmail.com>wrote:
>
>> @Thunder thanks for suggesting design but my main problem is
>> indexing/quering dynamic Tag on each row that is main context of each row
>> and most of queries will include that..
>>
>> As an alternative to cassandra, i tried Apache Blur, in blur table i am
>> able to store exact same data and all queries also worked..so blur  allows
>> dynamic indexing  of tag column BUT moving away from cassandra, i am
>> loosing its strength because of that i am not confident on this decision as
>> data will be huge in my case.
>>
>> Please guide me on this with better suggestions.
>>
>> Thanks
>> Naresh
>>
>> On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges <
>> thunder.stumpges@gmail.com> wrote:
>>
>>> Well I think you have essentially time-series data, which C* should
>>> handle well, however I think your "Tag" column is going to cause troubles.
>>> C* does have collection columns, but they are not indexable nor usable in
>>> WHERE clause. Your example has both the uniqueness of the data (primary
>>> key) and query filtering on potentially multiple "Tag" columns. That is not
>>> supported in C* AFAIK.If it were a single Tag, that could be a column that
>>> is Indexed possibly.
>>>
>>> Ignoring that issue with the many different Tags, You could model the
>>> table as:
>>>
>>> CREATE TABLE metric_data (
>>>   metric text,
>>>   time text,
>>>   period text,
>>>   tag text,
>>>   value int,
>>>   PRIMARY KEY( (metric,time), period, tag)
>>> )
>>>
>>> That would make a composite partitioning key on metric and time meaning
>>> you'd always have to pass those (or else randomly page via TOKEN through
>>> all rows). After specifying metric and time, you could optionally also
>>> specify period and/or tag, and results would be ordered (clustered) by
>>> period. This would satisfy your queries a,b, and d but not c (as you did
>>> not specify time). If Time was a granularity column, does it even make
>>> sense to return records across differing time values? What does it mean to
>>> return the 4 month rows and 1 year row in your example? Could you issue N
>>> queries in this case (where N is a small number of each of your time
>>> granularities) ?
>>>
>>> I'm not sure how close that gets you, or if you can re-work your concept
>>> of Tag at all.
>>> Good luck.
>>> Thunder
>>>
>>>
>>>
>>> On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger <hk...@gmail.com> wrote:
>>>
>>>> To my eye that looks something what the traditional analytics systems
>>>> do. You can check out e.g. Acunu Analytics which uses Cassandra as a
>>>> backend.
>>>>
>>>> Cheers,
>>>> Hannu
>>>>
>>>>
>>>> 2014/1/9 Naresh Yadav <ny...@gmail.com>
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a use case with huge data which i am not able to design in
>>>>> cassandra.
>>>>>
>>>>> Table name : MetricResult
>>>>>
>>>>> Sample Data :
>>>>>
>>>>> Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
>>>>> Value=10
>>>>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,
>>>>> Value=20
>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,
>>>>> Value=30
>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,
>>>>> Value=10
>>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
>>>>>    Value=90
>>>>> Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,
>>>>>    Value=70
>>>>> Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,
>>>>> Value=8000
>>>>> Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,
>>>>> Value=4000
>>>>> Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,
>>>>> Value=9000
>>>>> Metric=Resource, Time=Week Period=Week1-2013,
>>>>> Value=100
>>>>>
>>>>> So in above case i have case of
>>>>>          TimeSeries data  i.e Time,Period column
>>>>>          Dynamic columns i.e Tag column
>>>>>          Indexing on dynamic columns i.e Tag column
>>>>>          Aggregations SUM, AVERAGE
>>>>>          Same value comes again for a Metric, Time, Period, Tag then
>>>>> overwrite it
>>>>>
>>>>> Queries i need to support :
>>>>> --------------------------------------
>>>>> a)Give data for Metric=Sales AND Time=Month
>>>>>        O/P : 5 rows
>>>>> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
>>>>>        O/P : 2 rows
>>>>> c)Give data for Metric=Sales AND Tag=U.S.A
>>>>>        O/P : 5 rows
>>>>> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND
>>>>> Tag=Pen
>>>>>        O/P :1 row
>>>>>
>>>>>
>>>>> This table can have TB's of data and for a Metric,Period can have
>>>>> millions of rows.
>>>>>
>>>>> Please give suggestion to design/model this table in Cassandra. If
>>>>> some limitation in Cassandra then suggest best technology to handle this.
>>>>>
>>>>>
>>>>> Thanks
>>>>> Naresh
>>>>>
>>>>
>>>>
>>>
>>
>>
>>
>
>
>

Re: Help on Designing Cassandra table for my usecase

Posted by Naresh Yadav <ny...@gmail.com>.

@Thunder
I just came to know about
(CASSANDRA-4511<https://issues.apache.org/jira/browse/CASSANDRA-4511>)
which allows Index on Collections and that will be part of release 2.1.
I hope in that case my problem will be solved by changing your designed
table with tag column as set<text> and defining secondary index on it. Is
there any risk of performance problem of this design keeping in mind huge
data ???


Naresh

On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav <ny...@gmail.com> wrote:

> @Thunder thanks for suggesting design but my main problem is
> indexing/quering dynamic Tag on each row that is main context of each row
> and most of queries will include that..
>
> As an alternative to cassandra, i tried Apache Blur, in blur table i am
> able to store exact same data and all queries also worked..so blur  allows
> dynamic indexing  of tag column BUT moving away from cassandra, i am
> loosing its strength because of that i am not confident on this decision as
> data will be huge in my case.
>
> Please guide me on this with better suggestions.
>
> Thanks
> Naresh
>
> On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges <
> thunder.stumpges@gmail.com> wrote:
>
>> Well I think you have essentially time-series data, which C* should
>> handle well, however I think your "Tag" column is going to cause troubles.
>> C* does have collection columns, but they are not indexable nor usable in
>> WHERE clause. Your example has both the uniqueness of the data (primary
>> key) and query filtering on potentially multiple "Tag" columns. That is not
>> supported in C* AFAIK.If it were a single Tag, that could be a column that
>> is Indexed possibly.
>>
>> Ignoring that issue with the many different Tags, You could model the
>> table as:
>>
>> CREATE TABLE metric_data (
>>   metric text,
>>   time text,
>>   period text,
>>   tag text,
>>   value int,
>>   PRIMARY KEY( (metric,time), period, tag)
>> )
>>
>> That would make a composite partitioning key on metric and time meaning
>> you'd always have to pass those (or else randomly page via TOKEN through
>> all rows). After specifying metric and time, you could optionally also
>> specify period and/or tag, and results would be ordered (clustered) by
>> period. This would satisfy your queries a,b, and d but not c (as you did
>> not specify time). If Time was a granularity column, does it even make
>> sense to return records across differing time values? What does it mean to
>> return the 4 month rows and 1 year row in your example? Could you issue N
>> queries in this case (where N is a small number of each of your time
>> granularities) ?
>>
>> I'm not sure how close that gets you, or if you can re-work your concept
>> of Tag at all.
>> Good luck.
>> Thunder
>>
>>
>>
>> On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger <hk...@gmail.com> wrote:
>>
>>> To my eye that looks something what the traditional analytics systems
>>> do. You can check out e.g. Acunu Analytics which uses Cassandra as a
>>> backend.
>>>
>>> Cheers,
>>> Hannu
>>>
>>>
>>> 2014/1/9 Naresh Yadav <ny...@gmail.com>
>>>
>>>> Hi all,
>>>>
>>>> I have a use case with huge data which i am not able to design in
>>>> cassandra.
>>>>
>>>> Table name : MetricResult
>>>>
>>>> Sample Data :
>>>>
>>>> Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
>>>> Value=10
>>>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,
>>>> Value=20
>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,
>>>> Value=30
>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,
>>>> Value=10
>>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
>>>>    Value=90
>>>> Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,
>>>>    Value=70
>>>> Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,
>>>> Value=8000
>>>> Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,
>>>> Value=4000
>>>> Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,
>>>> Value=9000
>>>> Metric=Resource, Time=Week Period=Week1-2013,
>>>> Value=100
>>>>
>>>> So in above case i have case of
>>>>          TimeSeries data  i.e Time,Period column
>>>>          Dynamic columns i.e Tag column
>>>>          Indexing on dynamic columns i.e Tag column
>>>>          Aggregations SUM, AVERAGE
>>>>          Same value comes again for a Metric, Time, Period, Tag then
>>>> overwrite it
>>>>
>>>> Queries i need to support :
>>>> --------------------------------------
>>>> a)Give data for Metric=Sales AND Time=Month
>>>>        O/P : 5 rows
>>>> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
>>>>        O/P : 2 rows
>>>> c)Give data for Metric=Sales AND Tag=U.S.A
>>>>        O/P : 5 rows
>>>> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
>>>>        O/P :1 row
>>>>
>>>>
>>>> This table can have TB's of data and for a Metric,Period can have
>>>> millions of rows.
>>>>
>>>> Please give suggestion to design/model this table in Cassandra. If some
>>>> limitation in Cassandra then suggest best technology to handle this.
>>>>
>>>>
>>>> Thanks
>>>> Naresh
>>>>
>>>
>>>
>>
>
>
>

Re: Help on Designing Cassandra table for my usecase

Posted by Naresh Yadav <ny...@gmail.com>.

@Thunder thanks for suggesting design but my main problem is
indexing/quering dynamic Tag on each row that is main context of each row
and most of queries will include that..

As an alternative to cassandra, i tried Apache Blur, in blur table i am
able to store exact same data and all queries also worked..so blur  allows
dynamic indexing  of tag column BUT moving away from cassandra, i am
loosing its strength because of that i am not confident on this decision as
data will be huge in my case.

Please guide me on this with better suggestions.

Thanks
Naresh

On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges <
thunder.stumpges@gmail.com> wrote:

> Well I think you have essentially time-series data, which C* should handle
> well, however I think your "Tag" column is going to cause troubles. C* does
> have collection columns, but they are not indexable nor usable in WHERE
> clause. Your example has both the uniqueness of the data (primary key) and
> query filtering on potentially multiple "Tag" columns. That is not
> supported in C* AFAIK.If it were a single Tag, that could be a column that
> is Indexed possibly.
>
> Ignoring that issue with the many different Tags, You could model the
> table as:
>
> CREATE TABLE metric_data (
>   metric text,
>   time text,
>   period text,
>   tag text,
>   value int,
>   PRIMARY KEY( (metric,time), period, tag)
> )
>
> That would make a composite partitioning key on metric and time meaning
> you'd always have to pass those (or else randomly page via TOKEN through
> all rows). After specifying metric and time, you could optionally also
> specify period and/or tag, and results would be ordered (clustered) by
> period. This would satisfy your queries a,b, and d but not c (as you did
> not specify time). If Time was a granularity column, does it even make
> sense to return records across differing time values? What does it mean to
> return the 4 month rows and 1 year row in your example? Could you issue N
> queries in this case (where N is a small number of each of your time
> granularities) ?
>
> I'm not sure how close that gets you, or if you can re-work your concept
> of Tag at all.
> Good luck.
> Thunder
>
>
>
> On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger <hk...@gmail.com> wrote:
>
>> To my eye that looks something what the traditional analytics systems do.
>> You can check out e.g. Acunu Analytics which uses Cassandra as a backend.
>>
>> Cheers,
>> Hannu
>>
>>
>> 2014/1/9 Naresh Yadav <ny...@gmail.com>
>>
>>> Hi all,
>>>
>>> I have a use case with huge data which i am not able to design in
>>> cassandra.
>>>
>>> Table name : MetricResult
>>>
>>> Sample Data :
>>>
>>> Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
>>> Value=10
>>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,  Value=20
>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,     Value=30
>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,  Value=10
>>> Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
>>> Value=90
>>> Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,
>>>    Value=70
>>> Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,
>>> Value=8000
>>> Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,
>>> Value=4000
>>> Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,
>>> Value=9000
>>> Metric=Resource, Time=Week Period=Week1-2013,
>>> Value=100
>>>
>>> So in above case i have case of
>>>          TimeSeries data  i.e Time,Period column
>>>          Dynamic columns i.e Tag column
>>>          Indexing on dynamic columns i.e Tag column
>>>          Aggregations SUM, AVERAGE
>>>          Same value comes again for a Metric, Time, Period, Tag then
>>> overwrite it
>>>
>>> Queries i need to support :
>>> --------------------------------------
>>> a)Give data for Metric=Sales AND Time=Month
>>>        O/P : 5 rows
>>> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
>>>        O/P : 2 rows
>>> c)Give data for Metric=Sales AND Tag=U.S.A
>>>        O/P : 5 rows
>>> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
>>>        O/P :1 row
>>>
>>>
>>> This table can have TB's of data and for a Metric,Period can have
>>> millions of rows.
>>>
>>> Please give suggestion to design/model this table in Cassandra. If some
>>> limitation in Cassandra then suggest best technology to handle this.
>>>
>>>
>>> Thanks
>>> Naresh
>>>
>>
>>
>

Re: Help on Designing Cassandra table for my usecase

Posted by Thunder Stumpges <th...@gmail.com>.

Well I think you have essentially time-series data, which C* should handle
well, however I think your "Tag" column is going to cause troubles. C* does
have collection columns, but they are not indexable nor usable in WHERE
clause. Your example has both the uniqueness of the data (primary key) and
query filtering on potentially multiple "Tag" columns. That is not
supported in C* AFAIK.If it were a single Tag, that could be a column that
is Indexed possibly.

Ignoring that issue with the many different Tags, You could model the table
as:

CREATE TABLE metric_data (
  metric text,
  time text,
  period text,
  tag text,
  value int,
  PRIMARY KEY( (metric,time), period, tag)
)

That would make a composite partitioning key on metric and time meaning
you'd always have to pass those (or else randomly page via TOKEN through
all rows). After specifying metric and time, you could optionally also
specify period and/or tag, and results would be ordered (clustered) by
period. This would satisfy your queries a,b, and d but not c (as you did
not specify time). If Time was a granularity column, does it even make
sense to return records across differing time values? What does it mean to
return the 4 month rows and 1 year row in your example? Could you issue N
queries in this case (where N is a small number of each of your time
granularities) ?

I'm not sure how close that gets you, or if you can re-work your concept of
Tag at all.
Good luck.
Thunder

On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger <hk...@gmail.com> wrote:

> To my eye that looks something what the traditional analytics systems do.
> You can check out e.g. Acunu Analytics which uses Cassandra as a backend.
>
> Cheers,
> Hannu
>
>
> 2014/1/9 Naresh Yadav <ny...@gmail.com>
>
>> Hi all,
>>
>> I have a use case with huge data which i am not able to design in
>> cassandra.
>>
>> Table name : MetricResult
>>
>> Sample Data :
>>
>> Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,     Value=10
>> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,  Value=20
>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,     Value=30
>> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,  Value=10
>> Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
>> Value=90
>> Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,
>> Value=70
>> Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,
>> Value=8000
>> Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,
>> Value=4000
>> Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,
>> Value=9000
>> Metric=Resource, Time=Week Period=Week1-2013,
>> Value=100
>>
>> So in above case i have case of
>>          TimeSeries data  i.e Time,Period column
>>          Dynamic columns i.e Tag column
>>          Indexing on dynamic columns i.e Tag column
>>          Aggregations SUM, AVERAGE
>>          Same value comes again for a Metric, Time, Period, Tag then
>> overwrite it
>>
>> Queries i need to support :
>> --------------------------------------
>> a)Give data for Metric=Sales AND Time=Month
>>        O/P : 5 rows
>> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
>>        O/P : 2 rows
>> c)Give data for Metric=Sales AND Tag=U.S.A
>>        O/P : 5 rows
>> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
>>        O/P :1 row
>>
>>
>> This table can have TB's of data and for a Metric,Period can have
>> millions of rows.
>>
>> Please give suggestion to design/model this table in Cassandra. If some
>> limitation in Cassandra then suggest best technology to handle this.
>>
>>
>> Thanks
>> Naresh
>>
>
>

Re: Help on Designing Cassandra table for my usecase

Posted by Hannu Kröger <hk...@gmail.com>.

To my eye that looks something what the traditional analytics systems do.
You can check out e.g. Acunu Analytics which uses Cassandra as a backend.

Cheers,
Hannu


2014/1/9 Naresh Yadav <ny...@gmail.com>

> Hi all,
>
> I have a use case with huge data which i am not able to design in
> cassandra.
>
> Table name : MetricResult
>
> Sample Data :
>
> Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,     Value=10
> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,  Value=20
> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,     Value=30
> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,  Value=10
> Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
> Value=90
> Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,
> Value=70
> Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,
> Value=8000
> Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,
> Value=4000
> Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,
> Value=9000
> Metric=Resource, Time=Week Period=Week1-2013,
> Value=100
>
> So in above case i have case of
>          TimeSeries data  i.e Time,Period column
>          Dynamic columns i.e Tag column
>          Indexing on dynamic columns i.e Tag column
>          Aggregations SUM, AVERAGE
>          Same value comes again for a Metric, Time, Period, Tag then
> overwrite it
>
> Queries i need to support :
> --------------------------------------
> a)Give data for Metric=Sales AND Time=Month
>        O/P : 5 rows
> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
>        O/P : 2 rows
> c)Give data for Metric=Sales AND Tag=U.S.A
>        O/P : 5 rows
> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
>        O/P :1 row
>
>
> This table can have TB's of data and for a Metric,Period can have millions
> of rows.
>
> Please give suggestion to design/model this table in Cassandra. If some
> limitation in Cassandra then suggest best technology to handle this.
>
>
> Thanks
> Naresh
>

Re: Help on Designing Cassandra table for my usecase

Posted by Naresh Yadav <ny...@gmail.com>.

@thunder It will be write once 80% of time but there can be cases client
makes correction in data and then we need to overwrite that......

Thanks
Naresh


On Thu, Jan 9, 2014 at 11:49 PM, Naresh Yadav <ny...@gmail.com> wrote:

> @thunder thanks for guidance queries will be fired by application on this
> table when users login and browse the application and also through mobile
> apps through webservice. Response needs to be quick as user will be doing
> analysis over this data on the fly. Writes also needs to be fast as there
> is time limit we need to show this data to user everyday.
>
> Aggregation we can build in application outside cassandra. But we are not
> clear what table we should design in cassandra for the queries we
> need..Please give guidance on the possible design to handle dynamic tags
> indexing for queries..
>
> Thanks
> Naresh
>
>
>
> On Thu, Jan 9, 2014 at 9:41 PM, Thunder Stumpges <
> thunder.stumpges@gmail.com> wrote:
>
>> This sort of work sounds much more like a Hadoop/Hive/Pig type of
>> analysis.
>>
>> What are your latency requirements on queries? Are they ad-hoc or part of
>> an application? What is the case where you would need to change an existing
>> value? If it is write once, then Hadoop/Hive is great, if it changes
>> randomly, then not so much.
>>
>> Cassandra has limitations that it does not support aggregation, that must
>> be done by a client. In my experience it is really suited to quickly write
>> lots of data and look it back up in a "random io" type manner if you
>> already know the "key" you are looking for.
>>
>> If you have the high speed write and rewrite needs, but also the "full
>> data" analytical requirements, there are plugins for using C* as a backing
>> store for Pig/Hive. It is a little finicky to get working depending on all
>> your versions but does work fairly well in my limited experience.
>>
>> Perhaps with a little better understanding of your workload needs others
>> can chime in too. Good luck.
>>
>> -Thunder
>>
>>
>> > On Jan 9, 2014, at 5:15 AM, Naresh Yadav <ny...@gmail.com> wrote:
>> >
>> > Hi all,
>> >
>> > I have a use case with huge data which i am not able to design in
>> cassandra.
>> >
>> > Table name : MetricResult
>> >
>> > Sample Data :
>> >
>> > Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
>> Value=10
>> > Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,
>>  Value=20
>> > Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,
>> Value=30
>> > Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,
>>  Value=10
>> > Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
>>  Value=90
>> > Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,
>>    Value=70
>> > Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,
>> Value=8000
>> > Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,
>>  Value=4000
>> > Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,
>> Value=9000
>> > Metric=Resource, Time=Week Period=Week1-2013,
>>  Value=100
>> >
>> > So in above case i have case of
>> >          TimeSeries data  i.e Time,Period column
>> >          Dynamic columns i.e Tag column
>> >          Indexing on dynamic columns i.e Tag column
>> >          Aggregations SUM, AVERAGE
>> >          Same value comes again for a Metric, Time, Period, Tag then
>> overwrite it
>> >
>> > Queries i need to support :
>> > --------------------------------------
>> > a)Give data for Metric=Sales AND Time=Month
>> >        O/P : 5 rows
>> > b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
>> >        O/P : 2 rows
>> > c)Give data for Metric=Sales AND Tag=U.S.A
>> >        O/P : 5 rows
>> > d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
>> >        O/P :1 row
>> >
>> >
>> > This table can have TB's of data and for a Metric,Period can have
>> millions of rows.
>> >
>> > Please give suggestion to design/model this table in Cassandra. If some
>> limitation in Cassandra then suggest best technology to handle this.
>> >
>> >
>> > Thanks
>> > Naresh
>>
>
>
>
>

Re: Help on Designing Cassandra table for my usecase

Posted by Naresh Yadav <ny...@gmail.com>.

@thunder thanks for guidance queries will be fired by application on this
table when users login and browse the application and also through mobile
apps through webservice. Response needs to be quick as user will be doing
analysis over this data on the fly. Writes also needs to be fast as there
is time limit we need to show this data to user everyday.

Aggregation we can build in application outside cassandra. But we are not
clear what table we should design in cassandra for the queries we
need..Please give guidance on the possible design to handle dynamic tags
indexing for queries..

Thanks
Naresh


On Thu, Jan 9, 2014 at 9:41 PM, Thunder Stumpges <thunder.stumpges@gmail.com
> wrote:

> This sort of work sounds much more like a Hadoop/Hive/Pig type of analysis.
>
> What are your latency requirements on queries? Are they ad-hoc or part of
> an application? What is the case where you would need to change an existing
> value? If it is write once, then Hadoop/Hive is great, if it changes
> randomly, then not so much.
>
> Cassandra has limitations that it does not support aggregation, that must
> be done by a client. In my experience it is really suited to quickly write
> lots of data and look it back up in a "random io" type manner if you
> already know the "key" you are looking for.
>
> If you have the high speed write and rewrite needs, but also the "full
> data" analytical requirements, there are plugins for using C* as a backing
> store for Pig/Hive. It is a little finicky to get working depending on all
> your versions but does work fairly well in my limited experience.
>
> Perhaps with a little better understanding of your workload needs others
> can chime in too. Good luck.
>
> -Thunder
>
>
> > On Jan 9, 2014, at 5:15 AM, Naresh Yadav <ny...@gmail.com> wrote:
> >
> > Hi all,
> >
> > I have a use case with huge data which i am not able to design in
> cassandra.
> >
> > Table name : MetricResult
> >
> > Sample Data :
> >
> > Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
> Value=10
> > Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,  Value=20
> > Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,     Value=30
> > Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,  Value=10
> > Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
>  Value=90
> > Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,
>  Value=70
> > Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,
> Value=8000
> > Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,
>  Value=4000
> > Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,
> Value=9000
> > Metric=Resource, Time=Week Period=Week1-2013,
>  Value=100
> >
> > So in above case i have case of
> >          TimeSeries data  i.e Time,Period column
> >          Dynamic columns i.e Tag column
> >          Indexing on dynamic columns i.e Tag column
> >          Aggregations SUM, AVERAGE
> >          Same value comes again for a Metric, Time, Period, Tag then
> overwrite it
> >
> > Queries i need to support :
> > --------------------------------------
> > a)Give data for Metric=Sales AND Time=Month
> >        O/P : 5 rows
> > b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
> >        O/P : 2 rows
> > c)Give data for Metric=Sales AND Tag=U.S.A
> >        O/P : 5 rows
> > d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
> >        O/P :1 row
> >
> >
> > This table can have TB's of data and for a Metric,Period can have
> millions of rows.
> >
> > Please give suggestion to design/model this table in Cassandra. If some
> limitation in Cassandra then suggest best technology to handle this.
> >
> >
> > Thanks
> > Naresh
>

Re: Help on Designing Cassandra table for my usecase

Posted by Thunder Stumpges <th...@gmail.com>.

This sort of work sounds much more like a Hadoop/Hive/Pig type of analysis. 

What are your latency requirements on queries? Are they ad-hoc or part of an application? What is the case where you would need to change an existing value? If it is write once, then Hadoop/Hive is great, if it changes randomly, then not so much. 

Cassandra has limitations that it does not support aggregation, that must be done by a client. In my experience it is really suited to quickly write lots of data and look it back up in a "random io" type manner if you already know the "key" you are looking for. 

If you have the high speed write and rewrite needs, but also the "full data" analytical requirements, there are plugins for using C* as a backing store for Pig/Hive. It is a little finicky to get working depending on all your versions but does work fairly well in my limited experience. 

Perhaps with a little better understanding of your workload needs others can chime in too. Good luck. 

-Thunder

> On Jan 9, 2014, at 5:15 AM, Naresh Yadav <ny...@gmail.com> wrote:
> 
> Hi all,
> 
> I have a use case with huge data which i am not able to design in cassandra.
> 
> Table name : MetricResult      
> 
> Sample Data :
> 
> Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,     Value=10
> Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,  Value=20
> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,     Value=30
> Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,  Value=10
> Metric=Sales, Time=Month, Period=Feb-10, Tag=India,                      Value=90
> Metric=Sales, Time=Year, Period=2010,       Tag=U.S.A,                    Value=70
> Metric=Cost,  Time=Year, Period=2010,    Tag=CPU,                     Value=8000
> Metric=Cost,  Time=Year,  Period=2010,    Tag=RAM,                    Value=4000
> Metric=Cost,  Time=Year  Period=2011,     Tag=CPU,                     Value=9000
> Metric=Resource, Time=Week Period=Week1-2013,                      Value=100
> 
> So in above case i have case of 
>          TimeSeries data  i.e Time,Period column
>          Dynamic columns i.e Tag column
>          Indexing on dynamic columns i.e Tag column
>          Aggregations SUM, AVERAGE
>          Same value comes again for a Metric, Time, Period, Tag then overwrite it 
> 
> Queries i need to support :
> --------------------------------------
> a)Give data for Metric=Sales AND Time=Month
>        O/P : 5 rows
> b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
>        O/P : 2 rows
> c)Give data for Metric=Sales AND Tag=U.S.A
>        O/P : 5 rows
> d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
>        O/P :1 row
> 
> 
> This table can have TB's of data and for a Metric,Period can have millions of rows.
> 
> Please give suggestion to design/model this table in Cassandra. If some limitation in Cassandra then suggest best technology to handle this.
> 
> 
> Thanks
> Naresh