You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Tamar Fraenkel <ta...@tok-media.com> on 2012/01/20 10:11:54 UTC

Data Model Question

Hi!
I am a newbie to Cassandra and seeking some advice regarding the data model I
should use to best address my needs.
For simplicity, what I want to accomplish is:
I have a system that has users (potentially ~10,000 per day) and they perform
actions in the system (total of ~50,000 a day).
Each User’s action is taking place in a certain point in time, and is also
classified into categories (1 to 5) and tagged by 1-30 tags. Each action’s
Categories and Tags has a score associated with it, the score is between 0 to 1
(let’s assume precision of 0.0001).
I want to be able to identify similar actions in the system (performed usually
by more than one user). Similarity of actions is calculated based on their
common Categories and Tags taking scores into account.
I need the system to store:

The list of my users with attributes like name, age etc
For each action – the categories and tags associated with it and their score,
the time of the action, and the user who performed it.
Groups of similar actions (ActionGroups) – the id’s of actions in the group, the
categories and tags describing the group, with their scores. Those are
calculated using an algorithm that takes into account the categories and tags of
the actions in the group.

When a user performs a new action in the system, I want to add it to a fitting
ActionGroups (with similar categories and tags).
For this I need to be able to perform the following:
Find all the recent ActionGroups (those who were updated with actions performed
during the last T minutes), who has at list one of the new action’s categories
AND at list one of the new action’s tags.
 
I thought of two ways to address the issue and I would appreciate your insights.
 
First one using secondary indexes
Column Family:Users
Key: userId
Compare with Bytes Type
Columns: name: <>, age: <> etc…
 
Column Family:Actions
Key: actionId
Compare with Bytes Type
Columns:  Category1 : <Score> ….
          CategoriN: <Score>,
          Tag1 : <Score>, ….
          TagK:<Score>
          Time: timestamp
          user: userId
 
Column Family:ActionGroups
Key: actionGroupId
Compare with Bytes Type
Columns: Category1 : <Score> ….
         CategoriN: <Score>,
         Tag1 : <Score> ….
         TagK:<Score>
         lastUpdateTime: timestamp
         actionId1: null, … ,
         actionIdM: null
 
I will then define secondary index on each tag columns, category columns, and
the update time column.
Let’s assume the new action I want to add to ActionGroup has NewActionCategory1
- NewActionCategoryK, and has NewActionTag1 – NewActionTagN. I will perform the
following query:
Select  * From ActionGroups where
   (NewActionCategory1 > 0  … or NewActionCategoryK > 0) and
   (NewActionTag1 > 0  … or NewActionTagN > 0) and
   lastUpdateTime > T;
 
Second solution
Have the same CF as in the first solutionwithout the secondaryindex, and have
two additional CF-ies:
Column Family:CategoriesToActionGroupId
Key: categoryId
Compare with ByteType
Columns: {Timestamp, ActionGroupsId1 } : null
         {Timestamp, ActionGroupsId2} : null
         ...
*timestamp is the update time for the ActionGroup
 
A similar CF will be defined for tags.
 
I will then be able to run several queries on CategoriesToActionGroupId (one for
each of the new story Categories), with column slice for the right update time
of the ActionGroup.
I will do the same for the TagsToActionGroupId.
I will then use my client code to remove duplicates (ActionGroups who are
associated with more than one Tag or Category).
 
My questions are:

Are the two solutions viable? If yes, which is better
Is there any better way of doing this?
Can I use jdbc and CQL with both method, or do I have to use Hector (I am using
Java).

Thanks
Tamar
 
 

Re: Data Model Question

Posted by aaron morton <aa...@thelastpickle.com>.
> 1. regarding time slicing, if at any point of time I am interested in what happened in the last T minutes, then I will need to query more than one row of the DimentionUpdates, right?
Yerp. 
Sometimes that's is what's needed. 

> 2. What did you mean by "You will also want to partition the list of actions in an actiongroup"
In the ActionGroup CF in your model you have a column for every action in the group. You may want to partition that so the row does not grow for ever. 

> 3. For insert of a new ActionGroup, I will need to
Sounds about right. Don't forget you want to round the current time to the start of the current time partition. So if it's 08:38 and you have 5 minute partitions it will be written to the 08:35 row. 

> Is there any problem of synchronization there?
You can send them as a batch, but only the mutations for the same rows are atomic. And the changes are not isolated, i.e. a read can get a partially updated row. So your client needs to handle cases where referential integrity is somehow broken.

See this presentation from Matt Dennis about how you can implement an application level transaction log if that's something you need http://www.slideshare.net/mattdennis/cassandra-nyc-2011-data-modeling

> Does that makes sense? Do you think your solution is still better?
> 

You may get poor read performance with that option. Overwriting the same row over a long period can result in poor read performance, this is better in 1.0 with leveled compaction but I cannot tell you how much better off the top of my head. Also TTL columns must be discarded during a read, and are only purged on compaction. A high number of dead columns in a row can reduce read performance. So you will want to reverse the comparator (http://thelastpickle.com/2011/10/03/Reverse-Comparators/) so you can avoid reading them. 

Cheers
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/01/2012, at 8:43 PM, Tamar Fraenkel wrote:

> Hi!
> Thank you very much for your response!
>  
> I have couple of questions regarding it, some are just to make sure I understood you:
>  
> 1. regarding time slicing, if at any point of time I am interested in what happened in the last T minutes, then I will need to query more than one row of the DimentionUpdates, right?
> 
> 2. What did you mean by "You will also want to partition the list of actions in an actiongroup"
> 
> 3. For insert of a new ActionGroup, I will need to
> 
> a) add (in some cases overwrite) columns for all the new ActionGroup catetories and tags in the time_partition row of  DimensionUpdates.
> b) add to all the rows in DimentionFacts that are relevant (time, categories and tags) a column for the ActionGroupId.
> c) add all the information to the ActionGroup CF and Action CF, including the tags\categories as column names and their scores as values.
> Is there any problem of synchronization there?
> 
> 4. regarding the row size, wouldn't that be solved by giving TTL to the columns
> 
>  
> 5. After reading a bit I also thought that built in secondary indices won't work. There is also another complication, that having tagA with CategoryA is different than having tagA with CategoryB.
> 
> But I still thought that something like
> 
>  
> CF: ActionGroupByCategoryTag
> Stores which action groups are tagged by each tag.
> key: <category_name:tag_name> 
> col_names: <action_group_creation_time>
> col_value: <action_group_id>
>  
> And then I will get categories*tags rows from this index table, and get sliced columns by the time range i am interested in. I can define some TTL on the columns to get only ActionGroups created in the last T minutes.
> 
>  
> Does that makes sense? Do you think your solution is still better?
> 
>  
> Tamar
> 
> 
> 
> On January 22, 2012 at 9:53 PM aaron morton <aa...@thelastpickle.com> wrote: 
> 
> In general if you are collecting data over time you should consider partitioning the row's to avoid creating very large rows. Also if you have a common request you want to support consider modeling it directly rather than using secondary indexes. 
>  
> Assuming my understanding of the problem is in some one way correct I would consider this for the Action Groups…
>  
> Pick a time partition, this is the minimum time resolution you are interested in (your T Minutes). 
>  
> CF: DimensionUpdates
> Stores which dimensions (tags, categories) were updated in the time partition. 
>  
> key: <time_partition>  is the start of the partition e.g. 2011-01-23T08:30  
> col_names: <dimension_name:dimension_value> where <dimension_name> is "tag" or "category" and <dimension_value> is a value from that domain. e.g. <tag:foo>
> col_value: empty
>  
>  
> CF: DimensionFacts
> Stores the facts that included the dimension in a time partition. 
>  
> key: <time_partition:dimension_name:dimension_value> definitions as above.
> col_names: ActionGroupID. 
> col_values: empty
>  
> So to…
>  
>  
> Find all the recent ActionGroups (those who were updated with actions performed during the last T minutes), who has at list one of the new action’s categories AND at list one of the new action’s tags. 
>  
> 1) Query the DimensionUpdates CF with the  current time partition as the key, and the tags and columns the action group has. 
> 2) For each column returned from (1) query the rows in DimensionFacts to get the ActionGroups.
> 3) Filter the unique set of ActionGroups client side. 
>  
>  
> Some notes:
> 1) Row size in all cases are bound to the time partition size. This will make your life easier when it comes to repair and compaction. By default rows sizes of 64MB will take a slower 2 pass approach that will cost you IO. 
>  
> 2) All queries are bound. Query 1 will only want request 1 to 35 columns from a row that contains 0 to 35 columns. Query 2 can be done as either a multi get (select with lots of KEY clauses) or a series of multi gets, and can be further bound by limiting the number of columns in each request. Making queries that take for a lot of rows at once can harm overall query throughput.   
>  
> 3) Overwrites (writing the to the same row) are bound by the time partition. Depending on load this *may* mean that rows are only physically written to one SSTable. 
>  
> 4) You will also want to partition the list of actions in an actiongroup.
>  
> Hope that helps. 
>  
>  
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 20/01/2012, at 10:11 PM, Tamar Fraenkel wrote:
> 
> Hi!
> I am a newbie to Cassandra and seeking some advice regarding the data model I should use to best address my needs.
> 
> For simplicity, what I want to accomplish is:
> 
> I have a system that has users (potentially ~10,000 per day) and they perform actions in the system (total of ~50,000 a day).
> 
> Each User’s action is taking place in a certain point in time, and is also classified into categories (1 to 5) and tagged by 1-30 tags. Each action’s Categories and Tags has a score associated with it, the score is between 0 to 1 (let’s assume precision of 0.0001).
> 
> I want to be able to identify similar actions in the system (performed usually by more than one user). Similarity of actions is calculated based on their common Categories and Tags taking scores into account.
> 
> I need the system to store:
> 
> The list of my users with attributes like name, age etc
> For each action – the categories and tags associated with it and their score, the time of the action, and the user who performed it.
> Groups of similar actions (ActionGroups) – the id’s of actions in the group, the categories and tags describing the group, with their scores. Those are calculated using an algorithm that takes into account the categories and tags of the actions in the group.
> When a user performs a new action in the system, I want to add it to a fitting ActionGroups (with similar categories and tags).
> 
> For this I need to be able to perform the following:
> 
> Find all the recent ActionGroups (those who were updated with actions performed during the last T minutes), who has at list one of the new action’s categories AND at list one of the new action’s tags.
> 
>  
> I thought of two ways to address the issue and I would appreciate your insights.
> 
>  
> First one using secondary indexes
> 
> Column Family: Users
> 
> Key: userId
> 
> Compare with Bytes Type
> 
> Columns: name: <>, age: <> etc…
> 
>  
> Column Family: Actions
> 
> Key: actionId
> 
> Compare with Bytes Type
> 
> Columns:  Category1 : <Score> ….
> 
>           CategoriN: <Score>,
> 
>           Tag1 : <Score>, ….
> 
>           TagK:<Score>
> 
>           Time: timestamp
> 
>           user: userId
> 
>  
> Column Family: ActionGroups
> 
> Key: actionGroupId
> 
> Compare with Bytes Type
> 
> Columns: Category1 : <Score> ….
> 
>          CategoriN: <Score>,
> 
>          Tag1 : <Score> ….
> 
>          TagK:<Score>
> 
>          lastUpdateTime: timestamp
> 
>          actionId1: null, … ,
> 
>          actionIdM: null
> 
>  
> I will then define secondary index on each tag columns, category columns, and the update time column.
> 
> Let’s assume the new action I want to add to ActionGroup has NewActionCategory1 - NewActionCategoryK, and has NewActionTag1 – NewActionTagN. I will perform the following query:
> 
> Select  * From ActionGroups where
> 
>    (NewActionCategory1 > 0  … or NewActionCategoryK > 0) and
> 
>    (NewActionTag1 > 0  … or NewActionTagN > 0) and
> 
>    lastUpdateTime > T;
> 
>  
> Second solution
> 
> Have the same CF as in the first solution without the secondary index , and have two additional CF-ies:
> 
> Column Family: CategoriesToActionGroupId
> 
> Key: categoryId
> 
> Compare with ByteType
> 
> Columns: {Timestamp, ActionGroupsId1 } : null
> 
>          {Timestamp, ActionGroupsId2} : null
> 
>          ...
> 
> *timestamp is the update time for the ActionGroup
> 
>  
> A similar CF will be defined for tags.
> 
>  
> I will then be able to run several queries on CategoriesToActionGroupId (one for each of the new story Categories), with column slice for the right update time of the ActionGroup.
> 
> I will do the same for the TagsToActionGroupId.
> 
> I will then use my client code to remove duplicates (ActionGroups who are associated with more than one Tag or Category).
> 
>  
> My questions are:
> 
> Are the two solutions viable? If yes, which is better
> Is there any better way of doing this?
> Can I use jdbc and CQL with both method, or do I have to use Hector (I am using Java).
> Thanks
> 
> Tamar
> 
>  
>  
> 
>  


Re: Data Model Question

Posted by Tamar Fraenkel <ta...@tok-media.com>.
Hi!
Thank you very much for your response!
 
I have couple of questions regarding it, some are just to make sure I understood
you:
 
1. regarding time slicing, ifat any point of time I am interested in what
happened in the last T minutes, then I will need to query more than one row of
the DimentionUpdates, right?


2. What did you mean by "You will also want to partition the list of actions in
an actiongroup"


3. For insert of a new ActionGroup, I will need to


a) add (in some cases overwrite) columns for all the new ActionGroup catetories
and tags in the time_partition row of DimensionUpdates.
b) add to all the rows in DimentionFacts that are relevant (time, categories and
tags) a column for the ActionGroupId.
c) add all the information to the ActionGroup CF and Action CF, including the
tags\categories as column names and their scores as values.
Is there any problem of synchronization there?


4. regarding the row size, wouldn't that be solved by giving TTL to the columns
 
5. After reading a bit I also thought that built in secondary indices won't
work. There is also another complication, that having tagA with CategoryA is
different than having tagA with CategoryB.
But I still thought that something like
 

CF: ActionGroupByCategoryTag
Stores which action groups are tagged by each tag.
key: <category_name:tag_name> 
col_names: <action_group_creation_time>
col_value: <action_group_id> 
And then I will get categories*tags rows from this index table, and get sliced
columns by the time range i am interested in. I can define some TTL on the
columns to get only ActionGroups created in the last T minutes.
 
Does that makes sense? Do you think your solution is still better?
 
Tamar





On January 22, 2012 at 9:53 PM aaron morton <aa...@thelastpickle.com> wrote:


> In general if you are collecting data over time you should consider
> partitioning the row's to avoid creating very large rows. Also if you have a
> common request you want to support consider modeling it directly rather than
> using secondary indexes. 
>  
> Assuming my understanding of the problem is in some one way correct I would
> consider this for the Action Groups…
>  
> Pick a time partition, this is the minimum time resolution you are interested
> in (your T Minutes). 
>  
> CF: DimensionUpdates
> Stores which dimensions (tags, categories) were updated in the time
> partition. 
>  
> key: <time_partition>  is the start of the partition e.g. 2011-01-23T08:30  
> col_names: <dimension_name:dimension_value> where <dimension_name> is "tag" or
> "category" and <dimension_value> is a value from that domain. e.g. <tag:foo>
> col_value: empty
>  
>  
> CF: DimensionFacts
> Stores the facts that included the dimension in a time partition. 
>  
> key: <time_partition:dimension_name:dimension_value> definitions as above.
> col_names: ActionGroupID. 
> col_values: empty
>  
> So to…
>  
>  
> > 
> > Find all the recent ActionGroups (those who were updated with actions
> > performed during the last T minutes), who has at list one of the new
> > action’s categories AND at list one of the new action’s tags. 

>  
> 1) Query the DimensionUpdates CF with the  current time partition as the key,
> and the tags and columns the action group has. 
> 2) For each column returned from (1) query the rows in DimensionFacts to get
> the ActionGroups.
> 3) Filter the unique set of ActionGroups client side. 
>  
>  
> Some notes:
> 1) Row size in all cases are bound to the time partition size. This will make
> your life easier when it comes to repair and compaction. By default rows sizes
> of 64MB will take a slower 2 pass approach that will cost you IO. 
>  
> 2) All queries are bound. Query 1 will only want request 1 to 35 columns from
> a row that contains 0 to 35 columns. Query 2 can be done as either a multi get
> (select with lots of KEY clauses) or a series of multi gets, and can be
> further bound by limiting the number of columns in each request. Making
> queries that take for a lot of rows at once can harm overall query throughput.
>   
> 
>  
> 3) Overwrites (writing the to the same row) are bound by the time partition.
> Depending on load this *may* mean that rows are only physically written to one
> SSTable. 
>  
> 4) You will also want to partition the list of actions in an actiongroup.
>  
> Hope that helps. 
>  
>  
> 
> 
> 
> 
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> 
> On 20/01/2012, at 10:11 PM, Tamar Fraenkel wrote:
> 
> > 
> > 
> > Hi!I am a newbie to Cassandra and seeking some advice regarding the data
> > model I should use to best address my needs.
> > For simplicity, what I want to accomplish is:
> > I have a system that has users (potentially ~10,000 per day) and they
> > perform actions in the system (total of ~50,000 a day).
> > Each User’s action is taking place in a certain point in time, and is also
> > classified into categories (1 to 5) and tagged by 1-30 tags. Each action’s
> > Categories and Tags has a score associated with it, the score is between 0
> > to 1 (let’s assume precision of 0.0001).
> > I want to be able to identify similar actions in the system (performed
> > usually by more than one user). Similarity of actions is calculated based on
> > their common Categories and Tags taking scores into account.
> > I need the system to store:
> > 
> > The list of my users with attributes like name, age etc
> > For each action – the categories and tags associated with it and their
> > score, the time of the action, and the user who performed it.
> > Groups of similar actions (ActionGroups) – the id’s of actions in the group,
> > the categories and tags describing the group, with their scores. Those are
> > calculated using an algorithm that takes into account the categories and
> > tags of the actions in the group.
> > 
> > When a user performs a new action in the system, I want to add it to a
> > fitting ActionGroups (with similar categories and tags).
> > For this I need to be able to perform the following:
> > Find all the recent ActionGroups (those who were updated with actions
> > performed during the last T minutes), who has at list one of the new
> > action’s categories AND at list one of the new action’s tags.
> > 
> >  I thought of two ways to address the issue and I would appreciate your
> > insights.
> > 
> >  First one using secondary indexes
> > Column Family:Users
> > Key: userId
> > Compare with Bytes Type
> > Columns: name: <>, age: <> etc…
> > 
> >  Column Family:Actions
> > Key: actionId
> > Compare with Bytes Type
> > Columns:  Category1 : <Score> ….
> >           CategoriN: <Score>,
> >           Tag1 : <Score>, ….
> >           TagK:<Score>
> >           Time: timestamp
> >           user: userId
> > 
> >  Column Family:ActionGroups
> > Key: actionGroupId
> > Compare with Bytes Type
> > Columns: Category1 : <Score> ….
> >          CategoriN: <Score>,
> >          Tag1 : <Score> ….
> >          TagK:<Score>
> >          lastUpdateTime: timestamp
> >          actionId1: null, … ,
> >          actionIdM: null
> > 
> >  I will then define secondary index on each tag columns, category columns,
> > and the update time column.
> > Let’s assume the new action I want to add to ActionGroup has
> > NewActionCategory1 - NewActionCategoryK, and has NewActionTag1 –
> > NewActionTagN. I will perform the following query:
> > Select  * From ActionGroups where
> >    (NewActionCategory1 > 0  … or NewActionCategoryK > 0) and
> >    (NewActionTag1 > 0  … or NewActionTagN > 0) and
> >    lastUpdateTime > T;
> > 
> >  Second solution
> > Have the same CF as in the first solutionwithout the secondaryindex, and
> > have two additional CF-ies:
> > Column Family:CategoriesToActionGroupId
> > Key: categoryId
> > Compare with ByteType
> > Columns: {Timestamp, ActionGroupsId1 } : null
> >          {Timestamp, ActionGroupsId2} : null
> >          ...
> > *timestamp is the update time for the ActionGroup
> > 
> >  A similar CF will be defined for tags.
> > 
> >  I will then be able to run several queries on CategoriesToActionGroupId
> > (one for each of the new story Categories), with column slice for the right
> > update time of the ActionGroup.
> > I will do the same for the TagsToActionGroupId.
> > I will then use my client code to remove duplicates (ActionGroups who are
> > associated with more than one Tag or Category).
> > 
> >  My questions are:
> > 
> > Are the two solutions viable? If yes, which is better
> > Is there any better way of doing this?
> > Can I use jdbc and CQL with both method, or do I have to use Hector (I am
> > using Java).
> > 
> > Thanks
> > Tamar
> > 
> >  
> >  


 

Re: Data Model Question

Posted by aaron morton <aa...@thelastpickle.com>.
In general if you are collecting data over time you should consider partitioning the row's to avoid creating very large rows. Also if you have a common request you want to support consider modeling it directly rather than using secondary indexes. 

Assuming my understanding of the problem is in some one way correct I would consider this for the Action Groups…

Pick a time partition, this is the minimum time resolution you are interested in (your T Minutes). 

CF: DimensionUpdates
Stores which dimensions (tags, categories) were updated in the time partition. 

key: <time_partition>  is the start of the partition e.g. 2011-01-23T08:30  
col_names: <dimension_name:dimension_value> where <dimension_name> is "tag" or "category" and <dimension_value> is a value from that domain. e.g. <tag:foo>
col_value: empty


CF: DimensionFacts
Stores the facts that included the dimension in a time partition. 

key: <time_partition:dimension_name:dimension_value> definitions as above.
col_names: ActionGroupID. 
col_values: empty

So to…

> Find all the recent ActionGroups (those who were updated with actions performed during the last T minutes), who has at list one of the new action’s categories AND at list one of the new action’s tags. 

1) Query the DimensionUpdates CF with the  current time partition as the key, and the tags and columns the action group has. 
2) For each column returned from (1) query the rows in DimensionFacts to get the ActionGroups.
3) Filter the unique set of ActionGroups client side. 


Some notes:
1) Row size in all cases are bound to the time partition size. This will make your life easier when it comes to repair and compaction. By default rows sizes of 64MB will take a slower 2 pass approach that will cost you IO. 

2) All queries are bound. Query 1 will only want request 1 to 35 columns from a row that contains 0 to 35 columns. Query 2 can be done as either a multi get (select with lots of KEY clauses) or a series of multi gets, and can be further bound by limiting the number of columns in each request. Making queries that take for a lot of rows at once can harm overall query throughput.   

3) Overwrites (writing the to the same row) are bound by the time partition. Depending on load this *may* mean that rows are only physically written to one SSTable. 

4) You will also want to partition the list of actions in an actiongroup.

Hope that helps. 


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 20/01/2012, at 10:11 PM, Tamar Fraenkel wrote:

> Hi!
> I am a newbie to Cassandra and seeking some advice regarding the data model I should use to best address my needs.
> 
> For simplicity, what I want to accomplish is:
> 
> I have a system that has users (potentially ~10,000 per day) and they perform actions in the system (total of ~50,000 a day).
> 
> Each User’s action is taking place in a certain point in time, and is also classified into categories (1 to 5) and tagged by 1-30 tags. Each action’s Categories and Tags has a score associated with it, the score is between 0 to 1 (let’s assume precision of 0.0001).
> 
> I want to be able to identify similar actions in the system (performed usually by more than one user). Similarity of actions is calculated based on their common Categories and Tags taking scores into account.
> 
> I need the system to store:
> 
> The list of my users with attributes like name, age etc
> For each action – the categories and tags associated with it and their score, the time of the action, and the user who performed it.
> Groups of similar actions (ActionGroups) – the id’s of actions in the group, the categories and tags describing the group, with their scores. Those are calculated using an algorithm that takes into account the categories and tags of the actions in the group.
> When a user performs a new action in the system, I want to add it to a fitting ActionGroups (with similar categories and tags).
> 
> For this I need to be able to perform the following:
> 
> Find all the recent ActionGroups (those who were updated with actions performed during the last T minutes), who has at list one of the new action’s categories AND at list one of the new action’s tags.
> 
>  
> I thought of two ways to address the issue and I would appreciate your insights.
> 
>  
> First one using secondary indexes
> 
> Column Family: Users
> 
> Key: userId
> 
> Compare with Bytes Type
> 
> Columns: name: <>, age: <> etc…
> 
>  
> Column Family: Actions
> 
> Key: actionId
> 
> Compare with Bytes Type
> 
> Columns:  Category1 : <Score> ….
> 
>           CategoriN: <Score>,
> 
>           Tag1 : <Score>, ….
> 
>           TagK:<Score>
> 
>           Time: timestamp
> 
>           user: userId
> 
>  
> Column Family: ActionGroups
> 
> Key: actionGroupId
> 
> Compare with Bytes Type
> 
> Columns: Category1 : <Score> ….
> 
>          CategoriN: <Score>,
> 
>          Tag1 : <Score> ….
> 
>          TagK:<Score>
> 
>          lastUpdateTime: timestamp
> 
>          actionId1: null, … ,
> 
>          actionIdM: null
> 
>  
> I will then define secondary index on each tag columns, category columns, and the update time column.
> 
> Let’s assume the new action I want to add to ActionGroup has NewActionCategory1 - NewActionCategoryK, and has NewActionTag1 – NewActionTagN. I will perform the following query:
> 
> Select  * From ActionGroups where
> 
>    (NewActionCategory1 > 0  … or NewActionCategoryK > 0) and
> 
>    (NewActionTag1 > 0  … or NewActionTagN > 0) and
> 
>    lastUpdateTime > T;
> 
>  
> Second solution
> 
> Have the same CF as in the first solution without the secondary index , and have two additional CF-ies:
> 
> Column Family: CategoriesToActionGroupId
> 
> Key: categoryId
> 
> Compare with ByteType
> 
> Columns: {Timestamp, ActionGroupsId1 } : null
> 
>          {Timestamp, ActionGroupsId2} : null
> 
>          ...
> 
> *timestamp is the update time for the ActionGroup
> 
>  
> A similar CF will be defined for tags.
> 
>  
> I will then be able to run several queries on CategoriesToActionGroupId (one for each of the new story Categories), with column slice for the right update time of the ActionGroup.
> 
> I will do the same for the TagsToActionGroupId.
> 
> I will then use my client code to remove duplicates (ActionGroups who are associated with more than one Tag or Category).
> 
>  
> My questions are:
> 
> Are the two solutions viable? If yes, which is better
> Is there any better way of doing this?
> Can I use jdbc and CQL with both method, or do I have to use Hector (I am using Java).
> Thanks
> 
> Tamar
> 
>  
>