You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Courtney Robinson <sa...@live.co.uk> on 2011/10/06 21:34:20 UTC

CF design

I was hoping someone could share their opinions on the following CF designs or suggest a better way of doing it.
My app is constantly  receiving new data that contains URLs. I was
thinking of hashing this URL to form a key. The data is a JSON object with
several properties. For now many of its properties will be ignored and only 4
are of interests, URL, title, username, user_rating. Often times the same URL
is received but shared by a different user. I’m wondering if anyone can suggest
a better approach to what I propose below which will be able answer the
following .


Queries:

I’ll be asking the following questions:

1.      
Give me the N most frequently shared items over :


a)     
The last 30 minutes

b)     
The last 24hrs

c)      
Between D1 and D2 (where D1 and D2 represents
the start and end date of interest)


2)     
Give me the N most shared item over the 3 time
periods above WHERE the average user rating is above 5


3)     
Give me X for the item with the ID 123 (where X
is a property for the item with the ID 123)

Proposed approach

Use timestamps as keys in the CF, that should take care of
queries under  1 and partially handle 2
and use each column to store the JSON data, minus the common fields such as the
title which will be the same no matter how many users share the same link (they’ll
have their own columns in the row) other column names will be the user’s
username and the value for those columns will be any JSON left over that’s not
specific to the user.

 For the rest of 2, I
can get the N items we’re interested in and calculate the average user rating
for each item client side. Of course using timestamp as key means we need to
maintain an index of the “real” keys/IDs to each item which would allow us to
answer “Give me the item with the ID 123”

Finally to address 3, I was thinking; Using the index get
the timestamp of the item, and on the client side find the property of
interest.

CF1


 
  
  Timestamp1
  
  
  
   
    
    Title
    
   
   
    
    value
    
   
  
  
  
  
  
   
    
    ID
    
   
   
    
    ID1
    
   
  
  
  
  
  
   
    
    Username3
    
   
   
    
    {“rating”:5}
    
   
  
  
  
  
  
   
    
    Username2
    
   
   
    
    {“rating”:0}
    
   
  
  
  
  
  
   
    
    Username2
    
   
   
    
    {“rating”:4}
    
   
  
  
  
 
 
  
  Timestamp2
  
  
  
   
    
    Title
    
   
   
    
    Value1
    
   
  
  
  
  
  
   
    
    ID
    
   
   
    
    ID2
    
   
  
  
  
  
  
   
    
    Username24
    
   
   
    
    {“rating”:1}
    
   
  
  
  
  
  
   
    
    Username87
    
   
   
    
    {“rating”:9}
    
   
  
  
  
  
  
   
    
    Username7
    
   
   
    
    {“rating”:2}
    
   
  
  
  
 


 

CF2


 
  
  ID1
  
  
  Timestamp1
  
 
 
  
  ID2
  
  
  Timestamp2
  
 


 In the Username column, I'd ideally like to avoid storing the other properties as a JSON but I couldn't think of a way of doing it sensibly when that JSON grows into having 10 other properties.Does this sound like a sensible approach to designing my CFs?