You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by M Vieira <mv...@gmail.com> on 2011/09/29 11:24:58 UTC

Cassandra data modeling

I'm trying to get my head around Cassandra data modeling, but I can't quite
see what would be the best approach to the problem I have.
The supposed scenario:
You have around 100 domains, each domain have from few hundreds to millions
of possible URLs (think of different combinations of GET args,
example.org?a=one&b=two is different of example.org?b=two&a=one)


The application requirements
- two columns storing an average of 500kb each and four (maybe six) columns
storing 1kb each
- retrieve single oldest/newest URL of any single domain
- retrieve a range of oldest/newest URLs of any single domain
- retrieve single oldest/newest URL over all
- retrieve a range of oldest/newest URLs over all
- entries will be edited at least once a day (heavy read+write)

Having considered the following:
http://wiki.apache.org/cassandra/CassandraLimitations
http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage
http://wiki.apache.org/cassandra/MemtableThresholds#Memtable_Thresholds
https://issues.apache.org/jira/browse/CASSANDRA-16


**
Which of the models below would you go for, and why?
Any input would be appreciated


Model A
Hundreds of rows (domain names as row keys)
holding hundreds of thousands of columns (pages within that domain)
and each column then hold a few other columns (5 columns in this case)
Biggest row: "example.net" ~350Gb
Secondary index: column holding URL
{
   "example.com": {
       "example.com/a": ["1", "2", "3", "4", "5"],
       "example.com/b": ["1", "2", "3", "4", "5"],
       "example.com/c": ["1", "2", "3", "4", "5"],
   },
   "example.net": {
       "example.net/a": ["1", "2", "3", "4", "5"],
       "example.net/b": ["1", "2", "3", "4", "5"],
       "example.net/c": ["1", "2", "3", "4", "5"],
   },
   "example.org": {
       "example.org/a": ["1", "2", "3", "4", "5"],
       "example.org/b": ["1", "2", "3", "4", "5"],
       "example.org/c": ["1", "2", "3", "4", "5"],
   }
}


Model B
Millions of rows (URLs as row keys) each holding a few other columns (6
columns in this case).
Biggest row: any ~1004Kb
Secondary index: column holding the domain name
{
   "example.com/a": ["1", "2", "3", "4", "5", "example.com"],
   "example.com/b": ["1", "2", "3", "4", "5", "example.com"],
   "example.com/c": ["1", "2", "3", "4", "5", "example.com"],
   "example.net/a": ["1", "2", "3", "4", "5", "example.net"],
   "example.net/b": ["1", "2", "3", "4", "5", "example.net"],
   "example.net/c": ["1", "2", "3", "4", "5", "example.net"],
   "example.org/a": ["1", "2", "3", "4", "5", "example.org"],
   "example.org/b": ["1", "2", "3", "4", "5", "example.org"],
   "example.org/c": ["1", "2", "3", "4", "5", "example.org"],
}


Model C
Millions of rows (TimeUUID as row keys) each holding a few other columns (7
columns in this case).
Biggest row: any ~1004Kb
Secondary index: column holding the domain name & column holding URL
{
   "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/a"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/b"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/c"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/a"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/b"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/c"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/a"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/b"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/c"],
}

//END

Re: Cassandra data modeling

Posted by aaron morton <aa...@thelastpickle.com>.
If you are collecting time series data, and assuming the flying turtles we live on that swim through time do not stop, you will want to partition your data. (background http://www.slideshare.net/mattdennis/cassandra-data-modeling)

Lets say it makes sense for you to partition by month (may not be the case but it's easy for now) so your partition keys will look like "201109". Also I'm not sure about the first requirement for columns storing 500KB of data, so i'll just talk about the urls. 

CF: domain_partitions - used to find which partitions the domain has data in
key = <domain> 
column name = <partition_key>
column value = EMPTY

CF: url_time_series - store the url's for a domain in a partition
key = <domain> '+' <partition_key>
column name =  time uuid
column value = url


CF: url_payload - store additional url data
key = <domain> '+' <partition_key> + <time_uuid>

Requests:

* store a new hit
	- work out the current partition
	- batch mutate to update domain_partitions, url_time_series and  if needed url_payload	
	- use a special "ALL" domain and store it there too

* get oldest / newest url for a domain (same thing for a range)
	- get the oldest / newest column from the domain_partitions CF
	- get the oldest / newest col from the url_time_series CF using the partition

* get the oldest / newest for ALL domains
	- do the same as above but use the all domain

Notes:
- I split the payload out because I was not sure when you just wanted the URL and when you wanted all the other data. 
- You should look at using composite types http://www.slideshare.net/edanuff/indexing-in-cassandra
- I've probably missed things

Hope that helps, good luck. 

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 29/09/2011, at 11:13 PM, Thamizh wrote:

> If  the retrieval of URL is based on "TimeUUID". Then Model C with ByteOrderedPartitioner and rowkey as long type of "TimeUUID" can be correct choice and it helps you to apply range query based on TimeUUID.
> 
> Regards,
> Thamizhannal P
> From: M Vieira <mv...@gmail.com>
> To: user@cassandra.apache.org
> Sent: Thursday, 29 September 2011 2:54 PM
> Subject: Cassandra data modeling
> 
> 
> I'm trying to get my head around Cassandra data modeling, but I can't quite see what would be the best approach to the problem I have.
> The supposed scenario: 
> You have around 100 domains, each domain have from few hundreds to millions of possible URLs (think of different combinations of GET args,  example.org?a=one&b=two is different of example.org?b=two&a=one)
> 
> 
> The application requirements
> - two columns storing an average of 500kb each and four (maybe six) columns storing 1kb each
> - retrieve single oldest/newest URL of any single domain
> - retrieve a range of oldest/newest URLs of any single domain
> - retrieve single oldest/newest URL over all
> - retrieve a range of oldest/newest URLs over all
> - entries will be edited at least once a day (heavy read+write)
> 
> Having considered the following:
> http://wiki.apache.org/cassandra/CassandraLimitations
> http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage
> http://wiki.apache.org/cassandra/MemtableThresholds#Memtable_Thresholds
> https://issues.apache.org/jira/browse/CASSANDRA-16
> 
> 
> 
> Which of the models below would you go for, and why?
> Any input would be appreciated
> 
> 
> Model A
> Hundreds of rows (domain names as row keys) 
> holding hundreds of thousands of columns (pages within that domain)
> and each column then hold a few other columns (5 columns in this case)
> Biggest row: "example.net" ~350Gb
> Secondary index: column holding URL
> {
>    "example.com": {
>        "example.com/a": ["1", "2", "3", "4", "5"],
>        "example.com/b": ["1", "2", "3", "4", "5"],
>        "example.com/c": ["1", "2", "3", "4", "5"],
>    },
>    "example.net": {
>        "example.net/a": ["1", "2", "3", "4", "5"],
>        "example.net/b": ["1", "2", "3", "4", "5"],
>        "example.net/c": ["1", "2", "3", "4", "5"],
>    },
>    "example.org": {
>        "example.org/a": ["1", "2", "3", "4", "5"],
>        "example.org/b": ["1", "2", "3", "4", "5"],
>        "example.org/c": ["1", "2", "3", "4", "5"],
>    }
> }
> 
> 
> Model B
> Millions of rows (URLs as row keys) each holding a few other columns (6 columns in this case).
> Biggest row: any ~1004Kb
> Secondary index: column holding the domain name
> {
>    "example.com/a": ["1", "2", "3", "4", "5", "example.com"],
>    "example.com/b": ["1", "2", "3", "4", "5", "example.com"],
>    "example.com/c": ["1", "2", "3", "4", "5", "example.com"],
>    "example.net/a": ["1", "2", "3", "4", "5", "example.net"],
>    "example.net/b": ["1", "2", "3", "4", "5", "example.net"],
>    "example.net/c": ["1", "2", "3", "4", "5", "example.net"],
>    "example.org/a": ["1", "2", "3", "4", "5", "example.org"],
>    "example.org/b": ["1", "2", "3", "4", "5", "example.org"],
>    "example.org/c": ["1", "2", "3", "4", "5", "example.org"],
> }
> 
> 
> Model C
> Millions of rows (TimeUUID as row keys) each holding a few other columns (7 columns in this case).
> Biggest row: any ~1004Kb
> Secondary index: column holding the domain name & column holding URL
> {
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/a"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/b"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/c"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/a"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/b"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/c"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/a"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/b"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/c"],
> }
> 
> //END
> 
> 
>  
> 
> 
> 


Re: Cassandra data modeling

Posted by Thamizh <tc...@yahoo.co.in>.
If  the retrieval of URL is based on "TimeUUID". Then Model C with ByteOrderedPartitioner and rowkey as long type of "TimeUUID" can be correct choice and it helps you to apply range query based on TimeUUID.

Regards,
Thamizhannal P


________________________________
From: M Vieira <mv...@gmail.com>
To: user@cassandra.apache.org
Sent: Thursday, 29 September 2011 2:54 PM
Subject: Cassandra data modeling



I'm trying to get my head around Cassandra data modeling, but I can't quite see what would be the best approach to the problem I have.
The supposed scenario: 
You have around 100 domains, each domain have from few hundreds to millions of possible URLs (think of different combinations of GET args,  example.org?a=one&b=two is different of example.org?b=two&a=one)


The application requirements
- two columns storing an average of 500kb each and four (maybe six) columns storing 1kb each
- retrieve single oldest/newest URL of any single domain
- retrieve a range of oldest/newest URLs of any single domain
- retrieve single oldest/newest URL over all
- retrieve a range of oldest/newest URLs over all
- entries will be edited at least once a day (heavy read+write)

Having considered the following:
http://wiki.apache.org/cassandra/CassandraLimitations
http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage
http://wiki.apache.org/cassandra/MemtableThresholds#Memtable_Thresholds
https://issues.apache.org/jira/browse/CASSANDRA-16



Which of the models below would you go for, and why?
Any input would be appreciated


Model A
Hundreds of rows (domain names as row keys) 
holding hundreds of thousands of columns (pages within that domain)
and each column then hold a few other columns (5 columns in this case)
Biggest row: "example.net" ~350Gb
Secondary index: column holding URL
{
   "example.com": {
       "example.com/a": ["1", "2", "3", "4", "5"],
       "example.com/b": ["1", "2", "3", "4", "5"],
       "example.com/c": ["1", "2", "3", "4", "5"],
   },
   "example.net": {
       "example.net/a": ["1", "2", "3", "4", "5"],
       "example.net/b": ["1", "2", "3", "4", "5"],
       "example.net/c": ["1", "2", "3", "4", "5"],
   },
   "example.org": {
       "example.org/a": ["1", "2", "3", "4", "5"],
       "example.org/b": ["1", "2", "3", "4", "5"],
       "example.org/c": ["1", "2", "3", "4", "5"],
   }
}


Model B
Millions of rows (URLs as row keys) each holding a few other columns (6 columns in this case).
Biggest row: any ~1004Kb
Secondary index: column holding the domain name
{
   "example.com/a": ["1", "2", "3", "4", "5", "example.com"],
   "example.com/b": ["1", "2", "3", "4", "5", "example.com"],
   "example.com/c": ["1", "2", "3", "4", "5", "example.com"],
   "example.net/a": ["1", "2", "3", "4", "5", "example.net"],
   "example.net/b": ["1", "2", "3", "4", "5", "example.net"],
   "example.net/c": ["1", "2", "3", "4", "5", "example.net"],
   "example.org/a": ["1", "2", "3", "4", "5", "example.org"],
   "example.org/b": ["1", "2", "3", "4", "5", "example.org"],
   "example.org/c": ["1", "2", "3", "4", "5", "example.org"],
}


Model C
Millions of rows (TimeUUID as row keys) each holding a few other columns (7 columns in this case).
Biggest row: any ~1004Kb
Secondary index: column holding the domain name & column holding URL
{
   "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/a"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/b"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/c"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/a"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/b"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/c"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/a"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/b"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/c"],
}

//END