You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Coen Stevens <be...@gmail.com> on 2013/04/09 16:00:51 UTC

Column index vs Row index vs Denormalizing

Hi all,

We are working on a data model for storing tweets for multiple streams
(where a stream is defined by a number of keyword filters on the full
twitter firehose), and retrieving the tweets by timestamp and hashtag. My
question is whether the following data model would a good way for doing
that, where I'm creating a column name index for the hashtags.

ColumnFamily: streamTweets
     key: streamID + dayTimestamp (creating daily buckets for each stream)
     columns => name: TimeUUID, value: tweet json (storing all the tweets
for this stream in a wide row with a TimeUUID)

ColumnFamily: streamHashTagTweets
     key: streamID + dayTimestamp + hashTag (e.g. 123_2013-04-02_cassandra)
     columns => name: TimeUUID (referencing the TimeUUID value in the
streamTweets ColumnFamily), value: tweetID

Retrieving the latest 1000 tweets (of a given day) is trivial by requesting
the streamTweets columnFamily. Getting the latest tweets for a given
hashtag would mean you have to get the TimeUUIDs from the
streamHashTagTweets first, and then do a second get call on the
streamTweets with the former TimeUUIDs as the list of columns we like to
retrieve (column index).

Is referencing column names (TimeUUIDs) a smart thing to do when we have
wide rows spanning millions of columns? It seems easier (one reference
call) to do this, then it is to reference key values and running a
multi-get to get all the rows (we want to get up to 1000 tweets).
Referencing key values requires another columnFamily for tweets (key:
tweetId, columns: 1 column with data).

Of course we could instead denormalize the data and store the tweet also in
the streamHashTagTweet columns, but we want to do the same thing for other
indexes as well (topics, twitter usernames, links, etc), so it quickly adds
up in required storage space. Next to that we will request tweets by these
secondary indexes quite infrequently, while the tweets by timestamp will be
requested heavily.

Given we are estimating to store many TBs of tweets, we would prefer
setting up machines with spinning disks (2TB per node) to save costs.

We would love to hear your feedback.

Cheers,
Coen

Re: Column index vs Row index vs Denormalizing

Posted by Coen Stevens <be...@gmail.com>.

Thanks for the feedback! We will be going forward by implementing and
deploying the proposed model, and test it out.

Cheers,
Coen


On Thu, Apr 11, 2013 at 12:21 PM, aaron morton <aa...@thelastpickle.com>wrote:

> Retrieving the latest 1000 tweets (of a given day) is trivial by
> requesting the streamTweets columnFamily.
>
> If you normally want to get the most recent items use a reverse comparator
> on the column name
> see http://thelastpickle.com/2011/10/03/Reverse-Comparators/
>
> Getting the latest tweets for a given hashtag would mean you have to get
> the TimeUUIDs from the streamHashTagTweets first, and then do a second get
> call on the streamTweets with the former TimeUUIDs as the list of columns
> we like to retrieve (column index).
>
> You choices here depend on what sort of queries are the most frequent and
> how much disk space you have.
>
> You current model makes sense if the stream by day is the most frequent
> query, and you want to conserve disk space. If disk space is not an issue
> you can denormalise further and store the tweet JSON.
>
> If you have potentially many streamHashTagTweets rows where a single tweet
> is replicated it may make sense to stick with the current design to reduce
> disk use.
>
> (we want to get up to 1000 tweets).
>
> If you want to get 1000 anything from cassandra please break the multiget
> up into multiple calls. Each row request becomes a task in the thread pools
> on RF nodes. If you have a small ish cluster one client asking for 1000
> rows will temporarily block other clients and hurt request throughput.
>
>  Referencing key values requires another columnFamily for tweets (key:
> tweetId, columns: 1 column with data).
>
> This will be a more efficient (aka faster) read than reading from the a
> wide row.
>
> Next to that we will request tweets by these secondary indexes quite
> infrequently, while the tweets by timestamp will be requested heavily.
>
> If the hot path is the streamTweets calls demoralise into that, and
> normalise the tweet storage into it's own CF and reference them from
> the streamHashTagTweets. Have a canonical store of the events / tweets /
> entities  addressable by their business key can give you more flexibility.
>
> Given we are estimating to store many TBs of tweets, we would prefer
> setting up machines with spinning disks (2TB per node) to save costs.
>
> If you have spinning disks and 1G networking the rule of thumb is 300GB to
> 500GB per node. See previous discussions about size per node.
>
> Cheers
>
>    -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 10/04/2013, at 2:00 AM, Coen Stevens <be...@gmail.com> wrote:
>
> Hi all,
>
> We are working on a data model for storing tweets for multiple streams
> (where a stream is defined by a number of keyword filters on the full
> twitter firehose), and retrieving the tweets by timestamp and hashtag. My
> question is whether the following data model would a good way for doing
> that, where I'm creating a column name index for the hashtags.
>
> ColumnFamily: streamTweets
>      key: streamID + dayTimestamp (creating daily buckets for each stream)
>      columns => name: TimeUUID, value: tweet json (storing all the tweets
> for this stream in a wide row with a TimeUUID)
>
> ColumnFamily: streamHashTagTweets
>      key: streamID + dayTimestamp + hashTag (e.g. 123_2013-04-02_cassandra)
>      columns => name: TimeUUID (referencing the TimeUUID value in the
> streamTweets ColumnFamily), value: tweetID
>
> Retrieving the latest 1000 tweets (of a given day) is trivial by
> requesting the streamTweets columnFamily. Getting the latest tweets for a
> given hashtag would mean you have to get the TimeUUIDs from the
> streamHashTagTweets first, and then do a second get call on the
> streamTweets with the former TimeUUIDs as the list of columns we like to
> retrieve (column index).
>
> Is referencing column names (TimeUUIDs) a smart thing to do when we have
> wide rows spanning millions of columns? It seems easier (one reference
> call) to do this, then it is to reference key values and running a
> multi-get to get all the rows (we want to get up to 1000 tweets).
> Referencing key values requires another columnFamily for tweets (key:
> tweetId, columns: 1 column with data).
>
> Of course we could instead denormalize the data and store the tweet also
> in the streamHashTagTweet columns, but we want to do the same thing for
> other indexes as well (topics, twitter usernames, links, etc), so it
> quickly adds up in required storage space. Next to that we will request
> tweets by these secondary indexes quite infrequently, while the tweets by
> timestamp will be requested heavily.
>
> Given we are estimating to store many TBs of tweets, we would prefer
> setting up machines with spinning disks (2TB per node) to save costs.
>
> We would love to hear your feedback.
>
> Cheers,
> Coen
>
>
>

Re: Column index vs Row index vs Denormalizing

Posted by aaron morton <aa...@thelastpickle.com>.

> Retrieving the latest 1000 tweets (of a given day) is trivial by requesting the streamTweets columnFamily. 
If you normally want to get the most recent items use a reverse comparator on the column name 
see http://thelastpickle.com/2011/10/03/Reverse-Comparators/

> Getting the latest tweets for a given hashtag would mean you have to get the TimeUUIDs from the streamHashTagTweets first, and then do a second get call on the streamTweets with the former TimeUUIDs as the list of columns we like to retrieve (column index).
You choices here depend on what sort of queries are the most frequent and how much disk space you have. 

You current model makes sense if the stream by day is the most frequent query, and you want to conserve disk space. If disk space is not an issue you can denormalise further and store the tweet JSON. 

If you have potentially many streamHashTagTweets rows where a single tweet is replicated it may make sense to stick with the current design to reduce disk use. 

> (we want to get up to 1000 tweets). 
If you want to get 1000 anything from cassandra please break the multiget up into multiple calls. Each row request becomes a task in the thread pools on RF nodes. If you have a small ish cluster one client asking for 1000 rows will temporarily block other clients and hurt request throughput. 

>  Referencing key values requires another columnFamily for tweets (key: tweetId, columns: 1 column with data).
This will be a more efficient (aka faster) read than reading from the a wide row. 

> Next to that we will request tweets by these secondary indexes quite infrequently, while the tweets by timestamp will be requested heavily.
If the hot path is the streamTweets calls demoralise into that, and normalise the tweet storage into it's own CF and reference them from the streamHashTagTweets. Have a canonical store of the events / tweets / entities  addressable by their business key can give you more flexibility. 

> Given we are estimating to store many TBs of tweets, we would prefer setting up machines with spinning disks (2TB per node) to save costs.
If you have spinning disks and 1G networking the rule of thumb is 300GB to 500GB per node. See previous discussions about size per node.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/04/2013, at 2:00 AM, Coen Stevens <be...@gmail.com> wrote:

> Hi all, 
> 
> We are working on a data model for storing tweets for multiple streams (where a stream is defined by a number of keyword filters on the full twitter firehose), and retrieving the tweets by timestamp and hashtag. My question is whether the following data model would a good way for doing that, where I'm creating a column name index for the hashtags.
> 
> ColumnFamily: streamTweets
>      key: streamID + dayTimestamp (creating daily buckets for each stream)
>      columns => name: TimeUUID, value: tweet json (storing all the tweets for this stream in a wide row with a TimeUUID)
> 
> ColumnFamily: streamHashTagTweets
>      key: streamID + dayTimestamp + hashTag (e.g. 123_2013-04-02_cassandra)
>      columns => name: TimeUUID (referencing the TimeUUID value in the streamTweets ColumnFamily), value: tweetID
> 
> Retrieving the latest 1000 tweets (of a given day) is trivial by requesting the streamTweets columnFamily. Getting the latest tweets for a given hashtag would mean you have to get the TimeUUIDs from the streamHashTagTweets first, and then do a second get call on the streamTweets with the former TimeUUIDs as the list of columns we like to retrieve (column index).
> 
> Is referencing column names (TimeUUIDs) a smart thing to do when we have wide rows spanning millions of columns? It seems easier (one reference call) to do this, then it is to reference key values and running a multi-get to get all the rows (we want to get up to 1000 tweets). Referencing key values requires another columnFamily for tweets (key: tweetId, columns: 1 column with data).
> 
> Of course we could instead denormalize the data and store the tweet also in the streamHashTagTweet columns, but we want to do the same thing for other indexes as well (topics, twitter usernames, links, etc), so it quickly adds up in required storage space. Next to that we will request tweets by these secondary indexes quite infrequently, while the tweets by timestamp will be requested heavily.
> 
> Given we are estimating to store many TBs of tweets, we would prefer setting up machines with spinning disks (2TB per node) to save costs.
> 
> We would love to hear your feedback.
> 
> Cheers,
> Coen