You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by jason kowalewski <ja...@gmail.com> on 2012/05/17 17:55:16 UTC

Data modeling for read performance

We have been attempting to change our data model to provide more 
performance in our cluster. 

Currently there are a couple ways to model the data and i was 
wondering if some people out there could help us out. 

We are storing time-series data currently keyed by a user id. This 
current approach is leading to some hot-spotting of nodes likely due 
to the key distribution not being representative of the usage pattern. 
Currently we are using super columns (the super column name is the 
timestamp), which we intend to dispose of as well with this datamodel 
redesign.   

The first idea we had is that we can shard the data using composite row 
keys into time buckets: 

UserId:<TimeBucket> : { 
  <timestamp>:<colname> = <col value1>,
  <timestamp>:<colname2 = <col value2>
... and so on.
}

We can then use a wide row index for tracking these in the future: 
<TimeBucket>: { 
  <userId> = null
} 

This first approach would always have the data be retrieved by the composite 
row key. 

Alternatively we could just do wide rows using composite columns: 

UserId : { 
  <timestamp>:<colname> = <col value1>, 
  <timestamp>:<colname2> = <col value2>

... and so on
}


The second approach would have less granular keys, but is easier to group 
historical timeseries rather than sharding the data into buckets. This second 
approach also will depend solely on Range Slices of the columns to retrieve 
the data. 

Is there a speed advantage in doing a Row point get in the first approach vs 
range scans on these columns  in the second approach? In the first approach 
each bucket would have no more than 200 events. In the second approach we 
would expect the number of columns to be in the thousands to hundreds of 
thousands... Our reads currently (using supercolumns) are PAINFULLY slow - 
the cluster is constantly timing out on many nodes and disk i/o is very high. 

Also, Instead of having each column name as a new composite column is it 
better to serialize the multiple values into some format (json, binary, etc) to 
reduce the amount of disk seeks when paging over this timeseries data? 

Thanks for any ideas out there! 


-Jason

Re: Data modeling for read performance

Posted by aaron morton <aa...@thelastpickle.com>.

I would bucket the time stats as well.

If you write all the attributes at the same time, and always want to read them together, storing them in something like a JSON blob is  legitimate approach. 

Other Aaron, can you elaborate on 
> I'm not using composite row keys (it's just
> AsciiType) as that can lead to hotspots on disk.  

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 18/05/2012, at 4:56 AM, Aaron Turner wrote:

> On Thu, May 17, 2012 at 8:55 AM, jason kowalewski
> <ja...@gmail.com> wrote:
>> We have been attempting to change our data model to provide more
>> performance in our cluster.
>> 
>> Currently there are a couple ways to model the data and i was
>> wondering if some people out there could help us out.
>> 
>> We are storing time-series data currently keyed by a user id. This
>> current approach is leading to some hot-spotting of nodes likely due
>> to the key distribution not being representative of the usage pattern.
>> Currently we are using super columns (the super column name is the
>> timestamp), which we intend to dispose of as well with this datamodel
>> redesign.
>> 
>> The first idea we had is that we can shard the data using composite row
>> keys into time buckets:
>> 
>> UserId:<TimeBucket> : {
>>  <timestamp>:<colname> = <col value1>,
>>  <timestamp>:<colname2 = <col value2>
>> ... and so on.
>> }
>> 
>> We can then use a wide row index for tracking these in the future:
>> <TimeBucket>: {
>>  <userId> = null
>> }
>> 
>> This first approach would always have the data be retrieved by the composite
>> row key.
>> 
>> Alternatively we could just do wide rows using composite columns:
>> 
>> UserId : {
>>  <timestamp>:<colname> = <col value1>,
>>  <timestamp>:<colname2> = <col value2>
>> 
>> ... and so on
>> }
>> 
>> 
>> The second approach would have less granular keys, but is easier to group
>> historical timeseries rather than sharding the data into buckets. This second
>> approach also will depend solely on Range Slices of the columns to retrieve
>> the data.
>> 
>> Is there a speed advantage in doing a Row point get in the first approach vs
>> range scans on these columns  in the second approach? In the first approach
>> each bucket would have no more than 200 events. In the second approach we
>> would expect the number of columns to be in the thousands to hundreds of
>> thousands... Our reads currently (using supercolumns) are PAINFULLY slow -
>> the cluster is constantly timing out on many nodes and disk i/o is very high.
>> 
>> Also, Instead of having each column name as a new composite column is it
>> better to serialize the multiple values into some format (json, binary, etc) to
>> reduce the amount of disk seeks when paging over this timeseries data?
>> 
>> Thanks for any ideas out there!
> 
> 
> You didn't say what your queries look like, but the way I did it was:
> 
> <userid>|<stat_name>|<timebucket> : {
>  <timestamp> = <value>
> }
> 
> This provides very efficient read for a given user/stat combination.
> If I need to get multiple stats per user, I just use more threads on
> the client side.  I'm not using composite row keys (it's just
> AsciiType) as that can lead to hotspots on disk.  My timestamps are
> also just plain unix epoch's as that takes less space then something
> like TimeUUID.
> 
> 
> 
> -- 
> Aaron Turner
> http://synfin.net/         Twitter: @synfinatic
> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
>     -- Benjamin Franklin
> "carpe diem quam minimum credula postero"

Re: Data modeling for read performance

Posted by Aaron Turner <sy...@gmail.com>.

On Thu, May 17, 2012 at 8:55 AM, jason kowalewski
<ja...@gmail.com> wrote:
> We have been attempting to change our data model to provide more
> performance in our cluster.
>
> Currently there are a couple ways to model the data and i was
> wondering if some people out there could help us out.
>
> We are storing time-series data currently keyed by a user id. This
> current approach is leading to some hot-spotting of nodes likely due
> to the key distribution not being representative of the usage pattern.
> Currently we are using super columns (the super column name is the
> timestamp), which we intend to dispose of as well with this datamodel
> redesign.
>
> The first idea we had is that we can shard the data using composite row
> keys into time buckets:
>
> UserId:<TimeBucket> : {
>  <timestamp>:<colname> = <col value1>,
>  <timestamp>:<colname2 = <col value2>
> ... and so on.
> }
>
> We can then use a wide row index for tracking these in the future:
> <TimeBucket>: {
>  <userId> = null
> }
>
> This first approach would always have the data be retrieved by the composite
> row key.
>
> Alternatively we could just do wide rows using composite columns:
>
> UserId : {
>  <timestamp>:<colname> = <col value1>,
>  <timestamp>:<colname2> = <col value2>
>
> ... and so on
> }
>
>
> The second approach would have less granular keys, but is easier to group
> historical timeseries rather than sharding the data into buckets. This second
> approach also will depend solely on Range Slices of the columns to retrieve
> the data.
>
> Is there a speed advantage in doing a Row point get in the first approach vs
> range scans on these columns  in the second approach? In the first approach
> each bucket would have no more than 200 events. In the second approach we
> would expect the number of columns to be in the thousands to hundreds of
> thousands... Our reads currently (using supercolumns) are PAINFULLY slow -
> the cluster is constantly timing out on many nodes and disk i/o is very high.
>
> Also, Instead of having each column name as a new composite column is it
> better to serialize the multiple values into some format (json, binary, etc) to
> reduce the amount of disk seeks when paging over this timeseries data?
>
> Thanks for any ideas out there!


You didn't say what your queries look like, but the way I did it was:

<userid>|<stat_name>|<timebucket> : {
  <timestamp> = <value>
}

This provides very efficient read for a given user/stat combination.
If I need to get multiple stats per user, I just use more threads on
the client side.  I'm not using composite row keys (it's just
AsciiType) as that can lead to hotspots on disk.  My timestamps are
also just plain unix epoch's as that takes less space then something
like TimeUUID.



-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"