You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by jason kowalewski <ja...@gmail.com> on 2012/05/17 17:55:16 UTC
Data modeling for read performance
We have been attempting to change our data model to provide more
performance in our cluster.
Currently there are a couple ways to model the data and i was
wondering if some people out there could help us out.
We are storing time-series data currently keyed by a user id. This
current approach is leading to some hot-spotting of nodes likely due
to the key distribution not being representative of the usage pattern.
Currently we are using super columns (the super column name is the
timestamp), which we intend to dispose of as well with this datamodel
redesign.
The first idea we had is that we can shard the data using composite row
keys into time buckets:
UserId:<TimeBucket> : {
<timestamp>:<colname> = <col value1>,
<timestamp>:<colname2 = <col value2>
... and so on.
}
We can then use a wide row index for tracking these in the future:
<TimeBucket>: {
<userId> = null
}
This first approach would always have the data be retrieved by the composite
row key.
Alternatively we could just do wide rows using composite columns:
UserId : {
<timestamp>:<colname> = <col value1>,
<timestamp>:<colname2> = <col value2>
... and so on
}
The second approach would have less granular keys, but is easier to group
historical timeseries rather than sharding the data into buckets. This second
approach also will depend solely on Range Slices of the columns to retrieve
the data.
Is there a speed advantage in doing a Row point get in the first approach vs
range scans on these columns in the second approach? In the first approach
each bucket would have no more than 200 events. In the second approach we
would expect the number of columns to be in the thousands to hundreds of
thousands... Our reads currently (using supercolumns) are PAINFULLY slow -
the cluster is constantly timing out on many nodes and disk i/o is very high.
Also, Instead of having each column name as a new composite column is it
better to serialize the multiple values into some format (json, binary, etc) to
reduce the amount of disk seeks when paging over this timeseries data?
Thanks for any ideas out there!
-Jason
Re: Data modeling for read performance
Posted by aaron morton <aa...@thelastpickle.com>.
I would bucket the time stats as well.
If you write all the attributes at the same time, and always want to read them together, storing them in something like a JSON blob is legitimate approach.
Other Aaron, can you elaborate on
> I'm not using composite row keys (it's just
> AsciiType) as that can lead to hotspots on disk.
Cheers
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com
On 18/05/2012, at 4:56 AM, Aaron Turner wrote:
> On Thu, May 17, 2012 at 8:55 AM, jason kowalewski
> <ja...@gmail.com> wrote:
>> We have been attempting to change our data model to provide more
>> performance in our cluster.
>>
>> Currently there are a couple ways to model the data and i was
>> wondering if some people out there could help us out.
>>
>> We are storing time-series data currently keyed by a user id. This
>> current approach is leading to some hot-spotting of nodes likely due
>> to the key distribution not being representative of the usage pattern.
>> Currently we are using super columns (the super column name is the
>> timestamp), which we intend to dispose of as well with this datamodel
>> redesign.
>>
>> The first idea we had is that we can shard the data using composite row
>> keys into time buckets:
>>
>> UserId:<TimeBucket> : {
>> <timestamp>:<colname> = <col value1>,
>> <timestamp>:<colname2 = <col value2>
>> ... and so on.
>> }
>>
>> We can then use a wide row index for tracking these in the future:
>> <TimeBucket>: {
>> <userId> = null
>> }
>>
>> This first approach would always have the data be retrieved by the composite
>> row key.
>>
>> Alternatively we could just do wide rows using composite columns:
>>
>> UserId : {
>> <timestamp>:<colname> = <col value1>,
>> <timestamp>:<colname2> = <col value2>
>>
>> ... and so on
>> }
>>
>>
>> The second approach would have less granular keys, but is easier to group
>> historical timeseries rather than sharding the data into buckets. This second
>> approach also will depend solely on Range Slices of the columns to retrieve
>> the data.
>>
>> Is there a speed advantage in doing a Row point get in the first approach vs
>> range scans on these columns in the second approach? In the first approach
>> each bucket would have no more than 200 events. In the second approach we
>> would expect the number of columns to be in the thousands to hundreds of
>> thousands... Our reads currently (using supercolumns) are PAINFULLY slow -
>> the cluster is constantly timing out on many nodes and disk i/o is very high.
>>
>> Also, Instead of having each column name as a new composite column is it
>> better to serialize the multiple values into some format (json, binary, etc) to
>> reduce the amount of disk seeks when paging over this timeseries data?
>>
>> Thanks for any ideas out there!
>
>
> You didn't say what your queries look like, but the way I did it was:
>
> <userid>|<stat_name>|<timebucket> : {
> <timestamp> = <value>
> }
>
> This provides very efficient read for a given user/stat combination.
> If I need to get multiple stats per user, I just use more threads on
> the client side. I'm not using composite row keys (it's just
> AsciiType) as that can lead to hotspots on disk. My timestamps are
> also just plain unix epoch's as that takes less space then something
> like TimeUUID.
>
>
>
> --
> Aaron Turner
> http://synfin.net/ Twitter: @synfinatic
> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
> -- Benjamin Franklin
> "carpe diem quam minimum credula postero"
Re: Data modeling for read performance
Posted by Aaron Turner <sy...@gmail.com>.
On Thu, May 17, 2012 at 8:55 AM, jason kowalewski
<ja...@gmail.com> wrote:
> We have been attempting to change our data model to provide more
> performance in our cluster.
>
> Currently there are a couple ways to model the data and i was
> wondering if some people out there could help us out.
>
> We are storing time-series data currently keyed by a user id. This
> current approach is leading to some hot-spotting of nodes likely due
> to the key distribution not being representative of the usage pattern.
> Currently we are using super columns (the super column name is the
> timestamp), which we intend to dispose of as well with this datamodel
> redesign.
>
> The first idea we had is that we can shard the data using composite row
> keys into time buckets:
>
> UserId:<TimeBucket> : {
> <timestamp>:<colname> = <col value1>,
> <timestamp>:<colname2 = <col value2>
> ... and so on.
> }
>
> We can then use a wide row index for tracking these in the future:
> <TimeBucket>: {
> <userId> = null
> }
>
> This first approach would always have the data be retrieved by the composite
> row key.
>
> Alternatively we could just do wide rows using composite columns:
>
> UserId : {
> <timestamp>:<colname> = <col value1>,
> <timestamp>:<colname2> = <col value2>
>
> ... and so on
> }
>
>
> The second approach would have less granular keys, but is easier to group
> historical timeseries rather than sharding the data into buckets. This second
> approach also will depend solely on Range Slices of the columns to retrieve
> the data.
>
> Is there a speed advantage in doing a Row point get in the first approach vs
> range scans on these columns in the second approach? In the first approach
> each bucket would have no more than 200 events. In the second approach we
> would expect the number of columns to be in the thousands to hundreds of
> thousands... Our reads currently (using supercolumns) are PAINFULLY slow -
> the cluster is constantly timing out on many nodes and disk i/o is very high.
>
> Also, Instead of having each column name as a new composite column is it
> better to serialize the multiple values into some format (json, binary, etc) to
> reduce the amount of disk seeks when paging over this timeseries data?
>
> Thanks for any ideas out there!
You didn't say what your queries look like, but the way I did it was:
<userid>|<stat_name>|<timebucket> : {
<timestamp> = <value>
}
This provides very efficient read for a given user/stat combination.
If I need to get multiple stats per user, I just use more threads on
the client side. I'm not using composite row keys (it's just
AsciiType) as that can lead to hotspots on disk. My timestamps are
also just plain unix epoch's as that takes less space then something
like TimeUUID.
--
Aaron Turner
http://synfin.net/ Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
-- Benjamin Franklin
"carpe diem quam minimum credula postero"